DevOps is about accelerating delivery of new products and services at scale, reliably and affordably. Doing this requires automation — using software to build, configure, deploy, scale, update, and manage other software.
We usually think of monitoring as happening alongside this process — its job is to alert operators when things go wrong, help analyse issues, confirm compliance with service-level objectives. But it’s better practice to treat monitoring as a vital part of ops automation. A modern, full-featured monitoring platform can be a powerful automation engine in its own right, and a critical enabler for larger automation initiatives in application and infrastructure lifecycle management and problem mitigation. It can even, in many cases, work to enable autonomous operations like self-scaling and self-healing.
Here are some of the ways that your monitoring system can be able to help you get more done, eliminate human error, and meet (not just comply with) SLOs:
Streamlined, automated monitoring system deployment and lifecycle management. On-premises monitoring solutions can co-reside with monitored infrastructure; both in classic private clouds and data centres and in provider-hosted virtual private clouds (VPCs). This lets them comply with security, privacy, data governance and other regulations; and helps them overcome bandwidth and cost barriers that can limit the scalability of SaaS monitoring solutions. Premise monitoring must be deployed, scaled, and updated, however — and this can be daunting for all but very simple, single-server configurations.
Forward-looking makers of this kind of monitoring platform are starting to exploit popular deployment automation frameworks like Ansible, Puppet, and Chef (the same ones DevOps is using to automate infrastructure deployment and routine operations) to streamline monitoring-system deployment in scaled-out, highly-available configurations. For operator convenience, they’re hiding deployment-tool complexity behind webUIs and simplified configurators, though the standard tooling is accessible for DevOps folks who wish to dovetail monitoring-system or metrics-collector deployment with infrastructure roll-outs — a best practice. Details of monitoring can be defined and maintained as part of definitive “infrastructure as code” repositories.
>See also: The value of visibility in your data centre
Automated agent deployment and monitored-object registration via API. Standard deployment tools like Ansible can also be used to inject, configure, and update monitoring components (endpoint agents, required libraries, etc.) on hosts. The same tools can extract facts from deployment manifests or directly from hosts at deploy time, then use monitoring-system APIs to rapidly configure monitoring for host infrastructure and applications, as well as “unmonitor” hosts at the end of life. Routinely putting systems under monitoring as soon as they’re deployed enables rapid detection of issues in staging or production, and can be used to trigger rollbacks, if required — an important best-practice for continuous delivery.
CMDB ingestion. Some monitoring platforms can ingest data from operations management tools and configuration management databases (CMDBs), such as those offered by ServiceNow and similar vendors. This lets operators quickly and confidently configure monitoring for existing infrastructure, applications, and full business services — avoiding laborious and error-prone manual compilation of system facts.
Discovery and auto-monitoring. Sophisticated monitoring solutions use an increasing range of methods, including direct access to hosts via SSH and indirect access via configuration repositories like ActiveDirectory and services like Windows Discovery, to extract facts from existing infrastructure and speed up monitoring configuration by operators. Leading-edge products are now moving towards automating the process completely: creating comprehensive maps of infrastructure, apps, and complete business services and monitoring these things without the need for any manual intervention or direction.
Alert processing, notification, escalation, integration. Alerting is, of course, a powerful form of automation. It entails decision-making, which may be simple (e.g., some metric has surpassed a given threshold) or significantly more complex (e.g., several metrics, from separate systems, have entered states predictive of a particular kind of known failure for a critical business service). It involves sophisticated assignment and escalation based on issue, team rotas, time/date and other variables. It demands outbound integration with communications methods such as email, or with multi-mode notification platforms such as PagerDuty; or more sophisticated integration with issue-management (e.g., JIRA) or operations workflow management (e.g., ServiceNow) as well as collaboration (e.g., Slack) and other solutions. All this automation power works together to get the right alert to the right person at the right time while avoiding over-alerting and fatigue — smoothing operations and helping teams avoid downtime and meet SLO commitments.
Proactive issue mitigation. Finally, sophisticated monitoring solutions now provide the ability to execute scripts on hosts, or trigger centralised automation (e.g., Ansible) to perform tasks based on monitored conditions: from rebooting a failed server to scaling up an infrastructure cluster. Over the next decade, developments in machine learning will gradually improve the ability to monitor systems to deduce the abstract structure and function of business services, monitor them automatically, predict their failure modes, repair them and optimise their performance — either autonomously, or by optimal allocation of operator resources to tasks.
By John Jainschigg, content strategy lead at Opsview