IT monitoring: Don’t monitor yourself into a madhouse

In our last article in this series we discussed how IT monitoring provides visibility into the health, performance, and resource consumption of infrastructure, key applications (e.g., databases), and business services. But visibility needs disciplined application and careful filtering to make operations more efficient. Too much-decontextualised data sows confusion and stress, increases costs, and can kill IT scalability and stunt business growth.

Imagine you’re an IT operations generalist. You have broad technical expertise about your organisation’s large IT estate. You deeply understand the business value of critical services and are held accountable for maintaining their availability. And now, for some reason, you’re getting text messages from your monitoring system — saying that a metric called MAX_CONNECTIONS has gone to critical on a particular MySQL database cluster.

What does that even mean? As an ops generalist, you probably have no idea. Ask an application architect or MySQL specialist, however, and they’ll give you a nuanced answer. MAX_CONNECTIONS is a built-in MySQL historical metric that records the fact that, at some point since the last DB restart, connections spiked over a known threshold (set to a very low default value). It conveys important information, but only to people who can use that information in context: prompting examination of upstream workloads and their traffic, why and how applications create and relinquish connections over time, options for connection pooling, and other variables, and determining whether (and how) database or apps need tuning, scaling, or other TLC.

>See also: Five ways unified monitoring can drive business value

In short: this is an example of a metric that IT ops generalists should probably not be alerted on — at least not directly — because it’s a bad match for their skills, role, goals, and accountabilities. Alerting this metric to an ops generalist will either cause confusion or concern over an issue that they cannot address. This can create a middleman scenario: a fire drill in which the generalist calls for help from a database specialist, who in turn (unless they are familiar with the connected application) might still be unable to provide definitive answers. In this case, a real root-cause analysis is only likely to come from the specific DevOps personnel who architected, deployed, configured, stress-tested, and now manage the app and its database instance in production, and who understand how app and database are supposed to interact.

Right Info, Right Time, Right People

The lesson: visibility alone isn’t enough for efficient IT monitoring — especially in larger, more complex enterprises. To be genuinely useful, enterprise monitoring solutions need to collect raw data, convert it to actionable insight (information), and deliver a subset of that information, filtered for relevance and utility, to the right people. Key to success is accomplishing this through appropriate communications channels and within process envelopes that facilitate proactive maintenance and drive incident responses that are both effective and proportionate.

The monitoring platform must then provide DevOps, IT operators, teams, managers, and business leaders with additional tools, letting them coordinate and fix the problem. These include solutions for collaboration, inquiry, root-cause determination, documentation of fixes, cost analysis, resolving the issue with impacted customers, and post-mortem analysis and process optimisation.

>See also: Just the tip of the iceberg: why you should be monitoring the Deep Web

Doing it Right: Integration Touchpoints

Enterprise IT monitoring solutions anticipate and enable ops/business process integration for each relevant persona. They serve the needs of larger staffs and diverse IT and business specialisations (both within the enterprise and possibly within customer organisations as well) and link with the appropriate, domain-specific IT operations and collaboration tools. Here are some of the types of integration involved:

1.Plugins and monitoring packs: Top monitoring providers work hard to maintain a delicate balance between ease of use and the ability to customise. To help diversely-skilled IT staffs start monitoring quickly, solution makers often combine plugins — the software needed to interface with device or software specific sources like host servers, operating systems, and databases — with preconfigured sets of recommended service checks, post-processing logic, alerting thresholds, and other configuration information.

By installing such an integration package (Opsview’s Opspacks are an example) operators can monitor nearly any device or application without needing domain-specific knowledge: just install the package, aim it at the technology, and monitoring just works.One potential caveat of this approach, however, is that pre-packaged metrics — as best-practice collections — may include too much information to suit the specific needs of a given persona. A pre-built monitoring pack for MySQL databases, for example, might thus include instructions to alert on a variable like MAX_CONNECTIONS: useful for specialists, but not for generalists. For this reason and others, monitoring software should also provide features enabling highly granular customisation of prebuilt monitoring templates. Savvy IT operators and domain specialists can use these to create subsets and aggregates of relevant service checks in versions that serve different roles’ needs for insight and notification.

>See also: “You are only as strong as your weakest link”

2.Notification profiles, preferences, on-call lists, escalation and contingency logic, and other platform-side alerting management features: Enterprise monitoring platforms let operators create notification profiles for applications, teams, roles, and individuals. They can precisely manage which alerts each role receives, ensure seamless delivery of alerts to role-representatives currently on-call, customise delivery of alerts in various channels (e.g., email, text message, Slack or other messaging solution, or via an external notification or broader-based ops management platform), drive effective escalation if specific alerts aren’t resolved in timely fashion, and document transmission and acknowledgment of alerts for later audit, analysis, and verification. Within bounds of agreed-on policy, these systems also enable individuals to further tweak notification behaviour, e.g., by temporarily suppressing repeated alerts on a known condition, once one such alert has been acknowledged. The summary here is to get the right alert to the right person at the right time, but not to bombard them with unnecessary detail or at 3 AM, when it isn’t business critical.

3.Integrations with enterprise ops management applications: Top enterprise monitoring solutions provide ease of integration with popular operations management suites (e.g., ServiceNow), notification providers (e.g., PagerDuty), ticketing and incident platforms (e.g., JIRA), collaboration frameworks (e.g., Slack), and other tooling. These integrations let the monitoring platform initiate operations and maintenance workflows in response to conditions. The goal is to reduce time to response, save money, and hold to “incidents per shift” goals by putting operations on a highly-proactive footing: pushing required maintenance and other non-critical conditions into a normal process, instead of alerting on them. The resultant stress reduction and business cost savings are well worth the planning and integration.

>See also: Low code –how smart businesses are fighting back

4.Integrations with collaboration tools: Once processes and workflows have been initiated, tickets created, etc., enterprise monitoring platforms may provide extensive and sophisticated integration with API-equipped, teamwise collaboration tools like Slack. The monitoring system can use these integrations to spin up and label multimedia group chats around specific issues, connect with pre-built chats for teams, and inject relevant metrics information into shared communications channels. Team communications in these channels become part of a (real time) audit trail for incident response, letting those who join later get up to speed immediately and facilitating escalation.

5.Custom dashboards, Business Service Monitoring: Critical to providing specific roles with exactly the information they need to be effective. Top monitoring solutions let you create situational and/or role-based dashboards that aggregate important metrics, provide simple, easily-gisted visualisation of key performance indicators, and simplify drill-down to detailed metrics to help specialists determine root causes. Dashboards can also help fulfil IT organisations’ need to provide KPIs to technical and business leaders in consumable forms. Business Service Monitoring (BSM) enables aggregation of metrics from all components and business logic that contribute to respective specific business services It imposes additional logic to evaluate availability and health of that business service, based on the summed states of all parts of the application services that are provided. This kind of dynamic, business-relevant view can be critical for ops generalists, whose main concern is whether the actual end-user-facing services (as opposed to just the infrastructure) are available and healthy.

>See also: 5 indispensable tools for Big Data visualisation

6.External analytics, visualisation, CMDB and other tools: Enterprise monitoring platforms provide feature-rich APIs that enable extraction of collected metrics (e.g., time-series data) by analytics and visualisation platforms (e.g., Grafana, Splunk), letting specialists integrate and apply these tools holistically. Enterprise monitoring also exchanges data with Configuration Management Databases (CMDBs) — consuming information to get target systems monitored more quickly, and providing information to help keep CMDB and monitoring configurations properly aligned.

7.Reports: Comprehensive reporting capabilities let monitoring feed customised information to different roles for operations, budgeting, and other management purposes. Reports are auditable long-term and document compliance.

Trust your Platform (Then Iterate and Optimise)

For larger organisations and more complex IT estates, achieving high operational efficiency with monitoring demands ongoing work to gradually build and optimise customisations and integrations supporting team efficiency and productivity. The best practice is to implement effective monitoring using enterprise supported, domain-specific monitoring templates, then gradually customise as appropriate for your IT estate. It makes good sense to begin by iterating over notifications, since these will impact personnel and workflow most directly: working to reduce unneeded alerts and deal with more conditions proactively, within standardised process workflows.

Sourced by John Jainschigg, content strategy lead at Opsview

Nominations are now open for the Women in IT Awards Ireland and Women in IT Awards Silicon Valley. Nominate yourself, a colleague or someone in your network now! The Women in IT Awards Series – organised by Information Age – aims to tackle this issue and redress the gender imbalance, by showcasing the achievements of women in the sector and identifying new role models

Avatar photo

Andrew Ross

As a reporter with Information Age, Andrew Ross writes articles for technology leaders; helping them manage business critical issues both for today and in the future

Related Topics

DevOps
IT ops