Keeping key enterprise applications up and running well is an absolute requirement for modern business. As estimated by Gartner, IDC and others, the cost of IT downtime averages out to around £4,200 per minute. A simple infrastructure failure might cost around £75,000; while the failure of a critical, public-facing application costs more like £378,000 to £755,000 per hour. When failures impact large-scale global logistics and cause widespread inconvenience to customers, for example, last May’s, British Airways airline operations systems failure, costs can quickly become staggering. BA estimated losing $102.19 million USD (£77.08 million GBP) in hard costs including airfare refunds to stranded passengers, plus incalculable damage to reputation. BA’s parent company, IAG, subsequently lost $224 million USD (£170 million GBP) in value, based on its then-current stock valuation.
Preventing such disasters, or intervening effectively and rapidly when they occur, means giving developers and operations staff (DevOps) visibility into IT infrastructure, networks, and applications. Modern IT monitoring solutions provide this visibility in many ways, including:
- Issue: Ingest and Discovery – Manually configuring monitoring for hundreds or thousands of hosts is a time-consuming and potentially error-prone process. Operators sometimes lack a complete picture of all the hosts, apps, and business services in their purview.Solution: IT monitoring systems are increasingly able to automate or infer information from discovery, configuration management databases (CMDBs), deployment tools, cloud APIs and other sources. This helps operators identify and label entities, visualise dependencies, and configure monitoring, quickly and accurately, throughout the hybrid (i.e., on-premises and cloud-based) data centre. Discovery may be made using WMI (Windows Management Instrumentation), SNMP network discovery, and other technologies.
- Issue: Summary status display – Operators need ‘single panes of glass’ that aggregate lots of status information on monitored systems, letting them spot issues quickly and drill down to determine root causes.Solution: Mature IT monitoring platforms provide collapsible, outline-style summary displays or scheduled reports that let operators hide or reveal meaningful subsets of information about monitored hosts and systems. Colour -coded popups draw attention to issues. Clickable labels offer quick access to details of individual service checks, graphs, raw event logs and troubleshooting tools.
- Issue: Dashboards: Too much monitoring data, too densely aggregated, can be hard to work with. Operators need to be able to visualise critical metrics and status information quickly.Solution: Valuable IT monitoring systems let you create customisable dashboards with graphical widgets isolating specific hosts, metrics, and KPIs. Read-only access to prepared panels can be distributed to key stakeholders aware of application status, SLA compliance, etc.
- Issue: Business service monitoring: IT and DevOps need to be able to visualise the status of all infrastructure elements and systems involved in providing key business services.Solution: Business Service Monitoring (BSM) is an enhanced dashboarding capability that lets operators create interactive views of complex application ‘stacks’ (e.g., the load balancers, web/application servers, database clusters, network gear and other elements supporting a typical, scaled-out, highly-available, tiered application). It’s ideal for keeping responsible developers, product managers, and others apprised of the status of applications they own and empowering them to help effectively if system status begins to degrade.
- Issue: Reporting: Real-time status visualisations don’t tell the whole story. Proactive management and planning also mean being able to view system-wide status, resource consumption trends and other information.Solution: Comprehensive reporting lets operators track compliance. It offers insight into service level agreements and objectives, scheduled maintenance and upgrades, keeps track of costs, and budgets for scale-out, among many other uses.
- Issue: Alerting: Severe problems may require immediate operator attention, around the clock.Solution: Almost all IT monitoring solutions offer to alert via pager, email, and text messages. Many also integrate directly with on-call management systems and services. The ability to correctly route the right alert to the right person at the right time is vitally important. Enterprise monitoring platforms either have this capability or integrate with proven solutions that ensure the right people have insights at the right time.
- Issue: Mobility: Tying down operators to Network Operating Centres (NOCs) and desks are bad for morale and productivity.Solution: The best IT monitoring solutions offer useful mobile applications enabling operators to view status, key business services and other dashboards; and respond to alerts and notifications from anywhere.
- Issue: Notifications and outbound integration: Once the status info is aggregated from monitored systems, how can issues be originated, tracked, assigned, collaborated over, and resolved?Solution: Top monitoring platforms offer an increasingly-broad set of integrations with popular enterprise and SMB issue-tracking, service desk, and IT process management solutions. Look for integration with Slack, ServiceNow, Puppet, Ansible, etc., in an enterprise monitoring platform. Ask about extensibility – “can the platform easily extend its capabilities to integrate with future solutions?”
Doing monitoring right means not seeking to visualise every possible signal. Ideally, control makes visible a minimum subset of signals giving maximum actionable insight:
Each metric collected comes with associated hard and soft costs. As IT estates grow in size and complexity, overheads associated with gathering, processing, storing, analysing, displaying, querying, and reporting on metrics all increase. This eventually impacts application, network, and/or monitoring system performance.
>See also: What it takes to be a fast-growing company
Excess visibility also imposes a significant cognitive load on operators. Too many complexes, rarely-used, or operationally-irrelevant metrics can camouflage essential signals (alerts), slowing effective incident response. Lack of selectivity about what signals to make visible, and how to evaluate and call attention to them can quickly lead to excessive alerting. That promotes alert fatigue, burnout, and eventually causes real incidents to be ignored through a sort of “crying wolf” syndrome.
Operator time consumed investigating non-critical incidents is time lost to more important and impactful work. Building automation; put simply: getting visibility wrong costs real money, and can hamstring innovation.
Enormous knowledge and experience are needed to identify the necessary-and-sufficient signals to optimally monitor a given type of infrastructure, application, or business service. Without the proper tooling, understaffed, time-stretched IT staff are often hard-pressed to provide this level of assurance.
Top-flite IT monitoring solutions bridge the knowledge gap by packaging up optimal metrics sets in modules or plugins, enabling best-practice-compliant monitoring to be set up quickly and with confidence. For example, using a plugin, an operator can immediately implement the 20 to 40 service checks needed to monitor health, performance, and resource consumption by a MySQL database.
Developers use less mature application performance monitoring (APM) systems and open source toolchains to instrument software under construction and visualise application state in test and production environments. APM solutions are usually not very helpful for operators who know little about application specifics, and whose job is to keep numerous complex applications running smoothly. Unlike IT operations monitoring, APM systems are diverse and hew to a wide range of standards. For example, there are dozens of open source servers, exporters, scrapers, and other tools designed to extract metrics from HAproxy (a popular open source proxy server/load balancer) for consumption by Prometheus (a popular metrics visualisation and database system).
Monitoring and visibility treat of “known unknowns” — the well-understood performance characteristics/indicators and known hard failure modes of applications and components. Meanwhile, the term observability is more focused, and now used in discussing the superset of visibility that includes “unknown unknowns.” In particular, this refers to challenges in understanding and managing the behaviour of dynamic, self-scaling, resilient, distributed applications. Basically, visibility knows which of a set of predictable issues might be occurring (enabling remediation); whereas observability gives you the insight to figure out what’s going on (enabling further inquiry).
Enterprise monitoring solutions are working hard to provide plugins and modules that make the inner workings of container orchestration and related systems more visible. At the same time, top players are evaluating a range of strategies for extracting a few, essential signals from distributed and containerised applications, making them observable. We’ll discuss some of these methods in upcoming columns.
Sourced by John Jainschigg, content strategy lead at Opsview