The lies of performance metrics

Organisations expect to experience a reduction in trouble tickets and downtime, an improvement in application performance and optimised capex purchases based on need when deploying performance monitoring. So why aren’t they seeing all those benefits? In most cases, they are finding that the promise is not a reality.

It is an industry misconception that all IT performance metrics are created equal. In fact, performance metrics are not what we think they are. Poor metrics often literally point in the wrong direction.

Other easy-to-gather data might not be performance data at all, but utilisation metrics or error counters, which are often skewed dramatically and not representative of what is actually going on within the infrastructure.

The least invasive metrics have often been seen as the least risky as no one wanted to risk a negative impact on their production workloads.

Let’s put this in perspective: a single 16GB connection to a storage array will push the equivalent number of transactions in a five-minute period, as there are people in all of Europe. And often times the traditional metrics associated with that link will be averaged into a single number for that period.

That is tantamount to taking a rolled up average of the entire population of Europe as a single number and expecting that figure to be meaningful.

For example, you might want to pull out the statistics of gender or religious diversity. That number could tell you that on average a person in Europe is female, but that is totally misleading and completely useless as a measure of gender across Europe. In the end, the resulting averages don’t paint very clear pictures of the true information they are attempting to represent.

Unfortunately, enterprises are far too often unable to access more beneficial statistics, and the truth is that some performance metrics are better left out.

Common performance metrics go wrong in a number of ways. Error counter and utilisation metrics don’t really tell you anything about your infrastructure performance.

People use error counters and utilisation metrics because the current economic climate means that IT teams will often settle for these less impactful metrics, as they are cheaper to gather.

More data is not always better – even with more data, organisations still find problems at the layer beneath. They need better data not just more data.

And once they do have some solid, relevant data, the challenge then becomes how to make use of the information the data represents. For example, alarm thresholds for monitoring systems will vary depending on the user, who often has no idea where to set the threshold. This leads to a lot of false positives, and, ultimately, alarms will often be ignored because users don’t trust them.

Metrics gone bad

When it comes to performance metrics, the most common question asked by CIOs is, ‘Why is it that my applications and infrastructure teams are arguing about whose metrics are correct?’

They are baffled at how one team can declare the situation a complete disaster while the other says everything is normal. Even with the metrics we have today, there seems to be a problem in these debates. These situations result in a lot of finger pointing, guesswork and not enough reliable, concrete data.

So how do they move from the scenario where everybody is pointing fingers, to having data that is shared across teams, allowing IT staff to improve the performance of the overall infrastructure?

Good data alone is not enough. To be successful, organisations need to know what data are relevant to infrastructure performance, which is why analytics is so important.

They need to be able to derive actionable insights from the data, otherwise nothing has been solved. The key is to turn the data into answers – where the real value lies. Doing this will drive higher performance – ultimately, the data needs to be tied to the service being provided.

IT staff want to be able to gather vital data without adding load, impacting or perturbing the system they are monitoring. Turning the monitoring for a system on full blast changes the behaviour of that system and can hide, or worse, exacerbate the initial problem.

The data need to be correlated to transactions. With data correlation, organisations should aim to get an end-to-end view from the applications all the way back through into the storage.

Then there is the importance of historical data retention. Organisations need the ability to go through historical data to make logical comparisons. This allows them to answer questions such as, ‘How was this workload performing last month, last quarter or last year?’ or ‘How did this application behave on my previous infrastructure from a different vendor?’

Better performance metrics

With an open systems ecosystem, meaningful data can be pulled from protocol layer, allowing organisations to attain vendor-independent information from outside the system. Looking at the distribution of data allows visibility into areas that were previously not possible.

In the past, when organisations had relatively few components and low data volumes and speeds, obtaining metrics on a simpler level might have sufficed, while anything more granular than averages was unattainable.

But today, with so much complexity in the data centre in the form of virtualisation at every tier, cloud migrations, flash and consolidation, it is more important than ever before to obtain granular detail and value from performance metrics.

Compared to five or ten years ago, the amount of data going across the wire has dramatically increased. And the reality is that organisations are still relying on the same inaccurate rolled up averages to tell them what’s going on in their infrastructures. But close enough is no longer good enough.

Rather than settling for the easiest metrics, it’s time for enterprises to start leveraging the most valuable metrics to not only better future proof their data centres from bottlenecks and downtime, but also achieve cost optimisation and guarantee infrastructure performance, end-to-end.

Sourced from Barry Cooks, Virtual Instruments

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and... More by Ben Rossi

The lies of performance metrics

Metrics gone bad

Better performance metrics

Ben Rossi

Related Topics

Related Stories

Observability – everything you need to know

Why data isn’t the answer to everything

Two-thirds of UKI firms struggling with data insight costs

Qlik completes acquisition of Talend

Related Stories

Observability – everything you need to know

Why data isn’t the answer to everything

Two-thirds of UKI firms struggling with data insight costs

What generative AI means for business analytics