The anatomy of an outage – how software performance problems are affecting the digital world

Businesses well understand the importance of providing seamless customer journeys, but we’ve seen a growing spate of digital service outages and software performance problems in recent months.

While some of these problems have been minor inconveniences, with the likes of online video streaming services or social media sites going down, others have caused far more serious concern. There have been online banking outages that have left customers unable to pay bills on time. Problems with major payments systems have left shoppers unable to use their bank cards at the checkouts. Even the daily commute has been impacted, with rail ticketing website outages leaving people unable to buy a ticket to travel. These problems all seriously disrupt peoples’ ability to live their day-to-day lives, so they’re becoming a growing concern for businesses and consumers alike. So, if businesses understand the importance of preventing these scenarios so clearly, why are they happening more often?

See also: IT outages: The actual cost and how to prevent it becoming a reality” – IT outages have become a part of everyday life. They can affect every business, regardless of sector or size.

Converging complexity

The soaring complexity of technology ecosystems is the biggest contributor to the rise in service outages and software performance problems. Modern digital services reside in complex hybrid multi-cloud environments, spanning multiple platforms and technologies. They’re powered by applications running in dynamic microservices and containers, creating constant change. A single web or mobile transaction now crosses an average of 35 different technology systems or components, compared to 22 just five years ago. With digital transactions crossing such a diversity of components in a dynamic technology stack, it’s gone beyond human capability to manage performance effectively. They struggle to maintain visibility into everything that’s happening in their environment, and to find the root cause of any performance problems that arise quickly. It’s moved beyond finding a needle in a haystack, to finding a needle in a thousand haystacks in a hurricane.

Unfortunately, this trend is showing no signs of reversing or even slowing. Digital ecosystems are becoming even more complex, and IT teams are under more pressure than ever to quickly identify and resolve the root cause of any problems that arise, before customers feel any impact. If they fail to do so, the spate of digital performance problems and service outages that we’ve seen recently will only become more pronounced and occur more often. This will become increasingly critical with the advent of driverless cars and connected medical devices, which could wreak major damage if they’re impacted by performance problems.

Related: The technology behind driverless vehicles – As we step closer to realising driverless vehicles as a common technology, we take a look at NVIDIA’s next-gen Pegasus AI. Read more

Overcoming the outage obstacles

There’s a number of reasons why it’s become impossible for businesses to manage the complexity of their digital ecosystems manually. Firstly, new technologies, infrastructure and platforms are constantly being layered onto IT stacks, requiring more monitoring tools to provide visibility and enable IT teams to manage performance. However, the digital ecosystems that have arisen around these IT stacks are also highly dynamic. Whilst this creates the agility that businesses need to thrive, it also makes it impossible for humans to stay on top of performance using traditional monitoring tools, which were built for static environments.

On top of this, these traditional monitoring tools are bombarding teams with alerts, most of which are just white noise. But understanding what’s white noise and what’s important is time consuming – time that most organisations simply don’t have. Given that it’s impossible for humans to overcome this challenge manually, organisations need to be able to automate as many IT operations processes as possible. They need the ability to automatically detect issues in real-time and, most importantly, use AI to pinpoint the root cause with precision. These capabilities can also help organisations onto the path of auto-remediation, so their monitoring system can detect problems and apply fixes to prevent or resolve the issue before it escalates into a full-blown outage. This in turn will take the pressure off IT teams, enabling them to focus on driving innovation rather than spending endless hours in war rooms to determine where a performance problem is stemming from.

Related: Gartner establishes the components of the digital workplace” – Analysts are set to discuss the latest digital workplace best practices in London next month. Read more

No turning back

While moving to the cloud has made businesses far more agile, it’s added exponential complexity to their digital ecosystems. This has had a huge impact on organisations’ ability to successfully monitor performance and rectify any issues quickly and efficiently. We’ve already seen an increase in the regularity of digital performance problems and service outages impacting on businesses and their customers. AI is crucial to combatting the problem. It can make the process of detecting and rectifying software performance problems much faster and more effective. Ultimately, this will enable IT teams to provide more consistent and positive user experiences, relegating the nightmare of major outages and late-night war rooms to the past.

Article written by Michael Allen, VP and EMEA CTO at Dynatrace

Avatar photo

Michael Baxter

.Michael Baxter is a tech, economic and investment journalist. He has written four books, including iDisrupted and Living in the age of the jerk. He is the editor of and the host of the ESG...