How to mitigate the impacts of an IT outage

Whether employees stick to a 9-5 routine or work more flexible hours, an IT outage can prove costly and severely disruptive to operations across the organisation. According to Gartner, downtime can cost companies $5,600 per minute and up to $300,000 per hour in web application downtime.

In addition, not only does it bring productivity to a stop, but victims of outages aren’t able to continue providing services to their customers, which can lead to a loss of trust.

This article will explore how companies can mitigate the impacts brought by an IT outage, and ultimately avoid such incidents in future.

Plan for backup and disaster recovery

When data is lost, it is vital to have plans in place to back data up and recover quickly in order to minimise impacts on activity.

“Since the onset of the pandemic, IT leaders in the UK and US have reported increases in data outages (43%), human error tampering data (40%), phishing (28%), malware (25%) and ransomware attacks (18%),” said W. Curtis Preston, chief technical evangelist at Druva.

“If businesses are not careful, their lifeline can quickly get destroyed. When an outage occurs, organisations need their servers and data restored as quickly as possible. Companies that are successful in managing unexpected outages have a disaster recovery plan in place that has been pressure tested regularly and is readily available on demand.

“It’s also important to make sure your backup system is physically and technologically isolated from your production system to minimise the risk of impact from a natural event or malicious actor. Cloud-based solutions offer a natural air-gap, while being able to meet RPO objectives of less than one hour.

“The key to managing outages is to place as many safeguards as you can before the inevitable outage happens. By taking some proactive steps and leveraging the right technology beforehand, companies will lessen the risk posed by emergencies to a great extent.”

Why disaster recovery-as-a-service is the industry’s next vital tool

Veniamin Simonov, director of product management at NAKIVO, explores why disaster recover-as-a-service is the industry’s vital tool. Read here

Visibility, insights, and automation

Without the right visibility and insights, it can be difficult to find where problems are arising within the networks, and the causes of these issues. Eugene Kim, director of product strategy at Cisco AppDynamics, explained how monitoring tools can meet these requirements, along with providing action automation.

“Without the right tools, teams can struggle to monitor, let alone manage or resolve, the performance of their applications across the IT stack,” said Kim.

“We know that in any given organisation, IT infrastructure is ever-growing in complexity. Today, very few organisations can map their IT stack with any accuracy. Sales within teams or projects, legacy infrastructure or applications and partial migrations can all muddy the waters, making it hard to solve potential issues or outages.

“To proactively prevent or solve outages quickly, IT teams need visibility, insights, and action automation across their IT operations. With a monitoring tool, IT teams are able to separate the noise from the signal, speeding up resolution. With full-stack observability – from the customer’s device, to the back-end application, or the underlying network and infrastructure – IT teams can turn data monitoring into meaningful, actionable insights in real-time to manage and prevent outages.”

Centralised, real-time observability

It’s also a good idea to have recovery plans be organised in a centralised location. This can make tools for bouncing back from an IT outage easier to find for when such an incident returns down the line.

“Falling over in front of the customer is never an option, so we’re seeing increasing focus on bolstering resilience efforts to learn more about these events, dependencies and how to plan for all eventualities more effectively,” said Steve Piggott, head of enterprise resilience at Cutover.

“Where resilience efforts are falling short, there are effective ways for companies to prepare and test better, fail faster, and then recover with more confidence in their processes.

“We’re seeing increased focus on resilience activities to factor for disruption and outages, often stretching across the enterprise, with the right level of automation and real-time observability to enable teams to plan, test, recover and analyse all resilience activities in one central repository.

“This helps factor for future events but also keeps companies accountable, with the audit trail to support decisions and outputs.”

Cyber resilience: your last line of defence

Ryan Weeks, CISO of Datto, discusses the need to build up cyber resilience, and how organisations can achieve this. Read here

Proactive monitoring

When it comes to monitoring the network, it pays to be proactive at all times. This continuous visibility ensures preparation for the worst.

“As organisations hasten their digital transformation – a trend accelerated by the Covid-19 pandemic – the increasing complexity of IT environments means that the likelihood of outages increases,” said Mark Banfield, CRO at LogicMonitor.

“For companies embracing digitalisation, there is no way to completely guarantee against an outage, but through proactive monitoring, the impact can be mitigated.

“Proactive monitoring is key to managing downtime as it allows IT teams to perform preventative maintenance, thereby fixing issues in the system before they result in an outage. This is not a silver bullet, but when problems do occur, comprehensive IT monitoring grants teams visibility into their IT environments, allowing for issues to be remediated faster and the length of outage times to be shortened. It is, after all, the difference between fixing a car engine with lights on or off – and while IT teams are fumbling around with the spanner, the media and upper management are demanding that the car is started immediately.

“Given this immense pressure, it’s no wonder that IT teams ask that, with proactive monitoring, the light be switched on in this scenario.”

Collaboration with ISPs and cloud providers

Finally, organisations should consider working closely with cloud providers and Internet service providers (ISPs) to plan for when an outage occurs.

Ian Waters, senior director of EMEA marketing at ThousandEyes (part of Cisco), explained: “Businesses need to understand what and where their third-party dependencies are, the relationship between them and in turn where their traffic is flowing.

“This increasingly complex environment requires a new monitoring framework that provides full visibility into the digital stack, including the ecosystem both within and without enterprise ownership. Armed with this real-life insight into internal and external network performance, businesses should collaborate with ISPs and cloud providers to plan for known and unknown events that can trigger outages. What’s more, by mapping out baseline performance, enterprises can uncover potential bottlenecks and vulnerabilities in advance.

“When an outage does occur, it’s important for a business to fully understand the scope and cause of an outage, drill down into the affected interfaces and take action to fix it, while communicating with employees, stakeholders or clients on resolution time. Multi-layered visibility not only provides this required information, but also enables enterprises to look back once the event has happened to learn from data for future issues.”

Avatar photo

Aaron Hurst

Aaron Hurst is Information Age's senior reporter, providing news and features around the hottest trends across the tech industry.