IT department disaster: Preparation is essential

Natural disasters like Cat 5 hurricanes, killer tornados and superstorms. Nightmare scenarios where enemies of the republic rain missiles down on our cities. Unstoppable superbugs that kill indiscriminately.

All of these are part of a scenario called The End Of The World As We Know It, or TEOTWAWKI, when all hell breaks loose and it’s every wo/man for her/himself.

But, companies can have their own TEOTWAWKI scenario, even as the rest of the world continues to function. That scenario – the IT version of TEOTWAWKI – comes about when a company’s servers and services, internal and external, seize to function due to any number of reasons (like a “glitch”), halting business, frustrating customers, and causing huge losses for the organisation.

What you’re supposed to do when real-world TEOTWAWKI hits is head for the hills, grabbing your bug-out bag and following your prepared escape plan, taking a very specific route to the backwoods redoubt where you will ride out the storm – a redoubt that is well-stocked with supplies, and equipped with appropriate defense measures to keep out vandals and thieves who are desperate enough to do anything to survive.

It’s all part of the movement of Doomsday Prepping, and while opinions about how valid fears of TEOTWAWKI really are, or should be, there are definitely important lessons IT staff and corporate management can take from the preppers.

TEOTWAWKI in the real world requires a truly dire set of circumstances – all out nuclear war or an alien invasion a la the film Independence Day – but in business, a TEOTWAWKI could evolve at any time, for a variety of reasons.

Natural disasters that flood or otherwise damage data centers are, of course, a perennial threat, but IT has its own special set of man-made TEOTWAWKI scenarios. Systems today are so complicated and sophisticated that even slight changes have the potential to take a company’s services (internal and external) completely offline.

At the New York Stock Exchange, for example, it was a “glitch” in the rollout of new software in the trading system that halted activities for hours on July 8, 2015. More recently, another “glitch” halted or significantly slowed traffic at airports around the world last September when airline check-in systems crashed. And American Airlines narrowly avoided disaster in November when it corrected yet another “glitch” that gave too many pilots time off in December, during one of the busiest travel seasons that would have guaranteed canceled flights and incessant delays.

But even these scenarios aren’t as bad as it can get. When a “glitch” is detected, IT staff can get busy working on it, at least patching it up temporarily to allow the organization to get back to work. Far worse is the situation that prevails at more than half of companies, according to a recent study by the University of Chicago.

That study shows that the majority (nearly 300 of the 500-some instances examined) of outages were due to “unknown” factors. That, in our opinion, is a true TEOTWAWKI for enterprises; if you don’t know what the problem is, how can you even begin to think about fixing it?

So, where does one begin? Obviously a comprehensive resilience strategy that can handle a wide variety of failure scenarios is key. From localized issues, such as single network, compute or storage component failure, through a collapse of an entire availability zone, all the way to a region-wide outage necessitating geo failover. But even with such a design, things can go wrong; excessive or frequent changes to the production systems will also impact resilience; instead of order, even more chaos could evolve under those circumstances.

The same applies to any change. Outages are far too common at data centers, and can occur for any number of reasons – even something as simple as adding more storage or applying routine patches and updates. Incorrect driver and firmware configuration, or patches inconsistently applied could leave systems exposed.

Of course, many changes performed to the data center are far more complex – for example, major innovation in private cloud architecture that involves adapting to new features and tools, and often hundreds of new vendor best practices, and, if incorrectly designed and performed, might affect thousands of VMs and containers. With millions of possibilities, many of them interconnected and with mutual dependencies, there is no way for a human being to figure it out.

The best way, some believe, to significantly reduce risk, and increase resiliency of IT systems is through deep knowledge-base and automation. Indeed, a new breed of tools for continual IT configuration quality assurance has evolved, to make sure IT systems are safely and consistently configured all the time.

Advanced preparation before can be life saving in the event of TEOTWAWKI – both in real life and IT life. For the latter, the key to ensuring that an organization is able to survive an outage and get back online as quickly as possible is a matter of survival akin to getting out of town to the secure redoubt. With the right preparation, both preppers and companies can live to see the beginning of a new world.

Sourced by Iris Zarecki, VP marketing of Continuity Software

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and... More by Nick Ismail

IT department disaster preparation

Nick Ismail

Related Topics

Related Stories

Why ISO 42001 sets the standard for responsible AI governance

7 key strategies for MLops success

Why synthetic data is pivotal to successful AI development

Why AI needs a kill switch – just in case

Related Stories

Why ISO 42001 sets the standard for responsible AI governance

7 key strategies for MLops success

Why synthetic data is pivotal to successful AI development

AI vs AI – are cybercriminals or organisations winning?