Falling through the net

In August 2010, thousands of citizens in the US state of Virginia found themselves unable to access welfare services, apply for driving licences and, in some cases, pay their taxes. The reason, of course, was a catastrophic failure of the state’s IT systems.

The disruption first came to light on 25 August, and continued into early September. According to the Virginia Information Technologies Agency, 26 of the state’s 89 agencies were affected. The number of citizens affected is not
yet known.

Just over a month later, on the other side of the world, Australian passengers on local carrier Virgin Blue were left stranded when a newly overhauled IT system, provided by airline specialist Navitaire, failed. Virgin Blue was forced to cancel 400 flights and the disruption affected 50,000 passengers.

Over the past decade, most enterprises have drafted and rehearsed contingency plans for dealing with disruptions to their business. Such plans include scenarios for extreme weather and natural disasters, pandemics, criminal acts and terrorism. But often, as the examples of both Virginia and Virgin Blue demonstrate, considerable disruption can be triggered by a simple and initially minor technical fault.

For Virgin Blue, the trigger for the disruption was a solid-state disk server fault. But why did that lead to 21 hours of service unavailability? “The service agreement that Virgin Blue has with Navitaire requires any mission-critical system outages to be remedied within a short period of time,” said the airline. “This did not happen in this instance.”

The situation was made worse, according to Virgin Blue, by a decision by its IT supplier to try to repair the component rather than bring a backup system online. “We are advised by Navitaire that while they were able to isolate the point of failure to the device in question relatively quickly, an initial decision to seek to repair the device proved less than fruitful,” it said. “[This] contributed to the delay in initiating a cutover to a contingency hardware platform.”

In Virginia, two circuit boards in the state’s storage area network (SAN) – a top-of-the-range EMC DX3 – were the initial cause of its outage. According to reports in the US press, backup systems failed to operate correctly and a number of state agency databases were corrupted as a result. The difficulty of recovering that corrupt data was the main reason for the length of the outage.

A third recent example befell US clothing retailer American Eagle Outfitters, which lost eight days of online trading following the failure of systems supporting its website. According to reports, two backup disks failed, as did a backup utility. A disaster recovery site, designed to take over, was not in a ready state to take over.

In each case, it seems, the culprit was not so much a technical fault as the failure of backup and recovery systems that keep systems online. According to some, such failures are increasingly commonplace as businesses try to cut back on IT spending. “The financial crisis has forced companies to make decisions they would not otherwise have done,” says Ray Stanton, global head of business continuity, security and governance at BT. “Some have not made investments, or have opted to defer business continuity exercises for a year.”

At the same time, the growing complexity of IT infrastructure is making it harder to design foolproof business continuity plans.

Continued…

Page 2 of 2

In most large organisations, there will be IT failures or incidents on an almost daily basis. Indeed, masking that base rate of failure from the wider business, and from customers and shareholders, is a crucial part of the IT department’s job. “In my business something happens every day, but it is below the radar,” explains BT’s Stanton. “It is very predictable when it is under your control.”

There are various precautions that businesses can take to keep the system failures that might snowball into long-term disruption to a minimum. Root cause analysis when incidents occur and inspecting trouble tickets from IT support systems can provide advanced warning of a potential outage. “That data will show where you have potential problems,” says John Morency, research vice president covering business continuity at Gartner.

Recent technological developments such as storage area networks, virtualisation and even cloud computing have boosted the resilience of IT systems, not least by separating the physical and virtual layers of a system. “The trend is towards more resilience,” says Rick Cudworth, head of business continuity and resilience for EMEA at professional services firm Deloitte.

But reducing the failure rate of individual systems can only go so far when it comes to mitigating the risk of IT downtime. In most case it is not be a single incident that causes the outage but a combination of unanticipated events. Businesses might have a plan to withstand one failure, but it is when a second or third system fails that outages begin to affect customers or stakeholders.

Addressing this danger calls for a risk management approach. It requires businesses to determine which of their systems are the most critical, and which should be the priority for protection, as well as restoration. Gartner calculates that between 10 and 18 per cent of IT systems run 60 to 70 per cent of business processes in most organisations.

It also requires an understanding of the knock-on effects that a single system failure will have. “Business continuity means undertaking a detailed analysis of the infrastructure to find any single points of failure,” says Seamus Reilly, head of information security for Northern Europe at Ernst & Young. “You have to understand the systems from a data flow point of view. We have clients that already have business continuity and disaster recovery in place but are now taking it to the next level, to see what impact a single point of failure could have on the business.” Another crucial but often absent component of business continuity is testing, Reilly adds.

Ensuring that standby systems can be activated and data restored is especially important. This should not simply be a technical test, he says, although these are vital. A proper testing regime involves a crisis management exercise that brings in the corporate communications, legal and HR departments, as well as the lines of business and IT. “Testing is the number one way to prove that key systems work as expected,” advises Reilly. “And many organisations still don’t do that enough.”

Clearly, business continuity remains one of the most complex challenges facing IT executives. But recent examples prove that it is a challenge that cannot be shirked.

Related Topics