Information Age: News, analysis & insight for IT & business leaders

Falling through the net

13 October 2010  

Why back up and recovery systems are so often the cause of disastrous IT failures

In August 2010, thousands of citizens in the US state of Virginia found themselves unable to access welfare services, apply for driving licences and, in some cases, pay their taxes. The reason, of course, was a catastrophic failure of the state’s IT systems.

The disruption first came to light on 25 August, and continued into early September. According to the Virginia Information Technologies Agency, 26 of the state’s 89 agencies were affected. The number of citizens affected is not
yet known.

Just over a month later, on the other side of the world, Australian passengers on local carrier Virgin Blue were left stranded when a newly overhauled IT system, provided by airline specialist Navitaire, failed. Virgin Blue was forced to cancel 400 flights and the disruption affected 50,000 passengers.

Over the past decade, most enterprises have drafted and rehearsed contingency plans for dealing with disruptions to their business. Such plans include scenarios for extreme weather and natural disasters, pandemics, criminal acts and terrorism. But often, as the examples of both Virginia and Virgin Blue demonstrate, considerable disruption can be triggered by a simple and initially minor technical fault.

For Virgin Blue, the trigger for the disruption was a solid-state disk server fault. But why did that lead to 21 hours of service unavailability? “The service agreement that Virgin Blue has with Navitaire requires any mission-critical system outages to be remedied within a short period of time,” said the airline. “This did not happen in this instance.”

The situation was made worse, according to Virgin Blue, by a decision by its IT supplier to try to repair the component rather than bring a backup system online. “We are advised by Navitaire that while they were able to isolate the point of failure to the device in question relatively quickly, an initial decision to seek to repair the device proved less than fruitful,” it said. “[This] contributed to the delay in initiating a cutover to a contingency hardware platform.”

In Virginia, two circuit boards in the state’s storage area network (SAN) – a top-of-the-range EMC DX3 – were the initial cause of its outage. According to reports in the US press, backup systems failed to operate correctly and a number of state agency databases were corrupted as a result. The difficulty of recovering that corrupt data was the main reason for the length of the outage.

A third recent example befell US clothing retailer American Eagle Outfitters, which lost eight days of online trading following the failure of systems supporting its website. According to reports, two backup disks failed, as did a backup utility. A disaster recovery site, designed to take over, was not in a ready state to take over.

In each case, it seems, the culprit was not so much a technical fault as the failure of backup and recovery systems that keep systems online. According to some, such failures are increasingly commonplace as businesses try to cut back on IT spending. “The financial crisis has forced companies to make decisions they would not otherwise have done,” says Ray Stanton, global head of business continuity, security and governance at BT. “Some have not made investments, or have opted to defer business continuity exercises for a year.”

At the same time, the growing complexity of IT infrastructure is making it harder to design foolproof business continuity plans.

Continued...


Comments 

There are currently no comments on this article

People who read this also read...

 

White Papers

Read article

'Think Lean' When Developing Management System Documentation

Learn how to efficiently and effectively implement a document management system for your organization.

Read article

11 Hiring Trends for 2011

In this document, you'll get the insider info you need to give potential employers what they want and beat your competition in 2011. You'll learn about the most valuable certifications and the game-changing skills that can lead to more job security and stability.

Read article

12 Hiring Manager Secrets to Getting the IT Job You Want

Learn how you can make yourself a more attractive candidate now with PrepLogic's free 12 Hiring Manager Secrets to Getting the Job You Want.

More
Advertisement
div class="banner">