Disaster zone

 
 

 

Once or twice every decade, a catastrophic event strikes a business hub with tragic loss of life and massive business disruption. The Loma Prieta earthquake south of San Francisco in 1989, the IRA bombing of the City of London in 1992, the Kobe earthquake in Japan in 1995, and, of course, the devastating terrorist attacks on New York and Washington in September 2001 all underscore the vital – and evolving – role of business continuity and disaster recovery planning.

The frequency of these events might seem all too consistent, but the sheer scale of the disruption to IT-based business functions is growing exponentially. The role of computing, especially in sectors such as financial services, meant that in New York, in the absence of an IT infrastructure, whole businesses ceased to exist – at least until their disaster recover plans swung into action.

Additionally, the attacks on the Twin Towers sent a resounding message to business leaders responsible for ensuring their organisation can recover from a major disaster. While almost all previous situations had been defined by their local impact, the global terrorist threat means that almost any building around the world become a potential target – a European headquarters in London, a data centre in the US, a manufacturing plant in Asia.

That has prompted organisations to take a long, hard look at their business continuity strategies – the location of ‘hot standby' facilities, the viability of their recovery plan, the level of automation of data back-up, and so on – and to examine how a serious disruption might severely impact their operations.

"The World Trade Center (WTC) was one of the first incidents that really tested the advanced recovery and high availability services that are now available within the business community," says Keith Tilley, managing director for Europe at services provider Sungard Availability Services.

For example, recovery technologies that rely on frequent data back-up and data mirroring helped avert massive data loss for many organisations in lower Manhattan, while arrangements with recovery service providers enabled many to move and restart their operations relatively rapidly at alternative IT sites.

Valuable lessons
Organisations that had well-planned business continuity strategies – even some located within a few hundred yards of the Twin Towers – were able to recover relatively well (see Restart from ‘Ground Zero'). But many others did not.

While most organisations had taken sufficient measures to protect their core centralised computing infrastructure, the end-user environment of businesses did not have a similar level of contingency planning, says Todd Gordon, vice president and general manager of business continuity and recovery services at IBM Global Services. Companies failed to protect end points on systems such as workstations and router configurations, and they failed to make basic provisions for items such as desks, printers, voice mail and a customer service dial tone, he adds.

 
 


Robin Gaddum, Guardian iT: “Organisations are looking at distributing their staff more widely.”

 

Early business recovery strategies tended to focus primarily on restoring IT equipment and connectivity after a major outage. But that is not enough "Business continuity is about preventing problems and maintaining the business. It is about covering the business processes, the command, control and communications and people aspects," says Robin Gaddum, managing consultant at UK-based services supplier Guardian iT.

Above all else, what has become evident is the vital importance of IT people following a major disaster. "Prior to 11 September, every business continuity plan assumed that organisations would have access to the majority of their staff," says Gaddum. That assumption has changed.

In tragic circumstances, for example, one of Sungard's customers lost every member of its business continuity planning department in the Twin Towers attack, says Tilley. "Clearly, putting all of your staff under one roof in the building next to yours has inherent security vulnerabilities. [Now] organisations are looking at distributing their [recovery] staff more widely," adds Gaddum.

For the six months since the attacks on the US, however, the emphasis has been more on re-examination than on action. Research points to little having changed in terms of investment in new services and technologies. "There is a lot of inquiring and analysis going on, but so far not the increased commitment in dollars to implement end-to-end continuity plans," says Gordon.

In one survey, IT services vendor CMG found that only 9% of UK organisations with more than 500 employees had reviewed their disaster recovery plans since the rise of the new terrorist threats. "Spending is not occurring at the rate you would think it should relative to the deficiencies [in many organisation's strategies] ," says Gordon.

But business continuity strategy is not something that can be put in place in a matter of weeks. "I don't think we will be able to measure for about another two financial quarters where [new business continuity] funds are being directed," adds Gordon. New contracts at service providers, he notes, take between six and nine months to agree and implement.

One common reason cited for a lack of new activity is that many organisations already feel they have adequate business continuity provision in place. For example, Ian Campbell, CIO at UK telecoms and service provider Energis, says, "There has been no change to strategy at the company from before the events of 11 September, other than a review and assessment of any possible lessons that could be learned."

Similarly, Peter Cox, systems director at UK food retailer Waitrose, says there has been no change in its strategy because, "We had [business continuity provision] firmly in our sights already."

But such companies are probably not the internationally recognised symbols of western business that are most at threat. Companies with global brands have taken a much deeper look at their exposure – but most, not surprisingly, are unwilling to comment on their continuity plans.

The difficulty is that much of disaster recovery planning is subjective. Companies have to assess what they are trying to protect themselves against, the likelihood of it occurring, and the cost of avoidance or recovery, says Cox.

Levels of business continuity requirement also vary enormously. For many financial instutions and those with high levels of online transactions, even a few minutes of downtime is too much. Others can get by for days.

Steven Hunter, operations manager at CMP Information, the professional media publishing division of United Business Media, admits his company did not have a business continuity strategy in place before September 2001. That "kick-started" the company's plans, he says.

Like many, the company has decided that instantaneous recovery is not necessary. "The business has told [IT operations] that they can afford to be down for about two days before it starts to have any sort of major impact on them," says Hunter. "The way the business viewed the risk was that if a disaster was to strike on the day it was going to press, it would be a major problem." But as CMP spreads its publication of various journals and magazines across the calendar, managers think the organisation could survive for a couple of days, he adds.

"Organisations would love instant recovery, but what they are demanding is typically between two and four hours for major applications," says Phil Jones, director of business development for Europe, Middle East and Africa, at storage specialist Hitachi Data Systems.

Mirror, mirror
Providing fast recovery may not be easy but the technology is there to help organisations ensure downtime is minimised. One crucial technology, for some organisations, is real-time data mirroring. Investment bank Schroders, for example, synchronises the transfer of data between its two London data centres located on either side of the River Thames using data mirroring software from storage vendor EMC.

"The beauty of having that kind of failover is that an organisation's return on investment is being maximised, as there is very little redundancy. The only issue is whether the organisation has enough capacity at one of the data centres to run all of its key systems in the event of a disaster," says David Jennings, technical business consultant at EMC. Schroders tests the capability of one data centre to support the operations of both every three months.

However, replicating or mirroring data in near real-time, particularly over long distances, is not cheap. "If you ask a hundred CIOs, all hundred would like to [use geographically dispersed data centres]. But can they afford to do it? The answer, for most, is no," says Charles Rutstein of Forrester Research.

 


Ian Bowdidge HP: “Is what you’re doing what the business needs?”

 
 

Data replication costs are largely determined by distance. "If you are within fibre line distance (within 50 kilometres), then it is not too expensive. But if you have T1 or T2 lines over a long distance requiring lots of telecommunications work, then it becomes very expensive," says Donna Scott, research director at IT industry research group Gartner.

Clearly, traditional methods, such as backing up to tape once or twice a day and then sending tapes to a secure off-site location, is much cheaper, despite being more labour intensive. Now companies are looking to back up far more frequently. "We back up the data on our mainframe continuously in real-time," says Cox at Waitrose.

Without regular rehearsals of disaster scenarios, though, weakness can lie undiscovered. Ian Bowdidge, business continuity manager for Europe at Hewlett-Packard, warns: "We are still amazed at the number of customers who come to us and either can't read their back-up tapes, or their back-up tapes do not contain what they think they contain."

Missing data
Other lessons have emerged from the disaster recovery efforts in New York. Protection against data loss needs to be much more extensive, says IBM's Gordon. On 11 September, most of the data lost was on user devices, he says. "There were over 100,000 workstations lost or destroyed. And backing up the information on those workstations was left up to individual users. That contrasts sharply with the solid data back-up policy most companies had for their core server systems where data loss was minimal."

Over the coming months, those kinds of lessons will be applied at major corporations around the globe as non-IT executives come to appreciate the impact a loss of computing function can have on their organisations. "Business continuity has to be led from the top. If you haven't got board-level commitment, then you don't know if what you are doing is what the business needs," says Bowdidge.

   
 

Waiting for disaster

Deciding what to do after a disaster has happened is a high-risk strategy. But it is still the most prevalent approach towards business continuity and disaster recovery provision for many large organisations.

A recent survey of 287 IT managers in mid-sized and large companies in the UK shows that 77% of respondents thought their organisation's non-IT operations only became aware of the importance of disaster recovery planning after serious downtime had occurred. The research by analyst group IDC, conducted on behalf of UK-based services provider Guardian iT, suggested that two thirds of organisations still view disaster recovery and business continuity as an IT problem, instead of a critical component of the overall business.

Rod Ratsma, a director in the consulting division of service provider Guardian iT, spells out the dangers of this strategy. "If a flood or fire wipes out an organisation's main computer room or a telecoms glitch brings its website down, this could potentially cost them millions of pounds."

Most IT staff are fully aware of these risks, but to implement adequate provision requires boardroom level support. Ratsma adds, "Indeed, 46% of IT managers complain that IT departments are not given enough support from the rest of the company in this area. These findings are echoes of an all-too-familiar story where organisations are not taking business continuity seriously, and when they do, responsibility often falls to the IT manager – rather than a board-level manager who can champion the cause."

A vital component of this support is financial backing. The survey found a pitifully low amount of expenditure on disaster recovery. IDC highlighted how the average annual expenditure on disaster recovery is a paltry £50,000. A breakdown of the interviewees' expenditure on disaster recovery was split 31% on Microsoft NT servers, 28% on mainframes and data centres, 20% for Unix servers, 11% on PC-based local area networks, and 9% on IBM AS/400 systems.

This level of spend will have to change. One key reason is that there is currently a significant mismatch between the demand organisations are placing on their IT departments to provide near-continuous availability across their networks, systems and applications and how they help to ensure that.

 

 
   

   
 

Restart from 'Ground Zero'

There are some chilling lessons to be learnt from companies that find themselves at the epicentre of disaster.

On 11 September 2001, almost one thousand IT and operations staff at a major investment bank [which requested anonymity due to the sensitivity of the situation] went to work as normal at the World Trade Center complex, supporting financial market trade settlement and reconciliation for the bank in New York and around the globe. When terrorists struck the Twin Towers only several hundred metres away, there was one overriding priority: to get staff to safety. That task was managed, miraculously, without the loss of a single life – but only because previous small-scale incidents had meant evacuation procedures were well rehearsed, and staff knew to head to exits away from the Twin Towers.

But as people were being accounted for, the bank's operations department business recovery plan swung into action. Over that and following days, staff were relocated to recovery facilities the bank owns outside of Manhattan and later to the company's New York City headquarters. After quick consideration, the company decided against using another recovery capability it had arranged at a third-party's facility several miles away.

By 11am on 11 September, the configuration of the company's recovery facility was already underway. Around 300 workstations were brought up as IT and business employees started restoring applications and data, and testing external connectivity. At the same time, financial settlement operations staff assessed the viability of critical processes and the extent and potential impact of lost data. At the firm's head office, an initial 100 new workstations were powered up, complete with support equipment and mainframe and server access. Meanwhile, the company's use of voice-over-IP technology meant that its communications capabilities were not hampered by the loss of major telecoms hubs in lower Manhattan.

By the afternoon of 12 September, almost half the staff from the former World Trade Center building were working at one of the two recovery locations, and the reconfiguration of this new operations environment was in place by the evening of 16 September 2001, in time for the re-opening of the New York Stock Exchange the next morning.

Over the next two weeks, the company managed a controlled migration to this new operations environment, a process helped by the recovery plan having involved the pre-testing of 37 of the 40 key third-party communications links used in the bank's settlement processing.

The lessons learned over that traumatic 20 days were clear, says the head of the bank's disaster recovery group. These included:

  • Recovery is primarily a staff issue; provision for staff care is vital, including transportation, accommodation, sustenance, and, where required, counselling.
  • Pre-agreements should be in place for equipment delivery, including IT infrastructure, PCs, printers, fax machines and office supplies.
  • Fully pre-configured recovery facilities should be established and tested.
  • Where third-party facilities are to be used, these will require significant configuration on the day.
  • Throughout the recovery period recovery costs should be tracked as these will be needed in insurance claims.

In the light of how well the bank recovered – and how other businesses struggled under less severe circumstances – the head of the bank's recovery unit has one headline conclusion that should resound around every boardroom: Business risk and IT risk are inextricably linked.

Editor's note: Sensitive to the devastating experience that many of its employees went through, the financial institution profiled here requested anonymity.
Back to main text

 

 
   

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics