Modern business continuity thinking rightly stresses that organisations need to protect their business processes, not just their data. But even though every second counts in the online world, processes are mostly transient, and can usually be picked up again after a continuity problem. When data is lost, it can be expensive, embarrassing and even devastating.
Ask the CEO of Co-Operators Life Insurance in Canada. In January 2003, the company’s IT partner, IBM, informed it that it had mislaid the financial details of as many as 180,000 of its clients. Eventually IBM found the data – but only after the CEO of Co-Operators Life, fearful of large losses, gave a humiliating press conference to apologise.
Until the mid-1990s, the mainstay of data back up was tape. Even today, many smaller companies use tape as a primary back-up medium – because it is cheap and reliable. If there is a disk failure, however, IT staff have to search for the correct back-up tape, then spool through the tape to locate
the wanted data. This takes time. And while modern tape drives may be able to back up and restore as much as two terabytes (2,000GB) of data in less than an hour, according to Graham Hunt, product marketing manager, EMEA, at storage specialist Quantum, this is far slower than disk-to-disk copying.
Tape is prone to other problems, too. “If one file is missing, it affects the whole recovery,” says Tony Reid, director of enterprise systems for HDS EMEA.
For these reasons tape now takes second place to disk-based back-up systems, and is reserved for archiving and offline restores only. Most organisations choose to use direct copying from one disk to another local drive as their primary business continuity back up, providing the maximum possible speed for both back up and restore. “It’s very fast” says Tarek Maliti, technical director of hosting company TDM. “You can choose which files you want, point to the version and it’s restored in seconds.”
A disk-based back up can also be organised at several levels, says Laurence James, disk business manager at Storagetek: at an application or operating system level, with either writing to both drives simultaneously; at an intermediate level, where a software or hardware agent monitors a server for disk writes and replicates them on the back-up drive; or at disk controller level, where any instructions to write to disk are duplicated by the hardware to replicate the write on the second disk. Batch back-ups, which tend to be out of date quickly, are effectively replaced.
The software-based approach is used by a number of software companies, including NeverFail, a Microsoft specialist. NeverFail sells software agents that monitor both SQL Server and Exchange Server and replicate their contents on a second server. The higher level of the software agent means it may not be as fast as the pure hardware approach, but it can integrate into systems management tools. It can also access hardware information from the operating system and can use this to anticipate a hardware failure and initiate a switchover to a second server. It can also get a more complete view of the disk’s situation.
“There is a perception that a tape back-up or traditional disaster recovery will help in the event of disaster,” says Steve Stobo, European sales and marketing director at NeverFail. “But when Exchange Server goes down, it usually goes down quite messily – usually for a day or so.” Stobo says NeverFail for Exchange Server can spot problems and switch the servers before the data is corrupted.
These strategies are all effective for dealing with the common problem of a local drive failure. But what if the business interruption stems from something more than a single sub-system problem – such as a flood or fire?
Only a remote back-up system can deal with this. In the event of a breakdown of the primary centre, the second centre can pick up the slack with a complete or near-complete copy of the original data.
“It’s not cheap, because you need to double everything,” says Miles Cunningham of SunGard Availability Services. “But you can get a gigabit circuit halfway across the UK for as little as GBP 50-60K a year.” An alternative, at the lower end, is to use a back-up service provider, such as Imperidata, which is able to ‘trickle’ a disk drive update to a remotely hosted copy over a standard broadband Internet connection without the user noticing.
Using remote replication, disk drives can automatically be brought online. And the second servers do not need to be reconfigured because the network address of the primary server can be virtualised using Cisco’s hot standby router protocol (HRRP) or the international standard virtual router redundancy protocol (VRRP). A router at the apparent network address of the primary server forwards traffic to its usual address until the server fails, but redirects traffic to the second server in the event of a failure.
Virtualisation technology, whether of the network or of storage, has provided users with much more resilient systems all round. In the event of storage failure, for example, traffic from the still-functioning servers is simply redirected to the back-up network storage without any interruption in service.
When SANs were first introduced – during the 1990s – they solved some continuity problems but created others. Remote backing up of SANs was initially difficult: many SANs are based on fibre channel networking rather than standard Ethernet-based networking; they also often require expensive and proprietary virtualisation software, even for local back ups, to create easily manageable back-up processes.
This, however, has become easier. The advent of the iSCSI protocol, which forgoes the fibre channel of traditional SANs in favour of Gigabit Ethernet, has made it as easy to push iSCSI-based SAN data out over a leased line as it is to push out data from attached storage. “You’re better off deploying a SAN if you want to do remote back ups,” says Stephen Owens, EMEA product manager of Adaptec, a storage systems supplier. “It’s easier, you get the benefits of separate storage – you don’t impact the LAN – and you can deploy snapshotting, remote mirroring and storage virtualisation.”
Quantum’s Hunt concurs, adding that iSCSI will go where fibre channel can’t. “Fibre channel isn’t designed for long distances, but Ethernet has an unlimited range,” he says.
Remote back-up solutions face a barrier that local solutions do not face, however: the speed of light. Once servers are further apart than roughly 16 kilometres, there are noticeable lags in communication between primary and secondary sites that can affect performance, as messages confirming arrival and requests for more data are passed back and forth. With many companies now worried about large scale terrorist or natural disasters, this problem has moved from theoretical to practical.
Various storage firms claim to have overcome the problem, including EDS and Hitachi Data Systems. “We offer an asynchronous solution that can overcome the lag,” says John Hickman, business continuity manager, EMEA, HDS. Rather than write to both primary and secondary server simultaneously, asynchronous back up allows for lags of a few seconds before data is successfully written to the secondary server. “When we send out the data, we include metadata that specifies the order in which disk operations should be performed. The secondary server won’t perform the writes until it knows it has received all the data.”
This also prevents disk corruption in the secondary server, since it will hold off writing data that is broken off by a breakdown in the primary server.
“It is not uncommon for some organisations to synchronously mirror their data to a secondary site 10 kilometres away,” agrees Paul Hammond, director of solutions consulting at CNT.” Some then back up asynchronously to a third site, which could be up to 100 kilometres away.
An added step that creates a very sophisticated “belt and braces” business continuity system is to then run back ups of the data at the third site and keep these tapes in a fourth site.” Some organisations, at least, take their business continuity and back up process pretty seriously.