Contintuity guaranteed
- Reduce text size Decrease text size
- Increase text size Increase text size
- Print article Print
- Jump to comments Comment
- Share this article Share
- Email article to a friend Email
The grid computing model is promising high levels of continuity. How well are the early technologies delivering on that promise?
For almost three decades, there has been a single, fundamental model of business continuity - duplication.
On one level, organisations of all sizes have relied on backing up their data with the assumption that they can roll back to a previous state in the event of a failure. But the larger of them - banks, airlines and other companies that need to ensure little or no interruption to their transaction services - have taken that duplication a stage further by mirroring the core elements of their IT environments either on- or off-site.
Although such duplication of systems and facilities involves considerable cost, and is therefore only usually an option for high-end users, it does mean that such organisations can attain service levels approaching or even surpassing the hallowed 'five nines' - the goal of 99.999% uptime.
A variation on that theme that gained considerable traction in the late 1980s and early 1990s, especially among banks and telecoms companies, was to duplicate within the same box. So-called fault-tolerant or non-stop servers - from companies such as Tandem (now part of Hewlett-Packard) and Stratus - relied on cloning components, providing dual power supplies, standby processors, twin disk systems and so on. The system could switch to a back-up unit internally whenever a primary one failed. Again, those were the workhorses for organisations which could not tolerate downtime - no matter the expense.
However, over the last decade, systems vendors have been on a quest to bring those levels of business continuity to a much wider audience - at a much lower cost - by spreading the processing load across larger numbers of systems and thereby diluting or even eliminating the impact of any one failing. And that represents a fundamental shift in how availability is attained: from a model built on duplication to one based on redundancy.
Today, that manifests itself in cluster systems but, increasingly, that approach is giving way to the notion of systems networked together in a grid architecture.
Clustering power
Although the notion of hooking together the processing power of multiple servers has attractions for scalability, the primary reason for clustering servers together is to obtain a high degree of availability. Jean Bozman, an analyst at industry watcher IDC, outlines the situation: "In the absence of 100% non-stop or fully fault-tolerant systems, clustered servers allow mission-critical applications to remain highly available, even if one of the individual server nodes should fail. Typically, two or more servers are connected together by shared access to storage and by an interconnect link that provides a 'heartbeat'. If the heartbeat is not detected from one of the attached servers, then clustering software initiates a failover of workload from the affected server to an alternative server."
In a survey of 325 IT executives, IDC found that respondents were pursuing high availability by deploying failover clusters or workload-balancing that allowed end users to access servers with little or no interruption even in the event of failure of hardware or software components within the cluster.
The research, released in 2004, showed that 70% were using clusters primarily to leverage their high-availability characteristics. "This reflects the importance of achieving high availability for data and applications in a networked world that values 'anytime' access and that has little tolerance for downtime - planned or unplanned," says Bozman.
But, in practice, clustering has presented some major challenges - not least the need to re-balance database and applications workload across the remaining processors in the event of a failure. Systems software companies - IBM, Oracle, Microsoft and others - have sought to address this, to varying degrees of success, by parallelising the workload and ensuring it can be balanced across all processing units.
In the IDC survey, amongst those using 'parallel' clustering, one database system dominated. Oracle, with the Real Application Clusters version of its database, was used in 69% of sites. No other database scored above 10%. RAC, available since June 2001, is a major aid to continuity at companies such as Deutsche Post.
As part of a push for efficiency and systems reliability, the German postal service wanted to bring together the 84 separate servers that powered its letter coding system into a single cluster.
Underpinning that cluster with RAC "meant we could ensure the system remained available 24x7," says Robert Leaman, director of systems architecture at Deutsche Post ITSolutions, the organisations IT services provider. "Any systems failure would mean returning to hand sorting, which would have had a dramatic impact on the bottom line and therefore could not be allowed to happen."
Like other organisations, Deutsche Post is looking to take clustering to the next level, driven by a need to cut costs and overcome some of the limitations of clustering.
Live grid
Depending on the architecture, there are limits to the number of servers that can be clustered together, often forcing larger organisations to use high-end, expensive machines. Moreover, the act of clustering processing power establishes a single point of failure that might be vulnerable to events such as a power outage. Grid computing promises to address such issues.
"Grid computing involves networking together lots of commodity, low-cost servers acting like, and looking to the user like, one machine," says Chuck Phillips, president of Oracle. "The workload is balanced across machines as applications' needs dictate." In the company's 10g database product, Oracle has augmented its RAC software with grid control modules designed to do just that.
In many settings, grid will remove the need for duplicate hardware. If one node is unplugged or becomes unavailable for any other reason, the workload is simply spread automatically across remaining nodes.
The interest is particularly high for database grid deployments. At the halfway point of 2004, a survey by analyst group Evans Data found that 37% of database developers were implementing or planning grid computing architectures.
"While we're still in the infancy of grid computing, the technology looks very promising for database sites struggling with capacity issues," says Joe McKendrick, an analyst at Evans Data. "We just can't keep throwing more hardware at the problem."
And that thinking is driving early user demand. The UK's national mapping agency, Ordinance Survey, for example, cites a need for lower costs and higher resilience as reasons behind its adoption of Oracle 10g. It is a similar story at OGMA, the Portuguese aircraft maintenance company.
As more companies prove the model, grid will play an increasingly large role in the evolution of business continuity.





