When a web application fails it is often a very public affair. The unexpected arrival of a ‘flash crowd’ at a web site, a programming error that was missed during testing, a systems crash unsupported by hot back-up, a forced outage after a content error, a security hole that exposes data – there are plenty of reasons.
But despite the years of experience now accumulated, and the refining of process and tools for predicting and preventing such outages, the same mistakes are repeated countless times.
As Dr Peter Neumann, principle scientist at SRI International Computer Science Lab, highlights: “The underlying reality is shameful: most web application software is written oblivious to security principles, software engineering, operational implications and, indeed, common sense.”
When web problems occur at high-profile sites, the publicity is embarrassing and often damaging, but it is also typically short-lived. Sketchy details and attempts at damage limitation mean that there is little outside analysis of what went wrong. There are, however, common threads to the majority of disasters – and plenty of lessons to be learned that will help prevent other organisations falling into the same traps.
To dig deeper, Information Age pulled together what are arguably the 10 worst web applications failures seen in the UK, and asked a handful of leading experts in the field to draw some guiding principles from the disasters. Some overriding conclusions emerged.
First, the risks are tangible. “All of these web-based application failures boil down to one issue – risk management,” says Vange Yianni, a technology manager at applications management software vendor Compuware. That means identifying the components most at risk and monitoring traffic, he says.
A sense of foreboding is certainly common. According to a recent survey of CIOs and CEOs by market watcher Winmark, less than half are confident that they “know what will happen” when launching a new web application. Moreover, the CIOs questioned believe that, each year, there is a ‘more than 80%’ chance that there will be a systems failure affecting their online services, corporate email, primary servers or company intranet systems as a result of the launch of a new application.
“Web application history is littered with disasters,” says Scott Miller, VP for Europe at web application performance software vendor Empirix, “yet companies fail to learn.” The root cause is common: a failure to adequately test the application for functionality, accurate content and scalability before it goes into production. “The bottom line,” says Paul Tarttelin, head of Intel Solution Services’ distributed solutions group in Europe, “is that many of these failures aren’t due to anything complex, but to a simple lack of planning.”
Nick Beitner, CTO of web applications developer the Aspect Group, which created Formula1.com and PopIdols.com, says what surprises him most is that sites continue to fail for similar reasons – most notably, being overwhelmed by their own success or the success of off-line promotions and events that drive visitors to them. He advocates some best practices:
- Plan for purpose – the level of traffic at critical events has to be better estimated, planned and catered for.
- Design for performance – clients need to ensure that it is the on-site processes, rather than the look and feel, that drive the architecture of a site.
- Engineer for performance – many site builders throw hardware at the challenge of managing high volume traffic or mission critical sites. It is better to have an intelligently engineered solution tailored to the needs of the site, rather than have a solution made to fit the hardware.
- Capacity management – it is important to balance the perceived merits of a big launch against the disproportional cost required to support it. If possible, manage a launch to sit within planned capacity. Alternatively, if a disproportionate initial load cannot be balanced out, develop a two-tier system that has a high load, but low functional element, as the initial ‘landing site’ to absorb the atypical load.
- Test, test, test – rigorous testing and review is essential, but is frequently compromised by project time and budget overruns. Testing is not just about traffic loads – it straddles everything from business case assumptions testing, through unit testing to load testing and finally user acceptance testing. Poor testing in any of these areas is often the root cause of inevitable failure.
- Performance analysis – sites require monitoring regularly, not just in the early stages, in order to preempt problems and avoid outages or failures.
“Good lifecycle management (testing, tuning, operational management) is the only way to ensure the performance of a web site,” concurs Andy Crosby, a product marketing director at applications performance management software company Mercury Interactive. “No one would think of driving a car that has not been tested for safety pre-delivery, serviced (tuned) on a regular basis, and without reading the instruments while driving (operational management). The exact same principles should be applied to web site management.”
Ensuring the code is sound is one thing, ensuring scalability is another. Crosby says that in seven out of 10 cases, sites only scale to 15% of the expected capacity at the time of their release. “There is far more to web performance than page download time – the full user experience needs to be catered for.” That end user experience consists of data entry, security, third-party data feeds, and so much more – as many of the aspects of the 10 disasters below show.
“And the risk is under-appreciated until its too late – the loss of reputation, the loss of customers, the loss of revenue and the risk of litigation,” adds, Ian Forsythe, UK country manager for change management software company Serena Software.
As is also shown, the flaws are far too easy to avoid and far too dangerous to ignore.
1. Amazon.co.uk (www.amazon.co.uk)
Content error led to a surge in online traffic, server overload, temporary closure of site, and legal issues over customer orders.
The morning of 19 March 2003 won’t be forgotten easily at Amazon’s UK IT department. The wildly incorrect mispricing of two lines of PocketPCs caused a huge surge in order activity that forced the company to take down the site.
A data input mistake meant that customers could order a 64MB, 200MHz Hewlett-Packard iPAQ H1910, normally priced at £275, for just £7.32. Additionally, a top-of-the-range iPAQ H5450 was on offer for £23.04, rather than its £500 high street price. As word of the bargains spread via email, volumes soared and the site froze, first blocking customers from signing in and then crashing or being taken offline by systems administrators. Although tens of thousands of orders were taken, and confirmation emails sent, before the site went down, Amazon quickly issued a statement saying its conditions of sale meant it was under no obligation to honour them.
The mispricing of goods in a traditional retail store has a small-scale effect that rarely goes unnoticed for long and almost never draws public attention. It is a very different situation at an e-tailer – details spread rapidly via email, and competitors and the media are alerted.
Input errors are always going to occur, says Ian Davis, product marketing manager at online commerce software vendor ATG, but processes for approving content prior to publishing should be put in place, as should the flexibility to rapidly correct errors when they are spotted. An automated workflow system for catalogue change review should ensure pricing is correct before the data is published to the site, however that may remove some of the dynamic nature of web commerce.
“The key is striking a balance between speed – as web pricing needs to be flexible to be competitive – and peer-approval, which reduces the chances of mispricing,” says Davis. He suggests several levels of approval process, where the length of the approval is pegged by the value of the goods. Most content management systems have this kind of workflow and editorial process built in.
In addition, there are packaged and bespoke tools that automate at least some of the checking. “Rules to check that price changes are in line with similar products and product history might have flagged this event,” says Nick Beitner at the Aspect Group. There is also software that spots spikes in usage, says Scott Miller of Empirix, and can track the cause almost instantly. If Amazon managers had been using that, he says, “Then the price could have been corrected with minimal effect and the traffic [reduced] to normal levels.” Monitoring end-user transactions with content validation can highlight these errors before customers ever see them, he adds.
2. Floodline (www.environment-agency.gov.uk/floodline)
System failed to scale during peak period, despite exhaustive load testing.
The Environment Agency’s National Floodline was set up in 2002 to provide instant information via call centre or over web about potential flood dangers across the UK. However, heavy rainfall over the Christmas and New Year of 2002/2003 caused a surge of activity at both channels, resulting in the web site crashing. As the risk of flooding rose, phone enquiries climbed to a peak of 32,650 calls a day, and as people failed to get through, many turned to the web site where they would execute complicated searches in order to establish the impact of flooding in their area. At the peak, on 2 January, 23,350 individuals were hitting the site. As the Environment minister told a parliamentary committee, the web site crash (which took the site out for several days) was not helped by the fact that so many people were at home over that period “and had little else to do except surf the net and look for flood information”.
“This type of scenario is all too common: organisations test for the ‘standard use case’ but fail to envisage the extreme cases,” says Scott Miller at Empirix. “Testing must involve both. Experience tells us that the most implausible situations are the ones most likely to occur. So make sure you allow for them,” says Miller.
The thinking was not clear, others suggest. “This is an example of the need to ‘plan for purpose’,” says Nick Beitner at Aspect. “Demand for this site was always going to be around information events or, indeed, emergencies, and should have been engineered to meet high spikes in traffic.”
While the Agency may have not had the funds to ramp up processing power, that may have not been necessary. “There are other solutions to scalability: a more intelligent engineering of the site processes and a more sophisticated load balanced architecture,” says Beitner.
“Testing a website can certainly help, but not always prevent an overload of a site,” says Tom Sedlack, product manager at web applications management software company NetIQ. “Setting the proper trigger levels – for instance, as you approach you previously tested limit – can alert server administrators.” Then the system could have been design to provide a ‘graceful fall off under load’, probably through some form of throttling of high load functions such a searching, adds Beitner.
3. Nectar (www.nectar.com)
Ill-conceived marketing campaign drove millions of people to the web site, forcing the company to suspend its service.
Customer loyalty scheme Nectar launched in September 2002 by mailing 10.5 million UK households, inviting loyalty cardholders from Barclaycard, Sainsbury’s, BP and Debenhams to sign up for a consolidated points scheme. Although Nectar expected a large number of customers to reply by mail or through its call centre, it did try to shave costs by encouraging customers to register online, offering a rewards bonus as an incentive. But in the run up to the deadline for the bonus points (worth an estimated 50p), the move backfired when millions of customers flooded the web site. At one stage, 10,000 visitors hit the site an hour. Nectar was forced to suspend web registration for three days, despite having increased its Internet server capacity six-fold during the two days before the launch.
This is a common problem, says Empirix’s Scott Miller: “A lack of communication between IT and marketing to ensure the site was configured to handle the maximum demand that marketing could generate.”
Aspect CTO Nick Beitner agrees. “It was business critical for Nectar to do a big bang launch and use the web as a low cost fulfilment device, rather than taking phone calls or processing coupons. However, it would be marketing’s job to estimate a reliable response via the Internet to which IT could deliver a solution. However, a staggered promotion would have helped keep the system reliable, reduce costs and retain credibility.”
4. The Inland Revenue (www.inlandrevenue.gov.uk)
Security flaw breached rules on confidentiality and data protection, and forced system downtime, undermining campaign to win citizen confidence in online filing.
In May 2002, 10 months after its launch, the Inland Revenue’s self-assessment online tax returns service suffered a major security breach. A problem with one Internet service provider resulted in the system regarding two online filers as the same individual, resulting in them sharing a single online session and being able to view each others’ submissions. There were 60 known ‘shared sessions’ and 13 where individuals were aware they were viewing others’ forms, but the revenue admitted there were another 665 cases where it could not be certain a return had not been seen. In total, nearly 28,000 people used the system. A Commons Treasury Committee report said that nearly all those effected were customers of a specific Internet service provider, and that the complex problem had not involved hacking but “someone outside the revenue’s control storing information which should not have been stored”. The e-Envoy Andrew Pinder suggested that the problem lay in how the ISP (which was never named publicly named), had interacted with the Revenue.
“Improper session handling under load is a fairly common situation,” says Scott Miller of Empirix. It is also easily fixed. “As well as testing in a lab environment, it is essential to test outside the firewall to observe how production applications behave in the real world,” he says. “In particular end-user load needs to be tested with multiple sessions… content validation would have reproduced [the Inland Revenue’s] problem before the system went live.”
However, testing security for every possible security breach is not possible, others argue. “The best practices around security are those that alert you most quickly to a breach. Setting up preventative measures on the [vast number] of possibilities is time limiting and resource intensive,” says Tim Sedlack at NetIQ.
5. 1901 Census (www.census.pro.gov.uk)
Site failed to scale to predicted demand levels, causing repeated outages and an eight-month rebuild of the site.
On 2 January 2002, the Public Records Office launched access to the computerised returns of the 1901 census of England and Wales. The interest from the public in the 32 million records was overwhelming, with more than 1.2 million people trying to access the site at one point. The site collapsed after three days and was only intermittently available in January before being suspended for two weeks in February. But the scaling problems ran deeper than the site’s multiple contractors (led by Qinetiq) appreciated, and it did not go live again until September 2002.
The expert observation was unanimous in this case. The architectural design was wrong. “The site was overwhelmed by the success of its launch. It was not designed to cope with the unprecedented levels of traffic experienced in the first three days,” says Nick Beitner at Aspect Group, which participated in the re-design.
While some amount of fine-tuning can always increase site capacity, choosing the wrong architecture at the outset can create barriers that place a hard limit on site scalability and capacity, echoes Ian Davis at ATG. “To scale sites [that have weak architectural design] your only option is to invest in expensive new hardware and that only takes you so far, meaning a complete rebuild.”
The ideal is a linearly scalable site, where the addition of a given degree of network bandwidth, CPUs and disk will increase the site capacity by a fixed number of users, says Davis. He draws a distinction between static web sites, which can be scaled by simply adding more resources, and personalised sites, which require a different architecture for true scalability.
6. Sportingindex.com (www.sportingindex.com)
Failed to cope with expected surge in activity, resulting in forced downtime and extensive loss of revenue.
On 17 June 2002, the online betting site suffered an embarrassing outage two days before the England versus Brazil World Cup game. The site was only offline for a day, but it resulted in extensive revenue loss, and the loss of customers to other betting sites during what was arguably the biggest betting event in a dozen years.
“It is truly a mission-critical failure when revenue is at stake,” says Aspect’s Beitner. “Again this is an event-based site that’s going to encounter high loads and traffic spikes. We can only assume the site had not been engineered to handle a high concentration of complex transactions. You have to design for purpose, test, and ensure you have the capacity and bandwidth. Beyond an intelligently engineered site, using a hosting partner for major events has to be the most secure and cost effective approach for this type of site.”
The guiding principle such cases highlight is the need for organisations to capacity test rather then load test, advises Vange Yianni of Compuware. “Capacity testing allows you to see the point at which your site will break and [so] plan for traffic peaks.” The situation also calls for monitoring software that can be set to alert administration staff when traffic levels are approaching dangerous levels. Managers then need to formulate contingency plans that kick in as soon as such an alert occurs, says Yianni.
In some cases, these don’t need to be terrible complex – a simply notice saying ‘try back later’, may suffice. “Although, this isn’t ideal, it is better than allowing your site to crash.” says Yianni.
7. Egg.com (www.egg.com)
Upgrade/hardware crash triggered an outage; in the aftermath the already slow systems could not cope with reactivity.
For a whole week in January 2001, the pioneering online banking service at egg.com simply disappeared. Customers could access front-end information, but were blocked from getting to the secure servers that held their accounts. Only one explanation emerged: Egg’s developers had to take the site down while an upgrade was underway to add support for interactive TV. That was later modified to simply “a hardware problem”.
Whatever the cause, the extended length of the delay was blamed on a bottlenecking of customer activity as the service came back online. “Pent-up demand” caused soaring activity levels as customers grew increasingly frustrated – and worried – that they had no access to their bank accounts. Working with the site was still an extraordinary act of perseverance a year later in January 2002 when average web page download time over a 56K modem was 47.6 seconds – an issue that is now largely resolved.
The root of the problem here appeared to be a failure to plan maintenance and testing for times of least traffic, and a failure to communicate these plans to customers. Organisations need to monitor and understand their traffic flows and times of peak usage, and to plan and execute upgrades accordingly.
There is a simple principle here – the critically important act of thoroughly testing once changes have been made, says Yianni of Compuware. “For most organisations it is impractical and too costly for them to re-test the whole site when a change is made,” he says. As a result, deep knowledge of the application is required so that analysts can determine which parts of the application may be impacted and only re-test those.
8. Powergen (www.powergen.co.uk)
Security flaw exposed customer records, triggering large-scale customer inconvenience.
In July 2000, a Powergen customer showed that with rudimentary HTML knowledge he could gain access to the credit card details, names and addresses of 5,000 customers. “The information which had been accessed was in a file which, due to a technical error, was temporarily outside the security gate of the system,” a Powergen manager said at the time. Even though Powergen suggested the security breach was a “one-off incident”, all 5,000 customers were told to cut up their cards and were offered £50 each as compensation.
Powergen failed to diagnose a configuration error before the site went live, says Aspect’s Beitner. It could have overcome this by testing for content, not just functionality, by analysing the HTML returned in automated testing and comparing it to baseline expectations.
Moreover, says Scott Miller at Empirix, a security audit should have been a must. “If you’re taking credit card numbers and anything personal you have to take every single precaution possible. There should be no exceptions, or problems like this are going to happen.”
9. Halifax (www.halifax.co.uk/sharedealing/home.shtml)
Inadequate testing resulted in systems failure and security compromise, which led to a breach of confidentiality and data protection laws.
On 26 September 1999, the Halifax suspended its Internet share-dealing service after customers found they were able to access other people’s accounts, allowing them to, theoretically, draw on each other’s bank accounts to execute trades. The company suspended its online dealing service, ShareXpress, saying an attempt to fix a software bug had gone wrong. Trying to assess the damage, Halifax had to reconstruct “each keystroke made” by customers throughout the morning’s trading.
There is a broad and general principle here, say consultants: It is essential that organisations don’t skimp on testing application functionality before production, or neglect to perform regression testing with bug fixes. By using reusable automated test scripts, regression testing can be incorporated into the bug fix process with minimal overhead.
10. Argos (www.argos.co.uk)
Content error led to temporary site closure and legal issues with orders.
In September 1999, retailer Argos found its newly-launched web site the centre of unwanted attention when a rudimentary programming error resulted in television sets worth £299 being ’rounded up’ to the bargain price of £3. As word spread of the blunder, the site was swamped with more than £1 million of online orders – with one person even asking for 1,700 sets. Argos refused to honour the orders.
The Amazon error crept into a live application through manual error, making it difficult to avert. Argos’ programming error could have been avoided through simple testing and test management. For instance, a pre-requisite to testing should have been to validate pricing as it appeared on the site.