Down and out: 3 lessons learned from Amazon’s 2 hour outage

The unprecedented AWS outage occurred when the S3 team was debugging a payment system issue, and a command was executed, intending to remove a small number of servers from one of the s3 subsystems.

The command was inputted incorrectly, and as a result, a rather larger number of s3 supporting servers were removed. The restarting of these servers that service the larger regions took longer to restart and serve the regions at capacity than Amazon had estimated.

In the aftermath, what can we learn from Amazon’s outage? Here are three lessons learned.

Website downtime is really, really expensive

In e-commerce, a degradation in website performance means lost revenue – it’s as simple as that. When a site fails to load it’s the equivalent of shutting up shop, and consumers are also highly sensitive to how web pages perform – over a third of consumers say they would abandon a website that takes over 10 seconds to load.

>See also: Why critical data can’t be hosted with just one provider: the AWS outage

During the Amazon outage, some of the world’s busiest e-commerce retailers took over 30 seconds to fully load their home pages.

In November 2016, Apica created the top 100 Web Performance Cyber Monday Index. From this same list, it’s evaluated how these companies were hit during the Amazon S3 outage.

  • 54 out of the top IR 100 were affected (20% performance decrease or more).
  • 3 sites suffered major performance degradation Express, Lulu Lemon, One Kings Lane.
  • For the affected websites, the average decrease in load speed was 29.7 seconds — on average sites were taking 42.7 seconds to load:
  • Disney Store – 94 seconds slower to load (1165% increase).
  • Target – 41.6 seconds slower to load (991% increase).
  • Nike – 12.3 seconds slower to load (642% increase).
  • Nordstrom – 29.8 seconds slower to load (592% increase) [Due to 3rd-party resource].

Putting all of your eggs in one basket is really bad idea

The effects of the outage varied, depending on how the companies were working with Amazon S3. Many of the newer websites are pulling data from various databases in the cloud, stored all over the world, which only caused partial outages in image render time or various data being stored on Amazon.

It’s critical that companies have a contingency plan for when a third-party provider goes down. One of the dangers of moving to the cloud is an over reliance on a single cloud vendor, and it is worth considering the merits of a more ‘decentralised’ cloud strategy.

It is possible to deploy a multi-cloud strategy to mitigate issues when one or more of their vendors has an outage.

>See also: The cloud is great, but what happens when it goes down?

Companies are also looking to the cloud for disaster recovery. Backing up data in the cloud is not the daunting task it sounds, nor is using a cloud based disaster recovery model markedly different than the approaches used in traditional disaster recovery models.

By deploying a highly virtualised environment, disaster recovery sites that live in the cloud can be a safe, reliable and economical way to ensure that companies don’t put all their eggs in one basket.

Notably, during the outage, many of the customers on the index that had stored data across local servers were able to pull images from these servers and use them to keep sites up and running.

It pays to be prepared

Have a proactive monitoring system in place and a well-positioned crisis response should you need it.

While catching outages in advance is difficult, companies can use synthetic monitoring tools to determine what is failing, be it third-party services, website components, or a checkout process.

Infrastructure teams can monitor their website or other critical services from locations all over the world and be immediately alerted when any performance degradation occurs.

>See also: Top 5 collaboration and communication predictions for 2017

These alerts can trigger manual or automatic failover and disaster recovery measures. Companies that utilise these tools are much more easily able to start root-cause analysis when an outage occurs, putting them in a better position to generate workarounds or to deploy recovery images from a local server.

One of the most important lessons to come from the Amazon outage for any company is that preparation is key, and if Amazon’s cloud can go down, then any website is at risk of the same fate unless proper planning is in place.

Simple website failures and associated performance problems can be minimised just by conducting scheduled testing of your site and all its associated applications.

 
Sourced by Carmen Carey, CEO at Apica

Avatar photo

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and...

Related Topics

Data