The unprecedented AWS outage occurred when the S3 team was debugging a payment system issue, and a command was executed, intending to remove a small number of servers from one of the s3 subsystems.
The command was inputted incorrectly, and as a result, a rather larger number of s3 supporting servers were removed. The restarting of these servers that service the larger regions took longer to restart and serve the regions at capacity than Amazon had estimated.
In the aftermath, what can we learn from Amazon’s outage? Here are three lessons learned.
Website downtime is really, really expensive
In e-commerce, a degradation in website performance means lost revenue – it’s as simple as that. When a site fails to load it’s the equivalent of shutting up shop, and consumers are also highly sensitive to how web pages perform – over a third of consumers say they would abandon a website that takes over 10 seconds to load.
During the Amazon outage, some of the world’s busiest e-commerce retailers took over 30 seconds to fully load their home pages.
In November 2016, Apica created the top 100 Web Performance Cyber Monday Index. From this same list, it’s evaluated how these companies were hit during the Amazon S3 outage.
- 54 out of the top IR 100 were affected (20% performance decrease or more).
- 3 sites suffered major performance degradation Express, Lulu Lemon, One Kings Lane.
- For the affected websites, the average decrease in load speed was 29.7 seconds — on average sites were taking 42.7 seconds to load:
- Disney Store – 94 seconds slower to load (1165% increase).
- Target – 41.6 seconds slower to load (991% increase).
- Nike – 12.3 seconds slower to load (642% increase).
- Nordstrom – 29.8 seconds slower to load (592% increase) [Due to 3rd-party resource].
Putting all of your eggs in one basket is really bad idea
The effects of the outage varied, depending on how the companies were working with Amazon S3. Many of the newer websites are pulling data from various databases in the cloud, stored all over the world, which only caused partial outages in image render time or various data being stored on Amazon.
It’s critical that companies have a contingency plan for when a third-party provider goes down. One of the dangers of moving to the cloud is an over reliance on a single cloud vendor, and it is worth considering the merits of a more ‘decentralised’ cloud strategy.
It is possible to deploy a multi-cloud strategy to mitigate issues when one or more of their vendors has an outage.
Companies are also looking to the cloud for disaster recovery. Backing up data in the cloud is not the daunting task it sounds, nor is using a cloud based disaster recovery model markedly different than the approaches used in traditional disaster recovery models.
By deploying a highly virtualised environment, disaster recovery sites that live in the cloud can be a safe, reliable and economical way to ensure that companies don’t put all their eggs in one basket.
Notably, during the outage, many of the customers on the index that had stored data across local servers were able to pull images from these servers and use them to keep sites up and running.
It pays to be prepared
Have a proactive monitoring system in place and a well-positioned crisis response should you need it.
While catching outages in advance is difficult, companies can use synthetic monitoring tools to determine what is failing, be it third-party services, website components, or a checkout process.
Infrastructure teams can monitor their website or other critical services from locations all over the world and be immediately alerted when any performance degradation occurs.
These alerts can trigger manual or automatic failover and disaster recovery measures. Companies that utilise these tools are much more easily able to start root-cause analysis when an outage occurs, putting them in a better position to generate workarounds or to deploy recovery images from a local server.
One of the most important lessons to come from the Amazon outage for any company is that preparation is key, and if Amazon’s cloud can go down, then any website is at risk of the same fate unless proper planning is in place.
Simple website failures and associated performance problems can be minimised just by conducting scheduled testing of your site and all its associated applications.
Sourced by Carmen Carey, CEO at Apica