Amazon apologises for Xmas eve cloud outage

Amazon.com has apologies for a cloud outage that took video streaming service Netflix offline on Christmas eve.

The outage, which affected Amazon’s Elastic Load Balancing Service, was caused after data that records the state of load balancing systems – which direct network traffic to servers – was inadvertantly deleted during a maintenance process.

"This process was run by one of a very small number of developers who have access to this production environment," Amazon said in a statement. "Unfortunately, the developer did not realise the mistake at the time."

Amazon says that "only a fraction" of the load balancing systems were affected, but that "the impacted load balancers saw significant impact for a prolonged period of time". It was not until 24 hours after the first issues were identified that Amazon confirmed the service was fully restored.

To prevent such an outage occurring again, Amazon has tighentened up its change management process and altered its data recovery process.

The most noticeable impact of the outage was for US customers of Netflix. The company said that video streaming services were unavailable on certain devices, including games consoles, for around seven hours on Christmas eve.

In a blog post explaining the outage from Netflix’s perspective, the company’s director of cloud architecture (and Brit) Adrian Cockcroft wrote that it uses "hundreds" of ELB instances. "Each one supports a distinct service or a different version of a service and provides a network address that your Web browser or streaming device calls.

"Out of hundreds of ELBs in use by Netflix, a handful failed, losing their ability to pass requests to the servers behind them," he wrote.

"It is still early days for cloud innovation and there is certainly more to do in terms of building resiliency in the cloud," Cockcroft added.

Netflix is among Amazon.com’s largest cloud customers. The company is transparent about its technology usage, and has shared a number of open source tools for managing cloud services. This include Chaos Monkey, a cloud testing system.

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics

Amazon