Failed firmware upgrade caused Microsoft data centre to overheat

Microsoft has revealed technical details of a data centre outage that affected users of its and cloud email services earlier this week.

The outage, which started on Tuesday afternoon and lasted 16 hours, was triggered by a failed firmware upgrade to a component of the physical plant in one of Microsoft's data centre.

The upgrade failed "in an unexpected way", wrote Arthur de Haan, VP for test and engineering, in an official blog post. "This failure resulted in a rapid and substantial temperature spike in the data centre".

That caused safeguards designed to protect servers from overheating to kick in, and email inboxes hosted on those servers were inaccessible. It "also prevented any other pieces of our infrastructure to automatically failover and allow continued access," de Haan wrote.

Fixing the fault required human intervention, he said, which "is not the norm for our services and added significant time to the restoration".

"We do want to sincerely apologize to anyone that was unable to access their email during the interruption," de Haan said. "Outages are something we take very seriously and invest a significant amount of our time and energy in doing our best to prevent."

The plant in a typical data centre includes air conditioning units and power infrastructure.

It is not uncommon for air conditioning failures to trigger IT outages. In 2010, for example, a failed air conditioning unit at a London data centre took music streaming site Spotify offline for a number of hours.

Pete Swabey

Pete Swabey

Pete was Editor of Information Age and head of technology research for Vitesse Media plc from 2005 to 2013, before moving on to be Senior Editor and then Editorial Director at The Economist Intelligence...

Related Topics