Yesterday’s disruption to Microsoft’s Windows Azure cloud computing service, which lasted for up to 23 hours and affected customers around the world, appears to have been caused by the fact that it is a leap year. "While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year," wrote Bill Laing, Microsoft’s corporate VP of server and cloud, on a company blog.
Laing wrote that the issue had been discovered at 17:45 pm PST on Tuesday, February 28th, (01:45 on Wednesday, February 29th GMT). Writing 23 hours laster, at 16:46 pm PST yesterday, he revealed that "some sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality". The UK government’s new cloud service procurement portal CloudStore, which is hosted on Azure, was confirmed as being offline at 12:20 pm GMT, but was back in action by 17:40 pm GMT. Gartner analyst Kyle Hilgendorf wrote in a blog post yesterday that the Azure outage demonstrates a few ways in which cloud service providers need to mature. Some customers told him that Azure services that Microsoft claimed were still operational on its performance monitoring portal were in fact "√degraded to such a point of being unusable", Hilgendorf wrote. "Providers must start including performance and response [service level agreements] into their standard service," he said. He criticised the lack of updates from Microsoft during the outage. "Some customers told me this morning they feel completely in the dark." However, he also added that organisations that use cloud need to design their applications to withstand such outages. "If you will be running a critical application at a cloud provider, expect an outage, design for resiliency, and be prepared to pay for it," he wrote. "This may also mean that you have to hire or retain some very skilled cloud staff." This is not yet the longest lasting cloud service outage to date. In April last year, an outage at Amazon.com’s North Virginia data centre was still affecting users of its Amazon Web Service cloud offerings after five days.