Microsoft has revealed some details surrounding what it thinks caused the recent worldwide outage of Office 365 and some of its other platforms.
Users were left high and dry after Office 365 went down across the globe, with other services including Microsoft Teams, Office.com, Power Platform, and Dynamics365 also affected.
According to Microsoft, the outage was caused by a bug in the deployment of an Azure AD service update.
A preliminary report by the company found that the update was released too early, having not gone through the company’s usual testing regime. This typically involved progressing through five “rings” before being released, allowing Microsoft to trial any changes or upgrades with a set group of controlled testers.
However this time, a bug in Microsoft’s Safe Deployment Process (SDP) caused the update to be deployed to all rings rather than the proper first test ring.
“Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries,” Microsoft said in its preliminary post incident report.
“Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment. These changes are deployed in phases across five rings over several days.”
“In this case, the SDP system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently. The incorrect deployment caused service availability to degrade.”
Following the unexpected release, Microsoft says it attempted to rollback “within minutes of impact” using its automated rollback systems which would normally have limited the duration and severity of impact.
“However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes. This significantly extended the time to mitigate the issue,” the company’s report said, explaining why the issue affected users across the globe.
Users who were already logged in to Office 365 or any of the other services were unaffected.