By Matthew Dublin
It turns out that Amazon’s big failure last week was caused by a malfunction of the storage service-the Elastic Block Storage, a replicated storage resource for Amazon’s EC2 virtual compute instances.
Amazon finally came clean with an lengthy report on their Amazon Web Services blog detailing how a series of errors led to the widespread service outage. Basically a network change that occurred on April 21, intended to upgrade capacity, kicked off a domino effect of complicated failures.
The upgrade essentially resulted in a mistake where the primary network data traffic was shifted to a slower, secondary network, which couldn’t handle the amount of data. It took about 12 hours for an Amazon team to get control of the networking hiccup but then the real issue was recovering customer data, a significant amount of which is reported to be permanently lost. While Amazon’s usual protocol when a node fails is to replicate the data on the node before it is reused, the replication mechanisms were maxed out and adding physical capacity to accommodate the replication took the team two days to set up.
Some of the lessons learned are the need for an improved network upgrade process, including more available free capacity in each EBS cluster, as well as improved isolation between zones. Despite all the “I told you so” chatter from skeptics of the cloud, StorageMojo contends that the Amazon’s team response was commendable and that ultimately, the question is not about the reliability of public clouds, like Amazon’s EC2, versus a private cloud, but rather, the type of architectures that can be implemented. While traditional large-scale networks have been built with a focus on Mean Time Between Failure, clouds are designed with fast Mean Time To Repair in mind.
The AWS team concludes their failure novella with a sincere apology and promise of credit for affected users.