Amazon has released a lengthy explanation of its recent Elastic Cloud Computing (EC2) and Relational Database Service (RDS) downtime, blaming "stuck" data volumes and the system's inability to work around them for the failure of Amazon Web Services. According to the AWS team, a network change in an "Availability Zone" in the US East region caused the nodes in that zone to get "stuck", refusing to read or write data. Other nodes then became stuck themselves, when trying to access those initial stuck nodes.
Because of how the system is structured, automatically asking nodes to find a new, stable node and re-mirror their data if they can't communicate with the original - and taking user access to that data offline for the duration - the massing number of stuck nodes caused the zone to grind to a halt. Usually the re-mirroring takes milliseconds, Amazon says, however because of the scale of the disconnect a "re-mirroring storm" was spawned:
"In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state." Amazon
Amazon goes into detail as to how its engineers brought the system back online, but the important stat for subscribers is that "0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state." It's not exactly clear how much data loss that actually translates to.
In the future, Amazon says it will be adding support for Virtual Private Cloud users to take advantage of system capacity in other Availability Zones, reducing the likelihood of downtime should their current zone go down. Meanwhile, it will also be boosting the monitoring and recovery tools it uses, including automating some of the processes, and giving users themselves the ability to make snapshots of "stuck" volumes that can be restored in other zones.
Finally, and addressing one of the biggest complaints of the past week, Amazon says it will re-examine customer communications, giving regular, frequent updates rather than only alerting users when there was something new to say. AWS customers affected will get an automatically-applied "10 day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone."