💡 70% of data center outages are caused by human-error
🌳 AWS Outage
- Several companies using EC2 went down
- Outage lasted 3.5 days, some fraction of data was permanently lost
Background
- AWS Regions are separate data centers (
us-east1 or us-west-1 etc.)
- Availability Zones are a collection of racks within each data center
- AWS EBS (elastic block storage) is used for mountable storage devices
- Primary network between EC2 and EBS control plane and a secondary network used for overflow
- Control Plane traffic: commands like “open a file” or “close a file”
- Data Plan traffic: actual data itself like “EC2 fetch the contents of a file” — more voluminous
- EBS volumes are programmed to replicate upon failure (aggressive re-mirroring for strong consistency)
Timeline
- Routine primary network capacity update was run overnight (typically when updates are done)
- Traffic was moved to other primary networks
- ⚠️ HUMAN ERROR: someone accidentally shifted traffic for one such primary network router to a secondary network router
- This causes the primary replica to be disconnected
- Primary replica thinks it has no backup and starts creating more traffic
- Secondary already has low capacity, thus gets overwhelmed
- 13% of EBS volumes were stuck in a stuck state → re-mirroring storm (i.e. primary and backup do not agree)
- Primary replicas thought they had a backup and started aggressively re-mirroring, replicas were stuck in a loop
- Network was unable to serve control plane requests (
create volume is the most common EBS API request) → these requests filled up the buffer → AWS had to disable these control plane requests to slow things down
- ALSO ⚠️ EBS code had a race condition, which gets triggering during high requests rates
- Re-mirroring is a negotiation between EC2 node, EBS node, and EBS control plane (ensuring only 1 primary)
- Race condition caused the EBS nodes to fail → more negotiation for more replicas → caused a “brown out” of EBS API
- Team eventually figures out how to recover EBS servers, bringing back availability zones and control plane communication
- 3 days later, only 1% of volumes still stuck