Lecture 28: Datacenter Disasters

💡 70% of data center outages are caused by human-error

🌳 AWS Outage

Background

AWS Regions are separate data centers (us-east1 or us-west-1 etc.)
Availability Zones are a collection of racks within each data center
AWS EBS (elastic block storage) is used for mountable storage devices
Primary network between EC2 and EBS control plane and a secondary network used for overflow
- Control Plane traffic: commands like “open a file” or “close a file”
- Data Plan traffic: actual data itself like “EC2 fetch the contents of a file” — more voluminous
EBS volumes are programmed to replicate upon failure (aggressive re-mirroring for strong consistency)

Timeline

Routine primary network capacity update was run overnight (typically when updates are done)
- Traffic was moved to other primary networks
- ⚠️ HUMAN ERROR: someone accidentally shifted traffic for one such primary network router to a secondary network router
  - This causes the primary replica to be disconnected
  - Primary replica thinks it has no backup and starts creating more traffic
  - Secondary already has low capacity, thus gets overwhelmed
13% of EBS volumes were stuck in a stuck state → re-mirroring storm (i.e. primary and backup do not agree)
- Primary replicas thought they had a backup and started aggressively re-mirroring, replicas were stuck in a loop
- Network was unable to serve control plane requests (create volume is the most common EBS API request) → these requests filled up the buffer → AWS had to disable these control plane requests to slow things down
ALSO ⚠️ EBS code had a race condition, which gets triggering during high requests rates
- Re-mirroring is a negotiation between EC2 node, EBS node, and EBS control plane (ensuring only 1 primary)
- Race condition caused the EBS nodes to fail → more negotiation for more replicas → caused a “brown out” of EBS API
Team eventually figures out how to recover EBS servers, bringing back availability zones and control plane communication
3 days later, only 1% of volumes still stuck