A major set of problems at Amazon Web Services East Region started around 10:30 AM on December 7. Fortunately, GiveSignup | RunSignup continued to process transactions and serve our customer’s tens of thousands of websites during the entire time. Here is the post mortem description from AWS.
The impact was very widespread:
- Ticketmaster had to postpone selling seats to Adele’s 2022 tour. Also Eventbrite, Race Roster and many other ticket and registration systems.
- Netflix, Roku and Disney+
- Alexa, Ring, Amazon Warehouses
- Venmo and Cash App
- Robinhood and Coinbase
- Delta Airlines and more as seen in this graph from Down Detector:
GiveSignup | RunSignup Transaction Impact
We are happy to report there was no impact in transactions on our systems. This eye chart shows the number of transactions per minute. December is our slowest month of the year, but for example it was an important day for the Miami Marathon who picked 1,500 lottery winners yesterday who were all able to sign up yesterday. Overall we averaged 10.7 transactions per minute during the prime outage timeframe with a low of 2 and a high of 27 transactions every minute.
Our Timeline and Issues
10:30 – Stephen and Matt happened to be on a screen share yesterday morning when we started to get alerts that there were AWS issues. The good news was that the way we do load balancing and host our web servers in multiple zones at Amazon allowed our base functions to operate well. They also brought in Kristian and Bruce before 11AM and alerted Bob around 11:50 that they had completed all of the work they could do to help our customers.
They noticed issues with S3 (where files are stored – like images) and SQS (a queuing service we use extensively to have applications work together and pass data). Fortunately, for most of our application if S3 or SQS are down, our code automatically tries a backup region we always have available. So for the bulk of our application, we just did auto-failover. There were a few corner cases of our code that did not do the auto-failover. For example, we had about 3 failures between 10:30 and 11 where users were trying to upload their profile image that failed.
Around 11AM, we implemented the switch we have to use the backup regions for S3 and SQS so the code did not need to wait for a failover. This allowed most of our applications like photos to work properly.
The one application that did not survive well was our Analytics system. We use the AWS API Gateway, which was one of the prime services that went down at AWS. This means that the totals for pageviews is low for December 7 since we did not record all of the pageviews, and the sources like tracking a Facebook ad for about 6 hours. Here is an example where the pageviews for December 7 are lower than December 6 even though the number of registrations increased – because some of the pageviews were not collected:
We also spent time updating several areas of our application to improve user error messages in case a service is down.
Finally, we spent a fair amount of time sorting thru a deluge of alerts and emails from various monitoring tools we use. AWS CloudWatch and Logging both had problems, and are primary services we use for monitoring. But for example, we have a monitor on CloudWatch and that monitor was sending us emails every minute saying the service was down 🙂
Knock on wood, we survived another major outage that impacted probably a hundred thousand plus websites and services. Obviously, no system is perfect, but we are happy we have made the investments in infrastructure that were able to keep our customers running (pun intended).
And just to keep track, we have had one 4 minute outage over the past 6+ years.