Like much of the web, RunSignup and GiveSignup host our infrastructure on Amazon AWS. Unfortunately, they had a pretty major outage on Wednesday before Thanksgiving in one of their data centers. It took down major websites like 1Password, Acorns, Adobe Spark, Anchor, Autodesk, Capital Gazette, Coinbase, DataCamp, Getaround, Glassdoor, Flickr, iRobot, The Philadelphia Inquirer, Pocket, RadioLab, Roku, RSS Podcasting, Tampa Bay Times, Vonage, The Washington Post, and WNYC. Apparently this was caused by their rush to get new capacity in place to handle the post Thanksgiving ecommerce rush.
The good news for RunSignup customers was it had zero impact to their ability to take registrations, set up their races, run reports, or take donations. The reason is the massive investment we have made in our technology stack, and thoughtful design decisions. For example, we run across multiple Amazon AWS Regions and Availability Zones with automated failover. We have multiple redundancies built into our systems so that even though the AWS CloudWatch monitoring system failed, we were aware of the issues and were able to track the problems and the automated re-routing that took place in our AWS implementation.
This is similar to past outages that took out major websites like Twitter and CNN. As we wrote in that blog linked above, technology is hard. For example, some race timing cloud services are built on Lambdas. That was one of the services that went down on Wednesday. We do not run our production environment on Lambdas (although we use it for non-critical items like our Photo Platform). Imagine if you were a timer and it was a regular year and the race you were timing had 15,000 people and all of a sudden your scoring and results system went down. Not a good Thanksgiving…
We really respect Amazon AWS for the very transparent post-mortem they did on these problems. The root cause was thread count limits in their configurations on Kinesis, which then cascaded into problems across many services like Autoscaling, Lambdas, CloudWatch and others.
While nothing can ever be perfect, we continue to care about the details and invest in high availability and continuous improvement.