Note from Dec. 15 2022 – We have deployed an upgraded layer of software that controls our session management that is adaptive to high demand and removes the problem with the Apache Webserver cache mechanism that caused some slow performance on Thanksgiving morning.
We had over 750,000 people sign up for 730 Turkey Trots who used our platform this Thanksgiving. It is by far the busiest day of the year that produces huge load on our systems with last minute registrations, timers posting results and photos and people checking directions and start times and results. We did not perform perfectly this year as we will review below as we had some users experience slow responses around 10AM Eastern and again around 11:15. Here are some quick stats:
- Peak rate – 24,000 page views per minute (400 per second!)
- Registrations on Thanksgiving morning – 22,165
- Result TXT Notifications Sent – 177,390
- Result EMail Notifications Sent – 129,537
- Photos Uploaded – 25,802 (average is usually ~10 views per photo)
- Unique Visitors to Race Websites – 490,655
- Pageviews – 2,240,709
We have an advanced multi-level infrastructure that runs on Amazon AWS. This is a high level diagram of our infrastructure:
We started the day with 5 NGINX Servers (one more than usual) and 4 Webservers. This should have been enough to handle the load, but we ran into an issue that will need further investigation, but we believe is due to APC Cache. It is a local cache for data so we do not have to hit the database and makes our pages super fast. It seems that under high load it held threads or connections open, making the server run out of internal resources even though we had plenty of CPU capacity to handle the load.
Here is a graph of the requests per minute (combination of people clicking on web pages and timer scoring software posting results). Peak was around 24-25,000 requests per minute. The orange indicates our monitoring software showing slow response times in some requests:
Those orange areas were triggered by monitoring the average and max response times. We could see that the average time was going from 100 milliseconds to 300+ milliseconds (0.3 seconds) on our servers:
But things get very visible when you look at max response time that at least one user was experiencing. That max of 22,000 millisecond is 22 seconds – not good (but at least the average was much better – so if that user had simply hit refresh page they would have gotten at least decent response):
When we saw this, we began adding web servers and increasing their size, although we are not sure that had a big impact one way or another to be honest. Eventually we had a web server hang with “AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting” httpd error. This happened 6 times over the next hour and a half. That is not as bad as it sounds as we have failover and user impact was actually fairly minimal.
Here are our server graphs:
Here is a graph of webserver CPU load. As you can see, max utilization only hit about 60% and just on one server:
Same with our front end NGINX load balancers:
Same with our Database servers:
These are our Memcache Session Servers – they allow for users to be switched between web servers and are part of our auto-failover capability. It is misleading that they show high CPU utilization since they are not the root cause. We think it is the APC threads calling the memcache server. After heavy load, it seems to slow up APC calls. This would cause requests to take longer. The longer requests would ramp up database connections, but more importantly, lock sessions. While waiting on the lock, we’d try for 65 seconds every 5milliseconds (200 times per second). That really chewed up memcache CPU and exacerbated the issue. As a quick attempt to solve this, we changed that from 5ms to 10ms. Anyway, here are the Memcache server reports:
Further Investigation and Actions
We are actually kind of excited to dig into this more next week and beyond. We are in the midst of doing a bunch of infrastructure upgrades that we will expound upon at an end of year Availability Report. Some of the plans are actually to do improvements to our auto-scaling and we were planning on using Redline13 to do load testing at high loads. Between tracking down the core limitation/bug, and the auto-scaling we should continue to be the most advanced and high performance system in the registration and ticket market. And looking forward to an even bigger Thanksgiving in 2023!