Availability Update

redundancyWe had a partial problem that affected about 1/8 of our users on Thursday at 3:24PM until 3:28 PM. It disrupted one person who was in the process of signing up for a race – meaning they were redirected back to the starting page to sign up again. Other users may have seen a need to refresh the page they were on.

There was no data corruption, as our systems are transactional (people don’t get double charged or double entered or charged and not entered or vice-versa (surprisingly few registration systems are transactional)).

The problem was again a memcached server. We are investigating changes to our configuration and monitoring since this is the second time we have seen this in the past year.  The good news is that our high availability configuration limited the impact on users, and we are able to fix this issue quickly (4 minutes this time).

UPDATE 8/5/2014:  We have made several changes.  First, if we experience a memcache error, users will see the usual page and not a 500 error.  Since the system is designed to survive the failure of a memcache server, the user will be able to continue what they were doing now.  Second, we have further automated the process of replacing a dysfunctional memcache server (we have 8 of them running at all times, so we have capacity to do this temporarily. We should now recover normal performance within a minute.

One thought on “Availability Update

Leave a Reply

Leave a Reply