We did a deployment this afternoon at around 2:15 PM Eastern that updated some of our infrastructure (a new version of SQS, which is the queuing service from Amazon that we use). We had done testing on it in our test servers, but when it deployed it caused issues.
The net effect before we were able to pull back were 5 registrations that were not completed successfully and worse, we do not have their information to reach back out to them. This may sound like a very small number give our overall volume, but it is something we take very seriously. We sincerely apologize to the races that these people were signing up for and have sent an email to each of those race directors to let us know if they hear from a frustrated runner who did not have a successful registration experience.
This is the first registration interruption we have had since last September. As we have said in the past we try to learn from our mistakes. We have since built a mechanism that will save those partial transactions if something like this happens again so we can at least reach out to the customers to work on completing their transactions. We also will be building an automated process that will be a backup in case we have another issue with SQS. We hope to have this up within the week.
We will continue to report our availability and mistakes. We think it is thru transparency that we learn and get better. And we think the transparency is useful for customers to judge us by. Thanks for your support and patience.
3 thoughts on “We Screwed Up”