UPDATE (4/30/2018): In addition to the steps we discussed below, we also did a refactoring of the code. We implemented a new model to reduce the processing needed to setup new participants and spectators. This new code was used this weekend for several races, including the Kentucky Derby Marathon. We had 2.2 Million configuration records (number of participants and spectators times the number of timing points – more than Glass City), and processed over 118,000 notifications for 8,500 users without any issues.
ORIGINAL POST: We had a confluence of issues come together that resulted in poor performance of RaceJoy for some users this past weekend. As usual, we want to document publicly our issues, what we learned from them, and what we plan to do in the future to correct those issues. This fits with our core guiding principles of being open and learning.
RaceJoy productively – sending over 6,000 cheers and delivering over 68,000 progress alerts. On the other hand, many users who were loading RaceJoy within an hour of the race had issues getting registered and configured properly. In addition, there were delays in getting progress alerts out in a timely fashion to users. There were 2 major issues:
Issue 1: The fundamental problem was near the beginning of the race, there were 7,000+ new RaceJoy users creating profiles in a concentrated period of time. This created a lot of “writes” to the database. Glass City was one of the first users of RaceJoy (Thank You!). But we had kept their database configured on a disk drive from 2013 on Amazon AWS. It turns out that the newer storage units we use have 15X the capacity in terms of putting information in and out (I/O) of the drive (200 IOPS vs. 3,000 today). The monitoring we were doing showed this slow I/O and a huge backup of requests, which delayed the performance of the app.
Resolution 1: We will be moving the Glass City data to a new storage unit. The other databases were already on upgraded units. That is why our load testing had not revealed this issue and we’ve been able to easily serve greater capacity of users at previous events.
Issue 2: Using the previous RaceJoy Timer Integration model rather than just offering RaceJoy’s phone tracking coupled with RunSignUp Results. We still support a handful of legacy RaceJoy customers who want to show timing data directly in RaceJoy based on data collection from RaceJoy. In Glass City’s case, they have 28 chip timing points on their course. Each user that is registered (all of those 7,000 users coming in right before and during the early part of the race) had to get notification tables setup for each of those timing points. This created part of the load explained in Issue 1.
Resolution 2: We are sunsetting this timing integration in RaceJoy capability in favor of a more distributed architecture using RunSignUp Results technology. Currently, RunSignUp offers timing alerts outside of RaceJoy when a timer uses The Race Director, RunScore, Agee Timing, RM Timing, or RaceTec for their scoring software. (Unfortunately, this race uses a scoring software that is not integrated with RunSignUp Results, so we were trying to be flexible and support this for at least another year for Glass City.) The integration with RunSignUp Results provides the same capability, but in a more distributed manner that scales much better. For this case, it would have eliminated the bottleneck.
If we had only one of these issues, we would not have had a performance bottleneck. But we had both, and we are sorry for the impact it had on Glass City Marathon and its participants. We are working with Glass City to provide restitution.