Site icon RunSignup Blog

Upgrade to Aurora 2 Database – Potential Downtime

We are planning to do an upgrade to Amazon AWS Aurora MySQL 2 sometime in the next week and a half. We expect less than 5 minutes of potential system impact. Another blog will be posted the day before we do the upgrades.

Planned Impact

There should be minimal impact with the whole upgrade process switch taking less than 5 minutes (details below). The final switch is done from Aurora 1 to Aurora 2 will pause writes to the database. This has different impacts to different parts of the system:

What is Aurora?

Amazon AWS came out with Aurora in 2016 as a MySQL service that provided automated features for Read Replicas (a way to scale a database), automated backups and more. We were a Beta Test site for the service and were one of the early adopters when we moved to Aurora in 2016 (with zero downtime). What Aurora has meant to us and our users is a faster site that is more reliable and scalable to meet large demands. It has also lowered our cost of maintenance and support for the database tier.

We run a main database with a read replica and a shard database that also has a read replica. The read replicas allow us to failover automatically in the event of a database server problem. We also have a high speed caching layer in front of the databases to reduce the potential of the database being a bottleneck and to speed our site. Here is a diagram of our system:

Why are we Upgrading?

Amazon AWS will sunset support for security and maintenance updates to Aurora 1 in 2023. We want to be ahead of that to ensure our users of high quality and secure operations. If you have other vendors using Amazon, they are likely using Aurora and we recommend you ask them their timelines for migration.

We have continually invested in our infrastructure, constantly learning and improving. We also share our availability investments and failures publicly on this blog hoping to educate ourselves and our customers and even share lessons learned with competitors.

We have been lucky to have talented people at RunSignup to continue to upgrade our infrastructure. The combination of people, design and leveraging Amazon’s capabilities has given us a remarkable record of only 4 minutes of system impact since 2015. The one occurrence was a release of a new feature that impacted the system and it took us 4 minutes to see the error and rollback the system. We average about 2,000 releases of our software per year.

We applaud Eventbrite for also sharing their issues publicly and some of their newfound statements of wanting to invest back into their infrastructure after apparently not doing that for some time. In the blog link above they state it will take 3 years, and they are about 1 year into it and still seeing many issues that they share on their Twitter Status Page, which we are sure is frustrating to users and Eventbrite:

We wish all platform companies would be open about their efforts to make their systems secure and reliable like Eventbrite and RunSignup.

Upgrade Process Details

Doing an upgrade of a major system component is always risky, and we want to minimize that risk. Our CTO Stephen and Founder Bob have a side company, ZipCodeAPI.com. This past weekend Stephen did a practice run of doing an upgrade on that system and the good news is that it went well.

For the RunSignup system, we plan on adding some additional automation to make the transition as fast as possible and also to assure avoidance of manual errors. We will do an upgrade in our test environment first. Then we will do an upgrade of the Shard Database on production. Data in the Shard is typically non-critical and should not cause any real problems. Then we will do the primary database. For each of these, what we will do is the following steps:

As you can see, this may take less than a minute of impact to the system. Keep your fingers crossed!

Exit mobile version