One of the things we want to do at RunSignUp is be pretty transparent about how we operate. It helps us understand what is good and what we can improve upon in a direct fashion.
As you may know, we run on the Amazon Cloud. This gives us a large number of advantages in terms of scalability. Amazon is also the largest and most respected Cloud provider in the market. On the down side, when they have a problem, so do we. And they had a problem (http://news.cnet.com/8301-1001_3-57453815-92/amazon-web-services-recovers-from-partial-outage/) on the evening of June 14. Our service was down from about midnight until about 9:30AM.
We have monitoring and alerting – a separate service outside Amazon “pings” our service every 5 minutes to make sure things are up and running. So we knew right away. However, the fact that we were running on the Amazon servers held us hostage a bit.
We are going to be making some improvements this summer to our setup and recovery abilities. We will move to a multi-region set-up so that if systems fail in the Amazon East datacenter, we can move to another region quickly and hopefully transparently to any user outage. Amazon is coming out with some features that will give us some good options with this (such as Amazon RDS features for availability – http://aws.amazon.com/rds/#features).
Every system has occasional problems whether you run the systems yourself, rely on an outside hosting service, or use Amazon. We have high confidence that we are on the most reliable platform available and on the right track to make our service the most highly available in the running sign up business. We will also start keeping track of our downtime including system maintenance and upgrades. So far in 2012 we have had about 14 hours of downtime out of 4,128 hours – so an uptime of 99.66%. We hope to improve that.
2 thoughts on “RunSignUp Availability update”