We recently did a major upgrade of our management and monitoring tools to use New Relic. It provides us a number of things:
Monitoring & Alerting. We had been depending on our own simple scripts for this. This now monitors the site every 1 minute and alerts us if there is a problem.
SLA. We get a Service Level Agreement Report – basically uptime. We’ve only been running this since July 3, but so far we are at 100% uptime (which I am sure will go down since everyone has downtime and issues occasionally). Our manual way of keeping track of this since January 1 was 99.65%, but now this will be tracked in a much more automated way. We intend to publish this type of information on a continuing basis for transparency.
Site Performance. We can also track performance of our servers. Here is an example showing our server responds in an average of 0.255 seconds.
End User Performance. We can also track how fast our site is from the end user browser or smart phone. We get about 20% of our traffic from mobile devices now, so looking for ways to improve performance for everyone is needed. This graphic shows the fastest browser is Chrome on Mac computers. The slowest is Safari on Android mobile devices.
Transaction Tracking. This allows us to identify slow transactions and look into each database call and function call. It will help us identify bottlenecks and speed performance.
Error Checking. This will nto only send us alerts when errors in the system are encountered, but also gives us the ability to look inside the errors. It even identifies potential errors like slow SQL.
System Load Monitoring. We have “over designed” our standard running system with lots of extra capacity and failover points. However, we now have a way to see if bottlenecks develop and identify them early. This is especially important for large races that might have 10-50,000 runners signing up for a race all in one morning.