Surviving the Flood | Arfon Smith

It’s been nearly a month now since the launch of Galaxy Zoo 2 and with close to 15 million classifications the volume of traffic has exceeded even our wildest expectations.

I joined the Galaxy Zoo team in January this year and in the six weeks before launch worked pretty much non-stop to re-implement the Galaxy Zoo 2 beta site in Rails as well as write a web service to capture back the results. Now I know it’s almost always best to avoid the big rewrite but we had many good reasons for moving away from the old infrastructure and codebase.

The original Galaxy Zoo project was really an accidental success - the team had no idea that what they had created would become so popular so quickly and the story of the melting web server is Zoo folklore these days. With this in mind we were keen for a smooth launch of Zoo 2.

I think one of the most significant moves we made for the launch was to host the new Galaxy Zoo website and API on Amazon Web Services (AWS). AWS has a pay by the hour pricing model which was perfect for our very public launch. Below is a diagram of the production web stack we were running on for launch day. Blue (LB) nodes are HAProxy load balancers (one for the web nodes and one for the API nodes). Pink (WEB) nodes are serving up the Galaxy Zoo website, yellow (API) nodes are running the API backend of the Galaxy Zoo site (serving up images, capturing back classifications) and finally the green/white nodes are the MySQL Master/Slave databases.

All nodes were EC2 ‘small’ instances running Ubuntu Hardy (8.04), Apache with Phusion Passenger and deployed using Capistrano and Vehicle Assembly.

So on the morning of launch we were running a stack of 14 servers - two load balancers, 5 web nodes, 5 API nodes and a database layer. Because AWS makes it so easy, we we’re also taking hourly EBS snapshots of the database stored in S3.

This setup kept us going for about the first hour until Chris appeared on BBC breakfast and the web traffic went through the roof. Thanks to some seriously smart auto-bootstrapping of AWS EC2 nodes we were able to easily scale the web layer to 10 servers to handle the load, combined with a more beefy MySQL AWS instance and some on-the-fly code optimisations we managed to keep the site up.

I’ve been lucky enough to work on some big Rails projects in the past but this was my first experience of Rails in a high-traffic environment. If I had to do the launch again would I do anything different? Sure. Could we have done with some more time to test the production stack? Definitely. But we survived the flood and I can’t wait for the next big launch…