Scaling Galaxy Zoo With SQS

Over the last 8 months we've received close to 45 million classification clicks from the fantastic Galaxy Zoo community. Averaged over the 8 months that's roughly 8,000 clicks per hour - not bad going! The challenge for our development team has been to design a web stack that's able to cope with big traffic spikes like the one we had earlier this week from APOD but to also keep the hosting costs reasonably low. As I've mentioned before, the pricing model of Amazon Web Services means that we can scale our web and API layers based on how busy we are however what's not so straightforward is scaling the database layer in realtime.

» The Problem

If scaling databases is hard (and we don't want to buy our way out of the problem) then is there an alternative strategy that we can employ? It turns out there is and the solution is asynchronous processing of our classifications. In the past, when you reached the end of the classification decision tree on the Galaxy Zoo site there was a pause between answering the final question and the page refreshing with the next galaxy for analysis. During this pause the classification was being saved back to the database and then a second request made to get a new galaxy. The problem with this approach is obvious - the busier the site gets, the busier (and slower) the database becomes and the longer it takes for the page to refresh. A better approach then would be to decouple the classification-saving from the website user interface and remove the delay between classifications.

» The Solution - Asynchronous Processing

About 3 weeks ago we made a change to the Galaxy Zoo site to remedy this situation - the solution was to use a message queue. Message queues are basically a web-hosted queue of small snippets (or messages) of information - in our case a classification! Handy for us, Amazon have a message queueing service called Simple Queue Service (SQS) and we're using it to help us scale. The old model of saving a classification was to send an XML snippet back to the Galaxy Zoo API and wait for confirmation of a successful save to the database. The difference now is that this XML snippet is written to a SQS queue and we have a separate daemon that processes the queue. By posting the XML classifications to SQS I'm pleased to say that we've dramatically improved the responsiveness of the Galaxy Zoo site and managed to avoid paying for a significantly more expensive database!

» A resounding success?

Before I get too self-congratulatory here it's important to realise that whilst using a message queue has helped us a great deal, there are some undesirable consequences that can arise during busy periods. By using a queue we haven't actually increased the rate at which we can save classifications back to the database, instead we've just created a buffer that we can store the classifications in until the site quietens down and we can process the backlog. Typically there are less than 5 messages in the queue (i.e. we're keeping up with the current classification rate) however during very busy periods this isn't the case: Earlier this week we had a couple of very busy days which meant that at one point there were 30,000 classifications in the queue waiting to be saved! The consequence of these messages being queued is that it's possible that you could classify a whole bunch of galaxies but not see them in your recent history in 'My Galaxies' until minutes or hours later.

» Conclusions

Overall we've been very pleased with the new queue-based system - we've successfully managed to decouple the user interface from a database that's starting to get a little sluggish. The issue of 'My Galaxies' being slightly out of date only arises during particularly busy periods and usually resolves itself within less than an hour. With the launch of Amazon's RDS this week realtime resizing of a database may finally be a reality, but for now message queueing can definitely be used as an effective scaling strategy.