Rails / Sinatra / Metal Shootout

I’m currently in Las Vegas attending RailsConf 2009. This morning I heard Heroku’s Adam Wiggins give an excellent overview of Rails Metal, Rack and Sinatra.

Some time ago now, Rails adopted Rack as its middleware layer. For those not in the know (myself included before Adam’s talk), according to RailsGuides Rack is:

Rack provides a minimal, modular and adaptable interface for developing web applications in Ruby. By wrapping HTTP requests and responses in the simplest way possible, it unifies and distills the API for web servers, web frameworks, and software in between (the so-called middleware) into a single method call.

Metal is essentially a thin wrapper around the Rack middleware layer of Rails. Why is this important? Well, by dropping down into Metal it’s possible to completely bypass the Rails framework and squeeze the absolute maximum performance out of your stack. Specifically this can be useful for a common request to your application where the response time is crucial and you want to avoid the extra overhead of passing through the Rails routing mechanism before serving a response.

» The Test

OK, so enough talk, what do the numbers look like? This is by no means meant to be a thorough test of all possible permutations of Rails, Sinatra and Metal, rather I’m interested in replacing a simple API method with a Sinatra application and a Metal endpoint. The API I’m testing is the Galaxy Zoo API layer. Within the Galaxy Zoo API we have the concept of ‘Assets’. An Asset is something like a SDSS galaxy image and a frequently accessed API url looks like:

http://api_url/api/assets/:id

This API call returns a simple XML snippet that looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<asset>   
  <id>1</id>
  <location>http://s3.aws.com/1.jpg</location>
  <project_id>1</project_id>   
  <external_ref></external_ref>
</asset>

I used Apache benchmark to test each option. Passenger 2.2.2 / Rails 2.3.2 and my MacBook Pro (2.53 GHz) were used to serve the application. Also, to ensure a reasonably fair test I rebooted the OS for each variant and ‘warmed up’ Apache by running the test 4 times before taking the benchmark results from the 5th and final time I issued this command:

ab -n 1000 -c 4 http://api_url/api/assets/1

This is basically making 1000 requests with 4 concurrent connections.

» The Results

OK, so first up I used a standard Rails controller action, the code for which is shown below:

def show   
  @asset = Asset.find(params[:id])   

  respond_to do |format|
    format.xml { render :xml => @asset.to_xml }   
  end
end

This came out at a very reasonable 230 requests per second:
Requests per second: 229.64 [#/sec] (mean)

Next up I added in a Sinatra ‘application’ to respond to the api/assets/:id url. Because of the way that Rails uses Rack, the Sinatra/Metal endpoints are picked up before the Rails routing mechanism kicks in, no modification is therefore required to the routes.rb config to make the Sinatra application pick up the request url.

By default Sinatra/Metal endpoints are picked up if they are placed in the RAILS_ROOT/app/metal/ and have a class name that represents the filename for the Sinatra application:

RAILS_ROOT/app/metal/sinatra_asset.rb

require 'sinatra'

class SinatraAsset < Sinatra::Application   
  set :environment, 'production'   

  get '/api/assets/:id' do
    Asset.find(params[:id]).to_xml   
  end
end

Benchmarking Sinatra produces the following results:
Requests per second: 416.61 [#/sec] (mean)

Wow! So we’ve gone from ~230 requests per second using a standard Rails controller action up to over 400 requests per second using Sinatra. This is obviously a pretty serious speed bump and for really not very much work.

Finally I tested a Metal endpoint to intercept the same request url. Once again, Metal endpoints need to be installed in:

RAILS_ROOT/app/metal/metal_asset.rb

class MetalAsset   
  def self.call(env)     
    url_pattern = /^\/api\/assets\/(\d+)$/     
    if m = env['PATH_INFO'].match(url_pattern)
      asset = Asset.find(m[1])       
      [ 200, {"Content-Type" => "text/xml"}, asset.to_xml]    
    else       
      [ 404, {}, '']     
    end   
  end
end

So Sinatra was fast - how fast is Metal? Well it’s pretty nippy:
Requests per second: 522.12 [#/sec] (mean)

» Conclusions

As I mentioned earlier, this is by no means meant to be a through test of how Rails controller actions perform compared to their Sinatra and Metal equivalents, however the numbers are pretty spectacular: a bare Metal endpoint more than doubles the number of requests this application can handle per second. This is not to say that the Sinatra results weren’t pretty damn good too - using Sintara gave an 80% speed boost for this simple API request.

It seems clear that a significant speed boost can be had by getting down to ‘the metal’. Personally I prefer the clear syntax of Sinatra over the URL regex that Metal requires to achieve the same result, although the additional ~100 requests per second that Metal offers over Sinatra is hard to ignore.

David Heinemeier Hansson talked this week about the refactoring that’s going on with the Rails routing mechanism for the upcoming Rails 3 release so it’s possible that these numbers could significantly change when Rails 3 makes it into the wild. For now though, if you’ve got a Rails application with a frequently accessed url, drop in a Sinatra application or a Metal end point and watch it fly!

Confessions of a Zoonometer™ Addict

Last week at Galaxy Zoo as part of the 100 hours of Astronomy we challenged the Zooites to do 1 million clicks in 100 hours - a big challenge. In the week before the 100 hours we’d received about 1 million clicks so although the challenge of reaching 1 million was a big one but it seemed perfectly realistic. I don’t know about everyone else but I couldn’t stop refreshing the Galaxy Zoo homepage to check on the latest total. In the end we reached our goal of 1 million clicks about 12:45pm on the Saturday a mere 72 hours into the challenge!

» 1.45 million clicks in 100 hours

I wondered what would happen once we’d reached 1 million - would people stop classifying? Absolutely not! In the final 28 hours we added a further 450,000 clicks to the Zoonometer™ total reaching a grand total of 1.45 million clicks in 100 hours… Or did we?

» What the Zoonometer™ should have been reading

As I mentioned earlier, in the week before the 100 hours challenge we’d had about 1 million clicks and so with all the extra publicity surrounding the 100 hours of Astronomy I was secretly hoping that we might get closer to 2 million clicks. It turns out we did…

When writing the code for the Zoonometer™ I had to make a few changes to the Galaxy Zoo website and API. Without really thinking I decided that rather than count the total number of clicks each time we wanted to update the Zoonometer™ (a MySQL query that takes about 6 seconds) I’d keep the total as a separate counter. Each time someone classified a galaxy I’d add 1 to the total and this way the current total could be checked very quickly and so we could update the Zoonometer™ more frequently.

What a great idea Arfon! Erm no… It turns that this was a really bad idea and here’s why.

In the API we have a Project and Classification model. The Project has_many classifications and so I was keeping a counter column on the Galaxy Zoo project entry. In the code I had something like this as an after_create callback on the Classification model:

def update_counter   self.project.classification_count = self.project.classification_count + 1   self.project.save end

Simple right? When a classification comes in, add one to the project total and keep going. I had tests, the method worked, everything looked peachy. What I didn’t consider is what happens when you’re getting 30-40 classifications per second. Let’s consider what happens when two (or more) classifications are processed simultaneously. If the database is very busy then it’s possible that in the time it takes to create the classifications, when both after_create callbacks run the classification_count column on the project is the same. That is, if both callbacks get a value of 1000 for the current project classification_count then they are both going to update to the new value of 1001. Oh dear.

So what does this mean? Well the bad news is that the Zoonometer™ was reporting the wrong total. The great news is that we didn’t record 1.45 million clicks in the 100 hours of Astronomy, we actually had 2,617,570! Yes you heard me, that’s:

Turns out that Zoonometer™ was a little off the mark…

» A retrospective

So 2,617,570 not 1,450,000 clicks? Pretty impressive stuff. I knew we were busier than the Zoonometer™ was reporting, I just couldn’t figure out why it wasn’t counting properly! 2,617,570 is an amazing number to have reached in just 100 hours and I’d like to thank all the people who worked so hard to help us reach this total.

I’m putting this down to experience. To be honest I’ve never worked on a project quite so popular as Galaxy Zoo and problems like this only arise in very busy environments such as ours. When we next have to bring out the Zoonometer™ you can be assured of an accurate total!

Master/Slave Databases with Rails

Getting ActiveRecord to talk to multiple databases is easier than you might think.  It’s possible to override the connection settings in database.yml at the model level by doing something like:

` establish_connection( :adapter => “mysql”, :host => “localhost”, :username => “myuser”, :password => “mypass”, :database => “somedatabase” ) `

Calling the establish_connection method at the model level simply overrides the ActiveRecord connection object for the local model. At Galaxy Zoo we needed to do something a little different: initially we were write dominated at the database layer - 16 million classifications in the last month peaking at about 50 classifications per second on launch day. However as things have settled down and we’ve been adding more user-centric features to the Galaxy Zoo site, we’ve been finding that a significant amount of out database load has been coming from more complicated queries (reads) rather than lots of writes. An ideal solution for a situation like this is to introduce some kind of MySQL replication thus distributing the load across multiple databases. Rather than introducing the complexity of offset primary keys in a Master/Master configuration we’ve opted for a standard MySQL Master/Slave configuration sending the writes to the Master and reads to the Slave. But how to accomplish this with ActiveRecord?

» Enter Masochism

Masochism is a Rails plugin by Rails core team’s technoweenie (Rick Olson). It works by overriding the ActiveRecord connection object with ConnectionProxy that (by default) sends writes to the Master MySQL database and reads to the Slave. We’ve been running in production now for about 2 weeks using Masochism and so far there’s not much to say other than it works!

We’ve made a couple of optimisations along the way after examining the production logs: When writing a classification to the database there’s a couple of writes, then some reads, then some writes… In the log you see something like this:

Switching to Master
Switching to Slave
Switching to Master
Switching to Slave
Switching to Master
Switching to Slave

Obviously this switching between the Master and Slave database repeatedly in the same method call is less than ideal. Thankfully it’s possible to override the database so that within the method only one of the databases is used:

around_filter ActiveReload::MasterFilter, :only => [:create]

Masochism is a nice solution to a common problem - using ActiveRecord in a replicated database environment. I can already see us outgrowing Masochism - specifically it doesn’t support multiple slave databases which is a shame. When that day comes we’ll no doubt look to an alternative such as FiveRuns’ DataFabric or MySQL Proxy. But for now, Masochism works, and I can highly recommend it.

Surviving the Flood

It’s been nearly a month now since the launch of Galaxy Zoo 2 and with close to 15 million classifications the volume of traffic has exceeded even our wildest expectations.

I joined the Galaxy Zoo team in January this year and in the six weeks before launch worked pretty much non-stop to re-implement the Galaxy Zoo 2 beta site in Rails as well as write a web service to capture back the results. Now I know it’s almost always best to avoid the big rewrite but we had many good reasons for moving away from the old infrastructure and codebase.

The original Galaxy Zoo project was really an accidental success - the team had no idea that what they had created would become so popular so quickly and the story of the melting web server is Zoo folklore these days. With this in mind we were keen for a smooth launch of Zoo 2.

I think one of the most significant moves we made for the launch was to host the new Galaxy Zoo website and API on Amazon Web Services (AWS). AWS has a pay by the hour pricing model which was perfect for our very public launch. Below is a diagram of the production web stack we were running on for launch day. Blue (LB) nodes are HAProxy load balancers (one for the web nodes and one for the API nodes). Pink (WEB) nodes are serving up the Galaxy Zoo website, yellow (API) nodes are running the API backend of the Galaxy Zoo site (serving up images, capturing back classifications) and finally the green/white nodes are the MySQL Master/Slave databases.

All nodes were EC2 ‘small’ instances running Ubuntu Hardy (8.04), Apache with Phusion Passenger and deployed using Capistrano and Vehicle Assembly.

So on the morning of launch we were running a stack of 14 servers - two load balancers, 5 web nodes, 5 API nodes and a database layer. Because AWS makes it so easy, we we’re also taking hourly EBS snapshots of the database stored in S3.

This setup kept us going for about the first hour until Chris appeared on BBC breakfast and the web traffic went through the roof. Thanks to some seriously smart auto-bootstrapping of AWS EC2 nodes we were able to easily scale the web layer to 10 servers to handle the load, combined with a more beefy MySQL AWS instance and some on-the-fly code optimisations we managed to keep the site up.

I’ve been lucky enough to work on some big Rails projects in the past but this was my first experience of Rails in a high-traffic environment. If I had to do the launch again would I do anything different? Sure. Could we have done with some more time to test the production stack? Definitely. But we survived the flood and I can’t wait for the next big launch…