Twitter Streaming with EventMachine and DynamoDB

This week Amazon Web Services launched their latest database offering ‘DynamoDB’ - a highly-scalable NoSQL database service.

We’ve been using a couple of NoSQL database engines at work for a while now: Redis and MongoDB. Mongo allowed us to simplify many of our data models and represent more faithfully the underlying entities we were trying to represent in our applications and Redis is used for those projects where we need to make sure that a person only classifies an object once.1

Whether you’re using MongoDB or MySQL, scaling the performance and size of a database is non-trivial and is a skillset in itself. DynamoDB is a fully managed database service aiming to offer high-performance data storage and retrieval at any scale, regardless of request traffic or storage requirements. Unusually for Amazon Web Services, they’ve made a lot of noise about some of the underlying technologies behind DynamoDB, in particular they’ve utilised SSD hard drives for storage. I guess telling us this is designed to give us a hint at the performance characteristics we might expect from the service.

» A worked example

As with all AWS products there are a number of articles outlining how to get started with DynamoDB. This article is designed to provide an example use case where DynamoDB really shines - parsing a continual stream of data from the Twitter API. We’re going to use the Twitter streaming API to capture tweets and index them by user_id and creation time.

DynamoDB has the concept of a ‘table’. Tables can either be created in using the AWS Console, by making requests to the DynamoDB web service or by using one of the abstractions such as the AWS Ruby SDK. There are a number of factors you need to consider when creating a table including read/write capacity and how the records are indexed. The read/write capacity can be modified after table creation but the primary key cannot. DynamoDB assumes a fairly random access pattern for your records across the primary key - a poor choice of primary key could in theory lead to sub-par performance.

DynamoDB is schema-less (NoSQL) and so all we need to define upfront is the primary key for indexing records. Primary keys can be defined in two ways:

Simple hash-type primary - Simple hash-type primary key where a hash index is made using this key only.

Hash and range type primary - Composite hash and range primary key. In this situation two indexes are made on the records - an unordered hash index and a sorted range index.

» Why should I care about my choice of primary key?

Choosing an appropriate primary key is especially important with DynamoDB as it is only the primary key that is indexed. That’s right - at this point in time it is not possible to create custom indexes on your records. This doesn’t mean that querying by item attributes isn’t possible, it is, but you have to use the Scan API which is limited to 1MB of data per request (you can use a continuation token for more records) and as each query has to read the full table the performance will degrade as the database grows.

For this example we’re going to go with the composite hash and range key (user_id as the hash_key and created_at as the range_key) so that we can search for Tweets by a particular user in a given time period.

» Writing records

DynamoDB implements a HTTP/JSON API and this is the sole interface to the service. As it’s GOP debate season we’re going to search the Twitter API for mentions of the term ‘Romney’ or ‘Gingrich’. When a tweet matches the search we’re going to print the Twitter user_id, a timestamp, the tweet_id, the tweeter location and their screen_name.

Next we want to write the actual records to DynamoDB.

We can see the tweet count in out table increasing over time using the following script.

» Write performance

As things currently stand, our write performance is limited by the time taken by a single Ruby process to complete a HTTP request to the DynamoDB API endpoint (e.g. Regardless of your write capacity and network performance you’re never going to realise the performance of DynamoDB using a single threaded process like this. What we want instead is a multi-threaded tweet parsing monster.

» Enter EventMachine

Given the write performance is limited by the HTTP request cycle when we receive notification of a new tweet we want to pass of the task of posting that tweet to DynamoDB to a separate thread. EventMachine has a simple way to handle such a scenario using the EM.defer method

I’ve set up an EventMachine threadpool_size of 50 (default is 20) which means that we can have 50 processes simultaneously writing to DynamoDB. Awesome. If you choose a popular term on Twitter (such as Bieber) you’ll see dramatically improved write performance using EventMachine this way.

» Reading back those records

As I mentioned earlier, you’re limited to relatively simple indexes with DynamoDB and so the primary key you choose will have a significant affect on the query performance of your application. Below are some examples of querying using the index and using the Scan API.

For the most simple use cases executing queries using the Scan API will suffice. As data volumes grow however, common queries will need to be made more performant by keeping separate tables indexing the attributes you’re querying on. And this is one of the core differences between DynamoDB and other NoSQL solutions such as MongoDB today - if you want to produce complex indexes of your records then you’ll need to do the heavy lifting yourself.

» Conclusions

DynamoDB is an exciting new technology from AWS and I can already think of a number of use cases where we’d be very seriously considering it. As I see it, DynamoDB sits somewhere in between Redis and MongoDB, providing a richer query API than something like Redis but requiring more of your application (or a change in your data model) to support queries against a range of attributes.

One significant factor favouring DynamoDB is that just like RDS it’s a database as a service. We’ve been using RDS in production now for well over two years and the sheer convenience of having someone else manage your database, making it easy to provision additional capacity or set up read-slaves can be spoken of too highly. Our largest MongoDB instance today is about 5GB in size and very soon we’re going to have to being thinking about how to scale it across a number of servers. Regardless of the potential additional complexity of DynamoDB, not having to think about scaling is a massive win and makes it worthy of consideration for any NoSQL scenario.

1. We keep a Redis key for each user with the ids of the objects that they have classified. Subtracting that from a key with all available ids and then returning a random member is super-fast in Redis.

Now is the Time for Astroinformatics

What follows has been bouncing around in my head for about a year now; I’ve decided to write it down finally after hearing reports of Alex Szalay’s talk here in Oxford (sadly I was away) and seeing this recent paper on the arXiv. My apologies if this is a little too much about me, I promise not to do this too often.

» Nottingham circa 2002

In late 2002 I began my PhD at The University of Nottingham in the Astrochemistry group. Like many new PhD students, it wasn’t entirely clear to me before I arrived exactly which project I’d be working on but I arrived full of enthusiasm and looking forward to applying my understanding of Chemistry to the Cosmos. To be honest though I had little idea what a PhD entailed and after the standard ‘go away and read these 50 papers about subject xxxx’ I found myself needing to look through and match a large number of infrared sources from the IRAS and 2MASS catalogues.

After two days of manually going through the objects I realised there had to be a better way and that I should probably write a ‘program’ to go through the candidate list more rapidly. Following a lunchtime chat with one of the older PhD students I duly picked up a copy of a C programming book and spent the rest of the week writing about 40 lines of C code (yeah, not very impressive, I know).

Over the next year this small C program became a complete monster. Hundreds or even thousands of lines of buggy, repetitive undocumented code that on a good day seemed to produce answers that looked reasonable and on a bad day took me all morning to spot the smallest typo. Having a single copy of this program would have been bad enough but I had about 50 different versions, each very slightly different and with its own bugs. I can honestly say that I lost well over a month of my (PhD) life debugging just that one piece of code; I’d rather not think about how much time in total I spent debugging something.

» Not Unusual

Was my experience as a new PhD student unusual? No. At the time I saw a number of my peers having exactly the same experience and I see it still today with the new students here in Oxford. Most students’ first experience of programming is to either to inherit a existing codebase that’s probably just as buggy as my IR source matcher or to write something from scratch that solves an immediate problem with their research. Either way, it’s highly unlikely that they have any formal training in how to go about writing readable, efficient, maintainable code.

So with all of this terrible code being written, how on Earth do we ever get PhDs? I think the short version is that we muddle through. Students are smart enough to pick up a programming book and relatively quickly code up a solution, or to learn a new language by the trial and error editing of an existing script.

But this trial and error style of coding is exactly the problem. At no point in the process is any real attention paid to how to code is written and to be honest, why should it? I don’t recall being asked about my test coverage during my viva defence let alone having regular code reviews with my supervisor.

The truth is that for the majority of researchers the process of software development stops the very instant they believe their code is doing what they expected.

» Future Nottingham circa 2022

With your permission, I’m going to time-travel roughly 10 years into the future. A PhD student has just begun their PhD and they’ve been asked by their supervisor to look for radio sources with particular characteristics in a data cube from the Square Kilometre Array telescope1. In theory this is fine, they just need to step through the dataset by position and frequency searching for matches except that the data cube that they need to process is many hundreds of terabytes in size.

Faced with this formidable task the student would hopefully realise that the problem can be solved relatively easily in parallel by splitting up the date cube into many millions of smaller chunks. This means that the challenge faced by the student isn’t just to figure out what code to write to search for sources within each sub-cube but also to write a whole load of job management code to chunk up the cube and farm out these bite-sized searches. They’d also hopefully have heard of the MapReduce programming model and realise that the Hadoop project solved for exactly this kind of problem and that they didn’t need to reinvent the wheel for their studies. Hopefully.

The more likely reality is that because they didn’t have a computer science background they’d spend a huge amount of time writing run-once custom scripts that took days/weeks/months to run or worse still conclude that searching to this fidelity at such scale was impossible.

I believe in the next decade there’s a real possibility that the pace of research within astrophysics is going to be severely limited by the lack of awareness amongst researchers of modern approaches to data-intensive computation.

» Informatics for Astrophysics

After completing my PhD in 2006 I worked in the Production Software Team at the Wellcome Trust Sanger Institute building high-throughput software to manage the sequencing of DNA samples. Working here taught me two key things:

  1. How to write high-quality, production-grade software.
  2. That the biosciences are well ahead of astronomers when it comes to building software for science.

Biologists are ahead of astrophysicists because they’ve realised that working with large quantities of biological data is a research specialism in itself. Between the Sanger Institute and the European Bioinformatics Institute (on the same site) there are probably ~ 300 bioinformaticians on campus. That’s 300 people who’s expertise is not in pure genomics research, rather they are world leaders at understanding how to produce, store and analyse biological data.

It’s not that within the astronomy community as a whole there aren’t people developing high-quality software or using modern toolsets to solve common problems (see e.g. Wiley et al.), quite the opposite. The problem is that the efforts of these researchers are not widely recognised as an academic specialism. Until we as a community (and that means funding bodies too) recognise and value the research and development efforts of researchers building new tools then we are unlikely to see any significant advances.

We need to stop relying on occasional gems of brilliance by individual researchers and move to a fully-fledged research area that combines astrophysics with data-intensive computer science.

» Building for the Future

In the next decade, astronomers are likely to face data rates of unprecedented volumes from both the Large Synoptic Sky Telescope and the Square Kilometre Array. If the current situation continues then I think we’re going to have serious problems as a community storing, analysing and sharing the data products from these facilities.

All is not lost however. There are concrete steps that we start to take today that will improve our chances of developing smart solutions in the future:

  1. Train our students: We need to start teaching new undergraduates (and postgraduates) about the fundamentals of software development best practices. That means learning about (and using) version control, understanding how to structure code and writing tests to cover core functionality2.
  2. Build for reuse and share: We need to stop writing run-once ‘scripts’ and start building ‘software’ for reuse. Sharing code with others is a good way to encourage improvements in code quality.
  3. Recognise advances in other fields: Astronomers are not unique in experiencing the ‘data flood’. Recognising innovation and advances in techniques for data processing in other fields (e.g. the biosciences) will be crucial.
  4. Value informatics: Recognising and funding astroinformatics as a research specialism in its own right is vital for success. That means having more meetings like this one and funding research programmes aimed at developing innovative solutions that solve real problems.

I believe astroinformatics has a bright future. The old adage ‘innovate or die’ seems to be a pretty good fit here. Many of the solutions for tomorrow’s astrophysical data problems are already available in other research areas or industry, I just hope that we as a community can begin to recognise the advantages of devoting time and effort into exploring these possibilities.

1. The SKA is going to be massive. Data products from it are going to be rather large and unwieldy.
2. I’m not going to turn this into an essay about how to write better software. There are plenty of books/blogs/conferences that can help you with that.

Building a data archiving service using the GitHub API, figshare and Zenodo

Over the past couple of weeks we’ve seen a couple of great examples of service integrations from figshare and Zenodo that use the GitHub outbound API to automatically archive GitHub repositories. While the implementation of each solution is likely to be somewhat different I thought it might be useful to write up in general terms how to go about building such a service.

In a nutshell we need a tool that does the following:

  • Authenticates against the GitHub API
  • Configures an outbound API endpoint for repository events to be posted to
  • Respond to a GitHub repository event by grabbing a copy of the code
  • Issues a DOI for the code bundle

A while ago, together with Kaitlin Thaney (@MozillaScience) and Mark Hahnel (@figshare) I put together a proof of concept implementation called Fidgit that basically does the above. You can read more about how to run your own version of this service in the README here.

» Tuning in to the GitHub outbound API

GitHub has both an inbound (i.e. send commands to the API) and outbound notifications API called webhooks. By configuring the webhooks for a repository, it’s possible to receive an event notification from GitHub with some information about what has changed on the repo. Check out this list for all of the event types it’s possible to tune into but for the purposes of this article we’re going to focus on the event type that is generate when a new release is created.

Whether it’s Zenodo, figshare or a reference implementation like fidgit, they all rely upon listening to the [outbound GitHub API webhooks and responding with some actions based upon the content in the JSON payload received.

» Creating a webhook on a GitHub repo

Creating a webhook for a GitHub repo is something that can only be done by someone who has permissions to modify the repository state in some way. In order for figshare and Zenodo to set up the webhooks on your GitHub repos, both applications ask you to log in with your GitHub credentials and authorise their applications to administer your webhooks, they do this using OAuth. While a full OAuth login flow is the ‘complete’ way to do this, Fidgit requires a personal access token from your GitHub profile and uses this to authenticate and create a webhook. Note, it’s possible to only ask for the conservative permissions on a GitHub user’s account to just administer OAuth webhook scopes. You can read more about these scopes here.

» Archiving the code

Once notified of a change in the repository (such as a new release) then we need to go and grab that code. This could be in the form of a ‘Git clone’ of the repository and all of its history but Fidgit, Zenodo and figshare all choose to just grab a snapshot of the code from the GitHub raw download service. At the bottom right of the page of every GitHub repository, there’s a link to ‘Download ZIP’. This basically gives us a copy of the current status of the repository but without an Git (or GitHub) information attached such as Git history. As these files can be reasonably large it makes sense to grab this code bundle in a background worker process. That happens in Fidgit worker here which basically uses plain old curl to grab the zip archive and then push the code up to figshare through their API.

» Putting a DOI on it

This step is left as an exercise for the reader (just kidding). Fidgit doesn’t do this, figshare, Zenodo and Dryad are doing this bit and so it’s out of scope for this article.

Autoscaling on AWS Without Bundling AMIs

At Zooniverse HQ we typically run between 15 and 20 AWS instances to support our current set of projects. It’s a mix of fairly vanilla Apache/Rails webservers, AWS RDS MySQL instances, a couple of LAMP machines and a bit of MongoDB goodness for kicks. Over the past year we’ve migrated pretty much all of our infrastructure to make full use of the AWS toolset including elastic load balancing and auto-scaling our web and API tiers, SQS for asynchronous processing of requests and RDS for our database layer. Overall I’m pretty happy with the setup but one pain point in the deployment process has always been the bundling of AMIs for auto-scaling. I’ve described before the basic configuration required when setting up an auto-scaling group - the step that always takes the most time is saving out a machine image with the latest version of the production code so that when the group scales the machine is ready to go. The problem is that for a typical deploy, the changes to the code are minimal and really don’t require a new machine image to be built, rather we just need to be sure that when the machine boots it’s serving the latest version of the application code. Over the past few months I’ve considered a number of different options for streamlining this process, the best of which being an automated checkout of the latest code from version control when the machine boots. This is all very well but we host our code at GitHub. Now don’t get me wrong, I love their service but I really don’t want to build in a dependency of GitHub for our auto-scaling.

» An alternative?

A couple of weeks ago we made a change to the way that we deploy our applications and I can honestly say it’s been a revolution. The basic flow is this:

  1. Work on new feature, finish up and commit (don’t forget to run your tests)
  2. Push code up to GitHub
  3. Git ‘archive’ the repo locally, tar and zip it up
  4. Push the git archive up to S3
  5. Reboot each of the production nodes in turn
  6. Done!

» Eh?

So that not might look like a big deal but the secret sauce is what happens when the machine reboots. There’s a simple script that executes on boot to pull down the latest bundle of the production code from S3 put it in the correct place and voila, you’re running the latest version of the code. We use Capistrano for on-the-fly deployment and so it’s important that this script doesn’t get in the way of that - upon downloading a new bundle of the code the script timestamps the new ‘release’ and symlinks the config files and ‘current’ directory. That way, if we need to we can still cap deploy a hotfix to a running server.

» Show me the code already

The capistrano task used here is super-simple and can, I’m sure be further improved. Below is the example for our latest project Old Weather.

Scaling Galaxy Zoo With SQS

Over the last 8 months we’ve received close to 45 million classification clicks from the fantastic Galaxy Zoo community. Averaged over the 8 months that’s roughly 8,000 clicks per hour - not bad going! The challenge for our development team has been to design a web stack that’s able to cope with big traffic spikes like the one we had earlier this week from APOD but to also keep the hosting costs reasonably low. As I’ve mentioned before, the pricing model of Amazon Web Services means that we can scale our web and API layers based on how busy we are however what’s not so straightforward is scaling the database layer in realtime.

» The Problem

If scaling databases is hard (and we don’t want to buy our way out of the problem) then is there an alternative strategy that we can employ? It turns out there is and the solution is asynchronous processing of our classifications. In the past, when you reached the end of the classification decision tree on the Galaxy Zoo site there was a pause between answering the final question and the page refreshing with the next galaxy for analysis. During this pause the classification was being saved back to the database and then a second request made to get a new galaxy. The problem with this approach is obvious - the busier the site gets, the busier (and slower) the database becomes and the longer it takes for the page to refresh. A better approach then would be to decouple the classification-saving from the website user interface and remove the delay between classifications.

» The Solution - Asynchronous Processing

About 3 weeks ago we made a change to the Galaxy Zoo site to remedy this situation - the solution was to use a message queue. Message queues are basically a web-hosted queue of small snippets (or messages) of information - in our case a classification! Handy for us, Amazon have a message queueing service called Simple Queue Service (SQS) and we’re using it to help us scale. The old model of saving a classification was to send an XML snippet back to the Galaxy Zoo API and wait for confirmation of a successful save to the database. The difference now is that this XML snippet is written to a SQS queue and we have a separate daemon that processes the queue. By posting the XML classifications to SQS I’m pleased to say that we’ve dramatically improved the responsiveness of the Galaxy Zoo site and managed to avoid paying for a significantly more expensive database!

» A resounding success?

Before I get too self-congratulatory here it’s important to realise that whilst using a message queue has helped us a great deal, there are some undesirable consequences that can arise during busy periods. By using a queue we haven’t actually increased the rate at which we can save classifications back to the database, instead we’ve just created a buffer that we can store the classifications in until the site quietens down and we can process the backlog. Typically there are less than 5 messages in the queue (i.e. we’re keeping up with the current classification rate) however during very busy periods this isn’t the case: Earlier this week we had a couple of very busy days which meant that at one point there were 30,000 classifications in the queue waiting to be saved! The consequence of these messages being queued is that it’s possible that you could classify a whole bunch of galaxies but not see them in your recent history in ‘My Galaxies’ until minutes or hours later.

» Conclusions

Overall we’ve been very pleased with the new queue-based system - we’ve successfully managed to decouple the user interface from a database that’s starting to get a little sluggish. The issue of ‘My Galaxies’ being slightly out of date only arises during particularly busy periods and usually resolves itself within less than an hour. With the launch of Amazon’s RDS this week realtime resizing of a database may finally be a reality, but for now message queueing can definitely be used as an effective scaling strategy.

Getting Started With Elastic MapReduce and Hadoop Streaming

A couple of months ago I wrote about how the astrophysics community should place more value on those individuals building tools for their community - the informaticians. One example of a tool that I don’t think is particularly well known in many areas of research is the Apache Hadoop software framework.

Hadoop is a tool designed for distributing data-intensive processes across a large number of machines and its development was inspired by the Google MapReduce and Google File System papers. One of the largest contributors to the Hadoop open source project is Yahoo! who use it to build their search indexes based upon vast quantities of data crawled by their search bots.

» Elastic MapReduce

Hadoop allows you to execute parallel tasks on a number of ‘slave’ machine nodes (the ‘map’ function) before combining the results of the processing into a result (the ‘reduce’ step). This requires understanding how to configure a ‘master’ job-management node as well as a number of ‘slave’ worker nodes to form a Hadoop cluster. Wouldn’t it be nice if you didn’t have to spend your days installing software and configuring firewalls? Well, thanks to Amazon and their Elastic MapReduce service you don’t have to.

Elastic MapReduce is a web service built on top of the Amazon cloud platform. Using EC2 for compute S3 for storage it allows you to easily provision a Hadoop cluster without having to worry about set-up and configuration. Data to be processed is pulled down from S3 and processed by an auto-configured Hadoop cluster running on EC2. Like all of the Amazon Web Services, you only pay for what you use - when you’re not using your 10,000 node Hadoop cluster you don’t pay for it.

» Processing with Hadoop

Data processing workflows with Hadoop can be developed in a number of ways including writing MapReduce applications in Java, using SQL-like interfaces which allows you to query a large dataset in parallel (e.g. Pig) or perhaps most exciting for people with existing software using something called Hadoop Streaming. Streaming allows you to develop MapReduce applications with scripts in any language you like for the mapper and reducer provided that they read input from STDIN and return their output to STDOUT.

While Hadoop can be a valuable tool for any data-rich research domain, building applications using Streaming is a great way to leverage the power of Hadoop without having the overhead of learning a new programming language.

» A Very Simple Streaming Example - aka ‘Who loves Justin Bieber?’

As an example of how to use Hadoop Streaming on Elastic MapReduce I’m going to capture all of the tweets over a 12-hour period that have the word ‘bieber’ in them and search for the word ‘love’. To save these tweets I’ve used the Twitter Streaming API and the simple PHP script below that writes out the tweets to a file in hourly snapshots.

» The Map Script

As I mentioned earlier, the streaming functions/scripts can in theory be written in any language - Elastic MapReduce supports Cascading, Java, Ruby, Perl, Python, PHP, R, or C++. In this example I’m going to use Ruby. Elastic MapReduce is going to pass the input files from the S3 bucket to this script running on the parallel slave nodes, hence the use of ARGF.each - we’re reading from STDIN.

Data from the streaming API PHP script are saved as JSON strings to hourly snapshotted files. Each line in the file is a potential tweet so we’re stepping through the file line by line (tweet by tweet), verifying that we can parse the tweet using the Crack Rubygem by John Nunemaker and also checking if the tweet text has the word ‘love’ in it. If we find ‘love’ then we print to STDOUT the word ‘love’ - this is the streaming output from the map step and is forms the input for the reduce function.

» The Reduce Script

The reduce script is super-simple. The output from the map script is automatically streamed to the reduce function by Hadoop. All we do with the reduce script is count the number of lines returned (i.e. the number of times the word ‘love’ is found).

» Bootstrapping Nodes

In the map script we have a Rubygem dependency for the Crack gem. As we can’t specify the machine image (AMI) that Elastic MapReduce uses we need to execute a bootstrap script to install Rubygems and the Crack gem when the machine boots.

» Configuring the Workflow

Now we’ve got our map and reduce scripts we need to upload them to S3 together with the raw data pulled from the Twitter API. We’ve placed out files in the following locations:

  • Input path (Twitter JSON files): s3n://ruby-flow/bieber_input
  • Output path: s3n://ruby-flow/bieber_output
  • Map script: s3n://ruby-flow/bieber_scripts/map.rb
  • Reduce script: s3n://ruby-flow/bieber_scripts/reduce.rb
  • Bootstrap script: s3n://ruby-flow/bieber_scripts/

Next we need to configure a streaming workflow with the values above.

Then we need to add a custom bootstrap action to install Rubygems and our dependencies on launch of the Amazon nodes, review the summary and launch the workflow.

» Executing the Workflow

Once we have the workflow configured click the ‘Create Job Flow’ button to start processing. Clicking this button launches the Hadoop cluster, bootstraps each node with the specified script ( and begins the processing of the data from the S3 bucket.

As the Elastic MapReduce nodes are instance-store backed rather than EBS volumes they take a couple of minutes to launch. You can review the status of the job on the Elastic MapReduce Jobs view but also see the status of the cluster on the EC2 tab.

» Closing Down and Finishing Up

Once the job has completed, the reduce function writes the output from the script to the output directory configured in the job setup and the cluster closes itself down.

So it turns out that a fair few people love Justin Bieber: over the 12 hours there were about 200,000 tweets mentioning ‘beiber’ and we find the following in our output: 8096 ‘loves’ for Justin. That’s a whole lotta love.

» Conclusions

Obviously this is a pretty silly example, but I’ve tried to keep it as simple as possible so that we can focus on the setup and configuring of a Hadoop Elastic MapReduce cluster.

Hadoop is a great tool but it can be fiddly to configure. With Elastic MapReduce you can focus on the design of your map/reduce workflow rather than figuring out how to get your cluster setup. Next I’m planning on making some small changes to software used by radio astronomers to find astrophysical sources in data cubes of the sky to make it work with Hadoop Streaming - bring it on SKA!

A First Look at the Amazon Relational Database Service

This morning Jeff Bar announced a new service offering from Amazon Web Services (AWS): Amazon Relational Database Service. The Amazon Relational Database Service (or RDS for short) is a new web service that allows users to create, manage and scale relational databases without having to configure the server instance, database software and storage that the database runs on. In short, this is a service that has the potential to take much of the headache out of database management.

» The current setup

At Galaxy Zoo we run our database on a combination of MySQL 5.1, EC2/Ubuntu Hardy and XFS/EBS storage. While there are some excellent guides on how best to configure a database running on AWS, operating a database in a virtualised environment requires that you plan for the worst case scenario of the virtualised server failing and the filesystem disappearing along with it. Because of this the steps required to configure a new database using ‘persistent’ storage (i.e. on an Elastic Block Store volume) are numerous:

1. Launch a new EC2 instance

Launching a new EC2 instance and installing the database software is pretty simple however for convenience we have the Galaxy Zoo database image saved out as a custom AMI.

2. Update server and install database engine

Launching a new EC2 image without a database engine installed means that you probably need to update the server software (e.g. apt-get update on Ubuntu) and then install MySQL or your database of choice.

3. Create a new Elastic Block Store volume and attach it to your instance

Next up you need to create a new EBS volume, attach it to the EC2 instance and format the filesystem on the EBS volume.

4. Create database

Setting up a new database instance on EC2 is clearly non-trivial and requires knowledge of EBS, mount points, filesystems, not to mention configuring the MySQL settings for the chosen size of EC2 instance that you have.

» A different way?

With the introduction of RDS, Amazon has removed almost all of the difficulty in setting up and configuring a new MySQL database that is both scalable and reliable. Creating a new database instance now is as simple as issuing a single command:

>> rds-create-db-instance --db-instance-identifier mydatabase --allocated-storage 20 --db-instance-class db.m1.small --engine MySQL5.1 --master-username root --master-user-password password

With this command I have created a new m1.small MySQL 5.1 database server with 20Gb of storage and configured the master username and password. Provisioning a new RDS instance took a few minutes and during the provisioning you can check on the progress with the command: >> rds-describe-db-instances Once available, your new RDS instance is given a hostname that you can then use to connect with the standard MySQL port of 3306. Actually, it’s not quite that simple - before you can connect you need to assign which AWS security groups are allowed to connect to your RDS instances. I found this step a little confusing but essentially you need to configure is which EC2 instances running under their respective security groups are allowed to connect. For Galaxy Zoo, we have a default security group for all of our web servers called ‘web’ and so to allow access from these servers I had to add this ‘web’ security group to the defaults for the RDS servers:

>> rds-authorize-db-security-group-ingress default --ec2-security-group-name web --ec2-security-group-owner-id 1234567789

» The devil is in the details

At this point you have a RDS instance running MySQL 5.1 ready and waiting to serve up your databases. That’s not where the benefits stop though - not only do you get the ease of creating new database instances but there are some very nice extras you also get by using the service.

» Scaling/resizing

At Galaxy Zoo, we usually our main ‘classifications’ database on a single EC2 small instance. In the last 8 months we’ve received something close of 45 million classifications and while the database has started to get a little sluggish, by writing user classifications to SQS and processing them asynchronously we are able to keep the site feeling nippy. Each month however, we try to send a newsletter to our 250,000 strong community and the increased load that this causes on the database means that for a couple of days we switch to a m1.large instance. The overhead of switching database servers is pretty annoying - place the site into maintenance mode, stop the MySQL server, detach the EBS volume, launch a new EC2 instance, attach the EBS volume to the new server… the list goes on. With RDS not only can you change the amount of disk space available to your RDS instance but you can also dynamically resize the server size (i.e. RAM/CPU). I can see that this is going to be a real win for us.

» A tuned MySQL instance

If there’s on thing that my time at The Sanger Institute taught me, it’s that managing and scaling large databases is a dark art. For the majority of small web applications it’s not crucial whether the MySQL server configuration you’re running is absolutely optimised for your hardware however now that we’re reaching the limits of our current instance size, making sure the MySQL server is well configured is becoming important. Deciding how large your innodb_buffer_pool or key_buffer size should be is not obvious for most of us and so having a MySQL server configured to work well for the resources available to it is very comforting. Over the next couple of days I’m going to be benchmarking our standard MySQL setup to see how it compares against a RDS instance with the same resources. Watch this space!

Elastic Load Balancing on EC2 redux

A few months back I wrote about how we switched the Galaxy Zoo HAProxy load balancers to Amazon Web Services (AWS) Elastic Load Balancers (ELB). At that point we had basically just swapped out HAProxy (running on its own EC2 small instance) for an ELB but weren’t making any use of the auto-scaling features also on offer. For the past few days I’ve been playing around with auto-scaling our API layer with the ELB that’s already in place and this morning I pushed the changes into production.

» Getting started

As I mentioned earlier, we already had an ELB in place so we didn’t need to create a new one - instead we’re adding here auto-scaling to an ELB that’s already in place. For completeness however, this is the command used to create the original ELB:

>> elb-create-lb ApiLoadBalancer --zones us-east-1b --listener "lb-port=80, instance-port=80, protocol=TCP" --listener "lb-port=443, instance-port=8443, protocol=TCP"

» De-register existing ELB instances

As we already had a couple of instances registered with the ELB I found the easiest way to get auto-scaling up and running was to remove the existing instances before proceeding:

>> elb-describe-instance-health ApiLoadBalancer
INSTANCE i-abcdefgh InService INSTANCE i-ijklmnop InService >> elb-deregister-instances-from-lb -lb ApiLoadBalancer --instances i-abcdefgh i-ijklmnop No Instances currently registered to LoadBalancer

» Create a launch configuration

Before you can introduce auto-scaling you need to have a couple of things in place - an Amazon Machine Image (AMI) that upon boot is immediately ready to serve your application and a launch configuration compatible with your currently ELB-scaled nodes (security groups etc.). Depending upon your setup, always having an AMI ready to launch with the latest version of your production codebase is probably the hardest thing to achieve here. Once you have your AMI in place and your security group and key-pair settings to hand you’re ready to create your launch configuration:

>> as-create-launch-config ApiLaunchConfig --image-id ami-myamiid --instance-type m1.small --key ssh_keypair --group "elb security group name"
OK-Created launch config

» Create an auto-scaling group

Once you have a launch configuration in place it’s time to create an auto-scaling group. Auto-scaling groups need as a minimum to know what launch configuration, which load-balancer to use, which availability zone and the minimum and maximum to scale to. We never run the Galaxy Zoo API on anything less than 2 nodes and so to create our auto-scaling group I issued a command something like this:

>> as-create-auto-scaling-group ApiScalingGroup --launch-configuration ApiLaunchConfig --availability-zones us-east-1b --min-size 2 --max-size 6 --load-balancers ApiLoadBalancer
OK-Created AutoScalingGroup

At this point it’s worth noting that although we’d removed all of the instances being load balanced by the ApiLoadBalancer ELB, because the auto-scaling group set a minimum number of instances of 2 checking the status of the auto-scaling group showed that 2 new instances were spinning up:

>> as-describe-scaling-activities ApiScalingGroup
ACTIVITY 78bf4e0d-f72b-4b5b-a044-6b99942088ed 2009-08-24T07:19:28Z Successful "At 2009-08-24 07:16:12Z a user request created an AutoScalingGroup changing the desired capacity from 0 to 2. At 2009-08-24 07:17:17Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 2."

I don’t know about you but I think that’s pretty AWESOME!

» Create some launch triggers

To complete the auto-scaling configuration, you need to define the rules that increase and decrease the number of load-balanced instances. Currently we have a very simple rule based upon CPU load - if the average CPU load over the past 120 seconds is greater than 60% we introduce a new instance, if the CPU average drops below 20% then we remove an instance:

>> as-create-or-update-trigger ApiCPUTrigger --auto-scaling-group ApiScalingGroup --namespace "AWS/EC2" --measure CPUUtilization --statistic Average --dimensions "AutoScalingGroupName=ApiScalingGroup" --period 60 --lower-threshold 20 --upper-threshold 60 --lower-breach-increment=-1 --upper-breach-increment 1 --breach-duration 120
OK-Created/Updated trigger

These triggers will almost certainly require refinement but helpfully the as-create-or-update-trigger command will create a new trigger if one doesn’t exist or update an existing trigger based upon the new parameters.

» That’s it!

Once again it’s been a breeze to introduce the latest AWS features into our production stack. Moving Galaxy Zoo to AWS has completely changed the way we think about running our web applications - we’ve gone from having a group of ‘pet’ servers we each know the name of to having a fault-tollerant, auto-scaled web-stack ready for the future.

Hunting for Supernovae

Over the last few days we’ve been running a little project to see if the Galaxy Zoo community can help find new supernovae from the Palomar Transient Factory). Turns out it works pretty well.

Developing the website to power Supernova Zoo was a fun challenge; the Supernova and Galaxy Zoo websites look pretty similar and obviously share many features but there were some new problems to solve that we hadn’t faced before…

» A Moving Target

Galaxy Zoo 2 had a static number of galaxy images to classify. Within the Galaxy Zoo domain model, we refer to an image as an ‘Asset’, Galaxy Zoo 2 had 245609 Assets to classify. One of the most exciting things about Supernova Zoo is that images are taken from the Palomar Transient Factory in near-realtime and sent up to the website for analysis. Being able to handle these images in an automated fashion is crucial and we’ve built API methods for uploading and auto-registering new Assets with the Supernova Zoo website.

» Priority Assets

For Galaxy Zoo, each Asset in the database has an equal priority of being shown to a Zooite. Upon reaching the classification interface, the Asset presented is therefore essentially random. For a static number of Assets this works well however for Supernova Zoo we wanted something a little different. The very nature of Supernova hunting means that you want to find the newest supernovae as quickly as possible and so to help us with achieve this we implemented a couple of new features:

Asset priority - When serving up an Asset to the Supernova Zoo interface we pick from the most recent supernova candidates. That way we are always going to be classifying the newest candidates first before heading further back in time to look at older ones.

Asset escalation - So that we could alert Mark and Sarah at the WHT to new supernova candidates as rapidly as possible we needed a mechanism for escalating the priority of the Asset in the system. We achieved this by essentially ‘scoring’ your classifications as they came in. When creating the decision tree we attached a score to some of the answers. When your classification was complete we kept an average total score for the Asset that you had just classified. By keeping track of the scores as you classified, if you gave the ‘correct’ sequence of answers for a potentially real candidate, then it would immediately become a higher priority target to show to other Zooites.

» A Retrospective

Supernova Zoo was our first opportunity to test the codebase that myself and the team at SIUE have been working hard on for the last few months. Handling a continual stream of new Assets and changing the behaviour of the system in real time based on your classifications has been a fun challenge and overall we’re pretty happy with the results.

In the next day Supernova Zoo will be taken offline so that we can have a good look at the results from the past few days. Based upon your excellent feedback there will almost certainly be some tweaks to the classification interface and refinements to the decision tree. Supernova hunting is a very different challenge to galaxy classification and we’re delighted that our Zooites appear to equally adept at classifying galaxy morphologies as finding new supernovae!

Elastic Load Balancing on EC2

For the past few months we’ve been loading balancing the Galaxy Zoo web and API layers using HAProxy. Overall this has worked pretty well; HAProxy is easy to configure and hasn’t missed a beat, however having to spend $150 per month just to load balance our other EC2 nodes seems a little excessive.

For some time Amazon have been promising load balancing and auto-scaling as part of their EC2 offerings and a few weeks back now a public beta of their auto-scaling and load balancing products was announced on their blog.

It’s been a busy few weeks at the Zoo and so I’ve only just got around to playing with the new tools and I have to say, I’m impressed. In approximately 15 minutes I’ve managed to swap out one of our HAProxy nodes for an elastic load balancer (ELB). Count the steps:

1. Create a new load balancer

First we need to create an elastic load balancer. Note I’m using http and https, unfortunately ELB doesn’t have SSL termination capability so you need to route traffic on port 443 to an alternative port (in my case I’m routing SSL to port 8443).

>> elb-create-lb LoadBalancerName --zones us-east-1b --listener "lb-port=80, instance-port=80, protocol=TCP" --listener "lb-port=443, instance-port=8443, protocol=TCP"

2. Register the instances to be load balanced

>> elb-register-instances-with-lb LoadBalancerName --instances instance_id

3. Create a CNAME record for the elastic load balancer

Each load balancer is given an AWS hostname such as This needs to be aliased to the actual hostname you want to use your load balancer using a CNAME record.

4. Add a health check

Last thing to do is add a instance health check to the load balancer so that it doesn’t send requests to a unresponsive node. You can configure a health check like this:

>> elb-configure-healthcheck LoadBalancerName --healthythreshold 2 --interval 30 --target "TCP:8443" --timeout 3 --unhealthythreshold 2

This health check is set up to verify the status of each load balanced node every 30 seconds on port 8443, removing it from service if it fails more than two times.

5. Done!

And that’s it. A couple of points to note: At the moment it’s a limitation of the service that you can’t have a root domain url load balanced using ELB. This is basically because you can’t have a CNAME record pointing to the root of a domain. This is a known limitation and and should be fixed in the next release. Also elastic load balancing obviously isn’t free (what is these days). The good news is though, at $0.025/hour, running an elastic load balancer is significantly cheaper than running a single EC2 HAProxy node ($0.10/hour).

» What’s next?

Next up is configuring auto-scaling and monitoring using Cloudwatch. More of that later…