How the Zooniverse Works: The Domain Model

This was originally posted on the Zooniverse blogs here. I’m reposting as this blog sometimes feels like a personal diary for my time with the Zooniverse.

We talk a lot in the Zooniverse about research, whether it’s interesting stories from the community, a new paper based upon the combined efforts of the volunteers and the science teams or conferences we might be going to.

One thing we don’t spend much time talking about is the technology solutions we use to build the Zooniverse sites, the lessons we’ve learned as a team building more than twenty five citizen science projects over the past five years and where we think the technical challenges still remain in building out the Zooniverse into something bigger and better.

There’s a lot to write here so I’m going to break this into three separate blog posts. The first is going to be entirely about the domain model that we use to describe what we do. When it seems relevant I’ll talk a little more about implementation details of these domain entities in our code too. The second will be about technologies and the infrastructure we run the Zooniverse atop of and the third will be about making smarter systems.

» Why bother with a domain model?

Firstly it’s worth spending a little time talking about why we need a domain model. In my mind the primary reason for having a domain model is that it gives the team, whether it’s the developers, scientists, educators or designers working on a new project a shared vocabulary to talk about the system we’re building together. It means that when I use the term ‘Classification’ everyone in the team understands that I’m talking about the thing we store in the database that represents a single analysis/interaction of a Zooniverse volunteer with a piece of data (such as a picture of a galaxy), which by the way we call a ‘Subject’.

Technology wise the Zooniverse is split into a core set of web services (or Application Programming Interface, API) that serve up data and collect it back (more about that later) and some web applications (the Zooniverse projects) that talk to these services. The domain model we use is almost entirely a description of the internals of the core Zooniverse API called Ouroboros and this is an application that is designed to support all of the Zooniverse projects which means that some of the terms we use might sound overly generic. That’s the point.

» The core entities

The domain model is actually pretty simple. We typically think most about the following entities:

User

People are core to the Zooniverse. When talking publically about the Zooniverse I almost always use the term ‘citizen scientist’ or ‘volunteer’ because it feels like an appropriate term for someone who donates their time to one of our projects. When writing code however, the shortest descriptive term that makes sense is usually selected so in our domain model the term we use is User.

A User is exactly what you’d expect, it’s a person, it has a bunch of information associated with it such as a username, an email address, information about which projects they’ve helped with and a host of other bits and bobs. Crucially though for us, a User is the same regardless of which project they’re working - that is Users are pan-Zooniverse. Whether you’re classiying galaxies over at Galaxy Zoo or identifying animals on Snapshot Serengeti we’re associating your efforts with the same User record each time which turns out to be useful for a whole bunch of reasons (more later).

Subject

Just as people are core, as are the things that they’re analysing to help us do research. In Old Weather it’s a scanned image of a ship log book, in Planet Hunters it’s a light curve but regardless of the project internally we call all of these things Subjects. A Subject is the thing that we present to a User when we want to them to do something.

Subjects are one of the core entities that we want to behave differently in our system depending upon their particular flavour. A log book in Old Weather is only viewed three times before being retired whereas an image in Galaxy Zoo is shown more than 30 times before retiring. This means that for each project we have a specific Subject class (e.g. GalaxyZooSubject) that inherits its core functionality from a parent Subject class but then extends the functionality with the custom behaviour we need for a particular project.

Subjects are then stored in our database with a collection of extra information a particular Subject sub-class can use for each different project. For example in Galaxy Zoo we might store some metadata associated with the survey telescope that imaged the galaxy and in Cyclone Center we store information about the date, time and position the image was recorded.

Workflow/Task

These two entities are grouped together as they’re often used to mean broadly the same thing. When a User is presented with a Subject on one of our projects we ask them to do something. This something is called the Task. These Tasks can be grouped together into a Workflow which is essentially just a grouping entity. To be honest we don’t use Workflow very much as most projects just have a single Workflow but in theory it allows us to group a collection of Tasks into a logical unit. In Notes from Nature each step of the transcription (such as ‘What is the location?’) is a separate Task, in Galaxy Zoo, each step of the decision tree is a Task too.

Classification

It’s no accident that I’ve introduced these three entities, User, Subject and Task first as a combination of these is what we call a Classification. The Classification is the core unit of human effort produced by the Zooniverse community as it represents what a person saw and what they said about it. We collect a lot of these - across all of the Zooniverse projects to date we must be getting close to 500 million Classifications recorded.

I’ll talk more about what we store in a Classification in a followup the next post about technologies suffice to say now that they store a full description of what the User said about the object. In previous versions of the Zoonivese API software we tried to break these records out into smaller units called Annotations but we don’t do that any more – it was an unnecessary generalisation.

Group

Sometimes we need to group Subjects together for some higher level function. Perhaps it’s to represent a season’s worth of images in Snapshot Serengeti or a particular cell dye staining in Cell Slider. Whatever the reason for grouping, the entity we use to describe this is ‘Group’.

Grouping records is both one of the most useful features Ouroboros offers but also one of the things it took the longest for us to find the right level of abstraction. While a Group can represent an astronomical survey in Galaxy Zoo (such as the Hubble CANDELS survey) or a Ship in Old Weather, it isn’t just enough for a bunch of Subjects to all be associated with each other. There’s often a lots of functionality that goes along with a Group or the Subjects within that is custom for each Zooniverse project. Ultimately we’ve solved this in a similar fashion to Subject - by having per-project subclasses of Groups (i.e. there is a SerengetiGroup that inherits from Group) that can implement custom behaviour as required.

Project

Ouroboros (the Zooniverse API) hosts a whole bunch of different Zooniverse projects so it’s probably no surprise that we represent the actual citizen science project within our domain model. No prize for guessing the name of this entity - it’s called Project.

A Project is really just the overarching named entity that Subjects, Classifications and Groups are associated with. Project in Ouroboros does some other cool stuff for us though. It’s the Project that knows about the Groups, its current status (such as how many Subjects are complete) and other adminstrative functions. We also sometimes deliver a slightly different user experience to different Users in what are known as A/B splits - it’s the Project that manages these too.

» Finishing up.

So that’s about it. There are a few more entities routinely in discussion in the Zooniverse team such as Favourite (something a User favourites when they’re on one of our projects) but they’re really not core to the main operation of a project.

The domain description we’re using today is informed by everthing we’ve learnt over the past five years of building proejcts. It’s also a consequence of how the Zooniverse has been functioning - we try lots of projects in lots of different research domains and so we need a domain model that’s flexible enough to support something like Notes from Nature, Planet Four and Snapshot Seregeti but not so generic that we can’t build rich user experiences.

We’ve also realised that the vast majority of what’s differenct about each project is the user experience and classification interface. We’re always likely to want to put significant effort into developing the best user experience possible and so having an API that abstracts lots of the complexity away and allows us to focus on what’s different about each project is a big win.

Our domain model has also been heavily influenced by the patterns that have emerged working with science teams. In the early years we spent a lot of time abstracting out each step of the User interaction with a Subject into distinct descriptive entities called Annotations. While in theory these were a more ‘complete’ description of what a User did, the science teams rarely used them and almost never in realtime operations. The vast majority of Zooniverse projects to date collect large numbers of Classifications that are write once, read very much later. Realising this has allowed us to worry less about exactly what we’re storing at a given time and focus on storing data structures that are a convenient for the scientists to work with.

Overall the Zooniverse domain model has been a big success. When designing for the Zooniverse we really were developing a new system unlike anything else we knew of. It’s terminology is pervasive in the collaboration and makes conversations much more focussed and efficient which can only be a good thing.

Twitter Streaming with EventMachine and DynamoDB

This week Amazon Web Services launched their latest database offering ‘DynamoDB’ - a highly-scalable NoSQL database service.

We’ve been using a couple of NoSQL database engines at work for a while now: Redis and MongoDB. Mongo allowed us to simplify many of our data models and represent more faithfully the underlying entities we were trying to represent in our applications and Redis is used for those projects where we need to make sure that a person only classifies an object once.1

Whether you’re using MongoDB or MySQL, scaling the performance and size of a database is non-trivial and is a skillset in itself. DynamoDB is a fully managed database service aiming to offer high-performance data storage and retrieval at any scale, regardless of request traffic or storage requirements. Unusually for Amazon Web Services, they’ve made a lot of noise about some of the underlying technologies behind DynamoDB, in particular they’ve utilised SSD hard drives for storage. I guess telling us this is designed to give us a hint at the performance characteristics we might expect from the service.

» A worked example

As with all AWS products there are a number of articles outlining how to get started with DynamoDB. This article is designed to provide an example use case where DynamoDB really shines - parsing a continual stream of data from the Twitter API. We’re going to use the Twitter streaming API to capture tweets and index them by user_id and creation time.

DynamoDB has the concept of a ‘table’. Tables can either be created in using the AWS Console, by making requests to the DynamoDB web service or by using one of the abstractions such as the AWS Ruby SDK. There are a number of factors you need to consider when creating a table including read/write capacity and how the records are indexed. The read/write capacity can be modified after table creation but the primary key cannot. DynamoDB assumes a fairly random access pattern for your records across the primary key - a poor choice of primary key could in theory lead to sub-par performance.

DynamoDB is schema-less (NoSQL) and so all we need to define upfront is the primary key for indexing records. Primary keys can be defined in two ways:

Simple hash-type primary - Simple hash-type primary key where a hash index is made using this key only.

Hash and range type primary - Composite hash and range primary key. In this situation two indexes are made on the records - an unordered hash index and a sorted range index.

» Why should I care about my choice of primary key?

Choosing an appropriate primary key is especially important with DynamoDB as it is only the primary key that is indexed. That’s right - at this point in time it is not possible to create custom indexes on your records. This doesn’t mean that querying by item attributes isn’t possible, it is, but you have to use the Scan API which is limited to 1MB of data per request (you can use a continuation token for more records) and as each query has to read the full table the performance will degrade as the database grows.

For this example we’re going to go with the composite hash and range key (user_id as the hash_key and created_at as the range_key) so that we can search for Tweets by a particular user in a given time period.

» Writing records

DynamoDB implements a HTTP/JSON API and this is the sole interface to the service. As it’s GOP debate season we’re going to search the Twitter API for mentions of the term ‘Romney’ or ‘Gingrich’. When a tweet matches the search we’re going to print the Twitter user_id, a timestamp, the tweet_id, the tweeter location and their screen_name.

Next we want to write the actual records to DynamoDB.

We can see the tweet count in out table increasing over time using the following script.

» Write performance

As things currently stand, our write performance is limited by the time taken by a single Ruby process to complete a HTTP request to the DynamoDB API endpoint (e.g. http://dynamodb.us-east-1.amazonaws.com). Regardless of your write capacity and network performance you’re never going to realise the performance of DynamoDB using a single threaded process like this. What we want instead is a multi-threaded tweet parsing monster.

» Enter EventMachine

Given the write performance is limited by the HTTP request cycle when we receive notification of a new tweet we want to pass of the task of posting that tweet to DynamoDB to a separate thread. EventMachine has a simple way to handle such a scenario using the EM.defer method

I’ve set up an EventMachine threadpool_size of 50 (default is 20) which means that we can have 50 processes simultaneously writing to DynamoDB. Awesome. If you choose a popular term on Twitter (such as Bieber) you’ll see dramatically improved write performance using EventMachine this way.

» Reading back those records

As I mentioned earlier, you’re limited to relatively simple indexes with DynamoDB and so the primary key you choose will have a significant affect on the query performance of your application. Below are some examples of querying using the index and using the Scan API.

For the most simple use cases executing queries using the Scan API will suffice. As data volumes grow however, common queries will need to be made more performant by keeping separate tables indexing the attributes you’re querying on. And this is one of the core differences between DynamoDB and other NoSQL solutions such as MongoDB today - if you want to produce complex indexes of your records then you’ll need to do the heavy lifting yourself.

» Conclusions

DynamoDB is an exciting new technology from AWS and I can already think of a number of use cases where we’d be very seriously considering it. As I see it, DynamoDB sits somewhere in between Redis and MongoDB, providing a richer query API than something like Redis but requiring more of your application (or a change in your data model) to support queries against a range of attributes.

One significant factor favouring DynamoDB is that just like RDS it’s a database as a service. We’ve been using RDS in production now for well over two years and the sheer convenience of having someone else manage your database, making it easy to provision additional capacity or set up read-slaves can be spoken of too highly. Our largest MongoDB instance today is about 5GB in size and very soon we’re going to have to being thinking about how to scale it across a number of servers. Regardless of the potential additional complexity of DynamoDB, not having to think about scaling is a massive win and makes it worthy of consideration for any NoSQL scenario.

1. We keep a Redis key for each user with the ids of the objects that they have classified. Subtracting that from a key with all available ids and then returning a random member is super-fast in Redis.

Now is the Time for Astroinformatics

What follows has been bouncing around in my head for about a year now; I’ve decided to write it down finally after hearing reports of Alex Szalay’s talk here in Oxford (sadly I was away) and seeing this recent paper on the arXiv. My apologies if this is a little too much about me, I promise not to do this too often.

» Nottingham circa 2002

In late 2002 I began my PhD at The University of Nottingham in the Astrochemistry group. Like many new PhD students, it wasn’t entirely clear to me before I arrived exactly which project I’d be working on but I arrived full of enthusiasm and looking forward to applying my understanding of Chemistry to the Cosmos. To be honest though I had little idea what a PhD entailed and after the standard ‘go away and read these 50 papers about subject xxxx’ I found myself needing to look through and match a large number of infrared sources from the IRAS and 2MASS catalogues.

After two days of manually going through the objects I realised there had to be a better way and that I should probably write a ‘program’ to go through the candidate list more rapidly. Following a lunchtime chat with one of the older PhD students I duly picked up a copy of a C programming book and spent the rest of the week writing about 40 lines of C code (yeah, not very impressive, I know).

Over the next year this small C program became a complete monster. Hundreds or even thousands of lines of buggy, repetitive undocumented code that on a good day seemed to produce answers that looked reasonable and on a bad day took me all morning to spot the smallest typo. Having a single copy of this program would have been bad enough but I had about 50 different versions, each very slightly different and with its own bugs. I can honestly say that I lost well over a month of my (PhD) life debugging just that one piece of code; I’d rather not think about how much time in total I spent debugging something.

» Not Unusual

Was my experience as a new PhD student unusual? No. At the time I saw a number of my peers having exactly the same experience and I see it still today with the new students here in Oxford. Most students’ first experience of programming is to either to inherit a existing codebase that’s probably just as buggy as my IR source matcher or to write something from scratch that solves an immediate problem with their research. Either way, it’s highly unlikely that they have any formal training in how to go about writing readable, efficient, maintainable code.

So with all of this terrible code being written, how on Earth do we ever get PhDs? I think the short version is that we muddle through. Students are smart enough to pick up a programming book and relatively quickly code up a solution, or to learn a new language by the trial and error editing of an existing script.

But this trial and error style of coding is exactly the problem. At no point in the process is any real attention paid to how to code is written and to be honest, why should it? I don’t recall being asked about my test coverage during my viva defence let alone having regular code reviews with my supervisor.

The truth is that for the majority of researchers the process of software development stops the very instant they believe their code is doing what they expected.

» Future Nottingham circa 2022

With your permission, I’m going to time-travel roughly 10 years into the future. A PhD student has just begun their PhD and they’ve been asked by their supervisor to look for radio sources with particular characteristics in a data cube from the Square Kilometre Array telescope1. In theory this is fine, they just need to step through the dataset by position and frequency searching for matches except that the data cube that they need to process is many hundreds of terabytes in size.

Faced with this formidable task the student would hopefully realise that the problem can be solved relatively easily in parallel by splitting up the date cube into many millions of smaller chunks. This means that the challenge faced by the student isn’t just to figure out what code to write to search for sources within each sub-cube but also to write a whole load of job management code to chunk up the cube and farm out these bite-sized searches. They’d also hopefully have heard of the MapReduce programming model and realise that the Hadoop project solved for exactly this kind of problem and that they didn’t need to reinvent the wheel for their studies. Hopefully.

The more likely reality is that because they didn’t have a computer science background they’d spend a huge amount of time writing run-once custom scripts that took days/weeks/months to run or worse still conclude that searching to this fidelity at such scale was impossible.

I believe in the next decade there’s a real possibility that the pace of research within astrophysics is going to be severely limited by the lack of awareness amongst researchers of modern approaches to data-intensive computation.

» Informatics for Astrophysics

After completing my PhD in 2006 I worked in the Production Software Team at the Wellcome Trust Sanger Institute building high-throughput software to manage the sequencing of DNA samples. Working here taught me two key things:

  1. How to write high-quality, production-grade software.
  2. That the biosciences are well ahead of astronomers when it comes to building software for science.

Biologists are ahead of astrophysicists because they’ve realised that working with large quantities of biological data is a research specialism in itself. Between the Sanger Institute and the European Bioinformatics Institute (on the same site) there are probably ~ 300 bioinformaticians on campus. That’s 300 people who’s expertise is not in pure genomics research, rather they are world leaders at understanding how to produce, store and analyse biological data.

It’s not that within the astronomy community as a whole there aren’t people developing high-quality software or using modern toolsets to solve common problems (see e.g. Wiley et al.), quite the opposite. The problem is that the efforts of these researchers are not widely recognised as an academic specialism. Until we as a community (and that means funding bodies too) recognise and value the research and development efforts of researchers building new tools then we are unlikely to see any significant advances.

We need to stop relying on occasional gems of brilliance by individual researchers and move to a fully-fledged research area that combines astrophysics with data-intensive computer science.

» Building for the Future

In the next decade, astronomers are likely to face data rates of unprecedented volumes from both the Large Synoptic Sky Telescope and the Square Kilometre Array. If the current situation continues then I think we’re going to have serious problems as a community storing, analysing and sharing the data products from these facilities.

All is not lost however. There are concrete steps that we start to take today that will improve our chances of developing smart solutions in the future:

  1. Train our students: We need to start teaching new undergraduates (and postgraduates) about the fundamentals of software development best practices. That means learning about (and using) version control, understanding how to structure code and writing tests to cover core functionality2.
  2. Build for reuse and share: We need to stop writing run-once ‘scripts’ and start building ‘software’ for reuse. Sharing code with others is a good way to encourage improvements in code quality.
  3. Recognise advances in other fields: Astronomers are not unique in experiencing the ‘data flood’. Recognising innovation and advances in techniques for data processing in other fields (e.g. the biosciences) will be crucial.
  4. Value informatics: Recognising and funding astroinformatics as a research specialism in its own right is vital for success. That means having more meetings like this one and funding research programmes aimed at developing innovative solutions that solve real problems.

I believe astroinformatics has a bright future. The old adage ‘innovate or die’ seems to be a pretty good fit here. Many of the solutions for tomorrow’s astrophysical data problems are already available in other research areas or industry, I just hope that we as a community can begin to recognise the advantages of devoting time and effort into exploring these possibilities.

1. The SKA is going to be massive. Data products from it are going to be rather large and unwieldy.
2. I’m not going to turn this into an essay about how to write better software. There are plenty of books/blogs/conferences that can help you with that.

Building a data archiving service using the GitHub API, figshare and Zenodo

Over the past couple of weeks we’ve seen a couple of great examples of service integrations from figshare and Zenodo that use the GitHub outbound API to automatically archive GitHub repositories. While the implementation of each solution is likely to be somewhat different I thought it might be useful to write up in general terms how to go about building such a service.

In a nutshell we need a tool that does the following:

  • Authenticates against the GitHub API
  • Configures an outbound API endpoint for repository events to be posted to
  • Respond to a GitHub repository event by grabbing a copy of the code
  • Issues a DOI for the code bundle

A while ago, together with Kaitlin Thaney (@MozillaScience) and Mark Hahnel (@figshare) I put together a proof of concept implementation called Fidgit that basically does the above. You can read more about how to run your own version of this service in the README here.

» Tuning in to the GitHub outbound API

GitHub has both an inbound (i.e. send commands to the API) and outbound notifications API called webhooks. By configuring the webhooks for a repository, it’s possible to receive an event notification from GitHub with some information about what has changed on the repo. Check out this list for all of the event types it’s possible to tune into but for the purposes of this article we’re going to focus on the event type that is generate when a new release is created.

Whether it’s Zenodo, figshare or a reference implementation like fidgit, they all rely upon listening to the [outbound GitHub API webhooks and responding with some actions based upon the content in the JSON payload received.

» Creating a webhook on a GitHub repo

Creating a webhook for a GitHub repo is something that can only be done by someone who has permissions to modify the repository state in some way. In order for figshare and Zenodo to set up the webhooks on your GitHub repos, both applications ask you to log in with your GitHub credentials and authorise their applications to administer your webhooks, they do this using OAuth. While a full OAuth login flow is the ‘complete’ way to do this, Fidgit requires a personal access token from your GitHub profile and uses this to authenticate and create a webhook. Note, it’s possible to only ask for the conservative permissions on a GitHub user’s account to just administer OAuth webhook scopes. You can read more about these scopes here.

» Archiving the code

Once notified of a change in the repository (such as a new release) then we need to go and grab that code. This could be in the form of a ‘Git clone’ of the repository and all of its history but Fidgit, Zenodo and figshare all choose to just grab a snapshot of the code from the GitHub raw download service. At the bottom right of the page of every GitHub repository, there’s a link to ‘Download ZIP’. This basically gives us a copy of the current status of the repository but without an Git (or GitHub) information attached such as Git history. As these files can be reasonably large it makes sense to grab this code bundle in a background worker process. That happens in Fidgit worker here which basically uses plain old curl to grab the zip archive and then push the code up to figshare through their API.

» Putting a DOI on it

This step is left as an exercise for the reader (just kidding). Fidgit doesn’t do this, figshare, Zenodo and Dryad are doing this bit and so it’s out of scope for this article.

Autoscaling on AWS Without Bundling AMIs

At Zooniverse HQ we typically run between 15 and 20 AWS instances to support our current set of projects. It’s a mix of fairly vanilla Apache/Rails webservers, AWS RDS MySQL instances, a couple of LAMP machines and a bit of MongoDB goodness for kicks. Over the past year we’ve migrated pretty much all of our infrastructure to make full use of the AWS toolset including elastic load balancing and auto-scaling our web and API tiers, SQS for asynchronous processing of requests and RDS for our database layer. Overall I’m pretty happy with the setup but one pain point in the deployment process has always been the bundling of AMIs for auto-scaling. I’ve described before the basic configuration required when setting up an auto-scaling group - the step that always takes the most time is saving out a machine image with the latest version of the production code so that when the group scales the machine is ready to go. The problem is that for a typical deploy, the changes to the code are minimal and really don’t require a new machine image to be built, rather we just need to be sure that when the machine boots it’s serving the latest version of the application code. Over the past few months I’ve considered a number of different options for streamlining this process, the best of which being an automated checkout of the latest code from version control when the machine boots. This is all very well but we host our code at GitHub. Now don’t get me wrong, I love their service but I really don’t want to build in a dependency of GitHub for our auto-scaling.

» An alternative?

A couple of weeks ago we made a change to the way that we deploy our applications and I can honestly say it’s been a revolution. The basic flow is this:

  1. Work on new feature, finish up and commit (don’t forget to run your tests)
  2. Push code up to GitHub
  3. Git ‘archive’ the repo locally, tar and zip it up
  4. Push the git archive up to S3
  5. Reboot each of the production nodes in turn
  6. Done!

» Eh?

So that not might look like a big deal but the secret sauce is what happens when the machine reboots. There’s a simple script that executes on boot to pull down the latest bundle of the production code from S3 put it in the correct place and voila, you’re running the latest version of the code. We use Capistrano for on-the-fly deployment and so it’s important that this script doesn’t get in the way of that - upon downloading a new bundle of the code the script timestamps the new ‘release’ and symlinks the config files and ‘current’ directory. That way, if we need to we can still cap deploy a hotfix to a running server.

» Show me the code already

The capistrano task used here is super-simple and can, I’m sure be further improved. Below is the example for our latest project Old Weather.

Scaling Galaxy Zoo With SQS

Over the last 8 months we’ve received close to 45 million classification clicks from the fantastic Galaxy Zoo community. Averaged over the 8 months that’s roughly 8,000 clicks per hour - not bad going! The challenge for our development team has been to design a web stack that’s able to cope with big traffic spikes like the one we had earlier this week from APOD but to also keep the hosting costs reasonably low. As I’ve mentioned before, the pricing model of Amazon Web Services means that we can scale our web and API layers based on how busy we are however what’s not so straightforward is scaling the database layer in realtime.

» The Problem

If scaling databases is hard (and we don’t want to buy our way out of the problem) then is there an alternative strategy that we can employ? It turns out there is and the solution is asynchronous processing of our classifications. In the past, when you reached the end of the classification decision tree on the Galaxy Zoo site there was a pause between answering the final question and the page refreshing with the next galaxy for analysis. During this pause the classification was being saved back to the database and then a second request made to get a new galaxy. The problem with this approach is obvious - the busier the site gets, the busier (and slower) the database becomes and the longer it takes for the page to refresh. A better approach then would be to decouple the classification-saving from the website user interface and remove the delay between classifications.

» The Solution - Asynchronous Processing

About 3 weeks ago we made a change to the Galaxy Zoo site to remedy this situation - the solution was to use a message queue. Message queues are basically a web-hosted queue of small snippets (or messages) of information - in our case a classification! Handy for us, Amazon have a message queueing service called Simple Queue Service (SQS) and we’re using it to help us scale. The old model of saving a classification was to send an XML snippet back to the Galaxy Zoo API and wait for confirmation of a successful save to the database. The difference now is that this XML snippet is written to a SQS queue and we have a separate daemon that processes the queue. By posting the XML classifications to SQS I’m pleased to say that we’ve dramatically improved the responsiveness of the Galaxy Zoo site and managed to avoid paying for a significantly more expensive database!

» A resounding success?

Before I get too self-congratulatory here it’s important to realise that whilst using a message queue has helped us a great deal, there are some undesirable consequences that can arise during busy periods. By using a queue we haven’t actually increased the rate at which we can save classifications back to the database, instead we’ve just created a buffer that we can store the classifications in until the site quietens down and we can process the backlog. Typically there are less than 5 messages in the queue (i.e. we’re keeping up with the current classification rate) however during very busy periods this isn’t the case: Earlier this week we had a couple of very busy days which meant that at one point there were 30,000 classifications in the queue waiting to be saved! The consequence of these messages being queued is that it’s possible that you could classify a whole bunch of galaxies but not see them in your recent history in ‘My Galaxies’ until minutes or hours later.

» Conclusions

Overall we’ve been very pleased with the new queue-based system - we’ve successfully managed to decouple the user interface from a database that’s starting to get a little sluggish. The issue of ‘My Galaxies’ being slightly out of date only arises during particularly busy periods and usually resolves itself within less than an hour. With the launch of Amazon’s RDS this week realtime resizing of a database may finally be a reality, but for now message queueing can definitely be used as an effective scaling strategy.

Getting Started With Elastic MapReduce and Hadoop Streaming

A couple of months ago I wrote about how the astrophysics community should place more value on those individuals building tools for their community - the informaticians. One example of a tool that I don’t think is particularly well known in many areas of research is the Apache Hadoop software framework.

Hadoop is a tool designed for distributing data-intensive processes across a large number of machines and its development was inspired by the Google MapReduce and Google File System papers. One of the largest contributors to the Hadoop open source project is Yahoo! who use it to build their search indexes based upon vast quantities of data crawled by their search bots.

» Elastic MapReduce

Hadoop allows you to execute parallel tasks on a number of ‘slave’ machine nodes (the ‘map’ function) before combining the results of the processing into a result (the ‘reduce’ step). This requires understanding how to configure a ‘master’ job-management node as well as a number of ‘slave’ worker nodes to form a Hadoop cluster. Wouldn’t it be nice if you didn’t have to spend your days installing software and configuring firewalls? Well, thanks to Amazon and their Elastic MapReduce service you don’t have to.

Elastic MapReduce is a web service built on top of the Amazon cloud platform. Using EC2 for compute S3 for storage it allows you to easily provision a Hadoop cluster without having to worry about set-up and configuration. Data to be processed is pulled down from S3 and processed by an auto-configured Hadoop cluster running on EC2. Like all of the Amazon Web Services, you only pay for what you use - when you’re not using your 10,000 node Hadoop cluster you don’t pay for it.

» Processing with Hadoop

Data processing workflows with Hadoop can be developed in a number of ways including writing MapReduce applications in Java, using SQL-like interfaces which allows you to query a large dataset in parallel (e.g. Pig) or perhaps most exciting for people with existing software using something called Hadoop Streaming. Streaming allows you to develop MapReduce applications with scripts in any language you like for the mapper and reducer provided that they read input from STDIN and return their output to STDOUT.

While Hadoop can be a valuable tool for any data-rich research domain, building applications using Streaming is a great way to leverage the power of Hadoop without having the overhead of learning a new programming language.

» A Very Simple Streaming Example - aka ‘Who loves Justin Bieber?’

As an example of how to use Hadoop Streaming on Elastic MapReduce I’m going to capture all of the tweets over a 12-hour period that have the word ‘bieber’ in them and search for the word ‘love’. To save these tweets I’ve used the Twitter Streaming API and the simple PHP script below that writes out the tweets to a file in hourly snapshots.

» The Map Script

As I mentioned earlier, the streaming functions/scripts can in theory be written in any language - Elastic MapReduce supports Cascading, Java, Ruby, Perl, Python, PHP, R, or C++. In this example I’m going to use Ruby. Elastic MapReduce is going to pass the input files from the S3 bucket to this script running on the parallel slave nodes, hence the use of ARGF.each - we’re reading from STDIN.

Data from the streaming API PHP script are saved as JSON strings to hourly snapshotted files. Each line in the file is a potential tweet so we’re stepping through the file line by line (tweet by tweet), verifying that we can parse the tweet using the Crack Rubygem by John Nunemaker and also checking if the tweet text has the word ‘love’ in it. If we find ‘love’ then we print to STDOUT the word ‘love’ - this is the streaming output from the map step and is forms the input for the reduce function.

» The Reduce Script

The reduce script is super-simple. The output from the map script is automatically streamed to the reduce function by Hadoop. All we do with the reduce script is count the number of lines returned (i.e. the number of times the word ‘love’ is found).

» Bootstrapping Nodes

In the map script we have a Rubygem dependency for the Crack gem. As we can’t specify the machine image (AMI) that Elastic MapReduce uses we need to execute a bootstrap script to install Rubygems and the Crack gem when the machine boots.

» Configuring the Workflow

Now we’ve got our map and reduce scripts we need to upload them to S3 together with the raw data pulled from the Twitter API. We’ve placed out files in the following locations:

  • Input path (Twitter JSON files): s3n://ruby-flow/bieber_input
  • Output path: s3n://ruby-flow/bieber_output
  • Map script: s3n://ruby-flow/bieber_scripts/map.rb
  • Reduce script: s3n://ruby-flow/bieber_scripts/reduce.rb
  • Bootstrap script: s3n://ruby-flow/bieber_scripts/boot.sh

Next we need to configure a streaming workflow with the values above.

Then we need to add a custom bootstrap action to install Rubygems and our dependencies on launch of the Amazon nodes, review the summary and launch the workflow.

» Executing the Workflow

Once we have the workflow configured click the ‘Create Job Flow’ button to start processing. Clicking this button launches the Hadoop cluster, bootstraps each node with the specified script (boot.sh) and begins the processing of the data from the S3 bucket.

As the Elastic MapReduce nodes are instance-store backed rather than EBS volumes they take a couple of minutes to launch. You can review the status of the job on the Elastic MapReduce Jobs view but also see the status of the cluster on the EC2 tab.

» Closing Down and Finishing Up

Once the job has completed, the reduce function writes the output from the script to the output directory configured in the job setup and the cluster closes itself down.

So it turns out that a fair few people love Justin Bieber: over the 12 hours there were about 200,000 tweets mentioning ‘beiber’ and we find the following in our output: 8096 ‘loves’ for Justin. That’s a whole lotta love.

» Conclusions

Obviously this is a pretty silly example, but I’ve tried to keep it as simple as possible so that we can focus on the setup and configuring of a Hadoop Elastic MapReduce cluster.

Hadoop is a great tool but it can be fiddly to configure. With Elastic MapReduce you can focus on the design of your map/reduce workflow rather than figuring out how to get your cluster setup. Next I’m planning on making some small changes to software used by radio astronomers to find astrophysical sources in data cubes of the sky to make it work with Hadoop Streaming - bring it on SKA!

A First Look at the Amazon Relational Database Service

This morning Jeff Bar announced a new service offering from Amazon Web Services (AWS): Amazon Relational Database Service. The Amazon Relational Database Service (or RDS for short) is a new web service that allows users to create, manage and scale relational databases without having to configure the server instance, database software and storage that the database runs on. In short, this is a service that has the potential to take much of the headache out of database management.

» The current setup

At Galaxy Zoo we run our database on a combination of MySQL 5.1, EC2/Ubuntu Hardy and XFS/EBS storage. While there are some excellent guides on how best to configure a database running on AWS, operating a database in a virtualised environment requires that you plan for the worst case scenario of the virtualised server failing and the filesystem disappearing along with it. Because of this the steps required to configure a new database using ‘persistent’ storage (i.e. on an Elastic Block Store volume) are numerous:

1. Launch a new EC2 instance

Launching a new EC2 instance and installing the database software is pretty simple however for convenience we have the Galaxy Zoo database image saved out as a custom AMI.

2. Update server and install database engine

Launching a new EC2 image without a database engine installed means that you probably need to update the server software (e.g. apt-get update on Ubuntu) and then install MySQL or your database of choice.

3. Create a new Elastic Block Store volume and attach it to your instance

Next up you need to create a new EBS volume, attach it to the EC2 instance and format the filesystem on the EBS volume.

4. Create database

Setting up a new database instance on EC2 is clearly non-trivial and requires knowledge of EBS, mount points, filesystems, not to mention configuring the MySQL settings for the chosen size of EC2 instance that you have.

» A different way?

With the introduction of RDS, Amazon has removed almost all of the difficulty in setting up and configuring a new MySQL database that is both scalable and reliable. Creating a new database instance now is as simple as issuing a single command:

` » rds-create-db-instance –db-instance-identifier mydatabase –allocated-storage 20 –db-instance-class db.m1.small –engine MySQL5.1 –master-username root –master-user-password password `

With this command I have created a new m1.small MySQL 5.1 database server with 20Gb of storage and configured the master username and password. Provisioning a new RDS instance took a few minutes and during the provisioning you can check on the progress with the command: >> rds-describe-db-instances Once available, your new RDS instance is given a hostname that you can then use to connect with the standard MySQL port of 3306. Actually, it’s not quite that simple - before you can connect you need to assign which AWS security groups are allowed to connect to your RDS instances. I found this step a little confusing but essentially you need to configure is which EC2 instances running under their respective security groups are allowed to connect. For Galaxy Zoo, we have a default security group for all of our web servers called ‘web’ and so to allow access from these servers I had to add this ‘web’ security group to the defaults for the RDS servers:

>> rds-authorize-db-security-group-ingress default --ec2-security-group-name web --ec2-security-group-owner-id 1234567789

» The devil is in the details

At this point you have a RDS instance running MySQL 5.1 ready and waiting to serve up your databases. That’s not where the benefits stop though - not only do you get the ease of creating new database instances but there are some very nice extras you also get by using the service.

» Scaling/resizing

At Galaxy Zoo, we usually our main ‘classifications’ database on a single EC2 small instance. In the last 8 months we’ve received something close of 45 million classifications and while the database has started to get a little sluggish, by writing user classifications to SQS and processing them asynchronously we are able to keep the site feeling nippy. Each month however, we try to send a newsletter to our 250,000 strong community and the increased load that this causes on the database means that for a couple of days we switch to a m1.large instance. The overhead of switching database servers is pretty annoying - place the site into maintenance mode, stop the MySQL server, detach the EBS volume, launch a new EC2 instance, attach the EBS volume to the new server… the list goes on. With RDS not only can you change the amount of disk space available to your RDS instance but you can also dynamically resize the server size (i.e. RAM/CPU). I can see that this is going to be a real win for us.

» A tuned MySQL instance

If there’s on thing that my time at The Sanger Institute taught me, it’s that managing and scaling large databases is a dark art. For the majority of small web applications it’s not crucial whether the MySQL server configuration you’re running is absolutely optimised for your hardware however now that we’re reaching the limits of our current instance size, making sure the MySQL server is well configured is becoming important. Deciding how large your innodb_buffer_pool or key_buffer size should be is not obvious for most of us and so having a MySQL server configured to work well for the resources available to it is very comforting. Over the next couple of days I’m going to be benchmarking our standard MySQL setup to see how it compares against a RDS instance with the same resources. Watch this space!

Elastic Load Balancing on EC2 redux

A few months back I wrote about how we switched the Galaxy Zoo HAProxy load balancers to Amazon Web Services (AWS) Elastic Load Balancers (ELB). At that point we had basically just swapped out HAProxy (running on its own EC2 small instance) for an ELB but weren’t making any use of the auto-scaling features also on offer. For the past few days I’ve been playing around with auto-scaling our API layer with the ELB that’s already in place and this morning I pushed the changes into production.

» Getting started

As I mentioned earlier, we already had an ELB in place so we didn’t need to create a new one - instead we’re adding here auto-scaling to an ELB that’s already in place. For completeness however, this is the command used to create the original ELB:

>> elb-create-lb ApiLoadBalancer --zones us-east-1b --listener "lb-port=80, instance-port=80, protocol=TCP" --listener "lb-port=443, instance-port=8443, protocol=TCP"

» De-register existing ELB instances

As we already had a couple of instances registered with the ELB I found the easiest way to get auto-scaling up and running was to remove the existing instances before proceeding:

>> elb-describe-instance-health ApiLoadBalancer
INSTANCE i-abcdefgh InService INSTANCE i-ijklmnop InService >> elb-deregister-instances-from-lb -lb ApiLoadBalancer --instances i-abcdefgh i-ijklmnop No Instances currently registered to LoadBalancer

» Create a launch configuration

Before you can introduce auto-scaling you need to have a couple of things in place - an Amazon Machine Image (AMI) that upon boot is immediately ready to serve your application and a launch configuration compatible with your currently ELB-scaled nodes (security groups etc.). Depending upon your setup, always having an AMI ready to launch with the latest version of your production codebase is probably the hardest thing to achieve here. Once you have your AMI in place and your security group and key-pair settings to hand you’re ready to create your launch configuration:

>> as-create-launch-config ApiLaunchConfig --image-id ami-myamiid --instance-type m1.small --key ssh_keypair --group "elb security group name"
OK-Created launch config

» Create an auto-scaling group

Once you have a launch configuration in place it’s time to create an auto-scaling group. Auto-scaling groups need as a minimum to know what launch configuration, which load-balancer to use, which availability zone and the minimum and maximum to scale to. We never run the Galaxy Zoo API on anything less than 2 nodes and so to create our auto-scaling group I issued a command something like this:

>> as-create-auto-scaling-group ApiScalingGroup --launch-configuration ApiLaunchConfig --availability-zones us-east-1b --min-size 2 --max-size 6 --load-balancers ApiLoadBalancer
OK-Created AutoScalingGroup

At this point it’s worth noting that although we’d removed all of the instances being load balanced by the ApiLoadBalancer ELB, because the auto-scaling group set a minimum number of instances of 2 checking the status of the auto-scaling group showed that 2 new instances were spinning up:

>> as-describe-scaling-activities ApiScalingGroup
ACTIVITY 78bf4e0d-f72b-4b5b-a044-6b99942088ed 2009-08-24T07:19:28Z Successful "At 2009-08-24 07:16:12Z a user request created an AutoScalingGroup changing the desired capacity from 0 to 2. At 2009-08-24 07:17:17Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 2."

I don’t know about you but I think that’s pretty AWESOME!

» Create some launch triggers

To complete the auto-scaling configuration, you need to define the rules that increase and decrease the number of load-balanced instances. Currently we have a very simple rule based upon CPU load - if the average CPU load over the past 120 seconds is greater than 60% we introduce a new instance, if the CPU average drops below 20% then we remove an instance:

>> as-create-or-update-trigger ApiCPUTrigger --auto-scaling-group ApiScalingGroup --namespace "AWS/EC2" --measure CPUUtilization --statistic Average --dimensions "AutoScalingGroupName=ApiScalingGroup" --period 60 --lower-threshold 20 --upper-threshold 60 --lower-breach-increment=-1 --upper-breach-increment 1 --breach-duration 120
OK-Created/Updated trigger

These triggers will almost certainly require refinement but helpfully the as-create-or-update-trigger command will create a new trigger if one doesn’t exist or update an existing trigger based upon the new parameters.

» That’s it!

Once again it’s been a breeze to introduce the latest AWS features into our production stack. Moving Galaxy Zoo to AWS has completely changed the way we think about running our web applications - we’ve gone from having a group of ‘pet’ servers we each know the name of to having a fault-tollerant, auto-scaled web-stack ready for the future.

Hunting for Supernovae

Over the last few days we’ve been running a little project to see if the Galaxy Zoo community can help find new supernovae from the Palomar Transient Factory). Turns out it works pretty well.

Developing the website to power Supernova Zoo was a fun challenge; the Supernova and Galaxy Zoo websites look pretty similar and obviously share many features but there were some new problems to solve that we hadn’t faced before…

» A Moving Target

Galaxy Zoo 2 had a static number of galaxy images to classify. Within the Galaxy Zoo domain model, we refer to an image as an ‘Asset’, Galaxy Zoo 2 had 245609 Assets to classify. One of the most exciting things about Supernova Zoo is that images are taken from the Palomar Transient Factory in near-realtime and sent up to the website for analysis. Being able to handle these images in an automated fashion is crucial and we’ve built API methods for uploading and auto-registering new Assets with the Supernova Zoo website.

» Priority Assets

For Galaxy Zoo, each Asset in the database has an equal priority of being shown to a Zooite. Upon reaching the classification interface, the Asset presented is therefore essentially random. For a static number of Assets this works well however for Supernova Zoo we wanted something a little different. The very nature of Supernova hunting means that you want to find the newest supernovae as quickly as possible and so to help us with achieve this we implemented a couple of new features:

Asset priority - When serving up an Asset to the Supernova Zoo interface we pick from the most recent supernova candidates. That way we are always going to be classifying the newest candidates first before heading further back in time to look at older ones.

Asset escalation - So that we could alert Mark and Sarah at the WHT to new supernova candidates as rapidly as possible we needed a mechanism for escalating the priority of the Asset in the system. We achieved this by essentially ‘scoring’ your classifications as they came in. When creating the decision tree we attached a score to some of the answers. When your classification was complete we kept an average total score for the Asset that you had just classified. By keeping track of the scores as you classified, if you gave the ‘correct’ sequence of answers for a potentially real candidate, then it would immediately become a higher priority target to show to other Zooites.

» A Retrospective

Supernova Zoo was our first opportunity to test the codebase that myself and the team at SIUE have been working hard on for the last few months. Handling a continual stream of new Assets and changing the behaviour of the system in real time based on your classifications has been a fun challenge and overall we’re pretty happy with the results.

In the next day Supernova Zoo will be taken offline so that we can have a good look at the results from the past few days. Based upon your excellent feedback there will almost certainly be some tweaks to the classification interface and refinements to the decision tree. Supernova hunting is a very different challenge to galaxy classification and we’re delighted that our Zooites appear to equally adept at classifying galaxy morphologies as finding new supernovae!