A couple of months ago I wrote about how the astrophysics community should place more value on those individuals building tools for their community - the informaticians. One example of a tool that I don't think is particularly well known in many areas of research is the Apache Hadoop software framework.
Hadoop is a tool designed for distributing data-intensive processes across a large number of machines and its development was inspired by the Google MapReduce and Google File System papers. One of the largest contributors to the Hadoop open source project is Yahoo! who use it to build their search indexes based upon vast quantities of data crawled by their search bots.
Hadoop allows you to execute parallel tasks on a number of 'slave' machine nodes (the 'map' function) before combining the results of the processing into a result (the 'reduce' step). This requires understanding how to configure a 'master' job-management node as well as a number of 'slave' worker nodes to form a Hadoop cluster. Wouldn't it be nice if you didn't have to spend your days installing software and configuring firewalls? Well, thanks to Amazon and their Elastic MapReduce service you don't have to.
Elastic MapReduce is a web service built on top of the Amazon cloud platform. Using EC2 for compute S3 for storage it allows you to easily provision a Hadoop cluster without having to worry about set-up and configuration. Data to be processed is pulled down from S3 and processed by an auto-configured Hadoop cluster running on EC2. Like all of the Amazon Web Services, you only pay for what you use - when you're not using your 10,000 node Hadoop cluster you don't pay for it.
Data processing workflows with Hadoop can be developed in a number of ways including writing MapReduce applications in Java, using SQL-like interfaces which allows you to query a large dataset in parallel (e.g. Pig) or perhaps most exciting for people with existing software using something called Hadoop Streaming. Streaming allows you to develop MapReduce applications with scripts in any language you like for the mapper and reducer provided that they read input from STDIN and return their output to STDOUT.
While Hadoop can be a valuable tool for any data-rich research domain, building applications using Streaming is a great way to leverage the power of Hadoop without having the overhead of learning a new programming language.
As an example of how to use Hadoop Streaming on Elastic MapReduce I'm going to capture all of the tweets over a 12-hour period that have the word 'bieber' in them and search for the word 'love'. To save these tweets I've used the Twitter Streaming API and the simple PHP script below that writes out the tweets to a file in hourly snapshots.
As I mentioned earlier, the streaming functions/scripts can in theory be written in any language - Elastic MapReduce supports Cascading, Java, Ruby, Perl, Python, PHP, R, or C++. In this example I'm going to use Ruby. Elastic MapReduce is going to pass the input files from the S3 bucket to this script running on the parallel slave nodes, hence the use of
ARGF.each - we're reading from STDIN.
Data from the streaming API PHP script are saved as JSON strings to hourly snapshotted files. Each line in the file is a potential tweet so we're stepping through the file line by line (tweet by tweet), verifying that we can parse the tweet using the Crack Rubygem by John Nunemaker and also checking if the tweet text has the word 'love' in it. If we find 'love' then we print to STDOUT the word 'love' - this is the streaming output from the map step and is forms the input for the reduce function.
The reduce script is super-simple. The output from the map script is automatically streamed to the reduce function by Hadoop. All we do with the reduce script is count the number of lines returned (i.e. the number of times the word 'love' is found).
In the map script we have a Rubygem dependency for the Crack gem. As we can't specify the machine image (AMI) that Elastic MapReduce uses we need to execute a bootstrap script to install Rubygems and the Crack gem when the machine boots.
Now we've got our map and reduce scripts we need to upload them to S3 together with the raw data pulled from the Twitter API. We've placed out files in the following locations:
Next we need to configure a streaming workflow with the values above.
Then we need to add a custom bootstrap action to install Rubygems and our dependencies on launch of the Amazon nodes, review the summary and launch the workflow.
Once we have the workflow configured click the 'Create Job Flow' button to start processing. Clicking this button launches the Hadoop cluster, bootstraps each node with the specified script (boot.sh) and begins the processing of the data from the S3 bucket.
As the Elastic MapReduce nodes are instance-store backed rather than EBS volumes they take a couple of minutes to launch. You can review the status of the job on the Elastic MapReduce Jobs view but also see the status of the cluster on the EC2 tab.
Once the job has completed, the reduce function writes the output from the script to the output directory configured in the job setup and the cluster closes itself down.
So it turns out that a fair few people love Justin Bieber: over the 12 hours there were about 200,000 tweets mentioning 'beiber' and we find the following in our output: 8096 'loves' for Justin. That's a whole lotta love.
Obviously this is a pretty silly example, but I've tried to keep it as simple as possible so that we can focus on the setup and configuring of a Hadoop Elastic MapReduce cluster.
Hadoop is a great tool but it can be fiddly to configure. With Elastic MapReduce you can focus on the design of your map/reduce workflow rather than figuring out how to get your cluster setup. Next I'm planning on making some small changes to software used by radio astronomers to find astrophysical sources in data cubes of the sky to make it work with Hadoop Streaming - bring it on SKA!