Twitter Streaming with EventMachine and DynamoDB

This week Amazon Web Services launched their latest database offering 'DynamoDB' - a highly-scalable NoSQL database service.

We've been using a couple of NoSQL database engines at work for a while now: Redis and MongoDB. Mongo allowed us to simplify many of our data models and represent more faithfully the underlying entities we were trying to represent in our applications and Redis is used for those projects where we need to make sure that a person only classifies an object once.1

Whether you're using MongoDB or MySQL, scaling the performance and size of a database is non-trivial and is a skillset in itself. DynamoDB is a fully managed database service aiming to offer high-performance data storage and retrieval at any scale, regardless of request traffic or storage requirements. Unusually for Amazon Web Services, they've made a lot of noise about some of the underlying technologies behind DynamoDB, in particular they've utilised SSD hard drives for storage. I guess telling us this is designed to give us a hint at the performance characteristics we might expect from the service.

» A worked example

As with all AWS products there are a number of articles outlining how to get started with DynamoDB. This article is designed to provide an example use case where DynamoDB really shines - parsing a continual stream of data from the Twitter API. We're going to use the Twitter streaming API to capture tweets and index them by user_id and creation time.

DynamoDB has the concept of a 'table'. Tables can either be created in using the AWS Console, by making requests to the DynamoDB web service or by using one of the abstractions such as the AWS Ruby SDK. There are a number of factors you need to consider when creating a table including read/write capacity and how the records are indexed. The read/write capacity can be modified after table creation but the primary key cannot. DynamoDB assumes a fairly random access pattern for your records across the primary key - a poor choice of primary key could in theory lead to sub-par performance.

DynamoDB is schema-less (NoSQL) and so all we need to define upfront is the primary key for indexing records. Primary keys can be defined in two ways:

Simple hash-type primary - Simple hash-type primary key where a hash index is made using this key only.

Hash and range type primary - Composite hash and range primary key. In this situation two indexes are made on the records - an unordered hash index and a sorted range index.

» Why should I care about my choice of primary key?

Choosing an appropriate primary key is especially important with DynamoDB as it is only the primary key that is indexed. That's right - at this point in time it is not possible to create custom indexes on your records. This doesn't mean that querying by item attributes isn't possible, it is, but you have to use the Scan API which is limited to 1MB of data per request (you can use a continuation token for more records) and as each query has to read the full table the performance will degrade as the database grows.

For this example we're going to go with the composite hash and range key (user_id as the hash_key and created_at as the range_key) so that we can search for Tweets by a particular user in a given time period.

» Writing records

DynamoDB implements a HTTP/JSON API and this is the sole interface to the service. As it's GOP debate season we're going to search the Twitter API for mentions of the term 'Romney' or 'Gingrich'. When a tweet matches the search we're going to print the Twitter user_id, a timestamp, the tweet_id, the tweeter location and their screen_name.

Next we want to write the actual records to DynamoDB.

We can see the tweet count in out table increasing over time using the following script.

» Write performance

As things currently stand, our write performance is limited by the time taken by a single Ruby process to complete a HTTP request to the DynamoDB API endpoint (e.g. http://dynamodb.us-east-1.amazonaws.com). Regardless of your write capacity and network performance you're never going to realise the performance of DynamoDB using a single threaded process like this. What we want instead is a multi-threaded tweet parsing monster.

» Enter EventMachine

Given the write performance is limited by the HTTP request cycle when we receive notification of a new tweet we want to pass of the task of posting that tweet to DynamoDB to a separate thread. EventMachine has a simple way to handle such a scenario using the EM.defer method

I've set up an EventMachine threadpool_size of 50 (default is 20) which means that we can have 50 processes simultaneously writing to DynamoDB. Awesome. If you choose a popular term on Twitter (such as Bieber) you'll see dramatically improved write performance using EventMachine this way.

» Reading back those records

As I mentioned earlier, you're limited to relatively simple indexes with DynamoDB and so the primary key you choose will have a significant affect on the query performance of your application. Below are some examples of querying using the index and using the Scan API.

For the most simple use cases executing queries using the Scan API will suffice. As data volumes grow however, common queries will need to be made more performant by keeping separate tables indexing the attributes you're querying on. And this is one of the core differences between DynamoDB and other NoSQL solutions such as MongoDB today - if you want to produce complex indexes of your records then you'll need to do the heavy lifting yourself.

» Conclusions

DynamoDB is an exciting new technology from AWS and I can already think of a number of use cases where we'd be very seriously considering it. As I see it, DynamoDB sits somewhere in between Redis and MongoDB, providing a richer query API than something like Redis but requiring more of your application (or a change in your data model) to support queries against a range of attributes.

One significant factor favouring DynamoDB is that just like RDS it's a database as a service. We've been using RDS in production now for well over two years and the sheer convenience of having someone else manage your database, making it easy to provision additional capacity or set up read-slaves can be spoken of too highly. Our largest MongoDB instance today is about 5GB in size and very soon we're going to have to being thinking about how to scale it across a number of servers. Regardless of the potential additional complexity of DynamoDB, not having to think about scaling is a massive win and makes it worthy of consideration for any NoSQL scenario.

1. We keep a Redis key for each user with the ids of the objects that they have classified. Subtracting that from a key with all available ids and then returning a random member is super-fast in Redis.