Building a data archiving service using the GitHub API, figshare and Zenodo

Over the past couple of weeks we’ve seen a couple of great examples of service integrations from figshare and Zenodo that use the GitHub outbound API to automatically archive GitHub repositories. While the implementation of each solution is likely to be somewhat different I thought it might be useful to write up in general terms how to go about building such a service.

In a nutshell we need a tool that does the following:

Authenticates against the GitHub API
Configures an outbound API endpoint for repository events to be posted to
Respond to a GitHub repository event by grabbing a copy of the code
Issues a DOI for the code bundle

A while ago, together with Kaitlin Thaney (@MozillaScience) and Mark Hahnel (@figshare) I put together a proof of concept implementation called Fidgit that basically does the above. You can read more about how to run your own version of this service in the README here.

» Tuning in to the GitHub outbound API

GitHub has both an inbound (i.e. send commands to the API) and outbound notifications API called webhooks. By configuring the webhooks for a repository, it’s possible to receive an event notification from GitHub with some information about what has changed on the repo. Check out this list for all of the event types it’s possible to tune into but for the purposes of this article we’re going to focus on the event type that is generate when a new release is created.

Whether it’s Zenodo, figshare or a reference implementation like fidgit, they all rely upon listening to the [outbound GitHub API webhooks and responding with some actions based upon the content in the JSON payload received.

» Creating a webhook on a GitHub repo

Creating a webhook for a GitHub repo is something that can only be done by someone who has permissions to modify the repository state in some way. In order for figshare and Zenodo to set up the webhooks on your GitHub repos, both applications ask you to log in with your GitHub credentials and authorise their applications to administer your webhooks, they do this using OAuth. While a full OAuth login flow is the ‘complete’ way to do this, Fidgit requires a personal access token from your GitHub profile and uses this to authenticate and create a webhook. Note, it’s possible to only ask for the conservative permissions on a GitHub user’s account to just administer OAuth webhook scopes. You can read more about these scopes here.

» Archiving the code

Once notified of a change in the repository (such as a new release) then we need to go and grab that code. This could be in the form of a ‘Git clone’ of the repository and all of its history but Fidgit, Zenodo and figshare all choose to just grab a snapshot of the code from the GitHub raw download service. At the bottom right of the page of every GitHub repository, there’s a link to ‘Download ZIP’. This basically gives us a copy of the current status of the repository but without an Git (or GitHub) information attached such as Git history. As these files can be reasonably large it makes sense to grab this code bundle in a background worker process. That happens in Fidgit worker here which basically uses plain old curl to grab the zip archive and then push the code up to figshare through their API.

» Putting a DOI on it

This step is left as an exercise for the reader (just kidding). Fidgit doesn’t do this, figshare, Zenodo and Dryad are doing this bit and so it’s out of scope for this article.