Arfon Smith

28 Feb, 2019

Chatops-Driven Publishing

JOSS is coming up on it’s third birthday, and today we published our 500th paper. This has been no small feat and I’m hugely indebted to our wonderful group of volunteer editors and reviewers who have made all this possible. We had some help though too from our editorial robot Whedon. In this post I’ll spend some time introducing Whedon, it’s capabilities, and potential future directions.

It’s been about 2 year 10 months now since we started JOSS, our developer friendly journal for publishing open source research software. Today we passed 500 accepted papers - a remarkable achievement for our low cost, volunteer run journal.

When starting JOSS, I always thought that automation was going to be a big part of how things would work if the journal became successful. It turns out automation is a huge part of what we do. I like to call this Chatops-driven publishing.

When JOSS launched, I had a personal goal to keep the amount of custom infrastructure for the journal to an absolute minimum. This wasn’t the first journal I’d started and some of the challenges of building a more complex tool for projects such as The Open Journal of Astrophysics¹ led me to believe that it was possible to build something really lean based upon the GitHub API and GitHub issues.

JOSS is essentially three things:

A simple website that captures information about a paper when authors submit and lists accepted papers
GitHub (issues) for the majority of the editorial process and GitHub (repositories) for hosting of article PDFs and Crossref meta data.
An automated bot called Whedon.

» Whedon

Whedon is our editorial robot that hangs out on JOSS reviews and can be asked to do things by editors, authors, and reviewers. On reflection after publishing more than 500 papers over the last ~3 years, I think Whedon is the major innovation of JOSS.

Whedon is an example of a ‘Chatops’ bot – a term coined by GitHub for the way they deployed their robot ‘Hubot’. When JOSS was first launched, I was working at GitHub and got to see first hand how Hubot helped GitHub^{2, 3} streamline their operations and it seemed like the repetitive work involved in editing a journal was perfect for an automated bot.

The general idea behind Chatops is that robots (Hubot/Whedon) can be used to do repetitive work in a reliable, repeatable way. At GitHub, Hubot was available in the Slack channels and was able to do things like deploy applications and services, monitor the health of core parts of the GitHub infrastructure, defend against DDoS attacks, and even order pizza. All of these functionalities were exposed in commands in chat (Slack) e.g. hubot deploy github⁴.

For JOSS (and sister journal JOSE), Whedon is available on all of the review issues and has a range of abilities including: assigning reviewers and editors to a paper, compiling a preview of a submitted paper, carrying our pre-flight checks for submissions (e.g. looking for a valid open source license), and accepting a paper by depositing metadata with Crossref. Whedon understands the different roles of people involved in reviews on GitHub too which means authors and reviewers have access to a subset of commands, whereas editors and editors-in-chief have access to editorial commands that control the flow of the editorial process.

By having all of Whedon’s functionalities exposed as chatops commands, the vast majority of the editorial work at JOSS is heavily automated, and authors, editors, and reviewers learn how JOSS works by example. In fact, an author’s first exposure to JOSS after submitting is in a ‘pre-review’ issue where Whedon invites the author to make sure their auto-generated paper proof is formatted correctly and to suggest reviewers.

Whedon saying
hello

» How Whedon works

Under the hood, Whedon is a relatively small application (Whedon-API) that receives events from GitHub and handles them differently depending up the contents of the event. As an event is received, Whedon’s logic determines how to respond. This can simply be to say ‘hello’, or if asked, can do more complex things such as compile a PDF proof of the paper with Pandoc and check Crossref for missing DOIs in the references.

Lots of the paper processing (Pandoc stuff) is described in a RubyGem called Whedon too which does lots of the heavy lifting and can be run locally on an editor’s laptop if they want to process papers manually.

» Whedon’s powers

Initially Whedon’s abilities were mostly about assigning editors and reviewers to submissions. Over time though, Whedon’s abilities have grown substantially such that for the vast majority of submissions, a JOSS editor will exclusively work in GitHub Issues when doing their work (i.e. no other tools are needed, only instructions to Whedon in a GitHub issue).

Assign reviewers/editors
Whedon assigns an editor or reviewers to a paper by GitHub handle e.g. @whedon assign @arfon as editor

Checking for an open source license and programming languages
When a new submission is opened, Whedon looks for an open source license using Licensee (the same way GitHub detects the license of a repository) and also tries to figure out what programming languages are being used using Linguist.

Starting reviews
Once the editor and reviewer(s) have been identified, Whedon sets up the main JOSS review issue together with a review checklist for the reviewers to follow @whedon start review

Help authors and editors find reviewers
We have a long list of potential JOSS reviewers which you can join too :-). Whedon knows how to call up this list: @whedon list reviewers

Update metadata during the review process
During the review process version numbers for software are often bumped, DOIs for software archives created etc. Whedon knows how to update these: e.g. @whedon set 10.xxxx/xxxxx as archive

Generate proofs of papers
As a review is carried out, the paper associated with the submission is frequently updated. Authors, reviewers, and editors can all request a new proof from Whedon: @whedon generate pdf

Proofs

Remind authors and reviewers
Sometimes an author or reviewer needs time to make updates to their submission or carry out their review. Editors can ask Whedon to create reminders: @whedon remind @reviewer in 2 weeks

Checking for missing DOIs
We deposit reference metadata with Crossref and like our references to have DOIs (where they exist). Whedon uses the Crossref API to check for potential missing DOIs: @whedon check references

Create final proofs of papers in pull requests on GitHub
When we’re getting close to accepting a paper, editors can ask Whedon to generate final proofs of the JOSS paper and associated Crossref submission metadata: @whedon accept

Accept the paper, for real
Once the editor is happy with the paper, one of our editors in chief can then take the final step of accepting the paper into JOSS: @whedon accept deposit=true

Accepted
paper

» Powers we might give Whedon someday too

It turns out, that pretty much everything you might want to do as part of the standard editorial process is something that can be automated and turned into a chatops command. Some future functionalities we’re planning on developing for Whedon include:

Smart reviewer recommendations using machine learning
We have a reviewer pool of ~600 volunteers⁵ and a growing collection of papers and reviews to learn from. We think Whedon should be able to make smart recommendations for potential reviewers based on topic modelling, historical reviews, current review workload etc.

More editorial checks
Spelling, grammar, and paper formatting is still something our editors have to spend some time on. JOSS papers are deliberately fairly vanilla in their formatting so layout issues are rare but having Whedon spot basic spelling and grammar issues would be beneficial we think.

Support multiple submission types (e.g. LaTeX)
Currently all papers submitted to JOSS have to be in Markdown and we use Pandoc to turn these into PDFs. Lots of our authors would rather work in LaTeX and so supporting paper submissions in different forms has been on our wishlist for a while⁶.

Possibly: Work outside of GitHub/Open Journals ecosystem
One deployment of Whedon supports multiple journals (currently JOSS and JOSE), but Whedon can’t easily work outside of the Open Journals ecosystem. Lots of what Whedon does is generically useful (e.g. compiling papers, depositing metadata with Crossref) and we’ve thought about generalizing some of Whedon to work outside of the Open Journals ecosystem.

» Whedon: a key factor for keeping ours costs low

As described in our statement about our business model on the JOSS website, the Journal of Open Source Software is an open access journal committed to running at minimal costs, with zero publication fees (article processing charges) or subscription fees. In hindsight, the napkin-math summary for we outline here of ~$3.50 per paper turns out to be about right.

As with many (all?) journals, the major cost of publishing is human effort, and for JOSS this comes from our dedicated team of volunteer editors as well as all the amazing collection of reviewers on GitHub.

Close to three years in to this experiment with low cost, chatops-driven publishing, the submission rate to JOSS continues to grow, and we’re expanding our editorial team accordingly. We’ve also recently accounced a collaboration with AAS publishing to bring software review to their publications. It’s not clear what the next three years of JOSS will look like, but I’m pretty sure Whedon’s going to be a big part of it.

^1. Which has since moved to Scholastica in part because of the complexity of rolling your own.
^2. https://speakerdeck.com/holman/how-to-build-a-github
^3. https://speakerdeck.com/holman/unsucking-your-teams-development-environment
^4. Now lots of this stuff is extracted into the probot framework and GitHub Actions
^5. As well as tens of millions of GitHub users…
^6. This would also make it much easier to fork JOSS and set up a journal with longer-form papers…

5 May, 2016

Announcing The Journal of Open Source Software

The Journal of Open Source Software (JOSS) is a new take on an idea that’s been gaining some traction over the last few years, that is, to publish papers about software.

On the face of it, writing papers about software is a weird thing to do, especially if there’s a public software repository, documentation and perhaps even a website for users of the software. But writing a papers about software is currently the only sure way for authors to gain career credit as it creates a citable entity¹ (a paper) that can be referenced by other authors.

If an author of research software is interested in writing a paper describing their work then there are a number of journals such as Journal of Open Research Software and SoftwareX dedicated to reviewing such papers. In addition, professional societies such as the American Astronomical Society have explicitly stated that software papers are welcome in their journals. In most cases though, submissions to these journals are full-length papers that conflate two things: 1) A description of the software and 2) Some novel research results generated using the software.

» The big problem with software papers & our current credit system

The problem with software papers though is exactly what I wrote earlier: they’re the only sure way for authors to gain career credit². Or, put differently, they’re a necessary hack for a crappy metrics system. Buckheit & Donoho nailed it in their article about reproducible research when they said:

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

As academics, it’s important for us to be able to measure the impact of our work, but available tools & metrics are woefully lacking when it comes to tracking research output that doesn’t look like a paper. A 2009 survey of more than 2000 researchers found that > 90% of them consider software important or very important to their work — but even if you’ve followed this GitHub guide for archiving a GitHub repository with Zenodo (and acquired a DOI in the process), citations to your work probably aren’t being counted by the people that matter.

» Embracing the hack

In the long term we should move away from closed/proprietary solutions such as Scopus and Web of Science that primarily track papers and their citations, and instead move to tools that can track things without DOIs such as http://depsy.org. However, that’s the long-term fix, and not the one that helps research software engineers and researchers who are already spending significant amounts of time writing code today.

If software papers are currently the best solution for gaining career credit for software, then shouldn’t we make it as easy as possible to create a software paper? Building high quality software is already a lot of work — what if we could make the process of writing a software paper take less than an hour?

» The Journal of Open Source Software

The Journal of Open Source Software is an open source³ developer-friendly journal for research software packages. It’s designed to make it as easy as possible to create a software paper for your work. If a piece of software is already well documented, then paper preparation (and submission) should take no more than an hour.

The JOSS “paper” is deliberately extremely short and only allowed to include:

A short abstract describing the high-level functionality of the software (and perhaps a figure)
A list of the authors of the software (together with their affiliations)
A list of key references including a link to the software archive

Paper are not allowed to include other things such as descriptions of API functionality, as this should be included in the software documentation. You can see an example of a paper here.

» Oh cool. You’re going to publish a bunch of crappy papers!

Not at all. Remember, software papers are just advertising and JOSS “papers” are essentially just abstracts that point to a software repository. The primary purpose of a JOSS paper is to enable citation credit to be given to authors of research software.

We’re also not going to let just any old software through: JOSS has a rigorous peer review process and a first-class editorial board highly experienced at building (and reviewing) high-quality research software.

» So what’s the review for?

JOSS reviews are designed to improve the software being submitted. Our review process is based upon the tried-and-tested approach taken by the rOpenSci collaboration and happens openly on GitHub. Peer reviews of software papers rarely improve the code submitted⁴ but they do often improve the documentation, a critical part of making usable software, so our review process is about making sure the pieces are in place for open, (re)usable, well-documented code.

» To the future!

To be clear, we believe software papers are a nasty hack on a broken academic-credit system and that the ideal solution is to move away from papers as the only creditable research product.

None of this helps the students/postdocs/early career researchers of today who have to make very hard decisions about whether to spend time improving the software they’ve written for their research (and others in their community) or whether they should crank out a few papers to make them look like a “productive” researcher.

JOSS exists because we believe that after you’ve done the hard work of writing great software, it shouldn’t take weeks or months to write a paper about your work.

» I’m in, how can I help?

Great! There are two main ways you can help:

Cite JOSS software papers when they exist for a piece of software you’ve used, and
Perhaps volunteer to review some stuff for us?

» Thanks

Finally, I’d like to say thanks to all the people who’ve helped shape JOSS into its current form. Thanks to Karthik Ram, Kevin M. Moerman, Dan Katz, and Kyle Niemeyer who’ve helped refine what JOSS is (and isn’t) and all the people who’ve agreed to be on the editorial board.

And, of course, you might like to submit something — take a look at our author guidelines and let us know what you think.

^1. You can of course cite other things, they just don’t necessarily count towards your h-index.
^2. This assumes of course that authors remember to cite your software paper.
^3. There’s not a huge amount to look at right now but if you’re interested then head over the https://github.com/openjournals/joss
^4. Citation needed. Ask your friends/colleagues who have written a software paper whether they think the reviewer even looked at the code.

24 Apr, 2016

Open Source Licensing of Research Software on GitHub

It’s nearly two years now since we (GitHub) worked with the folks at Zenodo to develop an integration¹ designed to make it easier to archive a software repository and issue a DOI. Since there’s now a pretty reasonably long timeframe to work with I thought it might be interesting to look at both the usage of the integration and while we’re at it the licenses being used by authors of research software.

Stating my assumptions: I’m assuming that if a user has gone to the effort of following the GitHub Guide to deposit their software in Zenodo then it’s very likely to be research-focused software (or perhaps they have a perverse interest in DOIs).

» How many repositories are being archived?

Firstly, some raw stats on integration usage. At the time of writing close to 5000 (4859 to be exact) unique repositories have configured the Zenodo integration:

GitHub’s license API uses Licensee to detect the license of a software repository. Of these repositories, ~63% of them have a detectable open source license.

+------------------------+---------------+-------------------+
| @zenodo_licensed_total | @zenodo_total | @license_fraction |
+------------------------+---------------+-------------------+
|                   3037 |          4859 |      62.502572500 |
+------------------------+---------------+-------------------+

On the face of it, ~63% of Zenodo-archived code doesn’t sound very good but this is actually significantly higher than the ~17% of public repositories on GitHub that have a detectable open source license⁴.

Also, having a license is great, but some people care a lot about what license authors have picked and what restrictions they place on users of the software.

+--------------+-------+---------+
| license      | count | percent |
+--------------+-------+---------+
| mit          |   863 | 28.4162 |
| gpl-3.0      |   504 | 16.5953 |
| unknown      |   442 | 14.5538 |
| gpl-2.0      |   336 | 11.0635 |
| apache-2.0   |   304 | 10.0099 |
| bsd-3-clause |   246 |  8.1001 |
| bsd-2-clause |    93 |  3.0622 |
| agpl-3.0     |    73 |  2.4037 |
| cc0          |    57 |  1.8769 |
| lgpl-3.0     |    44 |  1.4488 |
| lgpl-2.1     |    25 |  0.8232 |
| isc          |    20 |  0.6585 |
| unlicense    |    10 |  0.3293 |
| epl-1.0      |     8 |  0.2634 |
| artistic-2.0 |     6 |  0.1976 |
| mpl-2.0      |     5 |  0.1646 |
+--------------+-------+---------+

Also, for many use cases, whether or not the license is ‘permissive’ or not matters².

-- Permissive: 'mit', 'apache-2.0', 'bsd-3-clause', 'bsd-2-clause', 'cc0', 'isc', 'unlicense', 'epl-1.0',
--             'mpl-2.0', 'artistic-2.0'
-- Non-permissive: 'gpl-3.0 ', 'gpl-2.0', 'agpl-3.0', 'lgpl-3.0', 'lgpl-2.1'

+----------------------+--------------------------+
| @permissive_fraction | @non_permissive_fraction |
+----------------------+--------------------------+
|         53.078696000 |             32.334540600 |
+----------------------+--------------------------+

» Wrapping up

Remember, public code isn’t open source without a proper license. Without an open source license that tells others how they can modify and reuse your work, you’ve only showed others your code; you haven’t shared it. Next time you’re releasing code (or are using someone else’s) make sure there’s a license - if there isn’t try opening a Pull Request asking them to add a license.

^1. https://guides.github.com/activities/citable-code/
^2. Channeling my inner VanderPlas
^3. https://github.com/blog/1964-open-source-license-usage-on-github-com
^4. If you only take into account repositories with two or more collaborators this number is much higher.

11 Aug, 2015

On the Possibility of Service Oriented Astronomy

There’s an idea in the technical world called Service Oriented Architecture. It’s a pretty widely recognized as a good way of building big, complex systems in a reliable way and crucially with reusable components. Famously, Amazon’s Jeff Bezos realised the value of this approach and as a consequence something like 100 separate services are used to render a typical product page on Amazon.com.

Wikipedia defines SOA as:

A service-oriented architecture (SOA) is an architectural pattern in computer software design in which application components provide services to other components via a communications protocol, typically over a network. The principles of service-orientation are independent of any vendor, product or technology.

Over the past few months I’ve been noodling on the idea of Service Oriented Astronomy, my working definition is something like this:

Service oriented astronomy (SOA) is an approach whereby researchers develop novel methods for common (or specialist) data analysis tasks that are shared as hosted services with their peers.

In this post I’m going to try and convince you why SOA (of the Astronomy kind) might be a good idea.

» Why LSST is a future headache exciting

In 2020, following more than two decades of planning and construction the Large Synoptic Survey Telescope (LSST) will see first light. If all goes to plan then shortly afterwards it will enter survey mode and will begin producing vast quantities of data¹.

Most of the headlines about LSST are focussed on the data products: A continual ‘Level 1’ alert stream of things (transients) that have changed since that area of the sky was last observed and an annual ‘Level 2’ data releases of calibrated images. Having at least three ‘Vs’ (Volume, Velocity and Variety) considered defining characteristics, LSST is generally considered a ‘Big Data’ project². For me though, the most exciting part of LSST isn’t that it’s big data, it’s a combination of the following factors:

It’s live
Doing things live is hard. Especially if you want to be responsive (and timely) to potentially interesting events.

It’s noisy
A large number of the events being produced are likely to be of little interest to any single individual/research group.

It’s big enough that most of us are woefully unprepared to deal with the data volume
LSST isn’t Facebook/Google ‘big’ but it’s certainly big enough to present difficulties given that the vast majority of us have never received any formal training in software development. Building fast, scalable and reliable tools for processing LSST datasets is not something we’ll likely to find easy.

It’s open
Open is a pre-requisite for large-scale innovation. If everyone³ has access to LSST data there’s an opportunity for the creation of secondary data products and a marketplace for their consumption.

» Science that requires services

Although each Level 1 alert itself is going to be pretty small (some positional information, basic photometry and a small thumbnail image), there’s going to be lots of them (Volume), coming at a very quick rate⁴ (Velocity) and given that they could be low flying rocks, variable stars, supernovae or signals of currently unknown astrophysics, they’re certainly going to be Varied.

So while ingesting Level 1 data might be OK, doing something intelligent, i.e. making decisions on the fly, is going to be much much harder. To add a further complication, many of the research efforts interested in consuming the alert stream are time-sensitive.

If for example we’re interested in capturing high-resoluting spectra of distant Type Ia supernovae then we have the additional compounding factors of being very time-sensitive and highly intolerant of false-positives (because followup is expensive).

Supernovae-science is just one of many use cases where a large volume of data alerts needs to be turned into a much smaller stream of high-quality candidates for (timely) followup observation⁵.

» Modern research: a combination of many steps

Continuing to use Type Ia supernovae discovery as a good use case, what does a typical research and discovery process look like⁶?

Producing the alert stream (source of candidates - in our case LSST)
Consuming the alert stream and doing a first level analysis/classification
Scheduling followup observation
Long-term followup/photometry
Data-reduction/modelling/analysis
Sharing your work (writing papers/generating figures/publishing data/methods)

Individually, none of these steps present an insurmountable problem: it’s the combination of a noisy (and high-volume) alert stream that requires a sophisticated (fast & accurate) first-level classification followed by an expensive followup observation that means we’ve got problems.

» Composable analysis workflows or “what can open source and SOA teach us?”

Open source is modular — your software is the value-add on top of a rich ecosystem of reusable components. In many ways, Service Oriented Architecture is a implementation of the same ideas: let’s build a the thing we need once and then everyone use the reference (best?) version. Of course, it’s possible that people might choose to re-implement a component but in open source that’s usually out of preference rather than necessity.

Across astronomy, there are small numbers of teams combining deep theoretical grounding of novel methods and implementing them in high-quality software solving real astrophysical problems, hell, some of them even have Bay Area Startups.

That means that there are people who are getting really really good and solving some of the very hard problems that LSST is going to present us with - why wouldn’t you want to incorporate their knowledge into your analysis?

What if there was a way to easily plug in the premier Type Ia detection algorithm into your research and subscribe to that alert stream rather that the raw LSST Level 1 alerts?

This is what I think Service Oriented Astronomy could be.

» Recognising good candidates for SOA

Not everything needs to be a service. Here are some ways to identify potentially good candidates:

Where implementing a (good) solution is hard (either because of the methods or technologies).
Where your peers will easily recongise the value of your service
Where it’s straightforward to define the inputs and outputs. For example, calculating precision radial velocity measurements for exoplanet detection
Where the underlying implementation is undesirable (for example a large chunk of old IDL code that people might not want to run - I’m looking at you Solar Physics community).

» Things we probably need to fix to make this a reality

It would also be disingenuous of me to not talk about some significant barriers to making this happen. Namely, even if someone built an incredible piece of software, ran it as a service that a large fraction of the community used they’d not be guaranteed a good career in academia because of our paper-citation obsession as a community. So, some things to address:

Services typically require support. Who would provide this?
If a whole bunch of people become reliant on SOA then we’re going to have to be really sure the implementation is good. Code/data or it didn’t happen.
We’re going to need to work out how to credit people for running these services⁷.

» Service Oriented Astronomy Science

Of course, none of this discussion is unique to astronomy. LSST just feels like a good opportunity (through necessity) to drive significant change in the way we do our research. The same could be said for many other areas of science where there is a large source of open data. Perhaps we should just be calling this Service Oriented Science (SOS).

Service Oriented Architecture (and its close relative Microservices) has dramatically changed the way that companies deliver software and data services. How long before we might say the same for Service Oriented Science?

1. Exactly how much still remains to be seen.
2. Some folks disagree with this statement. Some people also seem to have more Vs for you.
3. Those researchers not in the US or a partner country presumably just need to find themselves a suitably geographically-located collaborator.
4. No-body seems to know exactly how many but there are likely to be upwards of tens of and perhaps even hundreds of millions per night.
5. Of course we could just build a giant army of robotic telescopes to observe a larger fraction of the candidates but this seems like a loosing battle.
6. Full disclosure: I know literally nothing about finding supernovae which is somewhat ironic given this
7. I’d love your thoughts on some related efforts I’ve been working on here

15 Jan, 2015

Me, Myself and I

Every ounce of Britishness in my body is deeply uncomfortable writing this post but over the past 12 months I’ve twice been invited to speak at careers panels at conferences and both times I’ve had really positive feedback from young academics looking to hear more about alternative career paths post-PhD. So to save me from giving this talk again, this post outlines (as best I can remember) some of the decisions I’ve made in my career and how I’ve landed where I am today.

» Who am I?

This seems like a reasonable place to start: My name is Arfon Smith, I am a lapsed academic (I have a PhD in Astrochemistry) and I work for a company called GitHub leading their engagment with the research community.

» I have no idea what I am doing

Seriously, looking backwards it’s usually possible to construct some deliberately followed pathway between the things you have done with your life but if you’d asked me when I was 17 what I thought I was going to do with my life I would have told you I wanted to be a commercial airline pilot. If you’d asked my towards the end of my degree I would have said I wanted to learn how to make wine professionally¹ and if you’d asked me post-PhD I would have told you that I was going to re-train and study medicine. None of these happened, all I’ve done through my career is be responsive to interesting opportunities and not be (too) scared of taking a leap of faith.

» So what did I study?

At high school I student Maths, Physics and Chemistry. Mostly because they were my favourite subjects but I also had heard (correctly it turns out), that by combining these three subjects (especially the Maths and Physics) it was more like studying for 2.5 A-levels because there was so much overlap in the content.

I then enrolled at The University of Sheffield in the Chemistry program.

I got pretty good grades at high-school and so when picking a degree program I was optimising for my deep interest in going somewhere with excellent climbing but also somewhere where I could get a ‘decent’ degree. In the late 90’s, Sheffield had a pretty good Physics program but a really good Chemistry department. I knew I didn’t want to study Maths any further and liked Physics and Chemistry pretty much equally. Psyched about the climbing in and around Sheffield I picked Chemistry.

» Chemistry, not so much…

So it turns out, I’m not that into Chemistry. After the first semester I realised that while Chemistry was interesting enough, some people were waaaay more into it that me. Plus we always seemed to be in the lab when other people were in the bar.

Towards the end of my first year I had reached the firm conclusion that I wanted to stop studying Chemistry as soon as possible. I’d supplemented my first year studies with all of the undergraduate Astrophysics modules (taught in the Physics department) and had really enjoyed these and so I started talking with university administrators about switching degree programs to Physics/Astrophysics. This was pretty much impossible without completely re-doing my first year (I’d missed all of the Maths that they teach in year 1 Physics). Not wanting to do this so I decided to get my head down and get out as soon as possible by switching from a 4-year MSc program to a 3-year BSc.

» Significant decision #1

Finishing up a degree I wasn’t interested in was hard. I found it difficult to motivate myself to attend lectures and realised that with a modest amount of work I could probably scrape a 2.1 (which I did). So during my third year I pretty much checked out but did just enough work to exit with a reasonable qualification.

Luckily for me, my final year project was supervised by a professor named Tony Ryan. He noticed that I wasn’t particularly motivated and challenged me about this. We chatted for a few minutes one evening in his office and talked about how I was lacking motivation for Chemistry. He asked what I was interested in and I mentioned the Astrophysics I’d studied in my first year and how that was the most interesting thing from the last three years. He then said something that turned out to be pretty significant:

You know, there is this thing called ‘Astrochemistry’

It turned out there were funded PhD positions available just 40 miles down the road at The University of Nottingham in a subject that I was really interested in (space stuff) and that needed someone with a solid background in Chemistry. A couple of conversations with my (future) PhD supervisor later and I was signed up to work with him for the next 3-4 years. Here’s a picture of me using a giant telescope in Australia:

» The PhD years

Firstly, I should point out that until someone pretty much offered me a PhD position I hadn’t really thought that hard about a career in academia. To be honest, I didn’t really have a very firm idea of what I wanted to do after finishing up as an undergraduate and the option of spending another few years at university seemed like a pretty reasonable choice.

Given that I haven’t followed a ‘traditional’ academic path, people sometimes ask if I regret spending time getting a PhD. In short, no. I have very fond memories of my time as a PhD student (even with the standard tensions with your supervisor), I got to travel the world, met interesting people (including my wife!) and had ample time to learn new skills which turned out to be pretty important later on.

If I could describe my PhD in one line it would be this:

As interested in the code I was writing as the results being produced.

Basically I found academic research to be interesting, but not interesting enough. If you ever experienced impostor syndrome when arriving at university as an undergraduate then try being a new PhD student. By definition you’re working with some of the smartest people around and I realised that my peers were both better at research than me and more motivated by the work they were doing. For most, academia is a labour of love (you do get paid, but relatively modestly) and so you have to be willing to put in the hours to stay on top of the literature and ahead of your peers. It is widely recognised that there are far more PhD positions than there are permanent positions in academia and so if the ultimate goal² for a PhD student is to one day be a professor then the vast majority of people fail.

» Everything else you do during your PhD

For many people, a PhD affords you a remarkable amount of freedom. Sure you need to check in with your supervisor fairly regularly but I often wouldn’t see mine for 2-3 weeks at a time. This meant that there was a fair amount of time available to work on things not directly related to my studies. For me, I spent time noodling around writing code and building websites.

My first experience of programming was in my first year as a PhD student and because I was in a Chemistry department the programming language was of course Fortran³. Around the same time I took a basic HTML course at the university library and very quickly graduated on from Fortran to languages like Perl. Around the same time I started to acquire data that needed analysing for my research and so after a few frustrating months of typing commands like a robot into a terminal it was pointed out to me that repetitive tasks could be automated (again with Perl).

Over the next couple of years I started building more and more websites in my spare time, for myself, friends, colleagues and eventually clients in a freelance capactity. The more time I spent writing code, the more I became interested in the way the code was put together and the programming languages I was using. Around this time I was introduced to Ruby on Rails which had just reached 1.0. Working with Rails was astonishing, not only was the programming language a pleasure but the framework enforced really smart conventions that helped level up my understanding of software design and development.

» The post-PhD blues

As I reached the end of my PhD I realised I wasn’t cut out for a career in academia. Research was interesting but it was clear to me that some of my peers were one a very different path to me. I finished writing up my work outside of Nottingham (after funds had dried up) and was awarded my PhD in late 2006.

In the weeks and months post PhD I was pretty depressed about what I was going to do next. I had ‘failed’ as an academic and didn’t really know what to do next. For a short while I was a kept man, that is, I was at home trying (and failing) to be a successful freelance developer while my wife brought home the £££. Struggling to know what to do next it was at this point I started to think about re-training as a medical doctor⁴ as there was a family history of starting late in medical careers. I spent some time volunteering on the wards at a local hospital and while this was rewarding in its own way, it made me realise this wasn’t what I wanted to do.

In a stroke of luck, a good friend of mine then pointed me in the direction of a junior Ruby on Rails developer position at a local new media agency. A couple of weeks later I was gainfully employed as a developer on their web team.

» Significant decision #2

It wasn’t that the company I was working for was paricularly bad but after six months of doing client work at a new media agency I was pretty much done with the commercial sector. Deadlines were always tight, clients were often assholes and there was little interest in the quality of the work we were doing from a technical standpoint as long as ‘it looked good’.

At a similar time to me taking this role with the media agency, a good friend of mine Matt Wood (we’d studied for our PhDs together) had landed a role leading the Production Software Group at the Wellcome Trust Sanger Institute - the site responsible for sequencing about one third of the original human genome.

Matt was hiring a team and was looking for Rails developers to build out the laboratory management software to support the next generation sequencing platforms that the Sanger had invested in. While this job sounds like a dream gig, I should point out that I very nearly didn’t apply for this job. Not because I didn’t want it, but because I didn’t think I was qualified.

I should point out that I very nearly didn’t apply for this job. Not because I didn’t want it, but because I didn’t think I was qualified.

The imposter syndrome and insecurities many feel post-PhD can be crippling. The best advice I’ve ever received when applying for jobs is to let the interview committee decide whether you are qualified or not. Thankfully I did exactly this and I was soon working at one of the largest bioinformatics institutes in the world.

» A year in bioinformatics

Looking back now, it’s clear to me that Sanger was a formative role for me professionally. I was employed to write software but was working day-to-day with lots of academics. In many ways this role was a hybrid of my PhD days and my time doing client work in the new media agency. It was also eye-opening to work in an environment where the role (and value) of software in research was well understood. There were probably 800 people on site at Sanger and around 100 of those were developing software. Because of the highly specialised nature of some of the work many of these developers had PhDs in a related field. So here I was, working on a campus full of people with skills like me building tools to facilitate the research of others. It was exciting and I began to see where professionally I could have impact.

It was also eye-opening to work in an environment where the role (and value) of software in research was well understood.

Thankfully, Sanger offered a large amount of money available for professional development (Ruby training, converences etc.) and so over the next 12 months I worked in a small team (~5 people) building web-based tools to support the research of academics and honing the craft of building high-quality software.

» Significant decision #3 - co-founding the Zooniverse

A year into my time at Sanger, I had a serendipitous conversation at a friend’s wedding about a ‘citizen science’ project called Galaxy Zoo⁵. The project had been wildly successful and hundreds of thousands of people had taken part. The group had secured a grant for two people to work full time on the project building it out into a network of citizen science projects all with the same basic idea: find research challenges where human cognition exceeds the abilities of computers do crowd-source science with members of the public.

Chris Lintott (one of the creators of Galaxy Zoo) and myself ended up being those two people who spent the next few years working full time building out what became the Zooniverse.

I can honestly say that this is the first time in my career where the job felt right. Working at Sanger had shown me how it was possible academic research to reap the rewards of well-written software and this was my chance to start something new in a research domain I knew.

So 18 months after leaving academia, I found myself back in a university department as a postdoc⁶.

Long story short, I spend the next five years in Oxford and later at the Adler Planetarium as the Director of Citizen Science leading a team of ~15 designers, developers, educators and researchers all working on Zooniverse projects. It was a lot of fun.

A common pattern was emerging though in the kinds of people we were hiring. They were typically very accomplished individuals, often with a research background, but because they’d acquired significant programming skills they didn’t quite fit into the standard academic model. Zooniverse was a good home for these folks as they could come and apply their software (and other) skills to research problems but it was a small-scale fix for a much wider problem: researchers who spend a large amount of time write software typically suffer a signficant career penalty as time spend writing code is time not spent writing papers (which is where credit is awarded).

researchers who spend a large amount of time write software typically suffer a signficant career penalty as time spend writing code is time not spent writing papers

» GitHub

I can’t quite remember when I first met someone from GitHub but it was probably Tim Clem at a Science Hack Day in San Francisco. Tim and I bumped into each other a few times over the ~2.5 years I was working at the Adler and each time we’d talk about the role of software in academia, what open source communities were doing right (and where academia was going wrong supporting software development) and how GitHub was offering value to the academy as a place for researchers to publish their work.

At some point in those ~2.5 years it became clear that GitHub wanted to hire someone to work in this space. To engage with the wider academic community and work on how to better support academics using the platform. Zooniverse had given me the opportunity to change the way academic research was carried out and this opportunity at GitHub presented an evolution of that challenge but at a much larger scale.

And so that’s where I am today. Working at GitHub to help make the careers of those people who develop software as part of their academic life more successful. Unusually for GitHub (and perhaps all engineering companies) this isn’t just a new product that GitHub needs to build or a discount they need to offer a particular community. Large-scale change in how we credit research products other than papers is a huge challenge for the global academic community and one that GitHub is a small but important part of.

» Hindsight is 20/20

As I said at the start of this post, looking backwards it’s easy to construct a narrative around how you ended up at a particular point in your career. But as I also said, I generally have no idea what I’m doing. For the vast majority of my career thus far there has been no grand plan.

It’s now nearly nine years since I finished my PhD and it’s only in the last three years that I’ve really begun to understand what I want to do with my career. Could I have predicted that I’d end up at GitHub five years ago? Definitely not. But did GitHub hire me because of the professional experience I had. Absolutely.

» Some advice

Most PhD graduates don’t end up becoming tenured professors so here are some recommendations for things you can do to make yourself more employable outside of academia:

Results are important, but how you got there might count more in the end
Outside of academia, employers are likely to care less about your publications but the skills you acquired when doing research will always be valuable. Whether it’s statistical methods, technologies such a version control or programming these are the transferrable skills that will get you a job in industry.

Share (and license) your work
When considering you as a candidate, employers will look you up online and it’s a pretty well-established fact that people get hired because of their GitHub profiles. If you’re writing some code as part of your research, spend the time writing some documentation describing what it does, slap a license on it and put it up on GitHub⁷.

Take a course (there are lots out there)
Whether it’s Software or Data Carpentry or something else, there are a number of professional development opportunities out there for academics to learn industry applicable skills. The good news is that time spent on these courses will also help you with your research too!

Choose transferrable technologies (if you can)
Sometimes you have a choice about what technologies to use in your research. When you do have a choice try and pick ones that are used outside of academia. Astronomers, that means ditching IDL⁸ for something like Python. If there’s chance to learn how to work with databases and SQL then seize that opportunity and stop working with big CSV files.

» Wrapping up

I feel very lucky to have been able to spend the majority of my career thus far only working on problems that are interesting to me. All I can say for sure is that I’ve tried where possible to take opportunities that felt like a good decision at the time but have not always had the confidence to make those decisions on my own. I’ve been very fortunate to receive excellent career advice from a number of people, sometimes friends, sometimes colleagues, sometimes my boss at the time. If you can, find people you know you and who’s opinion you trust. Listen to them as they’re often going to have a better perspective than you.

find people you know you and who’s opinion you trust. Listen to them as they’re often going to have a better perspective than you.

Branching out our comfort zone and leaving academia can seem like a big decision and in many ways it is. People just like you make this decision every day and they’re probably working on fun and interesting problems too!

1. This would obviously have been awesome.
2. Obviously this isn’t the case for many people.
3. 77 of course. I had no idea how much (retro) street cred/sympathy this would later earn me.
4. As if the 7 years I already spent at university weren’t enough.
5. Galaxy Zoo is a website that asks members of the public to judge the shape of millions of galaxies collected by robotic telescopes. tl;dr - there are many analysis problems where machines are far inferiour to humans (shape classifications is one of these tasks) and some researchers at The University of Oxford had built a website that invited the public to help analyse the shape of 1 million galaxies.
6. Almost exclusively writing code (i.e. not trying to do research).
7. Other code hosting platforms are available.
8. The programming language you can buy.

16 Jun, 2014

JSON-LD for software discovery, reuse and credit

This is a continuation of some work I’ve been doing with the Mozilla Science Lab and their ‘code as a research object’ program. There’s multiple aspects to this project including work on code and GUI prototypes, discussions around best practices for making code reusable and software citation. This post explores some ideas around linked data and machine readable descriptions of software repositories with the goal being to make software more discoverable and therefore increase reuse.

» JSON-LD

JSON-LD is a way of describing data with additional context (or semantics if you like) so that for a JSON record like this:

 { "name" : "Arfon" }

when there’s an entity called name you know that it means the name of a person and not a place.

If you haven’t heard of JSON-LD then there are some great resources here and an excellent short screencast on YouTube here.

One of the reasons JSON-LD is particularly exciting is that it’s a lightweight way of organising JSON-formatted data and giving semantic meaning without having to care about things like RDF data models, XML and the (note the capitals) Semantic Web. Being much more succinct than XML and JavaScript native, JSON has over the past few years become the way to expose data through a web-based API. JSON-LD offers a way for API provides (and consumers) to share data more easily with little or no ambiguity about what the data they’re describing.

» So what about software?

Over the past few months there’s been a lot of talk about finding ways for researchers to derive (more) credit for code. There are lots of issues at play here but one major factor is that a prerequisite to receiving credit for some piece of code you’ve written is that a peer needs to both be able to find your work and then reuse it.

The problem is, it can be pretty hard to find software unless there’s a standard place to share tools in that language and the author of the code has chosen to publish there. Ruby has RubyGems.org, Python has PyPI, Perl has CPAN but where do I go if I’m looking to find an obscure library written in C++?

Discovering domain, language and function specific software is an even harder problem to crack. Sure, if I write Ruby I can head over to RubyGems to look for a Gem that might solve my problem but I’m relying on both the author to write a descriptive README and my ability to search for terms that include similar language to the author of the package.

For many subjects where common languages don’t benefit from canonical package indexes and the function of the software is relatively niche, then just finding code that might be useful is a problem.

» Towards a (machine readable) description of software

One way to address this discoverability problem is to find a standard way of describing software with context for the terms used. A design goal here should be that these files can be almost entirely automatically generated.

Inspired by the package.json format prescribed by the npm community and using an ontology described on http://schema.org below is a relatively short JSON-LD document that describes the Fidgit codebase. Let’s call it code.jsonld for now.

» Minimal citable form

Note the first two line (@context and @type) defines the context for the key/value pairs in the JSON structure so that name means the name of the codebase. You can see the full ontology for Code here but this should mostly be straightforward to understand¹.

Once we get to the authors attribute we’re now entering a new context, that of an individual. As we’re still using the schema.org ontology for type Person we only need to set the @type attribute here.

There are a bunch more attributes that we could set here but this feels like a minimal set of information that is sufficient for citation (and therefore credit and attribution for the author).

» For data archivers

This next example is a slightly modified version of the minimal. This includes multiple authors² but now also has keywords required by folks like figshare and Zenodo who require these terms. (Note these keywords should probably be more explicitly structured rather than relying on comma-delimited strings.)

» For discovery?

I started by describing the problem of software discovery and how domain, function and language specific searches for tools is hard. So far these JSON-LD snippets don’t really help with this problem as we still only have keywords and a description for describing the software function and domain.

The schema.org Code ontology includes a programmingLanguage attribute which solves for doing language-specific searches. At GitHub we’re pretty good at detecting this automatically with Linguist and so it’s not even clear that an author of a piece of software would need to manually specify this (a win).

The challenge when designing a more ‘complete’ code.jsonld document is that it’s seemingly rather tough to automate a description of what subject domain the software has been designed for and what the software does.

PLOS ONE has a pretty decent subject taxonomy that I’ve extracted into a machine readable form here and so it’s possible something along these lines could be used to assign a subject domain. Thus far, I’ve been unable to find a good schema for describing academic subjects (or any subject domains). Going deeper and attempting to describe also the function of software is also proving challenging.

» Feedback please!

At this point I’d love some feedback on these ideas. The goal here is to promote software discovery and reuse, so framing this in what’s possible today is a good place to start reflecting on these ideas. Right now it’s possible to do a pretty advanced search for code on GitHub with facets for programming language, file extension, creation date, username and more. Imagine if you could do the same but add in subject area and software function?

One major pitfall with this idea is that in order for an index of code.json files to be useful people have to start making them - a classic chicken and egg problem. All is not lost though, pretty much all of the minimal code.json file can be auto-generated and perhaps submitted to authors as a pull request patch by a friendly robot?

One of the biggest barriers to reusing research software is finding the damn stuff in the first place - does this help?

» Links

Description of a project (DOAP) - https://github.com/edumbill/doap
Schema.org - http://schema.org
JSON-LD.org

¹ Note the Code ontology on schema.org ~~doesn’t~~ *does now* include a license attribute which seems like an oversight.
² It’s not clear that this is allowed!

5 Feb, 2014

Scientific Software and the Open Collaborative Web

I gave a talk back in November at the WSSSPE workshop as part of Super Computing 2013. The slides are available here but are likely of limited value without notes. Encouraged by @juretriglav I’m going to try and summarise my talk here.

It’s also important to point out here that much of this really is a bunch of recycled ideas originally written up by people far more eloquent than me. I’ve included links inline where possible.

» Some Background

My research background is in astrochemistry which is an area of research that focuses on the chemistry of astrophysical environments. My particular interest was in studying the environments between stars - the interstellar medium - and the primary method used to interrogate these large clouds of dust and gas is spectroscopy.

When I used to meet people and tell them what I did for a living, I suspect most people had ideas about this wonderful life I was leading, jetsetting around the world travelling to exotic locations to use these awesome telescopes. Don’t get me wrong, being a researcher is absolutely a privilige but it’s definitely not as sexy as you might think. The reality of observing is long nights trying to stay awake in a weird state of jetlag and sleep deprivation. Throw in a bunch of caffeine, marginal cuisine, an uncomfortable bed¹, the worry of messing up your observations and it’s an altogether pretty stressful experience. And that’s just the beginning - once you’ve collected your data the next step is to actually do some science…

» Data Reduction - AKA getting to know your CCD

The goal of data reduction is basically to remove the effect of the instrument (telescope, spectrograph, observing condidtions etc.) from the data you’ve collected. Doing this properly means that results from different researchers using different facilities can then be compared. Optical spectroscopy works by pointing your telescope at a star or galaxy, collecting some light from the object, splitting that light up by its wavelength and recording the intensity of that light on a detector called a Charged Coupled Device (CCD). This CCD is similar to the one in your digital camera but is typically bigger, much more expensive and in order to perform a proper data reduction you need to get to know this CCD pretty damn well.

CCDs are composed of pixels and sometimes these pixels go a little crazy and need to be excluded from the dataset. Removing these pixels requires you to create something called a bad pixel mask which is essentially just a plain text file composed of x,y coordinates describing regions of a CCD that should not be included in the reduced dataset.

Creating this bad pixel mask is a pretty mind-numbing task: you basically open up an image viewer and look at how the CCD responds as you manipulate the dynamic range of the pixel values. Some bad pixels are very obvious, others less so. Depending on how experienced you are at this, how old and how big the CCD is this task might take about two days to complete this task. Now I’d love to tell you that when you leave the observatory you’re sent away with a bunch of resources to help you with your data reduction including an up-to-date bad pixel mask file for your CCD but in my day this was far from the case. The reality is that most people end up repeating this task and as a result a vast amount of time is wasted. How much time? Well let’s try and work it out…

Assuming it takes on average two days for a researcher to produce their bad pixel mask and that your average telescope is churning through three observing runs per week. If the CCD in use lasts for ~15 years then collectively more than 4,680 days of human effort is used producing slightly different versions of the same file. If ever there were a case for version control then this has to be it.

This example is deliberately extreme and you might take issue with some of my assumptions here (data reduction pipelines have got much better since I was a student). Regardless the core point remains the same: as a researcher we’re taught to think mostly about the future results of our work rather than the tools and approaches we take to derive these results. Initially this is simply down to a lack of training - versioning and sharing our tools is not something that’s taught to most aspiring researchers².

Later as careers progress this training barrier still exists but even if presented with the tools to share and collaborate more effectively it’s not clear researchers would take advantage of this opportunity. The incentives for this kind of activity are all wrong: if the ultimate goal of a researcher is to be a professor then to secure a tenure-track position you need to publish a bunch of papers in the highest profile journals you can find with as few co-authors as possible. No-one on your tenure committee is going to ask for your GitHub username.

The sad thing is that initially we behave like this because we don’t know any different, later we choose to behave in a way that limits the research efficiency of the community at large.

This is wasteful at any level but as research become ever more data- and compute-oriented an increasing fraction of the research method is being captured in these research products that we’re not sharing. Victoria Stodden’s talk at OKCon last year nailed it when she called out a ‘credibility crisis’. I don’t know for sure but I’d wager that as research domains becoming more compute-oriented the effort required to verify a result in the literature increases.

This behaviour of not sharing our tools isn’t universal though. There are researchers out there publishing code but this is the exception rather than the rule. Until we work out ways to credit these kinds of activities then we can’t expect this to be anything other than a rarity.

The travesty of this situation is that while the academy is struggling with how to be more research-efficient and collaborate more effectively there are whole communities of people working together seamlessly every day - these are the open source communities we see flourishing on GitHub. IPython lead Fernando Perez summarises this difference between the sharing work in academic and open source communities with the simplest of statements: open source communities only succeed as their work is ‘reproducible by necessity’. Simply put, open source communities are better at collaborating because they have to be.

This Web we’ve all created over the last few decades has led to the creation of a bunch of incredible platforms that allow for seamless communication and sharing of work with others and yet the only tool I know of that’s pretty much universally adopted in academia is video conferencing.

» The Open Collaborative Web

In his blog post, Marcio von Muhlen correctly (I think) identifies how on GitHub, significant contributions are recognised post-publication and that we should be reaching for something similar to this in academia. In environments where the publication and distribution of digital media is essentially free then prestige can come post-publication through things like usage metrics (downloads, forks or stars on GitHub) or inbound referals from other environments thus increasing a project’s visibility to things like search engines. Altmetrics folks like those at Impactstory are working hard on giving researchers an easy way to measure they’re impact outside of purely journal-based content but the success of these efforts seems to depend on those in positions of power factoring them into their hiring decisions.

We need to be able to derive meaningful metrics from open contributions and for these to be valued by our peers and tenure committees.

» Towards Collaborative Versioned Science

von Muhlen also suggests that to effect significant change may require a ‘nimble’ funding agency to come along and offer incentives for a change in behaviour. Recently the Moore and Sloan foundations announced $37.8M funding for three data science programs at Berkeley, NYU and UW specifically aimed at achieving the following core goals:

Develop meaningful and sustained interactions and collaborations between researchers with backgrounds in specific subjects.
Establish career paths that are long-term and sustainable, using alternative metrics and reward structures to retain a new generation of scientists whose research focuses on the multi-disciplinary analysis of massive, noisy, and complex scientific data and the development of the tools and techniques that enable this analysis; and
Build on current academic and industrial efforts to work towards an ecosystem of analytical tools and research practices that is sustainable, reusable, extensible, learnable, easy to translate across research areas and enables researchers to spend more time focusing on their science.

It’s hard not to get excited about this program as it touches on so many of the key issues necessary to move research towards the networked age. If a stimulus is required for signficant change then this might just be it. Focussing funding on a domain such as data science is smart because it’s a research area that’s relatively new, is inherently interdisciplinary and relies heavily upon software and tooling to produce results. In the launch event for the program hosted by the White House OSTP Ed Lazowska stated that academic environments of today ‘do not reward tool builders’. This funding seems explicitly designed to develop a level of maturity in an research domain where sharing methods is at its absolute core.

» Change is coming…

Right now I think it’s fair to say that in most domains it’s still the products of research (i.e. papers) that are the most highly valued, not the methods. Even popular tools such as Astropy are resorting to what David Donoho describes as ‘advertising’ by publishing a paper about their software to presumably collect some citiations. With 210 stars (bookmarks) and over 150 forks Astropy is a seriously popular GitHub project and yet has fewer than 20 citations.

Now more than ever though feels like a time of change. I always thought is was rather remarkable how BP managed to redraw their logo to look like a flower and start using ‘Beyond Petroleum’. BP presumably saw a future where energy doesn’t come primarly from fossil fuels and started to change their public persona and business model to fit. Digital Science feels like Macmillan’s bet on the future, one where their current business model fades away. Alan Kay is quoted way to much already but what the hell:

“The best way to predict the future is to invent it.” - Alan Kay

One of the leaders in the field of altmetrics, Altmetric.com is funded at least in part by Digital Science. In a time when what constitutes publishing is undergoing massive change, what better way to protect that billion dollar revenue stream than own the innovations that curate this mass of information in the future?

» So what now?

If I were an aspiring researcher today working in a semi-technical domain then I think I’d be hedging my bets. It’s definitely important to keep writing papers and building a ‘traditional’ career profile but it’s also important to realise that a move towards a culture of reuse - where researchers are sharing more routinely - is one that can be good for both the individual and the community at large. Take a look at the contributions to Dan FM’s emcee as a great example of the community iterating, improving and expanding on the original functionality of his work.

Open source projects understand this mutually beneficial relationship, even companies like AT\&T fund developers to work on open source technologies their business relies upon. ‘Open’ has won in the software world, with governments and federal agencies mandating open access and open data policies how long before we’re saying the same about research?

Code, data, manuscripts, teaching resources - these are just some of the day-to-day products of our time spent as an academic that by sharing more routinely we can being to change the cultural norms in our resespective fields. Do me a favour - next time you’re writing a paper, try posting the code and data up somewhere online³, you can even mint DOIs for the stuff you share. I’d put money on the fact that someone in the next few years is going to figure this altmetrics stuff out and those folks that have been sharing for a while are going to be the ones that reap the rewards.

1. As you might be able to tell, I never observed at Paranal.
2. There is of course the wonderful work of Software Carpentry these days.
3. When you share make sure you put a licence on it. It’s The Right Thing To Do™

23 Jul, 2013

How the Zooniverse Works: Keeping it Personal

This was originally posted on the Zooniverse blogs here.

This the the third post in a series about how, at a high level, the Zooniverse collection of citizen science projects work. In the first post I described the core domain model description that we use – something that turns out to be a crucial part of faciliating conversation between scienctists and developers. In the second I covered about some of the core technologies that keep things running smoothly. In this and the next few posts I’m going to talk about parts of the Zooniverse that are subtle but important optimisations. Things such as how we pick which Subject to show to someone next, how we decide when a Subject is complete, and measuring the quality of a person’s Classifications.

Much of what I’m about to describe probably isn’t obvious to the casual observer but these are some of the pieces of the Zooniverse technical puzzle that as a team we’re most proud of and have taken many iterations over the past five years to get right. This post is about how we decide what to show to you next.

» A Quick Refresher

At its most basic, a Zooniverse citizen science project is simply a website that shows you some data images, audio or plots, asks you to perform some kind of analysis on interpretation on it and collects back what you said. As I described in my previous post we’ve abstracted most of the data-part of that workflow into an API called Ouroboros which handles functionality such as login, serving up Subjects and collecting back user-generated Classifications.

» Keeping it Fast

The ability for our infrastructure to scale quickly and predictably is a major technical requirement for us. We’ve been fortunate over the past few years to receive a fair bit of attention in the press which can result in tens or hundreds of thousands of people coming to our projects in a very short period of time. When you’re dealing with visitor numbers at that scale ideally you want everyone to have a pleasant experience.

Let’s think a little more about what absolutely has to happen when a person visits for example Galaxy Zoo.

We need to show a login/signup form and send the information provided by the individual back to the server.
Once registration/login is complete we need to serve back some personal information (such as a screen name).
We need to pick some Subjects to show.

For many of the operations that happen in the Zooniverse, a record is written to a database somewhere. When trying to improve the performance of code that involves databases, a key strategy is to try and avoid querying these database as much as possible especially if the queries are complex and the databases are large as these are often the slowest parts of your application.

What count’s as ‘complex’ and ‘big’ in database terms varies based upon the types of records that you are storing, the choices you’ve made about how to index them and the resources you provide to the database server i.e. how much RAM/CPU you have available.

» Keeping it personal

If there’s one place that complex queries are guaranteed to reside in a Zooniverse project codebase then it’s the part where we decide what to show to a particular person next. It’s complex, in need of optimisation and potentially slow for a number of reasons:

When selecting a Subject we need to pick from one that a particular User hasn’t seen before.
Often Subjects are in Groups (such as a collection of records in Notes from Nature) and so these queries have to happen within a particular scope.
We often want to prioritise a certain subset of the Subjects.
These queries happen a lot, at least n * the total number of Subjects (where n is the number of repeat classifications each Subject receives).
The list of Subjects we’re selecting from is often large (many millions).

On first inspection, writing code to achieve the requirements above might not seem that hard but if you add in the requirement that we’d like to be able to select Subjects hundreds of times per second for many thousands of Users then it starts to get tricky.

A ‘poor man’s’ version of this might look something like this

What we’re doing here is finding all the classifications for a given User and grabbing all of the Subject ids for them. Then we do a SQL select to grab the first record that doesn’t have an id matching one of the ones from existing classifications.

While this code is perfectly valid and would work OK for small-scale datasets there are a number of core issues with it:

It’s pretty much guaranteed to get slower over time – as the number of classifications grows for a user retrieving the recent classifications is going to become a bigger and bigger query.
It’s slow from the start – NOT IN queries are notoriously slow.
It’s wasteful – every time we grab a new Subject for a User we essentially run the same query to grab the recent classification Subject ids.

These factors combined make for some serious potential performance issues if we want to execute code like this frequently, for large numbers of people and across large datasets all of which are requirements for the Zooniverse.

» A better way

It turns out that there are technologies out there designed to help with this sort of scenario. When we select the new Subject for a user there’s no reason why this operation has to actually happen in the database that the Subjects are stored in, instead we can keep ‘proxy’ records stored in lists or sets. That means that if we have a big list of ids of things that are available to be classified and a list of ids of things that each user has seen so far then when we want to select a Subject for someone we just subtract those two things and then pick randomly from the difference and pluck that record from the database.

In the diagram above when Rob (in the middle) comes to one of our sites we subtract from the big list of Subjects that need classifying still (in blue) the list of things that he’s already seen (in green) and then pick randomly from that resulting set. Going by this diagram it looks like we must have to keep a list of available Subjects for each project together with a separate list of Subjects per project per user so that we can do this subtraction and that’s exactly the case. The database technology that we use to do this is called Redis and it’s designed for operations just like this.

» The result

Maturing our codebase to a point where the queries described above are straightforward has been a lot of work, mostly by this guy. What does it look like to actually require this kind of behaviour in code? Just two lines:

Not only is it simple for us to now to implement this kind of Subject selection behaviour, using Redis to perform these selection operations means that everything is insanely quick, typically returning from Redis in ~30ms even for databases with many tens of thousands of Subjects to be classified.

Making the routinely hard stuff easier is a continual goal for the Zooniverse development team. That way we can focus maximum effort on the front-end experience and what’s different and hard about each new project we build.

26 Jun, 2013

How the Zooniverse Works: Tools and Technologies

This was originally posted on the Zooniverse blogs here.

In my last post I described at length the domain model that we use to describe conceptually what the Zooniverse does. That wouldn’t mean much without an implementation of that model and so in this post I’m going to describe some of the tools and technologies that we use to actually run our citizen science projects.

» The lifecycle of a Zooniverse project

Let’s think a little more about what happens when you visit a project such as Snapshot Serengeti. Ignoring all of the to-and-fro that your web browser does to work out where the domain name ‘snapshotserengeti.org’ points to, once it’s figured this and a few other details out you basically get sent a website that your browser renders for you. For the website to function as a Zooniverse project a few things are essential:

You need to be able to view images (or listen to audio or watch a video) that we and the science team need your help analysing.
You need to be able to log in with your Zooniverse account.
We need to capture back what you said when doing the citizen science analysis task.
Save out favourite images to your profile.
View recent images you’ve seen in your profile.
Discuss these images with the community.

It turns out that pretty much all of the functionality mentioned above is for us delivered by an application we call Ouroboros as an API layer and a website (such as Snapshot Serengeti) talking to it.

» Ouroboros – or ‘why the simplest API that works is probably all you need’.

So what is Ouroboros? It provides an API (REST/JSON) that allows you to build a Zooniverse project that has all of the core components (1-6) listed above. Technology-wise it’s a custom Ruby on Rails application (Rails 3.2) that uses MongoDB to store data and Redis as a query cache all running on Amazon Web Services. It’s probably utterly useless to anyone but us but for our needs it’s just about perfect.

At the Zooniverse we’re optimised for a few different things. In no particular order of priority they are:

Volume – we want to be able to build lots of projects.
Science – we want it to be easy to do science with the efforts of our community.
Scale/performance – we want to be able to have millions of people come to our proejcts and them to stay up.
Availability – we’d prefer our websites to be ‘up’ and not ‘down’.
Cost – we want to keep costs at a manageable level.

Pretty much all of these requirements point to having a shared API (Ouroboros) that serves a large number of projects (I’ll argue #4 in the pub with anyone who really wants to push me on it).

Running a core API that serves many projects makes you take the maintenance and health of that application pretty seriously. Should Ouroboros throw a wobbly then we’d currently take out about 10 Zooniverse projects at once and this is only set to increase. This means we’ve thought a lot about how to scale the application for times when we’re busy and we also spend significant amounts of time monitoring the application performance and tuning code where necessary. I mentioned that cost is a factor – running a central API means that when the Zooniverse is quiet and there aren’t many people about we can scale back the number of servers we’re running (automagically on Amazon Web Services) to a minimal level.

We’ve not always built our projects this way. The original Galaxy Zoo (2007) was an ASP/web forms application, projects between Galaxy Zoo 2 and SETI Live were all separate web applications, many of them built using an application called The Juggernaut. Building standalone applications every time not only made it difficult to maintain our projects but we also found ourselves writing very similar (but subtly different) code many times between projects, code for things like choosing which Subject to show next.

Ouroboros is an evolution of our thinking about how to build projects, what’s important and generalisable and what isn’t. At it’s most basic it’s a really fast Subject allocator and Classification collector. Our realisation over the last few years was that the vast majority of what’s different about each project is the user experience and classification interface and this has nothing to do with the API.

» The actual projects

The point of having a central API is that when we want to build a new project we’re already working with a very familiar toolset – the way we log people in, do signup forms, ask for a Subject, send back Classifications – all of this is completely standard. In fact if you’re building in JavaScript (which we almost always are these days) then there’s a client library called ‘Zooniverse’ (meta I know) available here on GitHub.

Having a standard API and client library for talking to it meant that we built the Zooniverse project Planet Four in less than 1 week! That’s not to say it’s trivial to build projects, it’s definitely not, but it is getting easier. And having this standardised way of communicating with the core Zooniverse means that the bulk of the effort when building Planet Four was exactly where it should be – the fan drawing tools – the bit that’s different from any of our other projects.

So how do we actually build our projects these days? We build our projects as JavaScript web applications using JavaScript web frameworks such as Spine JS, Backbone or something completely custom. The point being, that all of the logic for how the interface should behave is baked into the JavaScript application – Ouroboros doesn’t try and help with any of this stuff.

Currently the majority of our projects are hosted using the Amazon S3 static website hosting service. The benefits of this are numerous but key ones for us are:

There’s no webserver serving the site content, that is http://www.galaxyzoo.org resolves to an S3 bucket. When you access the Galaxy Zoo site S3 does all of the hard work and we just pay for the bandwidth from S3 to your computer.
Deploying is easy. When we want to put out a new version of any of our sites we just upload new timestamped versions of the files and your browser starts using them instead.
It’s S3! – Amazon S3 is a quite remarkable service – a significant fraction of the web is using it. Currently hosting more than 2 trillion (yes that’s 12 zeroes) objects and regularly serving more than 1 million requests for data per second the S3 service is built to scale and we get to use it (and so can you).

Amazon S3 is a static webhost (i.e. you can’t have any server-side code running) so how do we make a static website into a Zooniverse project you can log in to when we can’t access database records? The main site functions just fine – these JavaScript applications (such as the current Galaxy Zoo or any recent Zooniverse project) implement what is different about the project’s interface. We then use a small invisible iFrame on each website that actually points to api.zooniverse.org which is Ouroboros. When you use a login form we actually set a cookie on this domain and then send all of our requests back to the API through this iFrame. This approach is a little unusual and with browsers tightening up the restrictions on third-party cookies if looks like we might need to swap it out for a different approach but for now it’s working well.

» Summing up

If you’re like me then when you read something you read the opening, look at the pictures and then skip to the conclusions. I’ll summarise here just incase you’re doing that too:

In the Zooniverse there’s a clear separation between the API (Ouroboros) and the citizen science projects that the community interact with. Ouroboros is a custom-built, highly scalable application built in Ruby on Rails, that runs on Amazon Web Services and uses MongoDB, Redis and a few other fancy technologies to do its thing.

The actual citizen science projects that people interact with are these days all pure JavaScript applications that are hosted on Amazon S3 and they’re pretty much all open source. They’re generally still bespoke applications each time but share common code for talking to Ouroboros.

What I didn’t talk about in this post are the hardest bits we’ve solved in Ouroboros – namely all of the logic about how to make finding Subjects for people quickly and other ‘smart stuff’. That’s coming up next.

20 Jun, 2013

How the Zooniverse Works: The Domain Model

This was originally posted on the Zooniverse blogs here. I’m reposting as this blog sometimes feels like a personal diary for my time with the Zooniverse.

We talk a lot in the Zooniverse about research, whether it’s interesting stories from the community, a new paper based upon the combined efforts of the volunteers and the science teams or conferences we might be going to.

One thing we don’t spend much time talking about is the technology solutions we use to build the Zooniverse sites, the lessons we’ve learned as a team building more than twenty five citizen science projects over the past five years and where we think the technical challenges still remain in building out the Zooniverse into something bigger and better.

There’s a lot to write here so I’m going to break this into three separate blog posts. The first is going to be entirely about the domain model that we use to describe what we do. When it seems relevant I’ll talk a little more about implementation details of these domain entities in our code too. The second will be about technologies and the infrastructure we run the Zooniverse atop of and the third will be about making smarter systems.

» Why bother with a domain model?

Firstly it’s worth spending a little time talking about why we need a domain model. In my mind the primary reason for having a domain model is that it gives the team, whether it’s the developers, scientists, educators or designers working on a new project a shared vocabulary to talk about the system we’re building together. It means that when I use the term ‘Classification’ everyone in the team understands that I’m talking about the thing we store in the database that represents a single analysis/interaction of a Zooniverse volunteer with a piece of data (such as a picture of a galaxy), which by the way we call a ‘Subject’.

Technology wise the Zooniverse is split into a core set of web services (or Application Programming Interface, API) that serve up data and collect it back (more about that later) and some web applications (the Zooniverse projects) that talk to these services. The domain model we use is almost entirely a description of the internals of the core Zooniverse API called Ouroboros and this is an application that is designed to support all of the Zooniverse projects which means that some of the terms we use might sound overly generic. That’s the point.

» The core entities

The domain model is actually pretty simple. We typically think most about the following entities:

User

People are core to the Zooniverse. When talking publically about the Zooniverse I almost always use the term ‘citizen scientist’ or ‘volunteer’ because it feels like an appropriate term for someone who donates their time to one of our projects. When writing code however, the shortest descriptive term that makes sense is usually selected so in our domain model the term we use is User.

A User is exactly what you’d expect, it’s a person, it has a bunch of information associated with it such as a username, an email address, information about which projects they’ve helped with and a host of other bits and bobs. Crucially though for us, a User is the same regardless of which project they’re working - that is Users are pan-Zooniverse. Whether you’re classiying galaxies over at Galaxy Zoo or identifying animals on Snapshot Serengeti we’re associating your efforts with the same User record each time which turns out to be useful for a whole bunch of reasons (more later).

Subject

Just as people are core, as are the things that they’re analysing to help us do research. In Old Weather it’s a scanned image of a ship log book, in Planet Hunters it’s a light curve but regardless of the project internally we call all of these things Subjects. A Subject is the thing that we present to a User when we want to them to do something.

Subjects are one of the core entities that we want to behave differently in our system depending upon their particular flavour. A log book in Old Weather is only viewed three times before being retired whereas an image in Galaxy Zoo is shown more than 30 times before retiring. This means that for each project we have a specific Subject class (e.g. GalaxyZooSubject) that inherits its core functionality from a parent Subject class but then extends the functionality with the custom behaviour we need for a particular project.

Subjects are then stored in our database with a collection of extra information a particular Subject sub-class can use for each different project. For example in Galaxy Zoo we might store some metadata associated with the survey telescope that imaged the galaxy and in Cyclone Center we store information about the date, time and position the image was recorded.

Workflow/Task

These two entities are grouped together as they’re often used to mean broadly the same thing. When a User is presented with a Subject on one of our projects we ask them to do something. This something is called the Task. These Tasks can be grouped together into a Workflow which is essentially just a grouping entity. To be honest we don’t use Workflow very much as most projects just have a single Workflow but in theory it allows us to group a collection of Tasks into a logical unit. In Notes from Nature each step of the transcription (such as ‘What is the location?’) is a separate Task, in Galaxy Zoo, each step of the decision tree is a Task too.

Classification

It’s no accident that I’ve introduced these three entities, User, Subject and Task first as a combination of these is what we call a Classification. The Classification is the core unit of human effort produced by the Zooniverse community as it represents what a person saw and what they said about it. We collect a lot of these - across all of the Zooniverse projects to date we must be getting close to 500 million Classifications recorded.

I’ll talk more about what we store in a Classification in a followup the next post about technologies suffice to say now that they store a full description of what the User said about the object. In previous versions of the Zoonivese API software we tried to break these records out into smaller units called Annotations but we don’t do that any more – it was an unnecessary generalisation.

Group

Sometimes we need to group Subjects together for some higher level function. Perhaps it’s to represent a season’s worth of images in Snapshot Serengeti or a particular cell dye staining in Cell Slider. Whatever the reason for grouping, the entity we use to describe this is ‘Group’.

Grouping records is both one of the most useful features Ouroboros offers but also one of the things it took the longest for us to find the right level of abstraction. While a Group can represent an astronomical survey in Galaxy Zoo (such as the Hubble CANDELS survey) or a Ship in Old Weather, it isn’t just enough for a bunch of Subjects to all be associated with each other. There’s often a lots of functionality that goes along with a Group or the Subjects within that is custom for each Zooniverse project. Ultimately we’ve solved this in a similar fashion to Subject - by having per-project subclasses of Groups (i.e. there is a SerengetiGroup that inherits from Group) that can implement custom behaviour as required.

Project

Ouroboros (the Zooniverse API) hosts a whole bunch of different Zooniverse projects so it’s probably no surprise that we represent the actual citizen science project within our domain model. No prize for guessing the name of this entity - it’s called Project.

A Project is really just the overarching named entity that Subjects, Classifications and Groups are associated with. Project in Ouroboros does some other cool stuff for us though. It’s the Project that knows about the Groups, its current status (such as how many Subjects are complete) and other adminstrative functions. We also sometimes deliver a slightly different user experience to different Users in what are known as A/B splits - it’s the Project that manages these too.

» Finishing up.

So that’s about it. There are a few more entities routinely in discussion in the Zooniverse team such as Favourite (something a User favourites when they’re on one of our projects) but they’re really not core to the main operation of a project.

The domain description we’re using today is informed by everthing we’ve learnt over the past five years of building proejcts. It’s also a consequence of how the Zooniverse has been functioning - we try lots of projects in lots of different research domains and so we need a domain model that’s flexible enough to support something like Notes from Nature, Planet Four and Snapshot Seregeti but not so generic that we can’t build rich user experiences.

We’ve also realised that the vast majority of what’s differenct about each project is the user experience and classification interface. We’re always likely to want to put significant effort into developing the best user experience possible and so having an API that abstracts lots of the complexity away and allows us to focus on what’s different about each project is a big win.

Our domain model has also been heavily influenced by the patterns that have emerged working with science teams. In the early years we spent a lot of time abstracting out each step of the User interaction with a Subject into distinct descriptive entities called Annotations. While in theory these were a more ‘complete’ description of what a User did, the science teams rarely used them and almost never in realtime operations. The vast majority of Zooniverse projects to date collect large numbers of Classifications that are write once, read very much later. Realising this has allowed us to worry less about exactly what we’re storing at a given time and focus on storing data structures that are a convenient for the scientists to work with.

Overall the Zooniverse domain model has been a big success. When designing for the Zooniverse we really were developing a new system unlike anything else we knew of. It’s terminology is pervasive in the collaboration and makes conversations much more focussed and efficient which can only be a good thing.