Scientific Software and the Open Collaborative Web

I gave a talk back in November at the WSSSPE workshop as part of Super Computing 2013. The slides are available here but are likely of limited value without notes. Encouraged by @juretriglav I'm going to try and summarise my talk here.

It's also important to point out here that much of this really is a bunch of recycled ideas originally written up by people far more eloquent than me. I've included links inline where possible.

» Some Background

My research background is in astrochemistry which is an area of research that focuses on the chemistry of astrophysical environments. My particular interest was in studying the environments between stars - the interstellar medium - and the primary method used to interrogate these large clouds of dust and gas is spectroscopy.

When I used to meet people and tell them what I did for a living, I suspect most people had ideas about this wonderful life I was leading, jetsetting around the world travelling to exotic locations to use these awesome telescopes. Don't get me wrong, being a researcher is absolutely a privilige but it's definitely not as sexy as you might think. The reality of observing is long nights trying to stay awake in a weird state of jetlag and sleep deprivation. Throw in a bunch of caffeine, marginal cuisine, an uncomfortable bed1, the worry of messing up your observations and it's an altogether pretty stressful experience. And that's just the beginning - once you've collected your data the next step is to actually do some science...

» Data Reduction - AKA getting to know your CCD

The goal of data reduction is basically to remove the effect of the instrument (telescope, spectrograph, observing condidtions etc.) from the data you've collected. Doing this properly means that results from different researchers using different facilities can then be compared. Optical spectroscopy works by pointing your telescope at a star or galaxy, collecting some light from the object, splitting that light up by its wavelength and recording the intensity of that light on a detector called a Charged Coupled Device (CCD). This CCD is similar to the one in your digital camera but is typically bigger, much more expensive and in order to perform a proper data reduction you need to get to know this CCD pretty damn well.

CCDs are composed of pixels and sometimes these pixels go a little crazy and need to be excluded from the dataset. Removing these pixels requires you to create something called a bad pixel mask which is essentially just a plain text file composed of x,y coordinates describing regions of a CCD that should not be included in the reduced dataset.

Creating this bad pixel mask is a pretty mind-numbing task: you basically open up an image viewer and look at how the CCD responds as you manipulate the dynamic range of the pixel values. Some bad pixels are very obvious, others less so. Depending on how experienced you are at this, how old and how big the CCD is this task might take about two days to complete this task. Now I'd love to tell you that when you leave the observatory you're sent away with a bunch of resources to help you with your data reduction including an up-to-date bad pixel mask file for your CCD but in my day this was far from the case. The reality is that most people end up repeating this task and as a result a vast amount of time is wasted. How much time? Well let's try and work it out...

Assuming it takes on average two days for a researcher to produce their bad pixel mask and that your average telescope is churning through three observing runs per week. If the CCD in use lasts for ~15 years then collectively more than 4,680 days of human effort is used producing slightly different versions of the same file. If ever there were a case for version control then this has to be it.

This example is deliberately extreme and you might take issue with some of my assumptions here (data reduction pipelines have got much better since I was a student). Regardless the core point remains the same: as a researcher we're taught to think mostly about the future results of our work rather than the tools and approaches we take to derive these results. Initially this is simply down to a lack of training - versioning and sharing our tools is not something that's taught to most aspiring researchers2.

Later as careers progress this training barrier still exists but even if presented with the tools to share and collaborate more effectively it's not clear researchers would take advantage of this opportunity. The incentives for this kind of activity are all wrong: if the ultimate goal of a researcher is to be a professor then to secure a tenure-track position you need to publish a bunch of papers in the highest profile journals you can find with as few co-authors as possible. No-one on your tenure committee is going to ask for your GitHub username.

The sad thing is that initially we behave like this because we don't know any different, later we choose to behave in a way that limits the research efficiency of the community at large.

This is wasteful at any level but as research become ever more data- and compute-oriented an increasing fraction of the research method is being captured in these research products that we're not sharing. Victoria Stodden's talk at OKCon last year nailed it when she called out a 'credibility crisis'. I don't know for sure but I'd wager that as research domains becoming more compute-oriented the effort required to verify a result in the literature increases.

This behaviour of not sharing our tools isn't universal though. There are researchers out there publishing code but this is the exception rather than the rule. Until we work out ways to credit these kinds of activities then we can't expect this to be anything other than a rarity.

The travesty of this situation is that while the academy is struggling with how to be more research-efficient and collaborate more effectively there are whole communities of people working together seamlessly every day - these are the open source communities we see flourishing on GitHub. IPython lead Fernando Perez summarises this difference between the sharing work in academic and open source communities with the simplest of statements: open source communities only succeed as their work is 'reproducible by necessity'. Simply put, open source communities are better at collaborating because they have to be.

This Web we've all created over the last few decades has led to the creation of a bunch of incredible platforms that allow for seamless communication and sharing of work with others and yet the only tool I know of that's pretty much universally adopted in academia is video conferencing.

» The Open Collaborative Web

In his blog post, Marcio von Muhlen correctly (I think) identifies how on GitHub, significant contributions are recognised post-publication and that we should be reaching for something similar to this in academia. In environments where the publication and distribution of digital media is essentially free then prestige can come post-publication through things like usage metrics (downloads, forks or stars on GitHub) or inbound referals from other environments thus increasing a project's visibility to things like search engines. Altmetrics folks like those at Impactstory are working hard on giving researchers an easy way to measure they're impact outside of purely journal-based content but the success of these efforts seems to depend on those in positions of power factoring them into their hiring decisions.

We need to be able to derive meaningful metrics from open contributions and for these to be valued by our peers and tenure committees.

» Towards Collaborative Versioned Science

von Muhlen also suggests that to effect significant change may require a 'nimble' funding agency to come along and offer incentives for a change in behaviour. Recently the Moore and Sloan foundations announced $37.8M funding for three data science programs at Berkeley, NYU and UW specifically aimed at achieving the following core goals:

It's hard not to get excited about this program as it touches on so many of the key issues necessary to move research towards the networked age. If a stimulus is required for signficant change then this might just be it. Focussing funding on a domain such as data science is smart because it's a research area that's relatively new, is inherently interdisciplinary and relies heavily upon software and tooling to produce results. In the launch event for the program hosted by the White House OSTP Ed Lazowska stated that academic environments of today 'do not reward tool builders'. This funding seems explicitly designed to develop a level of maturity in an research domain where sharing methods is at its absolute core.

» Change is coming...

Right now I think it's fair to say that in most domains it's still the products of research (i.e. papers) that are the most highly valued, not the methods. Even popular tools such as Astropy are resorting to what David Donoho describes as 'advertising' by publishing a paper about their software to presumably collect some citiations. With 210 stars (bookmarks) and over 150 forks Astropy is a seriously popular GitHub project and yet has fewer than 20 citations.

Now more than ever though feels like a time of change. I always thought is was rather remarkable how BP managed to redraw their logo to look like a flower and start using 'Beyond Petroleum'. BP presumably saw a future where energy doesn't come primarly from fossil fuels and started to change their public persona and business model to fit. Digital Science feels like Macmillan's bet on the future, one where their current business model fades away. Alan Kay is quoted way to much already but what the hell:

"The best way to predict the future is to invent it." - Alan Kay

One of the leaders in the field of altmetrics, is funded at least in part by Digital Science. In a time when what constitutes publishing is undergoing massive change, what better way to protect that billion dollar revenue stream than own the innovations that curate this mass of information in the future?

» So what now?

If I were an aspiring researcher today working in a semi-technical domain then I think I'd be hedging my bets. It's definitely important to keep writing papers and building a 'traditional' career profile but it's also important to realise that a move towards a culture of reuse - where researchers are sharing more routinely - is one that can be good for both the individual and the community at large. Take a look at the contributions to Dan FM's emcee as a great example of the community iterating, improving and expanding on the original functionality of his work.

Open source projects understand this mutually beneficial relationship, even companies like AT&T fund developers to work on open source technologies their business relies upon. 'Open' has won in the software world, with governments and federal agencies mandating open access and open data policies how long before we're saying the same about research?

Code, data, manuscripts, teaching resources - these are just some of the day-to-day products of our time spent as an academic that by sharing more routinely we can being to change the cultural norms in our resespective fields. Do me a favour - next time you're writing a paper, try posting the code and data up somewhere online3, you can even mint DOIs for the stuff you share. I'd put money on the fact that someone in the next few years is going to figure this altmetrics stuff out and those folks that have been sharing for a while are going to be the ones that reap the rewards.

1. As you might be able to tell, I never observed at Paranal.
2. There is of course the wonderful work of Software Carpentry these days.
3. When you share make sure you put a licence on it. It's The Right Thing To Do™