JSON-LD for software discovery, reuse and credit

This is a continuation of some work I've been doing with the Mozilla Science Lab and their 'code as a research object' program. There's multiple aspects to this project including work on code and GUI prototypes, discussions around best practices for making code reusable and software citation. This post explores some ideas around linked data and machine readable descriptions of software repositories with the goal being to make software more discoverable and therefore increase reuse.


JSON-LD is a way of describing data with additional context (or semantics if you like) so that for a JSON record like this:

 { "name" : "Arfon" } 

when there's an entity called name you know that it means the name of a person and not a place.

If you haven't heard of JSON-LD then there are some great resources here and an excellent short screencast on YouTube here.

One of the reasons JSON-LD is particularly exciting is that it's a lightweight way of organising JSON-formatted data and giving semantic meaning without having to care about things like RDF data models, XML and the (note the capitals) Semantic Web. Being much more succinct than XML and JavaScript native, JSON has over the past few years become the way to expose data through a web-based API. JSON-LD offers a way for API provides (and consumers) to share data more easily with little or no ambiguity about what the data they're describing.

» So what about software?

Over the past few months there's been a lot of talk about finding ways for researchers to derive (more) credit for code. There are lots of issues at play here but one major factor is that a prerequisite to receiving credit for some piece of code you've written is that a peer needs to both be able to find your work and then reuse it.

The problem is, it can be pretty hard to find software unless there's a standard place to share tools in that language and the author of the code has chosen to publish there. Ruby has RubyGems.org, Python has PyPI, Perl has CPAN but where do I go if I'm looking to find an obscure library written in C++?

Discovering domain, language and function specific software is an even harder problem to crack. Sure, if I write Ruby I can head over to RubyGems to look for a Gem that might solve my problem but I'm relying on both the author to write a descriptive README and my ability to search for terms that include similar language to the author of the package.

For many subjects where common languages don't benefit from canonical package indexes and the function of the software is relatively niche, then just finding code that might be useful is a problem.

» Towards a (machine readable) description of software

One way to address this discoverability problem is to find a standard way of describing software with context for the terms used. A design goal here should be that these files can be almost entirely automatically generated.

Inspired by the package.json format prescribed by the npm community and using an ontology described on http://schema.org below is a relatively short JSON-LD document that describes the Fidgit codebase. Let's call it code.jsonld for now.

» Minimal citable form

Note the first two line (@context and @type) defines the context for the key/value pairs in the JSON structure so that name means the name of the codebase. You can see the full ontology for Code here but this should mostly be straightforward to understand1.

Once we get to the authors attribute we're now entering a new context, that of an individual. As we're still using the schema.org ontology for type Person we only need to set the @type attribute here.

There are a bunch more attributes that we could set here but this feels like a minimal set of information that is sufficient for citation (and therefore credit and attribution for the author).

» For data archivers

This next example is a slightly modified version of the minimal. This includes multiple authors2 but now also has keywords required by folks like figshare and Zenodo who require these terms. (Note these keywords should probably be more explicitly structured rather than relying on comma-delimited strings.)

» For discovery?

I started by describing the problem of software discovery and how domain, function and language specific searches for tools is hard. So far these JSON-LD snippets don't really help with this problem as we still only have keywords and a description for describing the software function and domain.

The schema.org Code ontology includes a programmingLanguage attribute which solves for doing language-specific searches. At GitHub we're pretty good at detecting this automatically with Linguist and so it's not even clear that an author of a piece of software would need to manually specify this (a win).

The challenge when designing a more 'complete' code.jsonld document is that it's seemingly rather tough to automate a description of what subject domain the software has been designed for and what the software does.

PLOS ONE has a pretty decent subject taxonomy that I've extracted into a machine readable form here and so it's possible something along these lines could be used to assign a subject domain. Thus far, I've been unable to find a good schema for describing academic subjects (or any subject domains). Going deeper and attempting to describe also the function of software is also proving challenging.

» Feedback please!

At this point I'd love some feedback on these ideas. The goal here is to promote software discovery and reuse, so framing this in what's possible today is a good place to start reflecting on these ideas. Right now it's possible to do a pretty advanced search for code on GitHub with facets for programming language, file extension, creation date, username and more. Imagine if you could do the same but add in subject area and software function?

One major pitfall with this idea is that in order for an index of code.json files to be useful people have to start making them - a classic chicken and egg problem. All is not lost though, pretty much all of the minimal code.json file can be auto-generated and perhaps submitted to authors as a pull request patch by a friendly robot?

One of the biggest barriers to reusing research software is finding the damn stuff in the first place - does this help?

» Links

1 Note the Code ontology on schema.org doesn't *does now* include a license attribute which seems like an oversight.
2 It's not clear that this is allowed!