It's nearly two years now since we (GitHub) worked with the folks at Zenodo to develop an integration[1] designed to make it easier to archive a software repository and issue a DOI. Since there's now a pretty reasonably long timeframe to work with I thought it might be interesting to look at both the usage of the integration and while we're at it the licenses being used by authors of research software.

Stating my assumptions: I'm assuming that if a user has gone to the effort of following the GitHub Guide to deposit their software in Zenodo then it's very likely to be research-focused software (or perhaps they have a perverse interest in DOIs).

How many repositories are being archived?

Firstly, some raw stats on integration usage. At the time of writing close to 5000 (4859 to be exact) unique repositories have configured the Zenodo integration:

zenodo-configurations

GitHub's license API uses Licensee to detect the license of a software repository. Of these repositories, ~63% of them have a detectable open source license.

+------------------------+---------------+-------------------+
| @zenodo_licensed_total | @zenodo_total | @license_fraction |
+------------------------+---------------+-------------------+
|                   3037 |          4859 |      62.502572500 |
+------------------------+---------------+-------------------+

On the face of it, ~63% of Zenodo-archived code doesn't sound very good but this is actually significantly higher than the ~17% of public repositories on GitHub that have a detectable open source license[2].

feffaf8e-04ae-11e6-84db-c1a0a4fc42c5

Also, having a license is great, but some people care a lot about what license authors have picked and what restrictions they place on users of the software.

+--------------+-------+---------+
| license      | count | percent |
+--------------+-------+---------+
| mit          |   863 | 28.4162 |
| gpl-3.0      |   504 | 16.5953 |
| unknown      |   442 | 14.5538 |
| gpl-2.0      |   336 | 11.0635 |
| apache-2.0   |   304 | 10.0099 |
| bsd-3-clause |   246 |  8.1001 |
| bsd-2-clause |    93 |  3.0622 |
| agpl-3.0     |    73 |  2.4037 |
| cc0          |    57 |  1.8769 |
| lgpl-3.0     |    44 |  1.4488 |
| lgpl-2.1     |    25 |  0.8232 |
| isc          |    20 |  0.6585 |
| unlicense    |    10 |  0.3293 |
| epl-1.0      |     8 |  0.2634 |
| artistic-2.0 |     6 |  0.1976 |
| mpl-2.0      |     5 |  0.1646 |
+--------------+-------+---------+

Also, for many use cases, whether or not the license is 'permissive' or not matters[3].

-- Permissive: 'mit', 'apache-2.0', 'bsd-3-clause', 'bsd-2-clause', 'cc0', 'isc', 'unlicense', 'epl-1.0',
--             'mpl-2.0', 'artistic-2.0'
-- Non-permissive: 'gpl-3.0 ', 'gpl-2.0', 'agpl-3.0', 'lgpl-3.0', 'lgpl-2.1'

+----------------------+--------------------------+
| @permissive_fraction | @non_permissive_fraction |
+----------------------+--------------------------+
|         53.078696000 |             32.334540600 |
+----------------------+--------------------------+

Wrapping up

Remember, public code isn't open source without a proper license. Without an open source license that tells others how they can modify and reuse your work, you've only showed others your code; you haven't shared it. Next time you're releasing code (or are using someone else's) make sure there's a license - if there isn't try opening a Pull Request asking them to add a license.


  1. https://guides.github.com/activities/citable-code/ ↩︎

  2. If you only take into account repositories with two or more collaborators this number is much higher. ↩︎

  3. Channeling my inner VanderPlas ↩︎