This post was originally published in German on the TIB Blog (March 21.2018)
Findability Through Persistent Identifiers
In addition to implementing the FAIR Data Principles for research data there is a need to support the sustainable development, use, and publication of research software. This blog post represents an interim status of the development of these competences.
In addition, it shall introduce a series of articles, which similarly to the already explained FAIR Data Principles for research data (Kraft 2017) will recommend: concrete examples and actions to scientific software projects, summarizes specialist literature (see the bibliography), and also sets goals for our own software projects and those we supports.
In the various thematic sections some of the recommendations are addressed separately to scientists and their institutions. In either case they are ranked roughly in the following order:
- low-hanging fruit,
- next achievable step,
- solid 80% solution, and
- really good implementation,
and imbued with these connotations of being realised. Of course, they should also be understood as a basis for discussion, which is why we are very happy about comments, questions, suggestions, criticism, and reports of experience, etc.
We’ll use the FAIR Data Principles for orientation purposes:
For this article we’ll focus on the first of the FAIR principles:
F – Findable
Analogous to research data we can say that software should be easy to find for both people and computer systems. Basic machine-readable descriptive metadata enable the discovery of interesting software programs and services. In view of this first principle, indeed only minor differences (if any) need to be drawn between finding datasets and software.
F1. Software projects receive a globally unique and permanently persistent identifier
What does that mean?
Each software project should be represented on the Internet under a globally unique and persistent identifier (PID). This can be, for example, a Uniform Resource Locator (URL), or a Digital Object Identifier (DOI), so that software and metadata can be found and referred to. Principle F1 can be classified as the most important, as it is difficult to comply with the other FAIR principles without “globally unique and persistent identifiers”.
FAIR Software — The scientist’s role
- If a software project does not already have its own website, a rudimentary one can be created by means of (an account and) a repository on a code hosting platform. Since Git is the most widely used source code management program, its use in combination with GitHub as the most popular code hosting platform is advised. This already creates a globally unique URL under which the software is found.
- Since a GitHub repo (abbreviation for “repository”) can officially be integrated with Zenodo, a DOI and a landing page can be generated as favored by the Software Citation Principles (Smith 2016). This means that the project can be cited globally and persistently. As a bonus Zenodo also grabs a backup copy automatically, and repeatedly for each subsequent release version. The Open Science Framework, and the Project “Code as a Research Object” by the Mozilla Science Labs offer similar functionality. The latter primarily uses FigShare, against whose long-term archiving one may have reservations. Even if a record has been deleted, its metadata should still be displayed AKA a “tombstone” page. Unfortunately, FigShare doesn’t so this at the moment, and leaves the data grave unmarked, so to speak. Besides Zenodo’s auto-backup option, Git repos can also be downloaded manually, and submitted to generic research data repositories such as RADAR, where they are archived and citable.
- DOIs should only be assigned to objects that have found a long term digital home. In order to identify drafts, pre-releases etc., an Archival Resource Key (ARK) can be used. Here is the list of ARK-issuing institutions.
- Step 2 might even be omitted, once SoftwareHeritage.org will be broadly available! This “ambitious initiative for the collection, preservation and joint management of the entire corpus of publicly accessible software source code”. (DiCosmo 2017), might make citation software as easy as possible (Katz 2017). Software Heritage automatically creates persistent IDs (PIDs), so that a source code repo becomes citable as soon as it is published.
FAIR Software — The institution’s role
- Scientific institutions should provide their members with alternatives to popular, centralized platforms for modern Source Code Hosting (like e.g. git.TIB.eu), and guarantee their long-term availability.
- To make such decentralized source code platforms more attractive, institutions should tick all the functional completeness checkboxes (Pages, CI, etc.) and ensure that the system is kept up to date. They should also ensure their public accessibility (as for example at the University College Hamburg) or in the Department of Computer Science at the university of Bremen), so that the project URLs can be used as public identifiers. Thankfully, GitLab supports persistence. If a user, group or repo name is changed, almost all necessary URL redirections are set up automatically. Although such URL redirects don’t attain the same level of persistence as described by URNs or PURLs, redirected project URLs also play their part in ensuring the findability of digital objects by people and search engine crawlers.
- Similar to the Zenodo-DOI-ification of GitHub repos also institutional platforms should offer DOI registration services, preferably by the sponsoring of corresponding Free Software modules.
- Institutions that run their own code hosting should prepare its integration with SoftwareHeritage.org in order to simplify citation of the works of their members.
- Even if an institution’s software projects are scattered across different hosting platforms, there is a remedy. A tool like DOE Code (source code) can provide the developers a central archive, and place to search.
Thanks for reading! This was some introductory advice on how to make research software projects more sustainable, and especially improve their discoverability by means of persistent identifiers (PIDs). As mentioned, we welcome your comments, questions, suggestions, criticism, experience reports etc. Soon this series will continue with other concrete advice for FAIRer scientific software.
Di Cosmo, Roberto, and Stefano Zacchiroli. 2017. “Software Heritage: Why and How to Preserve Software’s Source Code” In iPRES 2017: 14th International Conference on Digital Preservation. Kyoto, Japan. https://hal.archives-ouvertes.fr/hal-01590958/document
Katz, Daniel S. 2017. “Software Heritage and Repository Metadata: A Software Citation Solution” Daniel S. Katz’s Blog. September 25, 2017. https://danielskatzblog.wordpress.com/2017/09/25/software-heritage-and-repository-metadata-a-software-citation-solution/
Kraft, Angelina. 2017. “The FAIR Data Principles For Research Data” TIB Blog. September 12, 2017. https://blogs.tib.eu/wp/tib/2017/09/12/the-fair-data-principles-for-research-data/
Smith, Arfon M., Daniel S. Katz, and Kyle E. Niemeyer. 2016. “Software Citation Principles” PeerJ Computer Science 2 (September): e86. https://doi.org/10.7717/peerj-cs.86
Video (supplementary material)
If the above space is blank please disable your tracker blocker for this page, ‘These aren’t the droids you’re looking for…’
Video from The Leibniz “Mathematical Modeling and Simulation” (MMS) Days 2018 Leipzig, Feb / Mar 2018
Slides as PDF
Video DOI: https://doi.org/10.5446/35351 from TIB AV Portal
Leinweber, Katrin. ‘Killed By A Thousand Paper Cuts? A Newcomer’s Perspective On Possibilities And Gaps In Software Citation Workflows’. Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS),Technische Informationsbibliothek (TIB), 2018. https://doi.org/10.5446/35351.
Citation format: The Chicago Manual of Style, 17th Edition
Leinweber, Katrin. ‘Concrete Advice for FAIR Software’, 2018. https://doi.org/10.25815/D6C3-MD17