Software Preservation: A Stepping Stone for Software Citation

The Software Heritage archive growth graph (view live on 
https://www.softwareheritage.org/larchive-software-heritage/)

In recent years software has become a legitimate product of research gaining more attention from the scholarly ecosystem than ever before, and researchers feel increasingly the need to cite the software they use or produce. Unfortunately, there is no well established best practice for doing this, and in the citations one sees used quite often ephemeral URLs or other identifiers that offer little or no guarantee that the cited software can be found later on.

But for software to be findable, it must have been preserved in the first place: hence software preservation is actually a prerequisite of software citation.

Preservation: why is it important?

Software preservation is not a simple task. There are many use cases and the complexity of software may lead to different solutions for each of these use cases. One can find various approaches to describe a software system, for example in the paper ‘A framework for software preservation’ (Matthews et al. 2010) an identification schema is proposed with four elements:

  • Product,
  • Version,
  • Variant, and
  • Instance), 

but up to now the focus has been mainly on archiving software executables only. While an executable software artifact can be reused in certain circumstances — if the hardware and Operating System for which it was built still exists, a ‘big if’ when time goes by — it is often stripped of all the human knowledge a software source code may contain and is readable only by a machine. The executable is definitely important as a tool but it can’t be interpretedstudied or modified. That’s why the preservation of the source code is crucial if we want to keep the technical, functional, and cultural knowledge a software may contain, especially when dealing with research software.

In the scholarly ecosystem, the quest for making scientific results reproducible, and to pass the knowledge over to future generations depends on the preservation of the three main pillars: scientific articles that describe the results, the data sets used or produced, and the software that embodies the logic of the data transformation (Di Cosmo and Zacchiroli 2017).

Image: The pillars of knowledge preservation

Software heritage: preserving the source code

Software Heritage is an initiative aiming to collectpreserve and share all software source code, our software commons. The project was started in 2015 by Inria (The French Institute for Research in Computer Science and Automation) and has grown over time into a small and dedicated team led by Roberto Di Cosmo and Stefano Zacchiroli. As a non-profit organization Software Heritage will provide an infrastructure capable of responding to multiple stakeholders in a variety of situations.

Behind the scenes, the engineers created a mechanism that is actively crawling repositories (a task that we call listing) and collecting everything new it finds (a task that we call loading). The Software heritage archive is the largest software source code library to date and contains more than eighty-three million repositories, as of 6.6.2018.

The current sources of the software are:

  • live and updated regularly: GitHub, Debian
  • one shot archival: Gitorious, Google Code and GNU
  • in progress: Bitbucket

This is a significant head start, but there is still a long way to go to achieve the monumental task of archiving all software source code from development forges, package managers, repositories, FOSS distributions and even single URLs not hosted on major forges — which is the case with many researchers publishing their software on their personal page.

Software deposit: publish the source code and promote Open Science

Researchers in many domains are using software for their work and some are creating software to support their research. For data, there are many organizations, initiatives and working groups promoting the FAIR data principles, yet software is a new actor in the field and making software discoverable and open source is still not the default. Furthermore, finding the correct metadata to cite software is even more difficult. The CodeMeta initiative (CodeMeta 2017) and the CITATION file format (Druskat 2017) are two projects providing a metadata schema to enable citation by including a metadata file inside the source code. Unfortunately, these metadata files are really scarce today.

To promote Open Science and Open Source Software a new collaboration has emerged between Software Heritage, Hal-Inria (the open archive of Inria) and the CCSD (The Center for Direct Scientific Communication). It has resulted with a new type of scientific deposit in the French national open archive. Researchers have now the possibility to deposit software source code on Hal-Inria (Barborini et al. 2018). With this new possibility, research software is pushed with the submitted metadata to the Software Heritage archive and a swh-id (an intrinsic identifier) is returned by Software Heritage.The swh-id is an intrinsic identifier because it is calculated using the content of the digital artifact, which means that if calculated by another organization with the same cryptographic hash (SHA1, BLAKE256, etc.) it will result with the same identifier. We use it in our archive’s resolver with a specific semantic schema. The citation format on Hal-Inria, inspired by the software citation principles (Arfon et al. 2016), includes the swh-id which is a direct access to the archived software source code.

The swh-id: a digital finger-print for a software artifact

Source code is massively duplicated across projects and across forges, hence, as explained in more detail in (Di Cosmo and Zacchiroli 2017), the data model used for the Software Heritage archive is a Merkle Direct Acyclic Graph (DAG) (Ralph C. Merkle. 1987), commonly known as a hash tree. Using this structure, each object present in the Software Heritage archive is associated with an intrinsic identifier computed through cryptographic hashes.

Image: Software Heritage data model: a uniform Merkle DAG containing source code artifacts and their development history

These identifiers are guaranteed to remain stable over time, and are resolved with a persistent identifier schema directly on https://archive.softwareheritage.org/<swh-id>, described in the persistent identifiers documentation. To access the persistent identifiers a permalinks box is available on a side tab and provides identifiers for the current directory, revision (a commit on a particular branch) or snapshot (the complete set of branches in a version control system):

Image: Access Gensim source code in the archive https://archive.softwareheritage.org/swh:1:dir:774f7d3f4a99f9754e785e1335fe718a4234eba7;origin=https://github.com/RaRe-Technologies/gensim/

And the great advantage of the swh-ids is that they are now already available for all the billions of software artifacts stored in the Software Heritage archive: yes, that means that you can reference all kind of software, not just the few projects that have proper metadata attached!

The metadata challenge

Many challenges are still ahead. As noted in (Katz 2017) giving credit to the developers and finding the appropriate metadata for citation can’t be solved by Software Heritage alone. However, Software Heritage fulfills the need of software preservation, that is a stepping stone to proper software citation, and we are planning on extracting the metadata included in the source code in AUTHORS, CONTRIBUTORS, README, LICENCE, codemeta.json, CITATION.cff and other metadata files. That’s why we urge all researchers and developers to include metadata files in their source code and keeping these files updated.

You are welcome to visit our online browsable archive on https://archive.softwareheritage.org/ where you can search through the more than eighty million origins urls, browse the contents, obtain a swh-id and download a directory or a revision. Enjoy!

References

Matthews, A. Shaon, J. Bicarregui, and C. Jones, “A framework for software preservation” , International Journal of Digital Curation, vol. 5, no. 1, pp. 91-105, 2010. doi:10.2218/ijdc.v5i1.145

Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017.

Matthew B. Jones, Carl Boettiger, Abby Cabunoc Mayes, Arfon Smith, Peter Slaughter, Kyle Niemeyer, Yolanda Gil, Martin Fenner, Krzysztof Nowak, Mark Hahnel, Luke Coy, Alice Allen, Mercè Crosas, Ashley Sands, Neil Chue Hong, Patricia Cruse, Daniel S. Katz, Carole Goble. 2017. CodeMeta: an exchange schema for software metadata. Version 2.0. KNB Data Repository. doi:10.5063/schema/codemeta-2.0

Druskat, Stephan. 2017. Citation File Format (CFF). Zenodo. doi:10.5281/zenodo.1003150

Yannick Barborini, Roberto Di Cosmo, Antoine R. Dumont, Morane Gruenpeter, Bruno Marmol, et al. The creation of a new type of scientific deposit: Software. RDA Eleventh Plenary Meeting, Berlin, Germany, Mar 2018. https://www.rd-alliance.org/rda-11th-plenary-poster-session(hal-01738741)

Smith, Arfon M., Katz, Daniel S., Niemeyer, Kyle E., & FORCE11 Software Citation Working Group. 2016. Software citation principles. PeerJ Computer Science2, e86. https://doi.org/10.7717/peerj-cs.86

Ralph C. Merkle. 1987. A Digital Signature Based on a Conventional Encryption Function. In Advances in Cryptology – CRYPTO ’87, A Conference on the Theory and Applications of Cryptographic Techniques, Santa Barbara, California, USA, August 16-20, 1987, Proceedings (Lecture Notes in Computer Science), Carl Pomerance (Ed.), Vol. 293. Springer, 369–378. https://doi.org/10.1007/3-540-48184-2 32

Persistent identifiers documentation, Software Heritage –  https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html

Daniel S. Katz. 2017. Software Heritage and repository metadata: a software citation solution. Daniel S. Katz’s blog. https://danielskatzblog.wordpress.com/2017/09/25/software-heritage-and-repository-metadata-a-software-citation-solution/


DOI: 10.25815/0ZBH-2W14

Citation format: The Chicago Manual of Style, 17th Edition

Gruenpeter, Morane. ‘Software Preservation: A Stepping Stone for Software Citation’, 2018. https://doi.org/10.25815/0ZBH-2W14.


Morane Gruenpeter

Posted by Morane Gruenpeter

After several years as a professional harpist Morane found a new career path in software engineering. Morane joined the Software Heritage team as an intern in 2017 while finishing a Master's degree in Computer Science at University Pierre et Marie Curie, Paris, France. After a successful internship she continues her research on the software metadata challenge by building the Semantic Web of FOSS projects. She is also the metadata liaison on the Software Heritage and CROSSMINER partnership, enabling the applications of the CROSSMINER analysis tools on the Software Heritage archive.

5 Replies to “Software Preservation: A Stepping Stone for Software Citation”

  1. Katrin Leinweber

    Dear Morane,

    I am interested in learning how often SH is listing/ingesting new content from for example GitHub? It doesn’t seem to be a continuous, automatic process, is it?

    Kind regards,

    Katrin

    Reply

    1. Morane Gruenpeter July 4, 2018 at 9:56 am

      Dear Katrin,

      The content ingestion happen when new content has been observed during a scheduled visit. Since there are more than 83 million repositories, the time lapse between each visit depends on the activity on the repository (each visit without any updates results in doubling the time for the next visit till a certain limit). There are many inactive repositories that don’t require the same amount of visits an active repository requires.

      It is an automatic and continuous task, but there will be delays between a new commit on GitHub and its archival on SWH.

      I hope this answers your question. For further reading, here is a blog-post by Avi Kelman, describing how listers can be created and how they work: https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/.

      Don’t hesitate asking for more details.

      Cheers,
      Morane

      Reply

  2. Katrin Leinweber

    Hm, OK, thanks for the explanation.

    Maybe this turns into a bug report then: https://github.com/TIBHannover/ currently has 9 repos, but
    https://archive.softwareheritage.org/browse/search/ > “TIBHannover” finds only the 3 oldest ones.

    Or, how often does the _listing_ occur?

    Reply

    1. Morane Gruenpeter

      Dear Katrin,

      The listing occurs all the time on a very large number of repositories. The policy I described above was about the ingestion part called loading where we capture new content for a listed origin.

      Concerning the TIBHannover repos, 3 have been listed as you can see but are waiting the loading process and the others have not yet been listed so our crawler hasn’t yet archived those repos, but will eventually.

      We are a small team working hard to make the archive more complete every day. We are working to stream updates from ghtorrent to reduce the delay between the updates on the repository and the archive. Also we are working to enable a “save code now” feature which might be useful to your particular case. All the code is open source on: https://forge.softwareheritage.org/
      and we are happy to welcome contributors.

      Cheers,
      Morane

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *