The Software Heritage archive growth graph (view live on

In recent years software has become a legitimate product of research gaining more attention from the scholarly ecosystem than ever before, and researchers feel increasingly the need to cite the software they use or produce. Unfortunately, there is no well established best practice for doing this, and in the citations one sees used quite often ephemeral URLs or other identifiers that offer little or no guarantee that the cited software can be found later on.

But for software to be findable, it must have been preserved in the first place: hence software preservation is actually a prerequisite of software citation.

Preservation: why is it important?

Software preservation is not a simple task. There are many use cases and the complexity of software may lead to different solutions for each of these use cases. One can find various approaches to describe a software system, for example in the paper ‘A framework for software preservation’ (Matthews et al. 2010) an identification schema is proposed with four elements:

  • Product,
  • Version,
  • Variant, and
  • Instance), 

but up to now the focus has been mainly on archiving software executables only. While an executable software artifact can be reused in certain circumstances — if the hardware and Operating System for which it was built still exists, a ‘big if’ when time goes by — it is often stripped of all the human knowledge a software source code may contain and is readable only by a machine. The executable is definitely important as a tool but it can’t be interpretedstudied or modified. That’s why the preservation of the source code is crucial if we want to keep the technical, functional, and cultural knowledge a software may contain, especially when dealing with research software.

In the scholarly ecosystem, the quest for making scientific results reproducible, and to pass the knowledge over to future generations depends on the preservation of the three main pillars: scientific articles that describe the results, the data sets used or produced, and the software that embodies the logic of the data transformation (Di Cosmo and Zacchiroli 2017).

Image: The pillars of knowledge preservation

Software heritage: preserving the source code

Software Heritage is an initiative aiming to collectpreserve and share all software source code, our software commons. The project was started in 2015 by Inria (The French Institute for Research in Computer Science and Automation) and has grown over time into a small and dedicated team led by Roberto Di Cosmo and Stefano Zacchiroli. As a non-profit organization Software Heritage will provide an infrastructure capable of responding to multiple stakeholders in a variety of situations.

Behind the scenes, the engineers created a mechanism that is actively crawling repositories (a task that we call listing) and collecting everything new it finds (a task that we call loading). The Software heritage archive is the largest software source code library to date and contains more than eighty-three million repositories, as of 6.6.2018.

The current sources of the software are:

  • live and updated regularly: GitHub, Debian
  • one shot archival: Gitorious, Google Code and GNU
  • in progress: Bitbucket

This is a significant head start, but there is still a long way to go to achieve the monumental task of archiving all software source code from development forges, package managers, repositories, FOSS distributions and even single URLs not hosted on major forges — which is the case with many researchers publishing their software on their personal page.

Software deposit: publish the source code and promote Open Science

Researchers in many domains are using software for their work and some are creating software to support their research. For data, there are many organizations, initiatives and working groups promoting the FAIR data principles, yet software is a new actor in the field and making software discoverable and open source is still not the default. Furthermore, finding the correct metadata to cite software is even more difficult. The CodeMeta initiative (CodeMeta 2017) and the CITATION file format (Druskat 2017) are two projects providing a metadata schema to enable citation by including a metadata file inside the source code. Unfortunately, these metadata files are really scarce today.

To promote Open Science and Open Source Software a new collaboration has emerged between Software Heritage, Hal-Inria (the open archive of Inria) and the CCSD (The Center for Direct Scientific Communication). It has resulted with a new type of scientific deposit in the French national open archive. Researchers have now the possibility to deposit software source code on Hal-Inria (Barborini et al. 2018). With this new possibility, research software is pushed with the submitted metadata to the Software Heritage archive and a swh-id (an intrinsic identifier) is returned by Software Heritage.The swh-id is an intrinsic identifier because it is calculated using the content of the digital artifact, which means that if calculated by another organization with the same cryptographic hash (SHA1, BLAKE256, etc.) it will result with the same identifier. We use it in our archive’s resolver with a specific semantic schema. The citation format on Hal-Inria, inspired by the software citation principles (Arfon et al. 2016), includes the swh-id which is a direct access to the archived software source code.

The swh-id: a digital finger-print for a software artifact

Source code is massively duplicated across projects and across forges, hence, as explained in more detail in (Di Cosmo and Zacchiroli 2017), the data model used for the Software Heritage archive is a Merkle Direct Acyclic Graph (DAG) (Ralph C. Merkle. 1987), commonly known as a hash tree. Using this structure, each object present in the Software Heritage archive is associated with an intrinsic identifier computed through cryptographic hashes.

Image: Software Heritage data model: a uniform Merkle DAG containing source code artifacts and their development history

These identifiers are guaranteed to remain stable over time, and are resolved with a persistent identifier schema directly on<swh-id>, described in the persistent identifiers documentation. To access the persistent identifiers a permalinks box is available on a side tab and provides identifiers for the current directory, revision (a commit on a particular branch) or snapshot (the complete set of branches in a version control system):

Image: Access Gensim source code in the archive

And the great advantage of the swh-ids is that they are now already available for all the billions of software artifacts stored in the Software Heritage archive: yes, that means that you can reference all kind of software, not just the few projects that have proper metadata attached!

The metadata challenge

Many challenges are still ahead. As noted in (Katz 2017) giving credit to the developers and finding the appropriate metadata for citation can’t be solved by Software Heritage alone. However, Software Heritage fulfills the need of software preservation, that is a stepping stone to proper software citation, and we are planning on extracting the metadata included in the source code in AUTHORS, CONTRIBUTORS, README, LICENCE, codemeta.json, CITATION.cff and other metadata files. That’s why we urge all researchers and developers to include metadata files in their source code and keeping these files updated.

You are welcome to visit our online browsable archive on where you can search through the more than eighty million origins urls, browse the contents, obtain a swh-id and download a directory or a revision. Enjoy!


Matthews, A. Shaon, J. Bicarregui, and C. Jones, “A framework for software preservation” , International Journal of Digital Curation, vol. 5, no. 1, pp. 91-105, 2010. doi:10.2218/ijdc.v5i1.145

Roberto Di Cosmo and Stefano Zacchiroli. 2017. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of the 14th International Conference on Digital Preservation, iPRES 2017.

Matthew B. Jones, Carl Boettiger, Abby Cabunoc Mayes, Arfon Smith, Peter Slaughter, Kyle Niemeyer, Yolanda Gil, Martin Fenner, Krzysztof Nowak, Mark Hahnel, Luke Coy, Alice Allen, Mercè Crosas, Ashley Sands, Neil Chue Hong, Patricia Cruse, Daniel S. Katz, Carole Goble. 2017. CodeMeta: an exchange schema for software metadata. Version 2.0. KNB Data Repository. doi:10.5063/schema/codemeta-2.0

Druskat, Stephan. 2017. Citation File Format (CFF). Zenodo. doi:10.5281/zenodo.1003150

Yannick Barborini, Roberto Di Cosmo, Antoine R. Dumont, Morane Gruenpeter, Bruno Marmol, et al. The creation of a new type of scientific deposit: Software. RDA Eleventh Plenary Meeting, Berlin, Germany, Mar 2018.

Smith, Arfon M., Katz, Daniel S., Niemeyer, Kyle E., & FORCE11 Software Citation Working Group. 2016. Software citation principles. PeerJ Computer Science2, e86.

Ralph C. Merkle. 1987. A Digital Signature Based on a Conventional Encryption Function. In Advances in Cryptology – CRYPTO ’87, A Conference on the Theory and Applications of Cryptographic Techniques, Santa Barbara, California, USA, August 16-20, 1987, Proceedings (Lecture Notes in Computer Science), Carl Pomerance (Ed.), Vol. 293. Springer, 369–378. 32

Persistent identifiers documentation, Software Heritage –

Daniel S. Katz. 2017. Software Heritage and repository metadata: a software citation solution. Daniel S. Katz’s blog.

DOI: 10.25815/0ZBH-2W14

Citation format: The Chicago Manual of Style, 17th Edition

Gruenpeter, Morane. ‘Software Preservation: A Stepping Stone for Software Citation’, 2018.