Reproducible research hackathon: experience report
Last week, on July 3rd, we held an on-line hackathon on reproducible research issues. This hackathon was a collaborative effort to bring GNU Guix to concrete examples inspired by to contributions the recent Ten Years Reproducibility Challenge organized by ReScience.
We were ~15 people connected on the
#guix-hpc channel of
The day was interspersed by three video chats; the first to exchange about
interests, background and working plan, the second to report the work in
progress and the last to address the achievements and list future ideas.
Here’s a recap.
Growing the Guix-Past channel
The aim of the Guix-Past channel is to bring software from the past to the present: it gives you packages from “back then” that you can deploy here and now.
The Hackathon had been the occasion to add packages of historical interest:
- Perl 5.14
- Boost 1.58, 1.55, and 1.44
- GNU Scientific Library 1.16
- SimGrid 3.3 and GTnetS 2009, added in an attempt to reproduce the software environment described in this ReScience submission by Arnaud Legrand.
While working on old packages, two concerns about discoverability were raised:
The release date of packages matters to facilitate finding the version that was current when a paper was published. It had been discussed where to specify it? Synopsis or description or comment in the code? The policy ends up with the use of the extra field:
(properties `((release-date . "2015-04-17")))
The next step is to add UI to view properties from the command line.
guix time-machinecommand allows users to build and install previous package versions. However, it is not possible to “jump” to a Guix revision older than version 0.15.0, released in July 2018. For example, old Boost versions had already been packaged in Guix but they are unreachable and had be backported to the Guix-Past channel with bare Git commands such as:
git -C /path/to/guix-checkout log | grep -B4 "boost: Update"
And version history is already available on the Guix Data Service and one of the idea should be to extend such historical search.
Reviving the old Python ecosystem
For reproducibility purposes, people are interested in being able to deploy Python software from the last decade. For instance, GeneNetwork is a group of 25 years of legacy linked data sets and tools used to study complex networks of genes, molecules, and higher order gene function and phenotypes and the project needs to generate time machines of the platform version 1. Numpy and Matplotlib are frequently used and specific bioinformatics tools require it.
The hackathon was the occasion to add over 16 commits:
- NumPy/numarray 1.5.2, 1.1.1, 1.0.4, and 1.2.1 (numarray is one of the two predecessors of NumPy)
- Matplotlib 1.1.0 and this
- Python 2.4
- Nose, dateutil 2.1, Six 1.4.1, Pytest 2.4.2 and Argparse
So much for vintage Python!
Towards long-term and archivable reproducibility
The ambition of Software Heritage is to collect, preserve, and share all software that is publicly available in source code form. And Guix is able to interact with this archive.
Guix can submit request for archiving via
guix lint -c archival.
Once the package is ready, if the
git-fetch, linting ensures the source code is saved on Software
Heritage. The hackathon reminded us that support of other version control
systems, such as Subversion and Mercurial, is missing from
In the long run, one cannot assume that source code hosting sites will remain available—here’s a fresh example. In such cases, Guix falls back to Software Heritage and downloads from there if the source code is archived. During the hackathon, we found a regression in that fall-back path and fixed it.
A one-day on-line get-together is a great opportunity to tackle longstanding topics while helping each other and welcoming newcomers on board. Thanks to everyone for joining! It’s been a pleasant and productive experience, so stay tuned for other rounds!