Reproducible research hackathon: experience report

Simon Tournier, Ludovic Courtès — July 10, 2020

Last week, on July 3rd, we held an on-line hackathon on reproducible research issues. This hackathon was a collaborative effort to bring GNU Guix to concrete examples inspired by to contributions the recent Ten Years Reproducibility Challenge organized by ReScience.

We were ~15 people connected on the #guix-hpc channel of The day was interspersed by three video chats; the first to exchange about interests, background and working plan, the second to report the work in progress and the last to address the achievements and list future ideas. Here’s a recap.

Growing the Guix-Past channel

The aim of the Guix-Past channel is to bring software from the past to the present: it gives you packages from “back then” that you can deploy here and now.

The Hackathon had been the occasion to add packages of historical interest:

People also started work on addressing issues with Fortran 77, GNU Octave 3.4.3 with glibc 2.31 and opflow (1998).

While working on old packages, two concerns about discoverability were raised:

  • The release date of packages matters to facilitate finding the version that was current when a paper was published. It had been discussed where to specify it? Synopsis or description or comment in the code? The policy ends up with the use of the extra field:

    (properties `((release-date . "2015-04-17")))

    The next step is to add UI to view properties from the command line.

  • The guix time-machine command allows users to build and install previous package versions. However, it is not possible to “jump” to a Guix revision older than version 0.15.0, released in July 2018. For example, old Boost versions had already been packaged in Guix but they are unreachable and had be backported to the Guix-Past channel with bare Git commands such as:

    git -C /path/to/guix-checkout log | grep -B4 "boost: Update"

    And version history is already available on the Guix Data Service and one of the idea should be to extend such historical search.

Reviving the old Python ecosystem

For reproducibility purposes, people are interested in being able to deploy Python software from the last decade. For instance, GeneNetwork is a group of 25 years of legacy linked data sets and tools used to study complex networks of genes, molecules, and higher order gene function and phenotypes and the project needs to generate time machines of the platform version 1. Numpy and Matplotlib are frequently used and specific bioinformatics tools require it.

The hackathon was the occasion to add over 16 commits:

So much for vintage Python!

Towards long-term and archivable reproducibility

The ambition of Software Heritage is to collect, preserve, and share all software that is publicly available in source code form. And Guix is able to interact with this archive.

Guix can submit request for archiving via guix lint -c archival. Once the package is ready, if the origin is git-fetch, linting ensures the source code is saved on Software Heritage. The hackathon reminded us that support of other version control systems, such as Subversion and Mercurial, is missing from guix lint.

In the long run, one cannot assume that source code hosting sites will remain available—here’s a fresh example. In such cases, Guix falls back to Software Heritage and downloads from there if the source code is archived. During the hackathon, we found a regression in that fall-back path and fixed it.

A one-day on-line get-together is a great opportunity to tackle longstanding topics while helping each other and welcoming newcomers on board. Thanks to everyone for joining! It’s been a pleasant and productive experience, so stay tuned for other rounds!

  • MDC
  • Inria
  • UBC