Is reproducibility practical?

Ludovic Courtès — July 21, 2022

Our attention was recently caught by a nice slide deck on the methods and tools for reproducible research in R. Among those, the talk mentions Guix, stating that it is “for professional, sensitive applications that require ultimate reproducibility”, which is “probably a bit overkill for Reproducible Research”. While we were flattered to see Guix suggested as a good tool for reproducibility, the very notion that there’s a kind of “reproducibility” that is “ultimate” and, essentially, impractical, is something that left us wondering: What kind of reproducibility do scientists need, if not the “ultimate” kind? Is “reproducibility” practical at all, or is it more of a horizon?

In this post, we question the way we Guix people have been discussing “reproducibility” in the context of software deployment. We identify sources of confusion and show that reproducibility is a means that can help achieve different goals. Our conclusion, perhaps unsurprisingly, is that the kinds of “reproducibilities” offered by a tool like Guix are not a luxury for a professional elite: they’re a foundation for reliable software deployment and for verifiable research.

Two kinds of reproducibility

When we talk about “reproducibility” in the context of Guix, we really have two related but different goals in mind. The first goal is being able to redeploy the same software environment on different machines or at different points in time, with little effort.

This first goal is very practical: it’s about letting everyone on a team use the same software, it’s about letting you install the same software on two different machines, whether it’s a laptop running Guix System, a virtual machine running Debian, or a supercomputer running CentOS, and it’s about letting you rerun the computational experiment of a scientific article months later.

The second goal is verifiability. Let’s imagine a scenario where you publish an article and, as accompanying material, you publish source code together with a Docker image on Zenodo containing the code that was supposedly used to produce the results in the article and that supposedly corresponds to that source code.

I say “supposedly” because you cannot tell for sure unless you verify. There are two hypotheses one might want to verify:

  1. That the source code matches the binary in the Docker image;
  2. That the program produces the output shown in the article.

Scientific conferences now often have Artifact Evaluation Committees, which in practice verify that source code is available, and, when things go well, that the container image can produce the results shown in the article—the source/binary correspondence is all too often left out as a technical detail. Reproducible research is about being able to verify research outcomes though, and executable artifacts are one such outcome.

“Professional” vs. “good enough”

I see what you’re headed to”, you note, “but bit-for-bit reproducibility is overkill, I don’t need it.” Wait, we didn’t even mention bit-for-bit reproducibility (yet)!

Let’s get back to the first of our two goals: the ability to deploy the same software environment, anytime anywhere. Maybe there are “good enough” approaches, not as “overkill” as what Guix does, and yet that achieve that goal?

Maybe. The slide deck mentioned above is concerned primarily with GNU R software. At almost 30 years, R is all wisdom and reliability. The language rarely changes, its developers pay attention to backward compatibility, minimizing breakage for the thousands of user-contributed packages available on CRAN. If your software environment consists entirely of R modules, the Packrat tool can do wonders: it can create snapshots of the package name/version pairs used in your session and eventually restore those snapshots by looking up those name/version pairs. It is “good enough” in the sense that the restored environment is “likely” to behave “similarly”, compared to the original environment. It is not “ultimate reproducibility” because there are many things that could lead to different behavior: you might be restoring with a different version of R, or one built or configured differently, with a different set of dependencies, or it might run on a different operating system.

This approach falls short for software environments that are not 100% R. This is not uncommon, if you think about R packages that wrap C/C++ libraries (zlib, Cairo, cURL, Eigen, etc.). Those libraries are beyond the scope of Packrat; whether Packrat can restore an R package that depends on C/C++ libraries depends on external factors: whether those libraries were pre-installed through some other mean, whether the “right” versions are available, whether a C/C++ compiler is available, and so on. It might succeed, or it might fail at build time (due to the lack of a suitable compiler or dependencies) or at run time (due to binary incompatibilities, different dependency versions or build options, etc.) What’s “good enough” for 100% R projects isn’t good enough to let you redeploy polyglot environments.

Other package management tools that have a partial vision of the dependency graph—from pip and Conda to EasyBuild and Spack—suffer from that shortcoming. They may or may not be able to redeploy software packages; those packages might fail to build, because their build environment is not tightly controlled, or they might fail at run time due to binary incompatibilities. These are very practical problems.

Bit for bit

This brings us to our second goal: verifiability. For us developers of package management tools, the question is: how can we enable users to independently verify the source/binary correspondence? In our artifact evaluation scenario, we might want to provide reviewers with a Docker image for convenience, but how can we let them verify that the binaries in that image correspond to the accompanying source code?

This is where reproducible builds come in: as a means to allow for independent verification of the source/binary correspondence. The definition that many in the field agree on states:

A build is reproducible if given the same source code, build environment and build instructions, any party can recreate bit-by-bit identical copies of all specified artifacts.

“Bit-by-bit identical copies”. That phrase suggests perfection. Perfection doesn’t exist though, and it’s not unusual for scientists and practitioners to stop reading at “bit-by-bit”, saying: “nah—this is nice in theory but just impractical and overkill”.

Think about it though: how hard can it be to make a software build process reproducible bit-for-bit? Fortunately, compilers behave in a deterministic fashion: given the same input, they produce the same output. Experience with software distributions as large as Debian, Arch Linux, NixOS, and Guix has shown that there’s a core of well-identified sources of non-reproducibility. Addressing them takes some effort but is not insurmountable: more than 90% of Debian packages and at least 75% of Guix packages are indeed reproducible bit-for-bit. Guix provides users with tools that, we hope, are accessible to those who are not professional in the field of bit-for-bit reproducibility.

The same goes at a higher level. Earlier we wrote that a tool like Packrat can let you restore an environment “likely to behave similarly” compared to the original one. How would one define “similarly” though? If the computation produces different output, what conclusion can you draw? Will you incriminate the method, when you know your software environment doesn’t faithfully mirror the one that was originally used? No, you’ll have at best a lot of guesswork to do before you can draw any conclusion. Conversely, if you know you deployed the same software, bit-for-bit, then you’ve significantly reduced the search space in case the computation produces different output. Bit-for-bit reproducibility might sound overkill, but it’s the only practical way to determine way to determine whether a computational process is reproducible.

Practicality

This blog post was ignited by a slide deck. Perhaps what the author alluded to when they mentioned “ultimate reproducibility” and Guix being “overkill” is that Guix as a project is on a quixotic quest for reproducibility; but perhaps what they suggested by framing it as “professional” is that using it is difficult.

The answer is that if you liked pip install or apt install, you’ll love guix install. Over ten years of development, we’ve worked hard on the user interface and documentation to make it easier to get started. That doesn’t mean everything’s perfect—one of the talks at the upcoming Ten Years of Guix event is about making Guix more approachable and we’re always eager to get feedback from newcomers—but at least the basics should be accessible to anyone who has used the command line before, or even just Jupyter.

Our message is that it is possible to achieve these two types of “reproducibility”: the ability to deploy the same environment anywhere anytime, and the ability to verify the source/binary correspondence of an existing deployment. “Good enough” solutions are good enough in narrow cases only. We can and must demand more of our deployment tools.

Beyond reproducibility

This post focuses on reproducibility, but we should keep in mind that the scientific process does not consist in merely reproducing experiments as-is—it’s about experimenting, fiddling with the computation to evaluate the impact of a parameter on the output, changing parts of the code, and so forth. In a thoughtful article, Hinsen identifies four “essential possibilities” for reproducible computations:

  1. The possibility to inspect all the input data and all the source code that can possibly have an impact on the results.

  2. The possibility to run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results.

  3. The possibility to explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools.

  4. The possibility to verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code.

These four items might look consensual but their practical implications are wide-ranging. The first item is unlocked by publishing scientific software under a free license—as UNESCO recommends—and the two kinds of reproducibilities discussed in this article support #2 and #4. To explore the behavior of the code, we need more. Guix eases exploration with “package transformation options”, which let users deploy variants of the software environment, for example by applying a patch somewhere in the software stack or swapping one dependency for another. A “frozen” application bundle such as a Docker image does not provide this lever.

That most scientific processes now involve software should be an opportunity to improve reproducibility and provenance tracking and to facilitate experimentation, not the other way around.

Acknowledgments

Many thanks to Ricardo Wurmus who provided valuable feedback on an earlier draft of this post.

Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).

  • MDC
  • Inria
  • UBC
  • UTHSC