CRAN, a practical example for being reproducible at large scale using GNU Guix
A recent study published in Nature Scientific Data in February 2022 gives empirical insight into the success rate of reproducing R scripts obtained from Harvard’s Dataverse:
We re-executed R code from each of the replication packages using three R software versions, R 3.2, R 3.6, and R 4.0, in a clean environment. […] We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices.
Given that more than half of the published R files failed to run even when trying to run it with three different R versions, recording the exact environment software is supposed to run in could be declared a good coding practice for scientific publications.
The R ecosystem itself provides tools to capture and restore R software environments, including Packrat and its successor renv which both originate from within the RStudio project. Two replication packages in the study above used renv while the others did not record the environment at all.
Looking at renv more closely reveals that it is able to
capture the current R version and installed packages in a lockfile
called renv.lock
. However, as noted
before,
restoring an environment comes with a few
caveats:
First of all, renv does not install a different version of R if the
recorded and current version disagree. This is a manual step and up to
the user. The same is true for packages with external dependencies. Those
libraries, their headers and binaries also need to be installed by the
user in the correct version, which is not recorded in the lockfile.
Furthermore renv supports restoring packages installed from git
repositories, but fails if the user did not install git beforehand.
None of the guesswork and manual installation steps are required when using GNU Guix, since software in its repositories is bit-for-bit reproducible. It also provides scripts (“importer”) to turn packages from various language-specific repositories like PyPi for Python, crates.io for Rust and CRAN for R into Guix package recipes.
An example workflow for the CRAN package zoid, which is not available in Guix proper, would look like this:
Import the package into a manifest.
$ guix import cran -r zoid > manifest.scm
Edit
manifest.scm
to import the required modules and return a usable manifest containing the package and R itself.(use-modules (guix packages) (guix download) (guix licenses) (guix build-system r) (gnu packages cran) (gnu packages statistics)) (define-public r-zoid …) (packages->manifest (list r-zoid r))
Run your code.
$ guix shell -m manifest.scm -- R -e 'library(zoid)'
Although Guix displays hints which modules are missing when trying to use an incomplete manifest, editing the manifest file to include all of them can be quite tedious.
For R specifically the R package
guix.install provides
a way to automate this import. It also uses guix import
, but references
dependencies using package specifications like (specification->package "r-bh")
. This way no extra logic to figure out the correct module
imports is required. It then extends the package search path, including
the newly written file at ~/.Rguix/packages.scm
, installs the package
into the default Guix profile at ~/.guix-profile
and adds this profile
to R’s search path.
While this approach works well for individual users, Guix installations with a larger user-base, for instance institution-wide, would benefit from the default availability of the entire CRAN package collection with pre-built substitutes to speed up installation times. Additionally, reproducing environments would include fewer steps if the package recipes were available to anyone by default.
Introducing guix-cran
GNU Guix provides a mechanism called “channels”,
which can extend the package collection in Guix
proper. guix-cran does
exactly that: It provides all CRAN packages missing in Guix proper in
a channel and has all of the properties mentioned above. It can be
installed globally via /etc/guix/channels.scm
and packages can be
pre-built on a central server.
As of commit cc7394098f306550c476316710ccad20a510fa4b
there are 17431
packages available in guix-cran. 95% of them are buildable and only 0.5%
of these builds are not reproducible via guix build --check
. It is
also possible to use old package versions via guix time-machine
, similar
to what MRAN
offers. However, that time-frame only spans about two months right now.
Creating and updating guix-cran is fully
automated and happens
without any human intervention. Improvements to the already very good
CRAN importer also improve the channel’s quality. The channel itself
is always in a usable state, because updates are tested with guix pull
before committing and pushing them. However some packages may not build
or work, because (usually undeclared) build or runtime dependencies are
missing. This could be improved through better auto-detection in the
CRAN importer.
Currently building the channel derivation is very slow, most likely due to Guile performance issues. For this reason packages are split into files by the first letter of their name. This way they can still be referenced deterministically by their first letter. Since the number of loadable modules is limited to 8192, creating one module file per package is not possible and putting them all into the same file is even slower.
The channel is not signed, because all changes are automated anyway.
Usage
Using guix-cran requires the following steps:
Create
channels.scm
:(cons (channel (name 'guix-cran) (url "https://github.com/guix-science/guix-cran.git")) %default-channels)
Create
manifest.scm
:(specifications->manifest '("r-zoid" "r"))
Run:
$ guix time-machine -C channels.scm -- shell -m manifest.scm -- R -e 'library(zoid)'
For true reproducibility it’s necessary to pin the channels to a specific commit by running
$ guix time-machine -C channels.scm -- describe -f channels > channels.pinned.scm
once and using channels.pinned.scm
instead of channels.scm
from there on.
Appendix
Ludovic Courtès, Simon Tournier and Ricardo Wurmus provided valuable feedback to the draft of this post.
The channel statistics above can be reproduced using the following
manifest (channels.scm
):
(list
(channel
(name 'guix)
(url "https://git.savannah.gnu.org/git/guix.git")
(branch "master")
(commit
"4781f0458de7419606b71bdf0fe56bca83ace910")
(introduction
(make-channel-introduction
"9edb3f66fd807b096b48283debdcddccfea34bad"
(openpgp-fingerprint
"BBB0 2DDF 2CEA F6A8 0D1D E643 A2A0 6DF2 A33A 54FA"))))
(channel
(name 'guix-cran)
(url "https://github.com/guix-science/guix-cran.git")
(branch "master")
(commit
"cc7394098f306550c476316710ccad20a510fa4b")))
And the following Scheme code to obtain a list of all packages provided
by guix-cran (list-packages.scm
):
(use-modules (guix discovery)
(gnu packages)
(guix modules)
(guix utils)
(guix packages))
(let* ((modules (all-modules (%package-module-path)))
(packages (fold-packages
(lambda (p accum)
(let ((mod (file-name->module-name (location-file (package-location p)))))
(if (member (car mod) '(guix-cran))
(cons p accum)
accum)))
'() modules)))
(for-each (lambda (p) (format #t "~a~%" (package-name p))) packages))
And this Bash script:
#!/bin/sh
guix pull -p guix-profile -C channels.scm
export GUIX_PROFILE=`pwd`/guix-profile
source guix-profile/etc/profile
guix repl list-packages.scm > packages
cat packages| parallel -j 4 'rm -f builds/{} && guix build --no-grafts --timeout=300 -r builds/{} -q {} 2>&1 && guix build --no-grafts --timeout=300 --check -q {} 2>&1' | tee build.log
echo "total" && wc -l packages
echo "success" && sort -u build.log | grep '^/gnu/store' | wc -l
echo "failure" && sort -u build.log | grep 'failed$' | wc -l
echo "non-reproducible" && sort -u build.log | grep 'differs$' | wc -l
Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).