CRAN, a practical example for being reproducible at large scale using GNU Guix

Lars-Dominik Braun — December 21, 2022

A recent study published in Nature Scientific Data in February 2022 gives empirical insight into the success rate of reproducing R scripts obtained from Harvard’s Dataverse:

We re-executed R code from each of the replication packages using three R software versions, R 3.2, R 3.6, and R 4.0, in a clean environment. […] We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices.

Given that more than half of the published R files failed to run even when trying to run it with three different R versions, recording the exact environment software is supposed to run in could be declared a good coding practice for scientific publications.

The R ecosystem itself provides tools to capture and restore R software environments, including Packrat and its successor renv which both originate from within the RStudio project. Two replication packages in the study above used renv while the others did not record the environment at all.

Looking at renv more closely reveals that it is able to capture the current R version and installed packages in a lockfile called renv.lock. However, as noted before, restoring an environment comes with a few caveats: First of all, renv does not install a different version of R if the recorded and current version disagree. This is a manual step and up to the user. The same is true for packages with external dependencies. Those libraries, their headers and binaries also need to be installed by the user in the correct version, which is not recorded in the lockfile. Furthermore renv supports restoring packages installed from git repositories, but fails if the user did not install git beforehand.

None of the guesswork and manual installation steps are required when using GNU Guix, since software in its repositories is bit-for-bit reproducible. It also provides scripts (“importer”) to turn packages from various language-specific repositories like PyPi for Python, crates.io for Rust and CRAN for R into Guix package recipes.

An example workflow for the CRAN package zoid, which is not available in Guix proper, would look like this:

  1. Import the package into a manifest.

    $ guix import cran -r zoid > manifest.scm
  2. Edit manifest.scm to import the required modules and return a usable manifest containing the package and R itself.

    (use-modules (guix packages)
                 (guix download)
                 (guix licenses)
                 (guix build-system r)
                 (gnu packages cran)
                 (gnu packages statistics))
    
    (define-public r-zoid )
    
    (packages->manifest (list r-zoid r))
  3. Run your code.

    $ guix shell -m manifest.scm -- R -e 'library(zoid)'

Although Guix displays hints which modules are missing when trying to use an incomplete manifest, editing the manifest file to include all of them can be quite tedious.

For R specifically the R package guix.install provides a way to automate this import. It also uses guix import, but references dependencies using package specifications like (specification->package "r-bh"). This way no extra logic to figure out the correct module imports is required. It then extends the package search path, including the newly written file at ~/.Rguix/packages.scm, installs the package into the default Guix profile at ~/.guix-profile and adds this profile to R’s search path.

While this approach works well for individual users, Guix installations with a larger user-base, for instance institution-wide, would benefit from the default availability of the entire CRAN package collection with pre-built substitutes to speed up installation times. Additionally, reproducing environments would include fewer steps if the package recipes were available to anyone by default.

Introducing guix-cran

GNU Guix provides a mechanism called “channels”, which can extend the package collection in Guix proper. guix-cran does exactly that: It provides all CRAN packages missing in Guix proper in a channel and has all of the properties mentioned above. It can be installed globally via /etc/guix/channels.scm and packages can be pre-built on a central server.

As of commit cc7394098f306550c476316710ccad20a510fa4b there are 17431 packages available in guix-cran. 95% of them are buildable and only 0.5% of these builds are not reproducible via guix build --check. It is also possible to use old package versions via guix time-machine, similar to what MRAN offers. However, that time-frame only spans about two months right now.

Creating and updating guix-cran is fully automated and happens without any human intervention. Improvements to the already very good CRAN importer also improve the channel’s quality. The channel itself is always in a usable state, because updates are tested with guix pull before committing and pushing them. However some packages may not build or work, because (usually undeclared) build or runtime dependencies are missing. This could be improved through better auto-detection in the CRAN importer.

Currently building the channel derivation is very slow, most likely due to Guile performance issues. For this reason packages are split into files by the first letter of their name. This way they can still be referenced deterministically by their first letter. Since the number of loadable modules is limited to 8192, creating one module file per package is not possible and putting them all into the same file is even slower.

The channel is not signed, because all changes are automated anyway.

Usage

Using guix-cran requires the following steps:

  1. Create channels.scm:

    (cons
      (channel
        (name 'guix-cran)
        (url "https://github.com/guix-science/guix-cran.git"))
      %default-channels)
  2. Create manifest.scm:

    (specifications->manifest '("r-zoid" "r"))
  3. Run:

    $ guix time-machine -C channels.scm -- shell -m manifest.scm -- R -e 'library(zoid)'

For true reproducibility it’s necessary to pin the channels to a specific commit by running

$ guix time-machine -C channels.scm -- describe -f channels > channels.pinned.scm

once and using channels.pinned.scm instead of channels.scm from there on.

Appendix

Ludovic Courtès, Simon Tournier and Ricardo Wurmus provided valuable feedback to the draft of this post.

The channel statistics above can be reproduced using the following manifest (channels.scm):

(list
  (channel
    (name 'guix)
    (url "https://git.savannah.gnu.org/git/guix.git")
    (branch "master")
    (commit
      "4781f0458de7419606b71bdf0fe56bca83ace910")
    (introduction
      (make-channel-introduction
        "9edb3f66fd807b096b48283debdcddccfea34bad"
        (openpgp-fingerprint
          "BBB0 2DDF 2CEA F6A8 0D1D  E643 A2A0 6DF2 A33A 54FA"))))
  (channel
    (name 'guix-cran)
    (url "https://github.com/guix-science/guix-cran.git")
    (branch "master")
    (commit
      "cc7394098f306550c476316710ccad20a510fa4b")))

And the following Scheme code to obtain a list of all packages provided by guix-cran (list-packages.scm):

(use-modules (guix discovery)
             (gnu packages)
             (guix modules)
             (guix utils)
             (guix packages))
(let* ((modules (all-modules (%package-module-path)))
       (packages (fold-packages
                   (lambda (p accum)
                     (let ((mod (file-name->module-name (location-file (package-location p)))))
                       (if (member (car mod) '(guix-cran))
                         (cons p accum)
                         accum)))
                   '() modules)))
  (for-each (lambda (p) (format #t "~a~%" (package-name p))) packages))

And this Bash script:

#!/bin/sh

guix pull -p guix-profile -C channels.scm
export GUIX_PROFILE=`pwd`/guix-profile
source guix-profile/etc/profile
guix repl list-packages.scm > packages
cat packages| parallel -j 4 'rm -f builds/{} && guix build --no-grafts --timeout=300 -r builds/{} -q {} 2>&1 && guix build --no-grafts --timeout=300 --check -q {} 2>&1' | tee build.log

echo "total" && wc -l packages
echo "success" && sort -u build.log | grep '^/gnu/store' | wc -l
echo "failure" && sort -u build.log | grep 'failed$' | wc -l
echo "non-reproducible" && sort -u build.log | grep 'differs$' | wc -l

Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).

  • MDC
  • Inria
  • UBC
  • UTHSC