Guix-HPC — Reproducible software deployment for high-performance computing

Supporting academic conference artifact evaluation

2024-07-26T17:00:00Z

Having promoted Guix as one of the tools to support reproducible research workflows, we are happy that it is now officially presented as one way to produce and review software artifacts that accompany articles submitted to SuperComputing 2024 (SC24), the leading HPC conference. In this post we look at what this entails and reflect on the role of reproducible software deployment on conference artifact evaluation.

Artifact evaluation at SuperComputing 2024

Like many other conferences, SuperComputing has had a Reproducibility Initiative for several years now. The conference prides itself on being a leader in tangible progress towards scientific rigor, through its pioneering practice of enhanced reproducibility of accepted papers. That scientific rigor has been lacking in the evaluation of reproducibility of computational experiments is a sad reality, and we can only applaud efforts to rectify this. Artifact review badges such as those introduced more than a decade ago by the Association for Computer Machinery (ACM) were a step in the right direction and an inspiration for many computer science conferences.

The artifact evaluation guidelines of SC24 suggest three ways in which authors can provide software artifacts in a way that eases their evaluation by reviewers:

Providing instructions to build the software, ideally tested on one of the Chameleon Cloud images provided by the Artifact Evaluation Committee.
If the first option isn't practical, giving access to the author’s own computational resources.
Optionally—and this is the first time—using Guix to provide metadata to deploy and run the computational experiment.

The SC24 guidelines further state:

This year’s initiative proposes the optional use of Guix, a software tool designed to support reproducible software deployment. Guix allows [the] deployment of the exact same software environment on different machines and at different points in time, while still retaining complete provenance info. By eliminating almost entirely variability induced by the software environment, Guix gives authors and reviewers more confidence into the results of computational experiments.

Indeed, option 1 amounts to providing manual build instructions—a sequence of commands to build the software. Those instructions unavoidably make implicit assumptions about the software environment. A software environment cannot possibly be fully captured by a short, human-readable sequence of instructions. For example, the instructions might assume that a C compiler is available, that it’s “recent enough” to build their package, or that some library is already installed. Those build instructions are bound to fail on different systems or a different point in time.

This is where Guix can improve the touted scientific rigor. Guix provides the complete software deployment recipe. As shown in our guide to reproducible research papers, providing a pinned channels file and a manifest allows anyone to redeploy the exact same software environment. But it also provides enough freedom to allow for experimentation beyond that predefined environment: given these two files, one can deploy variants of the software environment, for example to study the impact of changing the version of a package, of passing a specific build flag, of applying a patch to a specific component in the stack, and so forth.

The SC24 guidelines link to a guide that we wrote to help authors who wish to ensure reproducible deployment of their software environment with Guix. It builds upon our earlier guide, explaining how to write package manifests and deploy them with guix shell, how to pin channels with guix describe, and how to jump to those pinned channels using guix time-machine. It also provides tips and tricks that are crucial in HPC, from MPI to GPU usage.

Going further

We are glad reproducible deployment makes its first appearance in the artifact evaluation guidelines of a major conference. This is just one computer science conference, but in a field that is very demanding. Surely, if this can be done in HPC, this can be adapted to other conferences.

While conferences are increasingly taking software deployment into account, the common go-to better-than-nothing approach is to ask authors to provide binary bundles—Docker or virtual machine (VM) images. Undoubtedly that greatly facilitates artifact evaluation—author-provided code can be readily executed—but it does so at the expense of transparency and of experimentation. Our goal is to raise awareness of this fundamental limitation in the reproducible research and open science community, and to add reproducible software deployment to our “best practices” book. “Scientific rigor” demands more than the bits of the binaries used in our computational experiments.

If you are part of an artifact evaluation committee, we would love to hear from you!

Adventures on the quest for long-term reproducible deployment

2024-03-13T15:00:00Z

Rebuilding software five years later, how hard can it be? It can’t be that hard, especially when you pride yourself on having a tool that can travel in time and that does a good job at ensuring reproducible builds, right?

In hindsight, we can tell you: it’s more challenging than it seems. Users attempting to travel 5 years back with guix time-machine are (or were) unavoidably going to hit bumps on the road—a real problem because that’s one of the use cases Guix aims to support well, in particular in a reproducible research context.

In this post, we look at some of the challenges we face while traveling back, how we are overcoming them, and open issues.

The vision

First of all, one clarification: Guix aims to support time travel, but we’re talking of a time scale measured in years, not in decades. We know all too well that this is already very ambitious—it’s something that probably nobody except Nix and Guix are even trying. More importantly, software deployment at the scale of decades calls for very different, more radical techniques; it’s the work of archivists.

Concretely, Guix 1.0.0 was released in 2019 and our goal is to allow users to travel as far back as 1.0.0 and redeploy software from there, as in this example:

$ guix time-machine -q --commit=v1.0.0 -- \
     environment --ad-hoc python2 -- python
> guile: warning: failed to install locale
Python 2.7.15 (default, Jan  1 1970, 00:00:01) 
[GCC 5.5.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

(The command above uses guix environment, the predecessor of guix shell, which didn’t exist back then.) It’s only 5 years ago but it’s pretty much remote history on the scale of software evolution—in this case, that history comprises major changes in Guix itself and in Guile. How well does such a command work? Well, it depends.

The project has two build farms; bordeaux.guix.gnu.org has been keeping substitutes (pre-built binaries) of everything it built since roughly 2021, while ci.guix.gnu.org keeps substitutes for roughly two years, but there is currently no guarantee on the duration substitutes may be retained. Time traveling to a period where substitutes are available is fine: you end up downloading lots of binaries, but that’s OK, you rather quickly have your software environment at hand.

Bumps on the build road

Things get more complicated when targeting a period in time for which substitutes are no longer available, as was the case for v1.0.0 above. (And really, we should assume that substitutes won’t remain available forever: fellow NixOS hackers recently had to seriously consider trimming their 20-year-long history of substitutes because the costs are not sustainable.)

Apart from the long build times, the first problem that arises in the absence of substitutes is source code unavailability. I’ll spare you the details for this post—that problem alone would deserve a book. Suffice to say that we’re lucky that we started working on integrating Guix with Software Heritage years ago, and that there has been great progress over the last couple of years to get closer to full package source code archival (more precisely: 94% of the source code of packages available in Guix in January 2024 is archived, versus 72% of the packages available in May 2019).

So what happens when you run the time-machine command above? It brings you to May 2019, a time for which none of the official build farms had substitutes until a few days ago. Ideally, thanks to isolated build environments, you’d build things for hours or days, and in the end all those binaries will be here just as they were 5 years ago. In practice though, there are several problems that isolation as currently implemented does not address.

Among those, the most frequent problem is time traps: software build processes that fail after a certain date (these are also referred to as “time bombs” but we’ve had enough of these and would rather call for a ceasefire). This plagues a handful of packages out of almost 30,000 but unfortunately we’re talking about packages deep in the dependency graph. Here are some examples:

OpenSSL unit tests fail after a certain date because some of the X.509 certificates they use have expired.
GnuTLS had similar issues; newer versions rely on datefudge to fake the date while running the tests and thus avoid that problem altogether.
Python 2.7, found in Guix 1.0.0, also had that problem with its TLS-related tests.
OpenJDK would fail to build at some point with this interesting message: Error: time is more than 10 years from present: 1388527200000 (the build system would consider that its data about currencies is likely outdated after 10 years).
Libgit2, a dependency of Guix, had (has?) a time-dependent tests.
MariaDB tests started failing in 2019.

Someone traveling to v1.0.0 will hit several of these, preventing guix time-machine from completing. A serious bummer, especially to those who’ve come to Guix from the perspective of making their research workflow reproducible.

Time traps are the main road block, but there’s more! In rare cases, there’s software influenced by kernel details not controlled by the build daemon:

Tests of the hwloc hardware locality library would fail when running on a Btrfs file system.

In a handful of cases, but important ones, builds might fail when performed on certain CPUs. We’re aware of at least two cases:

Python 3.9 to 3.11 would set a signal handler stack too small for use on Intel Sapphire Rapids Xeon CPUs (it’s more complicated than this but the end result is: it will no longer build on modern hardware).
Firefox would reportedly crash on Raptor Lake CPUs running an buggy version of their firmware.

Neither time traps nor those obscure hardware-related issues can be avoided with the isolation mechanism currently used by the build daemon. This harms time traveling when substitutes are unavailable. Giving up is not in the ethos of this project though.

Where to go from here?

There are really two open questions here:

How can we tell which packages needs to be “fixed”, and how: building at a specific date, on a specific CPU?
How can keep those aspects of the build environment (time, CPU variant) under control?

Let’s start with #2. Before looking for a solution, it’s worth remembering where we come from. The build daemon runs build processes with a separate root file system, under dedicated user IDs, and in separate Linux namespaces, thereby minimizing interference with the rest of the system and ensuring a well-defined build environment. This technique was implemented by Eelco Dolstra for Nix in 2007 (with namespace support added in 2012), at a time where the word container had to do with boats and before “Docker” became the name of a software tool. In short, the approach consists in controlling the build environment in every detail (it’s at odds with the strategy that consists in achieving reproducible builds in spite of high build environment variability). That these are mere processes with a bunch of bind mounts makes this approach inexpensive and appealing.

Realizing we’d also want to control the build environment’s date, we naturally turn to Linux namespaces to address that—Dolstra, Löh, and Pierron already suggested something along these lines in the conclusion of their 2010 Journal of Functional Programming paper. Turns out there is now a time namespace. Unfortunately it’s limited to CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks; the manual page states:

Note that time namespaces do not virtualize the CLOCK_REALTIME clock. Virtualization of this clock was avoided for reasons of complexity and overhead within the kernel.

I hear you say: What about datefudge and libfaketime? These rely on the LD_PRELOAD environment variable to trick the dynamic linker into pre-loading a library that provides symbols such as gettimeofday and clock_gettime. This is a fine approach in some cases, but it’s too fragile and too intrusive when targeting arbitrary build processes.

That leaves us with essentially one viable option: virtual machines (VMs). The full-system QEMU lets you specify the initial real-time clock of the VM with the -rtc flag, which is exactly what we need (“user-land” QEMU such as qemu-x86_64 does not support it). And of course, it lets you specify the CPU model to emulate.

News from the past

Now, the question is: where does the VM fit? The author considered writing a package transformation that would change a package such that it’s built in a well-defined VM. However, that wouldn’t really help: this option didn’t exist in past revisions, and it would lead to a different build anyway from the perspective of the daemon—a different derivation.

The best strategy appeared to be offloading: the build daemon can offload builds to different machines over SSH, we just need to let it send builds to a suitably-configured VM. To do that, we can reuse some of the machinery initially developed for childhurds that takes care of setting up offloading to the VM: creating substitute signing keys and SSH keys, exchanging secret key material between the host and the guest, and so on.

The end result is a service for Guix System users that can be configured in a few lines:

(use-modules (gnu services virtualization))

(operating-system
  ;; …
  (services (append (list (service virtual-build-machine-service-type))
                    %base-services)))

The default setting above provides a 4-core VM whose initial date is January 2020, emulating a Skylake CPU from that time—the right setup for someone willing to reproduce old binaries. You can check the configuration like this:

$ sudo herd configuration build-vm
CPU: Skylake-Client
number of CPU cores: 4
memory size: 2048 MiB
initial date: Wed Jan 01 00:00:00Z 2020

To enable offloading to that VM, one has to explicitly start it, like so:

$ sudo herd start build-vm

From there on, every native build is offloaded to the VM. The key part is that with almost no configuration, you get everything set up to build packages “in the past”. It’s a Guix System only solution; if you run Guix on another distro, you can set up a similar build VM but you’ll have to go through the cumbersome process that is all taken care of automatically here.

Of course it’s possible to choose different configuration parameters:

(service virtual-build-machine-service-type
         (virtual-build-machine
          (date (make-date 0 0 00 00 01 10 2017 0)) ;further back in time
          (cpu "Westmere")
          (cpu-count 16)
          (memory-size (* 8 1024))
          (auto-start? #t)))

With a build VM with its date set to January 2020, we have been able to rebuild Guix and its dependencies along with a bunch of packages such as emacs-minimal from v1.0.0, overcoming all the time traps and other challenges described earlier. As a side effect, substitutes are now available from ci.guix.gnu.org so you can even try this at home without having to rebuild the world:

$ guix time-machine -q --commit=v1.0.0 -- build emacs-minimal --dry-run
guile: warning: failed to install locale
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
38.5 MB would be downloaded:
   /gnu/store/53dnj0gmy5qxa4cbqpzq0fl2gcg55jpk-emacs-minimal-26.2

For the fun of it, we went as far as v0.16.0, released in December 2018:

guix time-machine -q --commit=v0.16.0 -- \
  environment --ad-hoc vim -- vim --version

This is the furthest we can go since channels and the underlying mechanisms that make time travel possible did not exist before that date.

There’s one “interesting” case we stumbled upon in that process: in OpenSSL 1.1.1g (released April 2020 and packaged in December 2020), some of the test certificates are not valid before April 2020, so the build VM needs to have its clock set to May 2020 or thereabouts. Booting the build VM with a different date can be done without reconfiguring the system:

$ sudo herd stop build-vm
$ sudo herd start build-vm -- -rtc base=2020-05-01T00:00:00

The -rtc … flags are passed straight to QEMU, which is handy when exploring workarounds…

The time-travel continuous integration jobset has been set up to check that we can, at any time, travel back to one of the past releases. This at least ensures that Guix itself and its dependencies have substitutes available at ci.guix.gnu.org.

Reproducible research workflows reproduced

Incidentally, this effort rebuilding 5-year-old packages has allowed us to fix embarrassing problems. Software that accompanies research papers that followed our reproducibility guidelines could no longer be deployed, at least not without this clock twiddling effort:

code of [Re] Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices, submitted as part of the ReScience Ten Years Reproducibility Challenge in June 2020, and which is precisely about showcasing reproducible deployment with Guix;
code of the 2022 Nature Scientific Data article entitled Toward practical transparent verifiable and long-term reproducible research using Guix, which relied on an April 2020 revision of Guix to deploy (Simon Tournier who co-authored the paper reported earlier on a failed attempt showing just how challenging it was).

It’s good news that we can now re-deploy these 5-year-old software environments with minimum hassle; it’s bad news that holding this promise took extra effort.

The ability to reproduce the environment of software that accompanies research work should not be considered a mundanity or an exercise that’s “overkill”. The ability to rerun, inspect, and modify software are the natural extension of the scientific method. Without a companion reproducible software environment, research papers are merely the advertisement of scholarship, to paraphrase Jon Claerbout.

The future

The astute reader surely noticed that we didn’t answer question #1 above:

How can we tell which packages needs to be “fixed”, and how: building at a specific date, on a specific CPU?

It’s a fact that Guix so far lacks information about the date, kernel, or CPU model that should be used to build a given package. Derivations purposefully lack that information on the grounds that it cannot be enforced in user land and is rarely necessary—which is true, but “rarely” is not the same as “never”, as we saw. Should we create a catalog of date, CPU, and/or kernel annotations for packages found in past revisions? Should we define, for the long-term, an all-encompassing derivation format? If we did and effectively required virtual build machines, what would that mean from a bootstrapping standpoint?

Here’s another option: build packages in VMs running in the year 2100, say, and on a baseline CPU. We don’t need to require all users to set up a virtual build machine—that would be impractical. It may be enough to set up the project build farms so they build everything that way. This would allow us to catch time traps and year 2038 bugs before they bite.

Before we can do that, the virtual-build-machine service needs to be optimized. Right now, offloading to build VMs is as heavyweight as offloading to a separate physical build machine: data is transferred back and forth over SSH over TCP/IP. The first step will be to run SSH over a paravirtualized transport instead such as AF_VSOCK sockets. Another avenue would be to make /gnu/store in the guest VM an overlay over the host store so that inputs do not need to be transferred and copied.

Until then, happy software (re)deployment!

Acknowledgments

Thanks to Simon Tournier for insightful comments on a previous version of this post.

Originally published on the Guix blog.

Guix-HPC Activity Report, 2023

2024-02-16T14:00:00Z

We are pleased to publish the sixth Guix-HPC annual report. Launched in 2017, Guix-HPC is a collaborative effort to bring reproducible software deployment to scientific workflows and high-performance computing (HPC). Guix-HPC builds upon the GNU Guix software deployment tool to empower HPC practitioners and scientists who need reliability, flexibility, and reproducibility; it aims to support Open Science and reproducible research.

Guix-HPC started as a joint software development project involving three research institutes: Inria, the Max Delbrück Center for Molecular Medicine (MDC), and the Utrecht Bioinformatics Center (UBC). GNU Guix for HPC and reproducible research has since received contributions from many individuals and organizations, including CNRS, Université Paris Cité, the University of Tennessee Health Science Center (UTHSC), Cornell University, and AMD. HPC remains a conservative domain but over the years, we have reached out to many organizations and people who share our goal of improving upon the status quo when it comes to software deployment.

This report highlights key achievements of Guix-HPC between our previous report a year ago and today, February 2024. This year was marked by exciting developments for HPC and reproducible workflows: the organization of a three-day workshop in November on this very topic where 120 researchers and HPC practitioners met, the expansion of the package collection available to Guix users—including significant contributions by AMD and new channels giving access to all of Bioconductor, along with more ground work to meet the needs of HPC and reproducible research.

Outline

Guix-HPC aims to tackle the following high-level objectives:

Reproducible scientific workflows. Improve the GNU Guix tool set to better support reproducible scientific workflows and to simplify sharing and publication of software environments.
Cluster usage. Streamlining Guix deployment on HPC clusters, and providing interoperability with clusters not running Guix.
Outreach & user support. Reaching out to the HPC and scientific research communities and organizing training sessions.

The following sections detail work that has been carried out in each of these areas.

Reproducible Scientific Workflows

Supporting reproducible research workflows is a major goal for Guix-HPC. This section looks at progress made on packaging and tooling.

Packages

The package collection available from Guix keeps growing: as of this writing, Guix itself provides more than 29,000 packages, all free software, making it the fifth largest free software distribution. With the addition of scientific computing channels, users have access to more than 52,000 packages!

We updated the hpcguix-web package browser and its Guix-HPC instance to make it easier to search these channels, to navigate them, and to get set up using them. The channels page lists channels commonly used by the scientific community. A noteworthy example is Guix-Science, now home to hundreds of packages. Most of these channels are under continuous integration, with pre-built binaries being published from build farms such as that hosted by Inria.

Expanding on the introduction of the guix-cran channel last year, we are happy to announce the new guix-bioc channel. This new channel makes most of the entire Bioconductor collection of R packages available as Guix packages. Substitutes are provided by the build farm at guix.bordeaux.inria.fr to speed up installation times. The channel augments the collection of R packages provided by the Guix default channel and the guix-cran channel. Creating and updating guix-bioc is fully automated and happens without any human intervention. The channel itself is always in a usable state, because updates are tested with guix pull before committing and pushing them. The same limitations of the guix-cran channel with regard to potential build failures due to undeclared build or runtime dependencies also apply to this channel. Improvements to the CRAN importer in Guix, however, have allowed us to reduce the failure rate and raise the quality of both channels.

These two automated channels grow the number of R packages available in reproducible Guix environments by 21,635 to a total of 24,187. Unlike other efforts that aim to provide binaries of R packages, the collection of R packages in Guix fully captures all dependencies, including those that would otherwise be considered “system dependencies”, insulating Guix environments from system-level changes over time. The increasing coverage of package sources archived by Software Heritage puts Guix in a unique position as a solid foundation for reliable long-term reproducible research with R.

A major highlight this year is the 100+ packages contributed by AMD for its ROCm and HIP toolchain for GPUs. Those include 5 versions of the entire HIP/ROCm toolchain, all the way down to LLVM and including support in communication libraries ucx and Open MPI. Anyone who has tried to package or to build this will understand that this is a major contribution: the software stack is complex, requiring careful assembly of the right versions or variants of each component.

Those packages are a boost to the supercomputer users. We have been able to use them to run HIP/ROCm benchmarks on the French national supercomputer Adastra, which features AMD Instinct MI250X GPUs, leveraging guix pack to ship the code. We expect this joint effort with AMD to continue so we can deliver other parts of the stack—e.g., rocBLAS, rocFFT, and related math libraries—and to enable ROCm support in other packages such as PyTorch and Tensorflow.

For those systems where the HIP/ROCm stack cannot be used, the Guix Science Nonfree channel provides various versions of CUDA and cuDNN. This channel now also provides CUDA-enabled variants of packages from the Guix Science channel that only support CPU-based inference. Of note is the addition of both the CPU- and CUDA-enabled variants of JAX, the machine learning framework for accelerated linear algebra and automated differentiation of numerical functions. Recent versions of Tensorflow 2 and related Tensorflow libraries are now also available, thanks to the addition of a Bazel build system abstraction in the Guix Science channel.

Other notable additions to the Guix-HPC channel include the plethora of dependencies needed to build GEOS, a geophysical simulation framework, and medInria, a medical image processing and visualization package, both contributed by Inria engineers.

Guix Packager, a Packaging Assistant

Defining packages for Guix is not all that hard but, as always, it is much harder the first time you do it, especially when starting from a blank page or not being familiar with the programming environment of Guix. Guix Packager is a new web user interface to get you started.

The interface aims to be intuitive: fill in forms on the left and it produces a correct, ready-to-use package definition on the right. Importantly, it helps avoid pitfalls that trip up many newcomers: adding an input adds the right variable name and modules, turning tests on and off or adding configure flags can be achieved without prior knowledge of the likes of keyword arguments and G-expressions.

While the tool's feature set provides a great starting point, there are still a few things that may be worth implementing. For instance, only the GNU and CMake build systems are supported so far; it would make sense to include a few others (Python-related ones might be good candidates).

Ultimately, Guix Packager does not intend to provide a full package definition editor, but rather a simple entry point for people looking into starting to write packages definitions. It complements a set of steps we've taken over time to make packaging in Guix approachable. Indeed, while package definitions are actually code written in the Scheme language, the package “language” was designed from the get-go to be fully declarative—think JSON with parentheses instead of curly braces and semicolons.

Nesting Containerized Environments

The guix shell --container (or guix shell -C) command lets users create isolated software environments—containers— providing nothing but the packages specified on the command line. This has proved to be a great way to ensure the run-time environment of one’s software is fully controlled, free from interference from the rest of the system.

Recently though, a new use case came up, calling for support of nested containers. As Konrad Hinsen explained, the need for nested containers arises, for example, when dealing with workflow execution engines such as Snakemake and CWL: users may be willing to use Guix to deploy both the engine itself and the software environment of the tasks the engine spawns.

This is now possible thanks to the new --nesting or -W option, to be used in conjunction with --container or -C. This option lets users create nested containerized environments as in this example:

guix shell --container --nesting coreutils -- \
  guix shell --container python

The “outer” shell creates a container that contains nothing but coreutils—the package that provides ls, cp, and other core utilities; the “inner” shell creates a new container that contains nothing but Python. For a Snakemake workflow, one would run:

guix shell --container --nesting snakemake -- \
  snakemake …

… which in turn allows the individual tasks of the workflow to run guix shell as well.

Concise Common Workflow Language

The Concise Common Workflow Language (ccwl) is a concise syntax to express Common Workflow Language (CWL) workflows. It is implemented as an EDSL (Embedded Domain Specific Language) in Guile Scheme. Unlike workflow languages such as the Guix Workflow Language (GWL), ccwl is agnostic to deployment. It does not use Guix internally to deploy applications. It merely picks up applications from PATH and thus interoperates well with Guix and any other package managers of the user's choice. ccwl also compiles to CWL and thus reuses all tooling built to run CWL workflows. Workflows written in ccwl may be freely reused by CWL users without impediment, thus ensuring smooth collaboration between ccwl and CWL users.

ccwl 0.3.0 was released in January 2024. ccwl 0.3.0 comes with significantly better compiler error messages to detect errors early and provide helpful error messages to users. ccwl 0.3.0 also adds new constructs to express scattering workflow steps and other more complex workflows.

Ensuring Source Code Availability

Our joint effort with Software Heritage (SWH) has made major strides this year on the two main fronts: increasing archive coverage, and improving source code recovery capabilities. The two are closely related but involve different work; together, they contribute to making Guix a tool of choice for reproducible research workflows.

Timothy Sample has been leading the archival effort and closely monitoring it. His latest Preservation of Guix Report, published in January 2024, reveals that 94% of the package source code referred to by Guix at that time is archived in SWH. That number has been steadily increasing since we started this effort in 2019. Archival coverage for the entire 2019–2024 period is 85%. Having identified the missing bits, the SWH team is now retroactively ingesting package source code of historical Guix revisions.

Guix’s ability to recover source code from SWH has improved in part thanks to the newly-added support for bzip2-compressed archives in Disarchive, the tool designed to allow Guix to recover exact copies of source code tarballs such as .tar.gz and tar.bz2 files.

A longstanding issue for automatic recovery from SWH is a mismatch between the cryptographic hashes used in Guix and in SWH to refer to content—a problem identified early on. This has been addressed by a recent SWH feature deployed in January 2024: SWH now computes and exposes nar SHA256 hashes for directories—the very hashes used in Guix package definitions. Those hashes are added as an extension of the SWH data model called external identifiers or ExtIDs; the HTTP interface lets us obtain the SWHID corresponding to a nar-sha256 ExtID, which is exactly what was necessary to ensure content-addressed access in all cases. Consequently, the fallback code in Guix was changed to use that method. This will allow Guix to recover source code for version control systems (VCS) other than Git, which was previously not possible.

To make SWH archival more tangible to users and packagers, we modified the hpcguix-web package browser, visible on the Guix-HPC web site, to include a source code archival badge on every package page. The badge, served by SWH, is currently shown both for packages whose source code is fetched from a Git repository, and for packages whose source code is fetched from a tarball. The information is comparable to that checked by the guix lint -c archival command.

Reproducible Research in Practice

In February 2023, Marek Felšöci defended his PhD thesis entitled Fast solvers for high-frequency aeroacoustics. The thesis was part of a collaboration between Inria and Airbus and deals with direct methods for solving coupled sparse/dense linear systems. Chapter 8 of the manuscript explains the strategy that was used to achieve reproducible and verifiable results and how Guix, Software Heritage, and other tools support it. It is another testimonial showing how reproducible computational workflows can be achieved, even in a demanding HPC context.

In a talk entitled Everyone Can Learn How to Guix, medical doctor Nicolas Vallet defended a similar thesis: tools such as Guix can support reproducible research workflows and be viewed as key enablers even in scientific domains one might think of as detached from software deployment considerations.

NumPEx is the French national program for exascale HPC, launched in mid-2023 with a 41 M€ budget for 6 years. Its Development and Integration project aims to ensure the dozens of HPC libraries and applications developed by French researchers can easily be deployed on national and European clusters, with high quality assurance levels. Guix is one of the deployment tools used to achieve those goals and well poised to do so. The project has just recruited two engineers to help with packaging, continuous integration, and training in this context.

We hope this will not only help create synergies with the broader Guix community, but also contribute to increasing awareness about reproducible deployment in HPC circles. Meanwhile, conducting reproducible research on supercomputers that lack Guix is already possible: by creating an image with guix pack, deploying it on the supercomputer, and setting up the host environment properly. Experiments have shown that it does not lead to any significant performance difference compared to the same code and software stack deployed natively. The motivation, technical details, and performance study were presented in a talk entitled Reconciling high-performance computing with the use of third-party libraries?

Another aspect related to reproducibile HPC research and development is the environment used to write code, document it, post-process data, produce scientific reports. Offering researchers and developers a way to share the exact same working environment is one way to facilitate collaboration. The Elementary Emacs configuration coupled with Guix (ElementaryX) project is an attempt towards such an elementary yet reproducible environment.

Cluster Usage and Deployment

The sections below highlight the experience of cluster administration teams and report on tooling developed around Guix for users and administrators on HPC clusters.

Usage at the German Aerospace Center

The Institute of Networked Energy Systems of the German Aerospace Center (DLR) has set up a Guix installation in its HPC system and transitioned several workflows to Guix, which are related to remote sensing and solar surface radiation and feed data into the European Copernicus Atmosphere Monitoring Service CAMS. Similar to containers, Guix software stacks are almost independent of the host system. However, container support in HPC systems is limited and still evolving. Guix relocation options offer more flexibility and the software stack has been successfully deployed in HPC clusters available to the DLR (like CARA, CARO and terrabyte), thereby enabling easy scaling of the radiation services.

Guix System Cluster at GLiCID

GLiCID is the HPC center for research in the French region Pays de la Loire, resulting from the merger of pre-existing HPC centers in the region.

The installation of new machines in June 2023 has led to the launch of a new common system infrastructure—identity management, SLURM services, databases, etc.—mostly independent from the solutions provided by the manufacturers. Installed on two remote data centers, the infrastructure needs to be highly available, and its deployment can be complex. The team wanted to guarantee simple, predictable redeployment of the infrastructure in the event of problems.

Guix, already offered to all cluster users, has a proven track record of reproducibility, a desirable feature not just for scientific software but also for the infrastructure itself. That is why the team embarked on an effort to build its infrastructure with Guix System, which led to the development of Guix System services for HPC—for OpenLDAP, SLURM, and more. They reported on the impact of these choices at the Workshop in Montpellier, and are currently making progress to reach a 100% Guixified infrastructure.

Pangenome Genetics Research Cluster at UTHSC

At UTHSC, Memphis (USA), we are running a 16-node large-memory Octopus HPC cluster (438 real CPU cores) dedicated to pangenome and genetics research. In 2023, the cluster effectively doubled in size with 192 4 GHz CPU cores, 144,000 GPU cores, and SSDs added. The storage adding up to a 200 TB Lizardfs fiber-optic connected distributed network storage.

Notable about this HPC cluster is that it is administered by the users themselves. Thanks to Guix, we install, run and manage the cluster as researchers—and roll back in case of a mistake. UTHSC IT manages the infrastructure—i.e., physical placement, electricity, routers and firewalls—but beyond that there are no demands on IT. Thanks to out-of-band access, we can completely (re)install machines remotely. Octopus runs Guix on top of a minimal Debian install and we are experimenting with pure Guix virtual machines and nodes that can be run on demand. Almost all deployed software has been packaged in Guix and can be installed on the head-node by regular users on the cluster without root access. This same software is shared through NFS on the nodes. See the guix-bioinformatics channel for all deployment configuration.

At FOSDEM 2023, Arun Isaac presented Tissue, our minimalist Git+plain text issue tracker that allows us to move away from GitHub source code hosting, continuous integration (CI), and issue trackers. We have also started to use Guix with the Concise Common Workflow Language (CCWL) for reproducible pangenome workflows (see above) on our Octopus HPC.

Supporting RISC-V

RISC-V is making inroads with HPC, e.g. in Barcelona and with the new Barcelona Supercomputing Center Sargantana chip.

Christopher Batten (Cornell) and Michael Taylor (University of Washington) are in charge of creating the NSF-funded RISC-V supercomputer with 2,000 cores per node and 16 nodes in a rack (NSF PPoSS grant 2118709), targeting Guix driven pangenomic workloads by Erik Garrison, Arun Isaac, Andrea Guarracino, and Pjotr Prins.

The supercomputer will incorporate Guix and the GNU Mes bootstrap, with input from Arun Isaac, Efraim Flashner and others. NLNet funds RISC-V support for the Guix riscv64 target from Efraim Flashner and the GNU Mes RISC-V bootstrap project with Ekaitz Zarraga, Andrius Štikonas, and Jan Nieuwenhuizen. The bootstrap is now working from stage0 to tcc-boot0.

TinyCC compiles the RISC-V target, but still has some issues to resolve. The next steps include compiling the GNU C library, various versions of GCC, and packages beyond. GNU Mes 0.25.1 was released with RISC-V support and a boostrappable-tcc branch. Both are available in Guix, though the RISC-V bootstrap is not yet enabled by default.

Outreach and User Support

Guix-HPC is in part about “spreading the word” about our approach to reproducible software environments and how it can help further the goals of reproducible research and high-performance computing development. This section summarizes talks and training sessions given this year.

Talks

Since last year, we gave the following talks at the following venues:

Making reproducible and publishable large-scale HPC experiments, HPC & Big Data track, FOSDEM, Feb. 2024 (Philippe Swartvagher)
Toward practical transparent, verifiable and long-term reproducible research using Guix, Institut Pasteur, Dec. 2023 (Simon Tournier)
Reproducible software deployment in scientific computing, Event of the Max Planck Society, Sept. 2023 (Ricardo Wurmus)
Guix: Funktionale Paketverwaltung zur wirklichen Reproduzierbarkeit, Second IT4Science Days, Meeting of the Helmholtz Association and the Max Planck Society, Sept. 2023 (Ricardo Wurmus)
Building a Secure Software Supply Chain with GNU Guix, Programming Conference, March 2023 (Ludovic Courtès)
Functional programming paradigm applied to package management: toward reproducible computational environment, IRILL, Feb. 2023 (Simon Tournier)
Guix, toward practical transparent, verifiable and long-term reproducible research, Open Research Tools and Technology track, FOSDEM, Feb. 2023 (Simon Tournier)
Vers une étude expérimentale reproductible avec GNU Guix, Rencontres sur les logiciels libres de recherche, Université de Strasbourg, Feb. 2023 (Marek Felšöci)
Reproducibility and performance: why choose?, HPC & Big Data track, FOSDEM, Feb. 2023 (Ludovic Courtès)

To this list we should add 11 talks given for the First Workshop on Reproducible Software Environments for Research and High-Performance Computing, held in November 2023, for which videos are now on-line.

Events

As in previous years, Pjotr Prins and Manolis Ragkousis spearheaded the organization of the “Declarative and minimalistic computing” track at FOSDEM 2023, which was home to several Guix talks, along with the satellite Guix Days where 50 Guix contributors gathered.

This year, we held a second on-line reproducible research hackathon on reproducible research issues. This hackathon was a collaborative effort to leverage Guix to achieve reproducible software deployment for articles contributed to the online journal ReScience C. As outlined in our write-up on the experience, this served as an excellent opportunity to put into practice our guide to reproducible research papers, and it helped us identify open issues for long-term and archivable reproducibility.

This year we organized the First Workshop on Reproducible Software Environments for Research and High-Performance Computing, which took place in Montpellier, France, in November 2023. Coming from France primarily but also from Czechia, Germany, the Netherlands, Slovakia, Spain, and the United Kingdom to name a few, 120 people—scientists, high-performance computing (HPC) practitioners, system administrators, and enthusiasts alike—came to listen to the talks, attend the tutorials, and talk to one another.

Our ambition was to gather people from diverse backgrounds with a shared interest in improving their research workflows and development practices. The 11 talks and 8 tutorials, along with the hallway discussions and group dinner, have allowed us to share skills and experience. Videos of the talks edited by the video team at Institut Agro, our host, are available on the event’s web site.

Many thanks to our publicly-funded academic sponsors who made this event possible: ISDM, our primary sponsor for this event, Institut Agro for hosting the workshop in such a beautiful place, and EuroCC² and Inria Academy for their financial and logistical support. We look forward to organizing a second edition!

Training Sessions

For the French HPC Guix community, we continued the monthly on-line event called Café Guix, originally started in October 2021. Each month, a user or developer informally presents a Guix feature or workflow and answers questions. These sessions are now recorded and are available on the web page, gathering up to 70 people. This is continuing in 2024.

Pierre-Antoine Bouttier and Ludovic Courtès ran a 4-hour Guix training session as part of the User Tools for HPC (UST4HPC) event organized by CNRS (action nationale de formation, ANF) in June 2023. The session targeted an audience of HPC system administrators with no prior experience with Guix. Material (in French) is available on-line.

Marek Felšöci and Ludovic Courtès ran a 4-hour tutorial as part of the Compas HPC conference, in June 2023. The tutorial showed how to devise reproducible research workflows combining the literal programming facilities of Org-Mode with Guix. Supporting material is available on-line.

On September 27, Ricardo Wurmus hosted a 3-hour tutorial on the use of Guix for reproducible science as a session at the second IT4Science Days, a joint meeting of representatives of the Helmholtz Association of German Research Centres and the Max Planck Society. The workshop was attended by system administrators and scientists hailing from research institutes all over Germany.

The workshop on reproducible software environments that took place in Montpellier, France, in November 2023 was home to 8 tutorials, half of which about Guix. Each Guix tutorial had a different target audience: users-to-be (people with no prior experience with Guix), novice packagers, experienced packagers, and system administrators. Supporting material is available on the web page of the event.

A new MOOC on Reproducible Research practices has almost been completed. It will be stress-tested in February 2024 and open to the public on the platform FUN in spring. One of its three modules is about reproducible computational environments, introducing the various obstacles to reproducibility and presenting practical solutions. One of them is Guix, and in particular Guix containers defined by manifest files and frozen in time through channel files. Exporting such containers to Docker and Singularity is also discussed, because of the importance of these technologies in HPC.

Personnel

As part of Guix-HPC, participating institutions have dedicated work hours to the project, which we summarize here.

Inria: 3.5 person-years (Ludovic Courtès and Romain Garbage; contributors to the Guix-HPC channel: Emmanuel Agullo, Julien Castelnau, Luca Cirrottola, Marek Felšöci, Marc Fuentes, Nathalie Furmento, Gilles Marait, Florent Pruvost, Philippe Swartvagher; system administrator in charge of Guix on the PlaFRIM and Grid’5000 clusters: Julien Lelaurain)
University of Tennessee Health Science Center (UTHSC): 3+ person-years (Efraim Flashner, Bonface Munyoki, Fred Muriithi, Arun Isaac, Andrea Guarracino, Erik Garrison and Pjotr Prins)
CNRS: 0.2 person-year (Konrad Hinsen)
CNRS and Université Grenoble-Alpes (GRICAD): 0.2 person-year (Céline Acary-Robert, Pierre-Antoine Bouttier)
Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC): 2 person-years (Ricardo Wurmus, Navid Afkhami, and Mădălin Ionel Patrașcu)
Université Paris Cité: 0.75 person-year (Simon Tournier)

Guix itself is a collaborative effort, receiving code contributions from about 100 people every month, along with lots of crucial non-coding contributions: organizing events, writing documentation, giving tutorials, and more.

Perspectives

As the second decade dawns on the GNU Guix project, we shall take the opportunity to not only look back on past achievements, but evaluate our current position with respect to our goals and adjust our trajectory if necessary. Previous issues of the Activity Report had a common refrain: the importance of continuous efforts to connect the communities that meet at the intersection of Open Science, reproducible research, software development, system administration, and systems design. This issue is no different—the Guix-HPC effort remains committed to strengthening the ability of these communities to establish practices that further Open Science and make reproducible research workflows accessible.

The workshop on reproducible software environments in Montpellier may serve as an example of what this may look like in practice. The presenters in these sessions discussed issues of reproducible research and showcased the various roles Guix can assume in a diverse community of research practitioners: whether as the core of a platform for ad-hoc research environments; as the nexus that binds medical data, the tools of interpretation, and the scientific publication; or as the workhorse for reliably deploying entire HPC sites. As a project whose development priorizes increasing user autonomy, Guix has clearly found its niche among enthusiastic Open Science practitioners in a wide range of scientific fields.

While these activities are certainly encouraging, we need to acknowledge the fact that this level of engagement is not representative of the impact Guix has had on the wider scientific community. Challenges remain in bringing all the benefits and guarantees that Guix provides to where researchers actually do their computing, to the systems that system administrators get to build and maintain, and to the existing platforms and networks that represent the landscape in which computer-aided research takes place.

On the technical side, this could mean to contribute extensions to existing workflow systems like Snakemake or Nextflow; to develop tools and implement adapters for deploying Guix containers and virtual machine images to platforms like OpenStack; or to bridge gaps to support users of commercial third-party cloud computing platforms whose moats remain difficult to cross without leaving user autonomy behind.

These technical goals are, of course, informed by the needs of members of the reproducible research community who are currently represented in the Guix-HPC efforts. In the coming year, we want to continue to reach out to the wider community by organizing training sessions and workshops, and to gain better insight into how we can improve Guix to serve their needs. It is our mission to put the tools we build in the hands of practitioners at large—and to shape these tools together. Let’s talk—we’d love to hear from you!

HIP and ROCm come to Guix

2024-01-30T15:30:00Z

We have some exciting news to share: AMD has just contributed 100+ Guix packages adding several versions of the whole HIP and ROCm stack! ROCm is AMD’s Radeon Open Compute Platform, a set of low-level support tools for general-purpose computing on graphics processing units (GPGPUs), and HIP is the Heterogeneous Interface for Portability, a language one can use to write code (computational kernels) targeting GPUs or CPUs. The whole stack is free and “open source” software—a breath of fresh air!—and is seeing increasing adoption in HPC. And, it can now be deployed with Guix!

In this post, written by AMD engineers and Inria research software engineers, we look at the packages AMD contributed and how you can use them, and we discuss the use cases at AMD and relation with the French and European supercomputing environments.

More than 100+ packages

The 100+ packages Kjetil Haugen and Thomas Gibson of AMD contributed to the Guix-HPC channel include 5 versions of the entire HIP/ROCm toolchain, all the way down to LLVM and including support in communication libraries ucx and Open MPI. Anyone who’s tried to package or to build this will understand that this is a major contribution: the software stack is complex, requiring careful assembly of the right versions or variants of each component.

As always with Guix, a key element here is that the package set is self-contained: these packages as well those that depend on them do not and in fact cannot rely on an external ROCm installation, contrary to what is customary in HPC environments. This is what has allowed us to run the exact same software stack both at AMD and on the French HPC clusters, as we will see below.

The foci of the initial packaging effort are to create a solid interface between Guix and ROCm, and to provide the components needed to start leveraging Guix for developing and deploying ROCm applications. To that end we provide two primary packages as the foundation for the AMD ROCm stack:

The ROCm toolchain
The HIP runtime for the AMD platform: hipamd

Note that all ROCm packages in Guix are considered experimental as the modest patching required to adapt to the Guix ecosystem implies that they deviate from the officially released ROCm binaries. Also note that we may modify the design as we gain experience with using Guix in our daily work.

The ROCm toolchain is analogous to clang-toolchain, and provides the ROCm variants of core LLVM components, such as clang, clang runtime, lld, libomp, and associated headers/binaries. In addition, the ROCm toolchain also provides the necessary ROCr/HSA runtimes and device libraries required for GPU offloading support. All supported GPU architectures can be found via AMD's official ROCm documentation.

The implementation of HIP runtime for AMD GPUs, hipamd, is an extension of the ROCm toolchain which provides necessary headers and the compiler wrapper hipcc. This is the primary user-facing package for developing or deploying applications using HIP; it provides a basic toolchain for most GPU kernel development, but does not include math libraries such as rocBLAS or rocFFT. Math libraries will be provided at a later date.

Due to the fact that both hardware and software advance quite rapidly, we make generous use of generator functions that enable the installation of multiple versions of ROCm/HIP to ensure that both existing stable versions as well as latest releases can be made easily available. Having older versions available ensures that projects relying on a particular release of ROCm/HIP are not distrupted. This also enables developers to examine performance impacts between versions to help guide their optimization efforts and track regressions/improvements.

As an application developer using Guix, one can utilize the guix shell command to create environments (on top of your system environment or completely isolated) with a fully functional HIP toolchain for any version you specify. For example:

guix shell hipamd@5.7.1

This shell will contain not only the standard ROCm-based Clang toolchain and its associated compilers/linkers, but will also provide hipcc and its associated utilities such as hipconfig (for HIP and Clang versions, include paths, and built-in flags) and rocminfo (for querying device information).

[env]$ ls -l `which hipcc`
lrwxrwxrwx 1 root root 66 Dec 31  1969 /gnu/store/2j5hqm1rk7q8h3ivwklpwmiv8nzkq15v-profile/bin/hipcc -> /gnu/store/kcfisihalab9fh75dd15rzwj30mv34bk-hipamd-5.7.1/bin/hipcc
[env]$ hipcc --version
HIP version: 5.7.1
clang version 17.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /gnu/store/r9zz6hjmgs2c79091s0s9zc43d0zq9vc-rocm-toolchain-5.7.1/bin

As an illustrative example, we can clone the open-source STREAM project for GPUs, BabelStream, and directly compile and run HIP implementation of the benchmark:

[env]$ git clone git@github.com:UoB-HPC/BabelStream.git

Once the repository is cloned, we can build the project using CMake as shown below:

[env]$ cd BabelStream/
[env]$ cmake -Bbuild -H. -DMODEL=hip -DCMAKE_CXX_COMPILER=hipcc
[env]$ cmake --build build

If neither Git nor CMake are available on your system, you can simply add both git and cmake to your guix shell command to automatically install them into your environment!

And finally, you can run the executable and immediately observe the measured streaming performance:

[env]$ ./build/hip-stream 
BabelStream
Version: 5.0
Implementation: HIP
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using HIP device AMD Radeon RX 6800 XT
Driver: 50731921
Memory: DEFAULT
Init: 0.150206 s (=5361.344563 MBytes/sec)
Read: 0.212430 s (=3790.920912 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        520715.707  0.00103     0.00104     0.00103     
Mul         450652.522  0.00119     0.00120     0.00119     
Add         438387.222  0.00184     0.00186     0.00184     
Triad       448402.828  0.00180     0.00180     0.00180     
Dot         438838.728  0.00122     0.00123     0.00123

This example shows how to obtain an interactive development environment with guix shell but if all you want is BabelStream, there’s a ready-to-use package.

Benchmarks

Adastra, one of the French national supercomputers, builds upon AMD GPUs. It’s a 78 PFlop machine that was ranked #3 in the November 2023 edition of Green500. ROCm and HIP are available pre-installed on Adastra, but naturally, we at Inria wanted to ensure that those packages that had been tested at AMD would also give the expected performance on this machine. Guix is currently unavailable on Adastra so we created a bundle of hpcg, a synthetic benchmark that exercises HIP, to ship it over to Adastra:

guix pack -RR hpcg bash-minimal -S /bin=bin

After unpacking, the resulting bundle lets us run hpcg on a single node of Adastra—each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node. We’d first allocate 8 CPUs on one node with SLURM:

salloc --time=01:00:00 --nodes=1 --ntasks-per-node=8 --cpus-per-task=8 \
  --gpus-per-task=1 --threads-per-core=1 --exclusive --account=ces1926 \
  --constraint=MI250 --mem=256000
ssh $SLURM_NODELIST

… and then run our Guix-built hpcg on the compute node, with 8 MPI processes:

module purge
GUIX_ROOT=$HOME/guix/hpcg
${GUIX_ROOT}/bin/mpirun -n 8 --map-by L3CACHE \
  --launch-agent ${GUIX_ROOT}/bin/orted       \
  -x GUIX_EXECUTION_ENGINE=performance        \
  ${GUIX_ROOT}/bin/rochpcg 280 280 280 180

Notice that we’re using the Guix-provided mpirun. We run module purge to avoid interference from environment modules available on the system. By setting GUIX_EXECUTION_ENGINE to performance, we instruct the Guix-provided wrapper of hpcg to select a relocation mechanism with no overhead.

The benchmark prints the kind of output we expected:

Total Time: 181.62 sec
Setup Time: 0.06 sec
Optimization Time: 0.12 sec

DDOT   =  1809.6 GFlop/s (14476.5 GB/s)     226.2 GFlop/s per process ( 1809.6 GB/s per process)
WAXPBY =   804.0 GFlop/s ( 9648.2 GB/s)     100.5 GFlop/s per process ( 1206.0 GB/s per process)
SpMV   =  1465.6 GFlop/s ( 9229.1 GB/s)     183.2 GFlop/s per process ( 1153.6 GB/s per process)
MG     =  1935.1 GFlop/s (14934.8 GB/s)     241.9 GFlop/s per process ( 1866.9 GB/s per process)
Total  =  1795.6 GFlop/s (13616.4 GB/s)     224.4 GFlop/s per process ( 1702.1 GB/s per process)
Final  =  1647.8 GFlop/s (12495.8 GB/s)     206.0 GFlop/s per process ( 1562.0 GB/s per process)

The software stack was packaged once and can now be used on a variety of machines without spending hours or days in deployment and testing. That alone is no small feat in a world where ad hoc HPC cluster deployments remain the norm.

Guix at AMD

Currently, the use of Guix within AMD is a grassroot effort among members of the Data Center GPU Software Solutions Group. The team engages in porting and optimization of HPC applications across a variety of engineering disciplines, organizes ROCm training and hackathons, provides feedback to ROCm development teams, and participates in the bring-up process preceeding the release of new hardware. More details about our activities can be found at AMD lab notes.

Compared to most engineers, we touch a larger number of applications, across a larger number of HPC systems, and with a greater variety of software dependencies and GPU architectures. An immediate consequence is that the overhead of dependency management can become quite significant. Moreover, the effort is often duplicated between engineers working on applications with similar dependencies, system administrators providing environment modules, and deployment engineers preparing container images and recipes.

As a functional package manager, Guix promises deduplication and reproducibility. In other words, if a package description is created by someone somewhere, it can used by anyone anywhere! Guix is already providing a lot of value for individual engineers. The primary use case is to allow the use of less contested resources for development (workstations with gaming card) and reserve more contested resources for performance testing (nodes with emerging GPU architectures). We are currently considering using Guix to create environment modules and are working on integrating Cuirass into engineering workflows.

After using Guix extensively to package ROCm, there are two things missing to better support GPU-based development. First, a mechanism for running unit tests on the GPU. This is currently impossible because the isolated environments in which Guix builds packages do not expose the GPU. Second, a mechanism to specify the target GPU architecture on the fly—e.g., through package transformations. The size of many GPU libraries is proportional to the number of GPU architectures supported and limiting the support only to the GPUs available on the system of interest is good software hygiene and may signficicantly reduce compilation time.

Beyond that, we are mostly happy with the range of functionality Guix offers. However, we would like a more interactive debugging environment. Keeping the build directory, i.e. guix build -K and subsequently running guix shell --container on that directory as described in the Guix manual gets us close, but providing a gdb-like user experience where we can set breakpoints, and list, inspect, step through, modify, and rerun build phases would be helpful.

HIP, Guix, and HPC in Europe

HPC research teams at Inria develop software ranging from run-time support libraries such as StarPU and hwloc, to linear algebra solvers such as Chameleon, to numerical simulation libraries. Having the HIP/ROCm stack packaged in Guix allows us to deploy and run those even more complex stacks on supercomputers and readily take advantage of their processing power without going through a tedious installation and testing process.

This makes even more of a difference considering the breadth and depth of HPC software developed in NumPEx. NumPEx is the French national program for exascale HPC, launched in mid-2023 with a 41 M€ budget for 6 years. Its Development and Integration project aims to ensure the dozens of HPC libraries and applications developed by French researchers can easily be deployed on national and European clusters, with high quality assurance levels. Guix is one of the deployment tools used to achieve those goals and well poised to do so; having a well-tested GPGPU package set makes it an even better fit.

It remains to be seen whether Jules-Verne, the EuroHPC exascale supercomputer to be hosted in France in 2025, will provide AMD GPUs. Given that the software stack for these GPUs is free software, this would send a strong signal in favor of Open Science, in line with the recommendations of UNESCO and those of the French Plan for Open Science.

This is just the beginning

All these packages are available from the Guix-HPC channel; they are continuously-built on the build farm at Inria, providing users with readily usable binaries.

With the HIP and ROCm foundations in place, there’s a lot on our agenda: providing rocBLAS, rocFFT, and related math libraries, taking advantage of these in the linear algebra and numerical simulation packages developed at Inria and in NumPEx, and working with the broader Guix community to provide ROCm-enabled variants of major packages like PyTorch. We plan to make the ROCm/HIP packages part of the main Guix channel once we have gained enough experience. The other important benefit we expect from this collaboration is to better cater to the needs of engineers at AMD.

Working together in the open has been a fruitful and pleasant experience and we can already foresee lots of opportunities to keep this going!

Videos of the 2023 workshop are on-line

2024-01-29T10:00:00Z

Back in November, the First Workshop on Reproducible Software Environments for Research and High-Performance Computing was held in Montpellier, France. Coming from France primarily but also from Czechia, Germany, the Netherlands, Slovakia, Spain, and the United Kingdom to name a few, 120 people—scientists, high-performance computing (HPC) practitioners, system administrators, and enthusiasts alike—came to listen to the talks, attend the tutorials, and talk to one another.

Our ambition was to gather people from diverse backgrounds with a shared interest in improving their research workflows and development practices. The 11 talks and 8 tutorials, along with the hallway discussions and group dinner (very nice!), have allowed us to share skills and experience.

Today, we’re publishing videos of the talks including short interviews with the speakers (tutorials were not recorded but supporting material is linked from the program).

Our gratitude goes to the video team at Institut Agro for taking care of the live stream during the event and for editing those videos—thank you! Many thanks to our publicly-funded academic sponsors who made this event possible: ISDM, our primary sponsor for this event, Institut Agro for hosting the workshop in such a beautiful place, and EuroCC² and Inria Academy for their financial and logistical support.

“When will be the second workshop?”, participants would ask as we were wrapping up. We don’t know yet, but if you’d like to host the next edition or to sponsor it, do get in touch with us!

The bonus video below will give you a feel of what the event in Montpellier was like…

Video by Institut Agro’s video team, published under CC-BY-NC 3.0. Guix artwork by Luis Felipe published under CC-BY-SA 4.0.

Enjoy!

Announcing the First Workshop on Reproducible Software Environments

2023-09-18T14:30:00Z

We’re excited to announce the First Workshop on Reproducible Software Environments for Research and High-Performance Computing (HPC), which will take place in Montpellier, France, on November 8–10th, 2023! The preliminary program is on-line, and now’s the time for you to register!

This event can be seen as a followup to the research session of the Ten Years of Guix event and the earlier French-speaking Workshop on Reproducible Software Environments.

The program features talks by scientists, engineers, and system administrators from different backgrounds who will share their experience with Guix, as well as tutorials on GitLab, Guix, and other tools that support scientific workflows—from bioinfo analyses to HPC and source code archival.

The list of speakers shows a variety of positions and scientific disciplines—psychology, linear algebra, biophysics, medicine, bioinfo, system administration—that we believe very much shows that software environment reproducibility is a cross-cutting concern that can be tackled whether or not one identifies themself as a “geek”.

Yann Dupont, system architect at the GliCID HPC center in France, writes:

Guix is the perfect Swiss Army knife that every digital plumber should have in their toolkit. We use it extensively, not only to enhance the software we offer to researchers, but also to build the GLiCID infrastructure.

Working with bioinformatics and genomics research teams, Ricardo Wurmus, also known for his outstanding contributions to Guix, subscribes to this view:

As a software engineer working in large and often changing teams I depend on Guix to ensure that development environments as well as complicated production deployments are free from surprises. In my role to support researchers with complex scientific software environments I cannot think of a more flexible and reliable foundation for reproducibly customizable deployments to laptops, HPC systems, and the cloud. I’m excited to see that the program is full of experience reports and tutorials by experienced HPC practitioners, and I can’t wait to get a chance to learn more about how Guix is used in other research environments.

To Nicolas Vallet, medical doctor and researcher in the Hematology and Cell Therapy department of the University Hospital of Tours (France), it’s about sharing research results:

As a scientist, I've experienced frustration when attempting to run packages described in research papers but encountering compatibility issues with my system. My goal is to ensure that my research will be accessible to a wide audience, regardless of their location or technical expertise. Guix has provided me with a solution to achieve it. I'm now proud to share not only the raw data and analysis pipeline from my projects, but also detailed instructions on how to recreate the transparent computational environment used, making my research more accessible to others.

Whether you’re a scientist, a practitioner, a newcomer, or a power user, we’d love to see you in November.

Stay tuned for updates!

Reproducible research hackathon: experience report

2023-07-12T15:20:00Z

Two weeks ago, on June 27th, we held an second on-line hackathon on reproducible research issues. This hackathon was a collaborative effort to bring GNU Guix to concrete examples inspired by contributions to the online journal ReScience C.

A small but enthusiastic group of about 5 people connected to the #guix-hpc IRC channel on Libera.chat and hacked the good reproducibility hack. The day was interspersed by three video chats; the first to exchange about interests, background and working plan, the second to report the work in progress and the last to address the achievements and list future ideas.

As we are advocating, this command line:

guix time-machine -C channels.scm -- shell -m manifest.scm

… captures all the requirements for redeploying the same computational environment. Specifically:

channels.scm pins a specific revision of Guix and potentially other channels;
manifest.scm specifies the packages required by the computational environment.

The three goals of the hackathon were:

Pick a ReScience C submission and add these two files: channels.scm and manifest.scm.
If needed, define packages. These could then go to Guix itself or one of the relevant dedicated channels: Guix-Science, Guix-Past, etc.
Identify open issues that hinder reproducibility of software environment environments.

Here’s a recap. TLDR, it was a success!

Complete “Guixification”

These two papers based on Python software were considered:

[Re] Neural Network Model of Memory Retrieval, ReScience C 6, 3, #8, 2021.
[Re] A general model of hippocampal and dorsal striatal learning and decision making, ReScience C 8, 1, #4, 2022.

Writing the two files, channels.scm and manifest.scm, was rather straightforward. This led to two pull requests again the original papers: here and there. Nothing fancy: most of the work consisted in “translating” the requirements.txt file used by pip to manifest.scm.

On a side note, would it be possible to take advantage of GitHub’s continuous integration, GitHub Action, to guide the review process? The first idea would be to let GitHub Action run some part of the numerical processing. However, the resources offered by GitHub are limited or are not suitable for numerical experiments. Instead, GitHub Action can be exploited to pack the software environment and publish the resulting artifact. For instance, Docker images are popular and Guix can produce them; for details about producing Docker images using Guix on the top of GitHub Action, see this example based on ReScience article above (8, 1, #4, 2022). In a nutshell, GitHub Action runs the following command:

guix time-machine -C channels.scm \
     -- pack -f docker --save-provenance -m manifest.scm

A reviewer could then load this Docker image artifact produced by Guix. Or they could directly generate the software environment from the files channels.scm and manifest.scm. Either way, a reviewer is thus able to inspect the software environment of the submission. Last, because of the --save-provenance option, the Docker image brings Guix information for reproducing itself.

Partial port to Guix

Other papers tracked by ReScience had been considered:

[Re] Groups of diverse problem-solvers outperform groups of highest-ability problem-solvers - most of the time, 8, 1, #6, 2022.
[Re] Modeling Insect Phenology Using Ordinal Regression and Continuation Ratio, 7, 1, #5, 2021.
[Re] A circuit model of auditory cortex, review still pending.
[Re] Particle Image Velocimetry with Optical Flow, initial paper from 1998 and the reproduction had been sent for the Ten Years Reproducibility Challenge.

We did not complete the reproduction of all of these papers using Guix due to lack of time or computational resources. Progress on the first paper is visible in this Git repository. The main pitfall illustrated by this paper is that not all of the experiment’s source code was available in the repository; some of it was stored elsewhere on-line and transparently downloaded and run via Python’s httpimport. This is problematic for several reasons: that code might simply vanish, it could be modified between the time the authors submitted the paper and the time someone else attempts to reproduce it, or it could be maliciously modified. The solution was to get the current copy of the relevant code inside the repository and to remove uses of httpimport. This experiment is computationally very expensive though, and we could not run it on time on our local cluster.

About the second paper, the main difficulty was related to time zone. The variable TZDIR required an adjustment. Hopefully, thanks to the inferiors Guix feature, a custom manifest combining two different Guix revisions allows to generate the software environment based on R ecosystem where the numerical experiment of the paper can be run.

The ReScience reviewer of the third paper took advantage of the hackathon for resuming and trying Guix for the software environment. The files channels.scm and manifest.scm were created without any big issue. The paper’s computational experiment runs on Jupyter Notebook, and it runs out-of-the-box with the --pure option of guix shell—running it with --container, for improved isolation, is left as an exercising for the reader. One drawback was that the paper’s author invokes apt install in the middle of the notebook. On the Guix side, one difficulty was finding the right TeX Live packages; another one was the interaction with the Python library matplotlib, which can be troublesome. The session was a double opportunity: dive in Guix-specific details—this hackathon was the right place to share knowledge!—and this specific review, which started in March, is now almost finished. Win-win!

The fourth and last paper were a challenge: produce a software environment where C code from 1998 can run. And that’s a positive result! The two tables agree with those in the paper. The C code compiles and runs, although some warnings are raised and possibly turned off via specific compiler flags, and the Bash shell scripts are not fully portable and required minor tweaks. The C code has no dependencies and thus it significantly simplify the portability and eases the reproducibility.

Towards long-term and archivable reproducibility

Over the years running Guix daily in scientific context, we have already identified many potential roadblocks to achieve long-term reproducible software environments—from unfixed bugs to unimplemented features. Verifiable environment deployment can only be achieved when all the following conditions are met:

availability of all the source code;
backward-compatibility of the Linux kernel system call interface;
some compatibility of the hardware (CPU, etc.);
no “time bomb”—software whose behavior is a function of the current time.

This hackathon was a nice opportunity to check their status and list what already works and what still remains, all based on a concrete example:

[Re] Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices, ReScience C 6, 1, #6, 2020.

This paper runs Guix end-to-end: it uses Guix to compile all the requirements, run all the experiments and last generate the final report. Let us check if two independent observers are able to verify the same result with three years between the two observations (2020—2023).

We know that this paper’s computational experiment is reproducible with Guix today under “normal circumstances” (try it!), so we set out to experiment with an extreme worst-case scenario: no pre-built binaries are available—everything needs to be rebuilt from source—and none of the source code hosting sites is reachable, with the exception of the Software Heritage archive. The ambition of Software Heritage is to collect, preserve, and share all software that is publicly available in source code form. Guix fetches code from Software Heritage as a fallback when source code hosting sites disappear. To our knowledge, redeploying software under such extreme conditions is practically impossible, unless of course one is using Guix—or at least that’s what we wanted to verify.

In summary, the outcome of this experiment is impressive. Considering this extreme worst-case setup, it's awesome that it almost works out-of-the-box. The remaining open issues we identified are:

Guix user interface annoyances: manual --fallback or --no-substitutes options and inconsistent error messages.
Holes in Software Heritage and Disarchive coverage of the source code we needed.
Source origin hash mismatches between Guix normalization and Software Heritage normalization.
“Time bomb”: the test suite of some packages is failing because it is time-dependent (example).
Weaknesses in the full-source bootstrap.
The archive of all the binary seeds of this bootstrap.

For the interested reader, take a look at the complete details. Does it mean we have a roadmap the next hackathon? If you are interested, we’d love to hear your ideas!

Last but not least, a one-day on-line get-together is a great opportunity to tackle longstanding topics while helping each other and welcoming newcomers on board. Thanks to everyone for joining! It’s been a pleasant and productive experience, so stay tuned for other rounds!

A guide to reproducible research papers

2023-06-23T12:00:00Z

A core tenet of science is the ability to independently verify research results. When computations are involved, verifiability implies reproducibility: one should be able to re-run the computations to ensure they get the same results, at which point they may want to start experimenting with variants of the computational methods, feed it different data sets, and so on. This is the motivation behind our work on Guix: we want to empower scientists by providing a tool in support of reproducible computations and experimentation.

This article is a guide to using Guix for reproducible research work: producing research articles with enough information so that anyone, anytime can re-run the computational experiments it describes. Before showing how to get this done with Guix, let’s look at existing practices and see where they fall short.

On the difficulty of sharing computational processes

A citation attributed to Jon Claerbout summarizes the problem:

Published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself.

Authors of research papers often realize that they need to share not only the data and source code but also the software environment they used, somehow. There are two common ways to do that, sometimes used in combination: recording software package names and versions in the paper (in an appendix), and providing a ready-to-use application bundle such as a Docker or virtual machine image.

Recording name/version pairs is appealing. The intuition is that by communicating the names and version numbers of my dependencies, someone can recreate the same environment that I used. However, where should I stop? If it’s an R program, should I only list R packages? What about R itself? Should I include the linear algebra libraries R depends on? And if my code is C/C++, should I include the compiler version number? The C library? Fellow researcher Konrad Hinsen gives this definition:

“Code” is the code you care about. “Environment” is code you don’t care about.

The problem is that all of “the environment” influences the results produced by the code you care about; it’s hard to make a judgment call to decide that some things should be excluded and others not.

So we see articles with software environments descriptions ranging from “I used Ubuntu 22.05” to long lists of package name/version—in research domains where R is used a lot, authors often provide the output of sessionInfo() as an appendix—some even including environment variable definitions! One obvious issue with those package name/version lists is that they are not actionable: you’re not going to build and install every single package version by hand, it’s just not practical. So in the end, they act more as a hint: if the software behaves differently than what’s described in the paper, it might be because I’m using a slightly different version of some dependency. The second problem is that a name/version pair fails to capture the complexity of a package dependency graph: it doesn’t tell which build options were used, whether patches were applied, which optional dependencies were enabled, and so on.

To address that, the other option is ship the bits: provide a Docker or virtual machine image of containing the software of interest. This is what more and more conference Artifact Evaluation Committees have come to recommend. It sure lets you run the code in the right software environment, but the cost is high: you can’t tell what code you’re running. The image is a big binary blob that was produced by a complex computational process (apt install, pip install, make, etc.) but usually one cannot map its contents back to source code.

You may object that, if you have the Dockerfile, then it’s fine. It’s not. Dockerfiles describe a process that is usually not reproducible since it depends on external resources such as the set of binary packages distributed by, say, Ubuntu at a given point in time. Even if it were reproducible, the whole process is fundamentally opaque: it assembles opaque binaries, starting with a full operating system image and piling binaries fetched by pip or other tools.

Conversely, Guix is, at its core, about providing a verifiable path from source code to binary. Guix packages are essentially source code that describes how to build software from source.

Our goal in the remainder of this article is to provide a step-by-step guide on using Guix to manage the software environment of your research software.

Executable provenance meta-data

With Guix as the basis of your computational workflow, you can get what’s in essence executable provenance meta-data: it’s like that long list of package name/version pairs, except more precise and immediately deployable. Let’s see how this can be achieved.

Step 1: Setting up the environment

The first step will be to identify precisely what packages you need in your software environment. Assuming you have a Python script that uses NumPy, you can start by creating an environment that contains these two packages and try to run your code in that environment:

guix shell -C python python-numpy -- python3 ./myscript.py

The -C flag here (or --container) instructs guix shell to create that environment in an isolated container with nothing but the two packages you asked for. That way, if ./myscript.py needs more than these two packages, it’ll fail to run and you’ll immediately notice. On some systems --container is not supported; in that case, you can resort to --pure instead.

Perhaps you’ll find that you also need Pandas and add it to the environment:

guix shell -C python python-numpy python-pandas -- \
  python3 ./myscript.py

If you fail to guess the name of the package (this one was easy!), try guix search.

Environments for Python, R, and similar high-level languages are relatively easy to set up. For C/C++ code, you may find need many more packages:

guix shell -C gcc-toolchain cmake coreutils grep sed make openmpi -- …

Or perhaps you’ll find that you could just as well provide a definition for your package.

Eventually, you’ll have a list of packages that satisfies your needs.

What if a package is missing? Guix and the main scientific and HPC channels provide about 25,000 packages today. Yet, there’s always the possibility that the one package you need is missing. In that case, you will need to provide a package definition for it in a dedicated channel of yours. For software in Python, R, and other high-level languages, most of the work can usually be automated by using guix import. Join the friendly Guix community to get help!

Step 2: Recording the environment

Now that you have that guix shell command line with a list of packages, the best course of action is to save it in a manifest file—essentially a software bill of materials—that Guix can then ingest. There are other ways to do that but the easiest way to get started is by “translating” your command line into a manifest:

guix shell python python-numpy python-pandas \
  --export-manifest > manifest.scm

Put that manifest under version control! From there anyone can redeploy the software environment described by the manifest and run code in that environment:

guix shell -C -m manifest.scm -- python3 ./myscript.py

Here’s what manifest.scm reads:

;; What follows is a "manifest" equivalent to the command line you gave.
;; You can store it in a file that you may then pass to any 'guix' command
;; that accepts a '--manifest' (or '-m') option.

(specifications->manifest
  (list "python" "python-numpy" "python-pandas"))

It’s a code snippet that lists packages. Notice that there are no version numbers! Indeed, these version numbers are specified in package definitions, located in Guix channels. To allow others to reproduce the exact same environment as the one you’re running, you need to pin Guix itself , by capturing the current Guix channel commits with guix describe:

guix describe -f channels > channels.scm

This channels.scm file is similar in spirit to “lock files” that some deployment tools employ to pin package revisions. You should also keep it under version control in your code, and possibly update it once in a while when you feel like running your code against newer versions of its dependencies. With this file, anyone, at any time and on any machine, can now reproduce the exact same environment by running:

guix time-machine -C channels.scm -- shell -C -m manifest.scm -- \
  python3 ./myscript.py

In this example we rely solely on the guix channel, which provides the Python packages we need. Perhaps some of the packages you need live in other channels—maybe guix-cran if you use R, maybe guix-science. That’s fine: guix describe also captures that.

Of course do include a README file giving the exact command to run the code. Not everyone uses Guix so it can be helpful to also provide minimal non-Guix setup instructions: which package versions are used, how software is built, etc. As we have seen, such instructions would likely be inaccurate and inconvenient to follow at best. Yet, it can be a useful starting point to someone trying to recreate a similar environment using different tools. It should probably be presented as such, with the understanding that the only way to get the same environment is to use Guix.

Step 3: Ensuring long-term source code archival

We insisted on version control before: for the manifest.scm and channels.scm files, but of course also for your own code. Our recommendation is to have these two .scm files in the same repository as the code they’re about.

Since the goal is enabling reproducibility, source code availability is a prime concern. Source code hosting services come and go and we don’t want our code to vanish in a whim and render our published research work unverifiable. Software Heritage (SWH for short) is the solution for this: SWH archives public source code and provides unique intrinsic identifiers to refer to it—SWHIDs. Guix itself is connected to SWH to (1) ensure that the source code of its packages is archived, and (2) to fall back to downloading from the SWH archive should code vanish from its original site.

Once your own code is available in a public version-control repository, such as a Git repository on your lab’s hosting service, you can ask SWH to archive it by going to its Save Code Now interface. SWH will process the request asynchronously and eventually you’ll find your code has made it into the archive.

Step 4: Referencing the software environment

This brings us to the last step: referring to our code and software environment in our beloved paper. We already have all our code and Guix files in the same repository, which is archived on SWH. Thanks to SWH, we now have a SWHID, which uniquely identifies the relevant revision of our code.

Following SWH’s own guide, we’ll pick an swh:dir kind of identifier, which refers to the directory of the relevant revision/commit of our repository, and we’ll keep contextual info for clarity—that includes the original URL. Putting it all together, we’ll conclude our paper with a sentence along these lines:

The source code used to produce this study, as well as instructions to run it in the right software environment using GNU Guix, is archived on Software Heritage as swh:1:dir:cc8919d7705fbaa31efa677ce00bef7eb374fb80;origin=https://gitlab.inria.fr/lcourtes-phd/edcc-2006-redone;visit=swh:1:snp:71a4d08ef4a2e8455b67ef0c6b82349e82870b46;anchor=swh:1:rev:36fde7e5ba289c4c3e30d9afccebbe0cfe83853a.

With this information, the reader can:

get the source code;
reproduce its software environment with guix time-machine and run the code;
inspect and possibly modify both the code and its environment.

Mission accomplished!

Examples

Perhaps you don’t feel adventurous enough to be the first one to follow this methodology. Worry not: you won’t be the first! Here are examples of reproducible papers built along the lines of this guide (with some variations), in several different fields:

Philippe Swartvagher et al., Tracing task-based runtime systems: feedbacks from the StarPU case. This article studies the impact of tracing complex HPC applications, especially what are the sources of performance degradation when an application execution is traced; evaluates the solutions to reduce the tracing overhead; and explores clock synchronization issues when distributed applications are traced. The paper is still under review but its content is available in Philippe's thesis. Considered applications are C programs using MPI, launched with Slurm, then Python scripts are used to process results and generate plots. The companion repository contains instructions and scripts to reproduce the whole study.
Emmanuel Agullo, Marek Felšöci, Guillaume Sylvand, A comparison of selected solvers for coupled FEM/BEM linear systems arising from discretization of aeroacoustic problems with the associated technical report describing the experimental environment and providing instructions for reproducing the experiments. Experiments in this study rely on private industrial code and can thus be reproduced only by a limited number of people. However, the publicly available material provides everyone with a fully documented example of building reproducible experimental studies within a constrained industrial context thanks the association of GNU Guix and the literate programming in Org mode.
Vic-Fabienne Schumann et al., SARS-CoV-2 infection dynamics revealed by wastewater sequencing analysis and deconvolution (preprint). The pipeline used to compute the results shown in the article is made with PIGx, a tool and collection of genomics pipelines that builds upon Guix. The “Data/Code Availability” section links to a repository that contains the manifest and channels files that were used and instructions to run the analysis.
Three contributions to the Ten Years Reproducibility Challenge organized by the ReScience C journal. In each article, the link to the code repository is at the bottom of the first page.
- Ludovic Courtès, [Re] Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices, ReScience C 6, 1, #6. This article reproduces the results of a 10-year old article. Experiments in the original article involved a complex software stack and did not use Guix (it actually predates Guix!). The article shows how to come up with a similar software environments a decade later, and how to use Guix to produce a pipeline that goes from source code to PDF.
- Konrad Hinsen, [¬Rp] Stokes drag on conglomerates of spheres, ReScience C 6, 1, #7. Tries to reproduce a study in computational fluid dynamics, based on Fortran code published in 1993. Ultimately fails because some of the code was lost, but the surviving code works nicely in a reproducible Guix environment.
- Konrad Hinsen, [Rp] Structural flexibility in proteins — impact of the crystal environment, ReScience C 6, 1, #5. Describes the reproduction of a computation of the normal modes of protein crystals, originally done in 2008 using Python scripts that no longer work with modern Python versions. A Guix environment based on the channel guix-past makes it possible to run historical versions of Python and some of its libraries.

Wrap-up

The key takeaways of this guide for reproducible papers are:

Recording package name/version is often of little help when it comes to running the code; conversely, providing an opaque image makes it easy to run the code but prevents verifiability and experimentation.
Guix lets you record the software environment with two files: manifest.scm, which lists software packages, and channels.scm, which pins Guix and its channels to a specific revision.
A combined command consumes these files and reproduces the exact same software environment: guix time-machine -C channels.scm -- shell -m manifest.scm.
With these files and your code under version control and archived on Software Heritage, it’s enough to share one SWHID in your paper.

Here are resources to learn more about this whole process:

Toward practical transparent verifiable and long-term reproducible research using Guix, Nature Scientific Data article (volume 9, Oct. 2022) by N. Vallet et al.
Guix as a tool for computational science, talk by K. Hinsen at the Ten Years of Guix event
Using Guix for scientific, reproducible, and publishable experiments, talk by P. Swartvagher at the same venue
Archive, reference, describe and cite software source code: a pathway to reproducibility, talk by M. Gruenpeter at the same venue
Guix and Org mode, a powerful association for building a reproducible research study, a self-contained tutorial by M. Felšöci.

If you’re interested, please join our next Reproducible Research Hackathon, which will take place on-line on June 27th, 2023, come to the Workshop on Reproducible Software Environments in November 2023, and/or subscribe to the guix-science mailing list!

Reproducible Research Hackathon—let’s redo!

2023-05-12T12:00:00Z

It's time to run the second Reproducible Research hackathon! The first one was from... 2020, already! The date: Tuesday June, 27th. Start: 9h30 (CEST) End: 17h30.

Update: Check out this report about the hackathon.

We propose to collectively tackle some of the issues about reproducible research:

identify stumbling blocks in using Guix to write end-to-end pipelines,
document how to achieve this,
feed the Guix-Past channel by other old packages,
provide guix.scm for some papers already published.

Anyone is welcome! Feel free to join if you would like to hack with us.

We suggest to pick articles from the ReScience C or COMPUTO – they provide a high level of transparency about the materials required for redoing. The best experiment would to choose articles from 2020. As a warm up, maybe Courtès, L., Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices? Or else from the Ten Years Reproducibility Challenge which took advantage of GNU Guix. If you prefer to work on your own topic that you would like to redo, you are welcome.

We will meet Tuesday 27th June at 9:30 CEST on the #guix-hpc channel of irc.libera.chat You can use this web client (set the nickname you wish) to reach us. We will provide a link for one BigBlueButton instance (video meeting), stay tuned!

▶ Join us on BigBlueButton at 9:30 CEST! ◀
Here’s a pad 🗒 for note-taking during the day.

At the end of day, we would like to draw the lines of an experiment report summarizing the successes or the roadblocks.

There's a lot we can do and we'd love to hear your ideas!

Drop us an email at guix-science@gnu.org.

Continuous integration and continuous delivery for HPC

2023-03-06T15:00:00Z

Will those binaries actually work? This is a central question for HPC practitioners and one that’s sometimes hard to answer: increasingly complex software stacks being deployed, and often on a variety of clusters. Will that program pick the right libraries? Will it perform well? With each cluster having its own hardware characteristics, portability is often considered unachievable. As a result, HPC practitioners rarely take advantage of continuous integration and continuous delivery (CI/CD): building software locally on the cluster is common, and software validation is often a costly manual process that has to be repeated on each cluster.

We discussed before that use of pre-built binaries is not inherently an obstacle to performance, be it for networking or for code—a property often referred to as performance portability. Thanks to performance portability, continuous delivery is an option in HPC. In this article, we show how Guix users and system administrators have benefited from continuous integration and continuous delivery on HPC clusters.

Hermetic builds

But first things first: before we talk about continuous integration, we need to talk about hermetic or isolated builds. One of the key insights of the pioneering work of Eelco Dolstra on the Nix package manager is this: by building software in isolated environments, we can eliminate interference with the rest of the system and practically achieve reproducible builds. Simply put, if Alice runs a build process in an isolated environment on a supercomputer, and Bob runs the same build process in an isolated environment on their laptop, they’ll get the same output (unless of course the build process is not deterministic).

From that perspective, pre-built binaries in Guix (and Nix) are merely substitutes for local builds: you can choose to build things locally, but as an optimization you may just as well fetch the build result from someone you trust—since it’s the same as what you’d get anyway.

A closely related property is full control of the software package dependency graph. Guix package definitions stand alone: they can only refer to one another and cannot refer to software that happens to be available on the machine in /usr/lib64, say—that directory is not even visible in the isolated build environment! Thus, a package in Guix has its dependencies fully specified, down to the C library—and even further down.

Thanks to hermetic builds and standalone dependency graphs, sharing binaries is safe: by shipping the package and all its dependencies, without making any assumptions on software already available on the cluster, you control what you’re going to run.

Continuous integration & continuous delivery

Guix uses continuous integration to build its more than 22,000 packages on several architectures: x86_64, i686, AArch64, ARMv7, and POWER9. The project has two independent build farms. The main one, known as ci.guix.gnu.org, was generously donated by the Max Delbrück Center for Molecular Medicine (MDC) in Germany; it has more than twenty 64-core x86-64/i686 build machines and a dozen of build machines for the remaining architectures.

The diagram above illustrates the packaging workflow in Guix, which can be summarized as follows:

packagers write a package definition;
they test it locally by using guix build;
eventually someone with commit access pushes the changes to the Git repository;
build farms pull from the repository and build the new package.

Build farms are a quality assurance tool for packagers. For instance, ci.guix runs Cuirass. The web interface often surprises newcomers—it sure looks different from those of Jenkins or GitLab-CI!—but the key part is that it provides a dashboard that one can navigate to look for packages that fail to build, fetch build logs, and so on.

A big difference with traditional continuous integration tools is that build results from the build farm are not thrown away: by running guix publish on the build farm, those binaries are made accessible to Guix users. Any Guix user may add ci.guix.gnu.org to their list of substitute URLs and they will transparently get binaries from that server.

One can check whether pre-built binaries of specific packages are available on substitute servers by running guix weather:

$ guix weather gromacs petsc scotch
computing 3 package derivations for x86_64-linux...
looking for 5 store items on https://ci.guix.gnu.org...
https://ci.guix.gnu.org ☀
  100.0% substitutes available (5 out of 5)
  at least 41.5 MiB of nars (compressed)
  109.6 MiB on disk (uncompressed)
  0.112 seconds per request (0.2 seconds in total)
  8.9 requests per second

looking for 5 store items on https://bordeaux.guix.gnu.org...
https://bordeaux.guix.gnu.org ☀
  100.0% substitutes available (5 out of 5)
  at least 30.0 MiB of nars (compressed)
  109.6 MiB on disk (uncompressed)
  0.051 seconds per request (0.2 seconds in total)
  19.7 requests per second

That way, one can immediately tell whether deployment will be quick or whether they’ll have to wait for compilation to complete…

Publishing binaries for third-party channels

Our research institutes typically have channels providing packages for their own software or software related to their field. How can they benefit from continuous integration and continuous delivery?

At Inria, we set up a build farm that runs Cuirass and publishes its binaries with guix publish. Cuirass is configured to build the packages of selected channels such as guix-hpc and guix-science (the Guix manual explains how to set up Cuirass on Guix System; you can also check out the configuration of this build farm for details). That way, it complements the official build farms of the Guix project.

The HPC clusters that the teams at Inria use, in particular PlaFRIM and Grid’5000, are set up to fetch substitutes from https://guix.bordeaux.inria.fr in addition to the Guix’s default substitute servers. When deploying packages from our channels on one of these clusters, binaries are readily available—a significant productivity boost! That also applies to binaries tuned for a specific CPU micro-architecture.

The Grid’5000 setup takes advantage of this flexibility in interesting ways. Grid’5000 is a “cluster of clusters” with 8 sites, each of which has its own Guix installation. To share binaries among sites, each site runs a guix publish instance, and each site has the other sites in its list of substitute URLs. That way, if a site has already built, say, Open MPI, the other sites will transparently fetch Open MPI binaries from it instead of rebuilding it.

While Cuirass is a fine continuous integration tool tightly integrated with Guix, it’s also entirely possible to use one of the mainstream tools instead. Here are examples of computing infrastructure that publishes pre-built binaries:

GliCID, the Tier-2 cluster for the region of Nantes (France), builds packages with Cuirass and publishes binaries.
ZPID publishes binaries of relevant packages built with a simple cron script.
GeneNetwork runs continuous integration jobs with Laminar and publishes the resulting binaries.
Phil Beadling of Quantile Technologies explained how they integrated Guix in their Jenkins CI/CD pipeline.

As you can see, there’s a whole gamut of possibilities, ranging from the “low-tech” setup to the fully-featured CI/CD pipeline. In all of these, guix publish takes care of the publication part. If your focus is on delivering binaries for a small set of packages, a periodic cron job as shown above is good enough. If you’re dealing with a large package set and are also interested in quality assurance, a tool like Cuirass may be more appropriate.

Wrapping up

We computer users all too often work in silos. Developers might have their own build and deployment machinery that they use for continuous integration (GitLab-CI with some custom Docker image?); system administrators might deploy software on clusters in their own way (Singularity image? environment modules?); and users might end up running yet other binaries (locally built? custom-made?). We got used to it, but if we take a step back, it looks like this is one and the same activity with a different cloak depending on who you’re talking to.

Guix provides a unified approach to software deployment; building, deploying, publishing binaries, and even building container images all build upon the same fundamental mechanisms. We have seen in this blog post that this makes it easy to continuously build and publish package binaries. The productivity boost is twofold: local recompilation goes away, and site-specific software validation is reduced to its minimum.

For HPC practitioners and hardware vendors, this is a game changer.

Acknowledgments

Thanks to Lars-Dominik Braun, Simon Tournier, and Ricardo Wurmus for their insightful comments on an earlier draft of this post.

Guix-HPC Activity Report, 2022

2023-02-10T15:45:00Z

This document is also available as PDF (printable booklet).

Guix-HPC is a collaborative effort to bring reproducible software deployment to scientific workflows and high-performance computing (HPC). Guix-HPC builds upon the GNU Guix software deployment tools and aims to make them useful for HPC practitioners and scientists concerned with dependency graph control and customization and, uniquely, reproducible research.

Guix-HPC was launched in September 2017 as a joint software development project involving three research institutes: Inria, the Max Delbrück Center for Molecular Medicine (MDC), and the Utrecht Bioinformatics Center (UBC). GNU Guix for HPC and reproducible science has received contributions from additional individuals and organizations, including CNRS, Université Paris Cité, the University of Tennessee Health Science Center (UTHSC), the Leibniz Institute for Psychology (ZPID), Cornell University, and a growing number of organizations deploying Guix on their HPC clusters.

This report highlights key achievements of Guix-HPC between our previous report a year ago and today, February 2023. This year was marked by exciting developments for HPC and reproducible workflows: the release of GNU Guix 1.4.0 in December, the celebration of ten years of Guix with a three-day conference, several releases of the Guix Workflow Language (GWL), more work on supporting RISC-V processors, and more publications relying on Guix as a foundation for reproducible computational workflows.

Outline

Guix-HPC aims to tackle the following high-level objectives:

Reproducible scientific workflows. Improve the GNU Guix tool set to better support reproducible scientific workflows and to simplify sharing and publication of software environments.
Cluster usage. Streamlining Guix deployment on HPC clusters, and providing interoperability with clusters not running Guix.
Outreach & user support. Reaching out to the HPC and scientific research communities and organizing training sessions.

The following sections detail work that has been carried out in each of these areas.

Reproducible Scientific Workflows

Supporting reproducible research workflows is a major goal for Guix-HPC.

Guix Workflow Language

The Guix Workflow Language (or GWL) is a scientific computing extension to GNU Guix's declarative language for package management. It allows for the declaration of scientific workflows, which will always run in reproducible environments that GNU Guix automatically prepares. The general idea with the GWL is a simple inversion of priorities: put reproducible software deployment first and extend the deployment infrastructure provided by Guix with tools to declare and run workflows. As a consequence, the GWL benefits directly from the continued development of Guix's salient features pertaining to software reproducibility and reliable, predictable deployment. Much of the work on the GWL is thus aimed at recasting these features through the lens of a domain-specific language for describing workflows as a graph of processes that are inextricably linked with their associated software stacks.

The year 2022 saw three releases of the Guix Workflow Language: version 0.4.0 on January 28, version 0.5.0 on July 21, and version 0.5.1 on November 13, representing the cumulative efforts of four contributors. The changes include fixes to errors discovered in active use of the GWL for scientific workflows, adjustments in the details of how the GWL extends Guix, and laying the groundwork for improved performance.

The German National Research Data Infrastructure—specifically its engineering sciences branch NFDI4Ing— recognizes workflow management systems as an important tool towards reproducible and reusable scientific workflows. A special interest group discussed and compared several workflow management systems, including GWL, along three different user story perspectives. The discussion paper entitled “Evaluation of tools for describing, reproducing and reusing scientific workflows” highlights GWL’s abilities to easily reproduce compute environments and to provide precise software provenance tracking as well as its flexible workflow definition. The special interest group recommends the GWL to specialists with high requirements on software reproducibility and integrity. The preprint of the discussion paper is available here.

Reproducible GNU R Environments

The R language is widely used for statistics in general and notably in bioinformatics. A common practice for installing R packages, from within the R session, is to run the install.packages utility: it allows users to download and install packages from CRAN and CRAN-like repositories such as Bioconductor, or from local files.

While convenient, use of install.packages raises the question of the level of control over the software “supply chain”. Some R packages are not just plain R scripts and instead also contain C, C++, or Fortran parts, mainly for performance, or require external system-wide dependencies unmanaged by install.packages, such as linear algebra libraries. Therefore, computational environments populated with the builtin utility install.packages might not be reproducible from one machine to another.

This is where the r-guix-install package comes in. r-guix-install, which is available on CRAN, allows users to install R packages via Guix from within the running R session, similarly as install.packages but where the complete supply chain is controlled by Guix. In addition, if the requested R package does not exist in Guix at this time, the package and all its missing dependencies will be imported recursively and the generated package definitions will be written to ~/.Rguix/packages.scm. This record of imported packages can be used later to reproduce the environment, and to add the packages in question to a proper Guix channel (or to Guix itself). guix.install() not only supports installing packages from CRAN, but also from Bioconductor or even arbitrary Git or Mercurial repositories, replacing the need for installation via devtools.

While this approach works well for individual users, Guix installations with a larger user base, for instance institution-wide, would benefit from the default availability of the entire CRAN package collection with pre-built substitutes to speed up installation times. Additionally, reproducing environments would include fewer steps if the package recipes were available to anyone by default.

The new guix-cran channel was built to address that issue. It extends the package collection by providing all CRAN packages missing in Guix proper and has all of the properties mentioned above.

Creating and updating guix-cran is fully automated and happens without any human intervention. The channel itself is always in a usable state, because updates are tested with guix pull before committing and pushing them. However, some packages may not build or work, because build or runtime dependencies (usually undeclared in CRAN itself) are missing. Any improvement to the already very good Guix CRAN importer, like enhanced auto-detection of these missing dependencies, also improves the channel’s quality. More details are available in a blog post.

Packages

As of this writing, Guix comes with more than 22,000 packages, which makes it one of the ten biggest free software distributions according to Repology. This is the result of more than 15,000 commits made by 343 people since last year—an impressive level of activity sustained thanks to the Guix tooling and continuous integration services.

Many scientific packages have been added or upgraded in Guix. As an example, Bioconductor, the R suite for bioinformatics, was upgraded to 3.16; OCaml 5 with support for shared memory parallelism and effect handlers was introduced; the snakemake package in Guix received an important bug fix, making snakemake usable for parallel execution on HPC clusters. The most common scientific and HPC packages were updated and improved: Open MPI and its many dependencies, SLURM, OpenBLAS, Scotch, SUNDIALS, and GROMACS, to name a few. The Julia package set is still growing; Julia was upgraded to 1.6.7 and then to 1.8.3, with fixes for i686 and improvements of the Julia build system.

In addition to the growing collection of curated packages provided as part of the main Guix channel, we maintain a number of special-purpose channels that provide additional packages for scientific and high-performance computing. An up-to-date list of Guix channels maintained by members of the Guix HPC effort is available on the project page. The on-line package package browser also makes it easier to navigate channels.

The Guix-Science channel, initiated in 2021, now provides more than 600 packages, complementing the rich scientific package collection available in Guix proper. Chief among the changes it received this year are an update of R Studio and improvements to the Jupyter Lab and Jupyter Hub packaging, and the addition of Integrative Genomics Viewer (IGV).

Ensuring Source Code Availability

The 10 Years of Guix event was an opportunity for developers of Guix and Software Heritage (SWH) to discuss intrinsic identifiers. An intrinsic identifier only depends on the data content itself and it requires three ingredients for its computation: a representation of the structure of this content (serializer), a cryptographic hash algorithm, and an encoding for the resulting byte string. While converting from one encoding to another is trivial—e.g., between base64 and base32—it is, naturally, impossible to “convert” a cryptographic hash to the hash computed by a different function. All three parameters can be selected with command-line options to the guix hash command.

By default Guix computes a SHA256 hash over the Nar serialization of source archives and version-control checkout (“Nar” stands for normalized archive; it is the serialization format inherited from Nix). Instead, the SWH archive computes the SHA1 hash of a Git-serialized representation of the files. This discrepancy deprives Guix of a simple and reliable way to query the SWH archive by content hash. This led to a discussion about the possibility for SWH to compute and preserve Nar hashes as additional information for code it archives—so-called ExtID identifiers. Doing so could improve archive coverage for code source referenced by Guix packages, in particular for Subversion checkouts as used by most of the TeX Live packages.

As discussed in last year’s report, Guix contributor Timothy Sample was awarded a grant by Software Heritage and the Alfred P. Sloan Foundation to further their work on Disarchive. Disarchive bridges the gap between source code archives (tarballs) packages refer to and content stored in the SWH archive. It does so by providing a command to extract the metadata of a tarball, and another command to reassemble metadata and content, thereby restoring the original tarball. This work is key to improving source code availability for the many packages built from source code tarballs.

Last year, the Guix project deployed infrastructure to continuously build and publish a Disarchive database at disarchive.guix.gnu.org. Guix is able to combine Disarchive and SWH as a fallback when downloading a tarball from its original URL fails, significantly improving source code archival coverage.

This work was initiated a few years back and is still ongoing. A proposal to integrate Disarchive into the SWH archive is being discussed. We believe Disarchive integration would be a great step forward, not just for Guix, but for all the distributions and tools that rely on source tarball availability.

Reproducible Research in Practice

This section highlights scientific productions made with GNU Guix.

Guix was used to ensure the reproducibility of experiments for the study of memory contention between computations and communications on several different HPC clusters. A public companion explains how to reproduce the experiments with and without GNU Guix.

Alexandre Denis et al., Predicting Performance of Communications and Computations under Memory Contention in Distributed HPC Systems

The reproducible paper about the impact of tracing on complex HPC application executions, mentioned in the previous Guix-HPC Activity Report, is still under review for publication. However, first feedbacks from reviewers requested several complementary experiments. These complementary experiments were made about a year after the initial experiments presented in the paper. Having a complete workflow based on GNU Guix really helped to dive back into the experimental context and configurations used a year before!

Philippe Swartvagher defended his PhD thesis on the interactions between HPC task-based runtime systems and communication libraries. In an appendix of the manuscript, he explains how he used on different HPC clusters GNU Guix, packages from the Guix-HPC channel and Software Heritage, to ensure reproducibility of his experiments.

The PhD thesis of Marek Felšöci (to be defended in February 2023), which is part of a collaboration between Inria and Airbus, is set in an industrial aeroacoustic context and deals with direct methods for solving coupled sparse/dense linear systems. Within the thesis, the author dedicates a full chapter to the topic of reproducibility. Throughout this chapter, he addresses the challenges of ensuring a reproducible research study in computer science in general and in the context of the thesis in particular. The questions related to the usage of non-free software are discussed as well. The author then presents the strategy he adopts to face these challenges including working principles, software tools and their alternatives. To share the resulting guidelines, he provides a minimal working example of a reproducible research study on solvers for coupled sparse/dense systems. Moreover, he introduces and references examples of actual studies from the thesis following the advocated principles and techniques for improving reproducibility.

The latest addition to the PiGx framework of reproducible scientific workflows backed by Guix is PiGx SARS-CoV-2, a pipeline for analysing data from sequenced wastewater samples and identifying given lineages of SARS-CoV-2. The output of the PiGx SARS-CoV-2 pipeline is summarized in a report which provides an intuitive visual overview about the development of variant abundance over time and location. This is the first of the released PiGx pipelines that comes with concise yet comprehensive instructions on how to use guix time-machine to reproduce the software environment used for the analyses presented in the paper:

Vic-Fabienne Schumann et al., SARS-CoV-2 infection dynamics revealed by wastewater sequencing analysis and deconvolution

Guix was used as the computational environment manager of biomedical research on the administration of azithromycin drug after allogeneic hematopoietic stem cell transplantation for hematologic malignancies. Studying 240 samples from patients randomized in this phase 3 controlled clinical trial was a unique opportunity to better understand the mechanisms underlying relapse, the first cause of mortality after transplantation. The various data processing scripts and associated computational environments using manifest.scm and channels.scm files for use with guix time-machine and guix shell are available here, there or there.

Nicolas Vallet et al. Azithromycin promotes relapse by disrupting immune and metabolic networks after allogeneic stem cell transplantation

Cluster Usage and Deployment

As part of our effort to streamline Guix deployment on HPC clusters, we updated and improved our cluster installation guide, which is now part of the Guix Cookbook. The guide describes the steps needed to get Guix running on a typical HPC cluster where nodes come with a distribution other than Guix System, such as CentOS or Rocky Linux.

The sections below highlight the experience of cluster administration teams and report on tooling developed around Guix for users and administrators on HPC clusters.

Genetics Research Cluster at UTHSC

At the University of Tennessee Health Science Center (UTHSC) in Memphis (USA), we are running an 11-node large-memory HPC Octopus cluster (264 cores) dedicated to pangenome and genetics research. In 2022 more storage added. Notable about this HPC is that it is administered by the users themselves. Thanks to GNU Guix we install, run and manage the cluster as researchers (and roll back in case of a mistake). UTHSC’s information technology (IT) department manages the infrastructure—i.e., physical placement, routers and firewalls—but beyond that there are no demands on IT. Thanks to out-of-band access we can completely (re)install machines remotely.

Octopus runs GNU Guix on top of a minimal Debian install and we are experimenting with Guix System nodes that can be run on demand. LizardFS is used for distributed network storage. Almost all deployed software has been packaged in GNU Guix and can be installed by regular users on the cluster without root access, see the guix-bioinformatics channel.

Tier-2 Research Cluster at GliCID

GliCID, a Tier-2 cluster in Nantes (France), will have a new computing cluster installed in the summer of 2023. To retain control over the system and avoid proprietary tools specific to this type of facility, GliCID chose to build an independent cluster infrastructure into which the newly delivered cluster will be integrated.

This infrastructure consists of virtual machines (VMs) generated from Guix operating system definitions and providing services such as: identity management, databases, monitoring, high availability, login machines, slurms servers, documentation servers—over 20 VMs in total. The generated images are directly pushed on Ceph RBD volumes and consumed by KVM hypervisors, which avoids a deployment phase. Now fully operational, this infrastructure is entering a test phase. The choice of Guix has proven to be perfectly adapted to control the whole infrastructure and to obtain redeployable, reproducible and easily scalable machines.

Compute nodes are a mix of virtual compute machines running Guix System, and physical machines from a previous cluster running another distribution. Making native and “foreign” Guix installations cohabit while guaranteeing the consistency of the profiles turned out to be challenging. One specific issue GliCID overcame was managing a shared independent /gnu/store, shared by all the nodes as per the standard cluster setup instructions, and merging the /gnu/store directory of native nodes via overlayfs.

In 2023, GliCID plans to increase the share of infrastructure machines running Guix System, to factor more code and improve the quality of operating system definitions, packages, and services that have been developed internally, and to contribute more of these upstream.

Packages as Environment Modules

To support seasoned HPC users and system administrators, we developed Guix-Modules, a tool to generate environment modules. Environment modules are a venerable tool that lets HPC users “load” the software environment of their choice via the module load. This gives a lot of flexibility: users can use their favorite software packages without interfering with one another, and they can also manipulate different environments. The downside of this tool that modules are all too often handcrafted on each cluster: an openmpi/4.1.4 module might be called differently on another cluster, or it might be a different version, or it might be built with different options. In other words, use of modules is usually specific to one cluster, and users have to “port” their code when switching to a different clusters as they cannot expect to find the same modules.

Nevertheless, the module command remains widespread, well-known, and convenient. Guix-Modules generates modules for the chosen Guix packages, such that users can then run module load to use them, without having any knowledge of Guix. For system administrators, the benefit is obvious: instead of having to build and maintain tens of modules for scientific software, they can instead generate them all at once and provide users with battle-tested packages found in Guix. For users, the immediate benefit is a smooth transition to Guix, but also reproducibility and provenance tracking: the generated modules record provenance information, which allows users to deploy the exact same software elsewhere or at a different point in time.

A similar interoperability layer was previously developed for the Spack and EasyBuild package managers with similar motivations. In the case of Guix, we hope this will help user accustomed to module migrate towards reproducible deployment without having to change their habits overnight.

Containers, Singularity, and Docker

For HPC environments that do not support running native Guix software deployment Guix supports building lightweight reproducible containers that only have the software that is really needed. At UTHSC we are distributing binary deployments as Docker containers that run on state-of-the art compute HPCs. These containers were developed and tested first on a separate computer with GNU Guix installed, and produced with guix pack.

Research teams at Inria resort to guix pack as well when targeting supercomputers where Guix is not installed. Scientists can deploy their software using Guix directly on clusters that support it, such as Grid’5000, PlaFRIM, and some of the Tier-2 clusters; when they need to deploy it on Tier-1 supercomputers, they build a Singularity image that they ship and run there. This is both a productivity boost—no need to manually rebuild software!—and the guarantee that they are running the same software.

Having Guix available on those supercomputers would of course make the process even smoother; we plan to engage with those cluster administration teams to make Guix available in the future.

Supporting POWER9 and RISC-V CPUs

While it is perhaps early days to call RISC-V an HPC platform, there are indicators that this may happen in the near future with investments from the USA, the EU, India, and China. RISC-V hardware platforms and vendors will become common in the coming years.

Together with Chris Batten of Cornell and Michael Taylor of the University of Washington, Erik Garrison and Pjotr Prins are UTHSC PIs responsible for leading the NSF-funded RISC-V supercomputer for pangenomics. It will incorporate GNU Guix and the GNU Mes bootstrap, with input from Arun Isaac, Efraim Flashner and others. NLNet is funding RISC-V support for GNU Guix with Efraim Flashner and the GNU Mes RISC-V bootstrap project with Ekaitz Zarraga and Jan Nieuwenhuizen. We aim to continue adding RISC-V support to GNU Guix at a rapid pace. After the Guix days in Paris, Alexey Abramov was the first to bootstrap GNU Guix for RISC-V on the Polarfire platform.

Why is the combination of GNU Mes and GNU Guix exciting for RISC-V? First of all, RISC-V is a very modern modular open hardware architecture that provides further guarantees of transparency and security. It extends reproducibility to the transistor level and for that reason generates interest from the Bitcoin community, for example. Because there are no licensing fees involved, RISC-V is already a major force in IoT and will increasingly penetrate hardware solutions, such as storage microcontrollers and network devices, going all the way to GPU-style parallel computing and many-core solutions with thousands of cores on a single die. GNU Mes and GNU Guix are particularly suitable for RISC-V because Guix can optimize generated code for different RISC-V targets and is able to parameterize deployed software packages for included/excluded RISC-V modules.

Outreach and User Support

Articles

The following refereed articles about Guix were published:

Nicolas Vallet, David Michonneau, and Simon Tournier, Toward practical transparent verifiable and long-term reproducible research using Guix, Nature Scientific Data, volume 9 issue 1, October 2022
Ludovic Courtès, Reproducibility and Performance: Why Choose?, IEEE CiSE volume 4, issue 3, June 2022
Ludovic Courtès, Building a Secure Software Supply Chain with GNU Guix, Programming Journal, volume 7 issue 1, June 2022

The following refereed articles about research that uses Guix were published:

Alexandre Denis et al., Predicting Performance of Communications and Computations under Memory Contention in Distributed HPC Systems
Vic-Fabienne Schumann et al., SARS-CoV-2 infection dynamics revealed by wastewater sequencing analysis and deconvolution, Science of the Total Environment, volume 853, December 2022
Nicolas Vallet et al. Azithromycin promotes relapse by disrupting immune and metabolic networks after allogeneic stem cell transplantation

Over the year we published six articles on the Guix-HPC blog touching topics such as environment modules, reproducible R environments, and reproducibility.

Talks

Since last year, we gave the following talks at the following venues:

Concise Common Workflow Language—Concision and Elegance in a Workflow Language Using Lisp, FOSDEM, Feb. 2022 (Arun Isaac)
Using Guix in Computer Architecture Research at both the gem5 users' workshop and the Sixth Workshop on Computer Archiecture Research with RISC-V (CARRV'22) in New York City, NY, June 2022, Christopher Batten (Cornell University), Pjotr Prins, Efraim Flashner, Arun Isaac (The University of Tennessee Health Science Center), Jan van Nieuwenhuizen (Joy of Source), Ekaitz Zarraga (ElenQ Technologies), Tuan Ta, Austin Rovinski (Cornell University), Erik Garrison (The University of Tennessee Health Science Center)
GNU Guix and the RISC-V Future, Ten Years of Guix, Sep. 2022, (Pjotr Prins)
GNU Guix, vers la reproductibilité computationnelle, BlueHats session of Open Source Experience, Nov. 2022 (Simon Tournier)
Toward practical transparent verifiable and long-term reproducible research using Guix, bioinfo seminar at Institut de Biologie de l'École Normale Supérieure (IBENS), Dec. 2022 (Simon Tournier)

Events

As in previous years, Pjotr Prins spearheaded the organization of the “Declarative and minimalistic computing” track at FOSDEM 2022, which was home to several Guix talks.

This year was also the tenth year of Guix as a project. Its first lines of code were written in April 2012, and it has since received code contributions by more than 800 people at an impressive rate, not to mention non-coding contributions in many areas—from helping out newcomers, to designing graphics, to translating documentation.

To celebrate, we organized Ten Years of Guix, a three-day event that took place in Paris, France, in September 2022, with support from research and non-profit organizations. About 50 people came in Paris and the event was also live-streamed.

This event was one of kind: it brought together scientists and free software hackers, two communities that evidently have shared values—as the open science movement demonstrates—and that benefit from one another. The program was organized as follows:

Friday, September 16th, was dedicated to reproducible deployment for reproducible research. Scientists and practitioners shared their experience building reproducible research workflows, using Guix and other tools.
Saturday focused on development with Guix and on Guix, as well as community topics.
Sunday had more in-depth presentations of Guix as well as informal discussions and skill-sharing sessions.

A total of 34 talks were given and videos are available on-line—many thanks to the Debian video team for making it possible!

Oh and of course, we ate not one but two birthday cakes.

Training Sessions

For the French HPC Guix community, we continued the monthly on-line event called “Café Guix”, originally started in October 2021. Each month, a user or developer informally presents a Guix feature or workflow and answers questions. These sessions are now recorded and are available on the webpage.

A mini-tutorial about Guix was presented by Simon Tournier on May 19, 2022 during the French Higher Education and Research Days on Networking (JRES). The 1h video and the support are available (in French). In June, INRAe (the French institute for research in agriculture, food, and environment) organized in Montpellier a training session covering tools such as Kubernetes and OpenStack, and hosted a session dedicated to computational reproducibility where Simon Tournier presented how Guix can help.

On May 30, 2022 the Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC) hosted a Guix workshop as part of the Data Science Café in Berlin. The workshop was entitled “Managing reproducible and transparent software environments with GNU Guix” and was presented by Ricardo Wurmus.

The Inria research center in Nancy (France) periodically organizes afternoon technical seminars, referred to as “Tuto Techno”, about a technology or programming language. On June 14, 2022 the research center hosted Marek Felšöci who gave a presentation on the use of Guix combined with literate programming with Org Mode for building reproducible research studies. The presentation was followed by a hands-on session. Attendees were guided through the process of constructing a standalone Git repository containing a research study entirely reproducible thanks to Guix and the literate description of the experimental environment, source code and methods in Org mode. At the end of the hands-on session, participants learned how to use Software Heritage to guarantee a long-term availability of their work. The tutorial is self-contained and publicly available for anyone who would like to try it out.

A training session was given during the Open Science Days, which took place in Grenoble, France, 13–15 December 2022. Entitled “Déploiement reproductible des logiciels scientifiques avec GNU Guix” (“Reproducible scientific software deployment with GNU Guix”) and given by Ludovic Courtès, Konrad Hinsen, and Simon Tournier, the session introduced the use of guix shell and guix time-machine as the building blocks of reproducible workflows. Training material is available on-line.

Another training session was organized by SARI (part of the DevLog knowledge network at CNRS) in Grenoble on the 8th of December 2022. It aimed to help people use Guix on the GriCAD HPC cluster.

Work has started on a sequel to the Reproducible Research MOOC by Inria Learning Lab, which will include an introduction to Guix for managing software environments for reproducible research.

Personnel

GNU Guix is a collaborative effort, receiving contributions from more than 90 people every month—a 50% increase compared to last year. As part of Guix-HPC, participating institutions have dedicated work hours to the project, which we summarize here.

Inria: 2.5 person-years (Ludovic Courtès; contributors to the Guix-HPC channel: Emmanuel Agullo, Luca Cirrottola, Marek Felšöci, Marc Fuentes, Nathalie Furmento, Gilles Marait, Florent Pruvost, Matthieu Simonin, Philippe Swartvagher, Mathieu Verite; system administrator in charge of Guix on the PlaFRIM and Grid’5000 clusters: Julien Lelaurain)
Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC): 2 person-years (Ricardo Wurmus and Mădălin Ionel Patrașcu)
University of Paris Cité: 0.75 person-year (Simon Tournier)
University of Tennessee Health Science Center (UTHSC): 3+ person-years (Efraim Flashner, Bonface Munyoki, Fred Muriithi, Arun Isaac, Jorge Gomez, Erik Garrison and Pjotr Prins)
CNRS and UGA (GRICAD): 0.3 person-year (Céline Acary-Robert, Pierre-Antoine Bouttier, Oliver Henriot)

Perspectives

With UNESCO’s Recommendation on Open Science and the many Open Science initiatives at the national and institutional levels, awareness of the Open Science and reproducible research principles is on the rise. Its implications are also better understood in particular when it comes to software: software publication and licensing, issues of software deployment, provenance tracking, and reproducibility are becoming central to scientific practices. Addressing these issues requires commitment of the scientific community at large: scientists, but also research software engineers (RSEs) and system administrators.

The Guix-HPC effort is unique in its ability to connect these communities. This Activity Report as well as the program of the Ten Years of Guix event earlier this year are proof that researchers, engineers, and system administrators all have a stake in what we are building. Together, we shape tools and practices that further Open Science and make reproducible research workflows practical.

Bringing these tools and practices to the scientific community is a key challenge for the project. While Guix gets more recognition as an enabler for reproducible research, misconceptions persist: that Guix only caters to the needs of “reproducibility professionals”, or that it reproducibility is antithetical to performance. In the coming year, we want to reach out to broader user communities—again scientists, engineers, and system administrators—and to provide training sessions. It is our mission to put the tools we build in the hands of practitioners at large.

There are technical challenges ahead for the coming year, in line with what we have been doing: improving the user experience for scientists, improving the user story when running software on a Guix-less cluster, bridging the gap with users that do not interact with software via the command line or Jupyter, bringing Guix System and guix deploy to HPC cluster administrators, and achieving 100% coverage of package source code in the Software Heritage archive.

The GNU Guix project turned ten this year. It started with the development of a “package manager” and is now providing a complete deployment toolbox: a package manager, but also a development environment manager, a container provisioning tool, a standalone operating system, and a cluster deployment tool. Besides its technical achievements, it has raised the bar of what one can expect in terms of software deployment—reproducibility, provenance tracking, and transparency. We are determined to make more strides in that direction.

There’s a lot we can do and we’d love to hear your ideas!

Guix-HPC at FOSDEM

2023-01-24T14:00:00Z

As has been the case for 9 years (!), Guix will be present at FOSDEM, the big annual free software developer conference in Europe. There will be no less than ten Guix-related talks, of which the following are particularly relevant to the HPC and reproducible research communities:

In the Open Research Tools track, Guix, toward practical transparent, verifiable and long-term reproducible research will be an introduction to Guix (by Simon Tournier) for an audience of scientists interested in coming up with scientific practices that improves verifiability and transparency.
In the RISC-V track, Efraim Flashner will talk about the latest breakthroughs in Porting RISC-V to GNU Guix—and the other way around.
In the HPC track, Ludovic Courtès will give a lightning talk about CPU tuning in Guix entitled Reproducibility and performance: why choose?.

There are lots of exciting talks in each of these tracks, check it out! Talks will be live-streamed so you can join and chat with us even if you’re not physically present.

Prior to FOSDEM, the community will meet in person for the Guix Days, two days to informally discuss organizational matters, technical issues, and road maps.

See you in Brussels!

CRAN, a practical example for being reproducible at large scale using GNU Guix

2022-12-21T15:50:00Z

A recent study published in Nature Scientific Data in February 2022 gives empirical insight into the success rate of reproducing R scripts obtained from Harvard’s Dataverse:

We re-executed R code from each of the replication packages using three R software versions, R 3.2, R 3.6, and R 4.0, in a clean environment. […] We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices.

Given that more than half of the published R files failed to run even when trying to run it with three different R versions, recording the exact environment software is supposed to run in could be declared a good coding practice for scientific publications.

The R ecosystem itself provides tools to capture and restore R software environments, including Packrat and its successor renv which both originate from within the RStudio project. Two replication packages in the study above used renv while the others did not record the environment at all.

Looking at renv more closely reveals that it is able to capture the current R version and installed packages in a lockfile called renv.lock. However, as noted before, restoring an environment comes with a few caveats: First of all, renv does not install a different version of R if the recorded and current version disagree. This is a manual step and up to the user. The same is true for packages with external dependencies. Those libraries, their headers and binaries also need to be installed by the user in the correct version, which is not recorded in the lockfile. Furthermore renv supports restoring packages installed from git repositories, but fails if the user did not install git beforehand.

None of the guesswork and manual installation steps are required when using GNU Guix, since software in its repositories is bit-for-bit reproducible. It also provides scripts (“importer”) to turn packages from various language-specific repositories like PyPi for Python, crates.io for Rust and CRAN for R into Guix package recipes.

An example workflow for the CRAN package zoid, which is not available in Guix proper, would look like this:

Import the package into a manifest.

$ guix import cran -r zoid > manifest.scm

Edit manifest.scm to import the required modules and return a usable manifest containing the package and R itself.

(use-modules (guix packages)
             (guix download)
             (guix licenses)
             (guix build-system r)
             (gnu packages cran)
             (gnu packages statistics))

(define-public r-zoid …)

(packages->manifest (list r-zoid r))

Run your code.

$ guix shell -m manifest.scm -- R -e 'library(zoid)'

Although Guix displays hints which modules are missing when trying to use an incomplete manifest, editing the manifest file to include all of them can be quite tedious.

For R specifically the R package guix.install provides a way to automate this import. It also uses guix import, but references dependencies using package specifications like (specification->package "r-bh"). This way no extra logic to figure out the correct module imports is required. It then extends the package search path, including the newly written file at ~/.Rguix/packages.scm, installs the package into the default Guix profile at ~/.guix-profile and adds this profile to R’s search path.

While this approach works well for individual users, Guix installations with a larger user-base, for instance institution-wide, would benefit from the default availability of the entire CRAN package collection with pre-built substitutes to speed up installation times. Additionally, reproducing environments would include fewer steps if the package recipes were available to anyone by default.

Introducing guix-cran

GNU Guix provides a mechanism called “channels”, which can extend the package collection in Guix proper. guix-cran does exactly that: It provides all CRAN packages missing in Guix proper in a channel and has all of the properties mentioned above. It can be installed globally via /etc/guix/channels.scm and packages can be pre-built on a central server.

As of commit cc7394098f306550c476316710ccad20a510fa4b there are 17431 packages available in guix-cran. 95% of them are buildable and only 0.5% of these builds are not reproducible via guix build --check. It is also possible to use old package versions via guix time-machine, similar to what MRAN offers. However, that time-frame only spans about two months right now.

Creating and updating guix-cran is fully automated and happens without any human intervention. Improvements to the already very good CRAN importer also improve the channel’s quality. The channel itself is always in a usable state, because updates are tested with guix pull before committing and pushing them. However some packages may not build or work, because (usually undeclared) build or runtime dependencies are missing. This could be improved through better auto-detection in the CRAN importer.

Currently building the channel derivation is very slow, most likely due to Guile performance issues. For this reason packages are split into files by the first letter of their name. This way they can still be referenced deterministically by their first letter. Since the number of loadable modules is limited to 8192, creating one module file per package is not possible and putting them all into the same file is even slower.

The channel is not signed, because all changes are automated anyway.

Usage

Using guix-cran requires the following steps:

Create channels.scm:

(cons
  (channel
    (name 'guix-cran)
    (url "https://github.com/guix-science/guix-cran.git"))
  %default-channels)

Create manifest.scm:

(specifications->manifest '("r-zoid" "r"))

Run:

$ guix time-machine -C channels.scm -- shell -m manifest.scm -- R -e 'library(zoid)'

For true reproducibility it’s necessary to pin the channels to a specific commit by running

$ guix time-machine -C channels.scm -- describe -f channels > channels.pinned.scm

once and using channels.pinned.scm instead of channels.scm from there on.

Appendix

Ludovic Courtès, Simon Tournier and Ricardo Wurmus provided valuable feedback to the draft of this post.

The channel statistics above can be reproduced using the following manifest (channels.scm):

(list
  (channel
    (name 'guix)
    (url "https://git.savannah.gnu.org/git/guix.git")
    (branch "master")
    (commit
      "4781f0458de7419606b71bdf0fe56bca83ace910")
    (introduction
      (make-channel-introduction
        "9edb3f66fd807b096b48283debdcddccfea34bad"
        (openpgp-fingerprint
          "BBB0 2DDF 2CEA F6A8 0D1D  E643 A2A0 6DF2 A33A 54FA"))))
  (channel
    (name 'guix-cran)
    (url "https://github.com/guix-science/guix-cran.git")
    (branch "master")
    (commit
      "cc7394098f306550c476316710ccad20a510fa4b")))

And the following Scheme code to obtain a list of all packages provided by guix-cran (list-packages.scm):

(use-modules (guix discovery)
             (gnu packages)
             (guix modules)
             (guix utils)
             (guix packages))
(let* ((modules (all-modules (%package-module-path)))
       (packages (fold-packages
                   (lambda (p accum)
                     (let ((mod (file-name->module-name (location-file (package-location p)))))
                       (if (member (car mod) '(guix-cran))
                         (cons p accum)
                         accum)))
                   '() modules)))
  (for-each (lambda (p) (format #t "~a~%" (package-name p))) packages))

And this Bash script:

#!/bin/sh

guix pull -p guix-profile -C channels.scm
export GUIX_PROFILE=`pwd`/guix-profile
source guix-profile/etc/profile
guix repl list-packages.scm > packages
cat packages| parallel -j 4 'rm -f builds/{} && guix build --no-grafts --timeout=300 -r builds/{} -q {} 2>&1 && guix build --no-grafts --timeout=300 --check -q {} 2>&1' | tee build.log

echo "total" && wc -l packages
echo "success" && sort -u build.log | grep '^/gnu/store' | wc -l
echo "failure" && sort -u build.log | grep 'failed$' | wc -l
echo "non-reproducible" && sort -u build.log | grep 'differs$' | wc -l

Is reproducibility practical?

2022-07-21T15:00:00Z

Our attention was recently caught by a nice slide deck on the methods and tools for reproducible research in R. Among those, the talk mentions Guix, stating that it is “for professional, sensitive applications that require ultimate reproducibility”, which is “probably a bit overkill for Reproducible Research”. While we were flattered to see Guix suggested as a good tool for reproducibility, the very notion that there’s a kind of “reproducibility” that is “ultimate” and, essentially, impractical, is something that left us wondering: What kind of reproducibility do scientists need, if not the “ultimate” kind? Is “reproducibility” practical at all, or is it more of a horizon?

In this post, we question the way we Guix people have been discussing “reproducibility” in the context of software deployment. We identify sources of confusion and show that reproducibility is a means that can help achieve different goals. Our conclusion, perhaps unsurprisingly, is that the kinds of “reproducibilities” offered by a tool like Guix are not a luxury for a professional elite: they’re a foundation for reliable software deployment and for verifiable research.

Two kinds of reproducibility

When we talk about “reproducibility” in the context of Guix, we really have two related but different goals in mind. The first goal is being able to redeploy the same software environment on different machines or at different points in time, with little effort.

This first goal is very practical: it’s about letting everyone on a team use the same software, it’s about letting you install the same software on two different machines, whether it’s a laptop running Guix System, a virtual machine running Debian, or a supercomputer running CentOS, and it’s about letting you rerun the computational experiment of a scientific article months later.

The second goal is verifiability. Let’s imagine a scenario where you publish an article and, as accompanying material, you publish source code together with a Docker image on Zenodo containing the code that was supposedly used to produce the results in the article and that supposedly corresponds to that source code.

I say “supposedly” because you cannot tell for sure unless you verify. There are two hypotheses one might want to verify:

That the source code matches the binary in the Docker image;
That the program produces the output shown in the article.

Scientific conferences now often have Artifact Evaluation Committees, which in practice verify that source code is available, and, when things go well, that the container image can produce the results shown in the article—the source/binary correspondence is all too often left out as a technical detail. Reproducible research is about being able to verify research outcomes though, and executable artifacts are one such outcome.

“Professional” vs. “good enough”

“I see what you’re headed to”, you note, “but bit-for-bit reproducibility is overkill, I don’t need it.” Wait, we didn’t even mention bit-for-bit reproducibility (yet)!

Let’s get back to the first of our two goals: the ability to deploy the same software environment, anytime anywhere. Maybe there are “good enough” approaches, not as “overkill” as what Guix does, and yet that achieve that goal?

Maybe. The slide deck mentioned above is concerned primarily with GNU R software. At almost 30 years, R is all wisdom and reliability. The language rarely changes, its developers pay attention to backward compatibility, minimizing breakage for the thousands of user-contributed packages available on CRAN. If your software environment consists entirely of R modules, the Packrat tool can do wonders: it can create snapshots of the package name/version pairs used in your session and eventually restore those snapshots by looking up those name/version pairs. It is “good enough” in the sense that the restored environment is “likely” to behave “similarly”, compared to the original environment. It is not “ultimate reproducibility” because there are many things that could lead to different behavior: you might be restoring with a different version of R, or one built or configured differently, with a different set of dependencies, or it might run on a different operating system.

This approach falls short for software environments that are not 100% R. This is not uncommon, if you think about R packages that wrap C/C++ libraries (zlib, Cairo, cURL, Eigen, etc.). Those libraries are beyond the scope of Packrat; whether Packrat can restore an R package that depends on C/C++ libraries depends on external factors: whether those libraries were pre-installed through some other mean, whether the “right” versions are available, whether a C/C++ compiler is available, and so on. It might succeed, or it might fail at build time (due to the lack of a suitable compiler or dependencies) or at run time (due to binary incompatibilities, different dependency versions or build options, etc.) What’s “good enough” for 100% R projects isn’t good enough to let you redeploy polyglot environments.

Other package management tools that have a partial vision of the dependency graph—from pip and Conda to EasyBuild and Spack—suffer from that shortcoming. They may or may not be able to redeploy software packages; those packages might fail to build, because their build environment is not tightly controlled, or they might fail at run time due to binary incompatibilities. These are very practical problems.

Bit for bit

This brings us to our second goal: verifiability. For us developers of package management tools, the question is: how can we enable users to independently verify the source/binary correspondence? In our artifact evaluation scenario, we might want to provide reviewers with a Docker image for convenience, but how can we let them verify that the binaries in that image correspond to the accompanying source code?

This is where reproducible builds come in: as a means to allow for independent verification of the source/binary correspondence. The definition that many in the field agree on states:

A build is reproducible if given the same source code, build environment and build instructions, any party can recreate bit-by-bit identical copies of all specified artifacts.

“Bit-by-bit identical copies”. That phrase suggests perfection. Perfection doesn’t exist though, and it’s not unusual for scientists and practitioners to stop reading at “bit-by-bit”, saying: “nah—this is nice in theory but just impractical and overkill”.

Think about it though: how hard can it be to make a software build process reproducible bit-for-bit? Fortunately, compilers behave in a deterministic fashion: given the same input, they produce the same output. Experience with software distributions as large as Debian, Arch Linux, NixOS, and Guix has shown that there’s a core of well-identified sources of non-reproducibility. Addressing them takes some effort but is not insurmountable: more than 90% of Debian packages and at least 75% of Guix packages are indeed reproducible bit-for-bit. Guix provides users with tools that, we hope, are accessible to those who are not professional in the field of bit-for-bit reproducibility.

The same goes at a higher level. Earlier we wrote that a tool like Packrat can let you restore an environment “likely to behave similarly” compared to the original one. How would one define “similarly” though? If the computation produces different output, what conclusion can you draw? Will you incriminate the method, when you know your software environment doesn’t faithfully mirror the one that was originally used? No, you’ll have at best a lot of guesswork to do before you can draw any conclusion. Conversely, if you know you deployed the same software, bit-for-bit, then you’ve significantly reduced the search space in case the computation produces different output. Bit-for-bit reproducibility might sound overkill, but it’s the only practical way to determine way to determine whether a computational process is reproducible.

Practicality

This blog post was ignited by a slide deck. Perhaps what the author alluded to when they mentioned “ultimate reproducibility” and Guix being “overkill” is that Guix as a project is on a quixotic quest for reproducibility; but perhaps what they suggested by framing it as “professional” is that using it is difficult.

The answer is that if you liked pip install or apt install, you’ll love guix install. Over ten years of development, we’ve worked hard on the user interface and documentation to make it easier to get started. That doesn’t mean everything’s perfect—one of the talks at the upcoming Ten Years of Guix event is about making Guix more approachable and we’re always eager to get feedback from newcomers—but at least the basics should be accessible to anyone who has used the command line before, or even just Jupyter.

Our message is that it is possible to achieve these two types of “reproducibility”: the ability to deploy the same environment anywhere anytime, and the ability to verify the source/binary correspondence of an existing deployment. “Good enough” solutions are good enough in narrow cases only. We can and must demand more of our deployment tools.

Beyond reproducibility

This post focuses on reproducibility, but we should keep in mind that the scientific process does not consist in merely reproducing experiments as-is—it’s about experimenting, fiddling with the computation to evaluate the impact of a parameter on the output, changing parts of the code, and so forth. In a thoughtful article, Hinsen identifies four “essential possibilities” for reproducible computations:

The possibility to inspect all the input data and all the source code that can possibly have an impact on the results.
The possibility to run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results.
The possibility to explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools.
The possibility to verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code.

These four items might look consensual but their practical implications are wide-ranging. The first item is unlocked by publishing scientific software under a free license—as UNESCO recommends—and the two kinds of reproducibilities discussed in this article support #2 and #4. To explore the behavior of the code, we need more. Guix eases exploration with “package transformation options”, which let users deploy variants of the software environment, for example by applying a patch somewhere in the software stack or swapping one dependency for another. A “frozen” application bundle such as a Docker image does not provide this lever.

That most scientific processes now involve software should be an opportunity to improve reproducibility and provenance tracking and to facilitate experimentation, not the other way around.

Acknowledgments

Many thanks to Ricardo Wurmus who provided valuable feedback on an earlier draft of this post.

Celebrating 10 years of Guix in Paris, 16–18 September

2022-06-13T15:00:00Z

It’s been ten years of GNU Guix! To celebrate, and to share knowledge and enthusiasm, a birthday event will take place on September 16–18th, 2022, in Paris, France. The program is being finalized, but you can already register!

Update (2022-07-12): Preliminary program published!

This is a community event with several twists to it:

Friday, September 16th, is dedicated to reproducible research workflows and high-performance computing (HPC)—the focuses of the Guix-HPC effort. It will consist of talks and experience reports by scientists and practitioners.
Saturday targets Guix and free software enthusiasts, users and developers alike. We will reflect on ten years of Guix, show what it has to offer, and present on-going developments and future directions.
on Sunday, users, developers, developers-to-be, and other contributors will discuss technical and community topics and join forces for hacking sessions, unconference style.

Check out the web site and consider registering as soon as possible so we can better estimate the size of the birthday cake!

If you’re interested in presenting a topic, in facilitating a session, or in organizing a hackathon, please get in touch with the organizers at guix-birthday-event@gnu.org and we’ll be happy to make room for you. We’re also looking for people to help with logistics, in particular during the event; please let us know if you can give a hand.

Whether you’re a scientist, an enthusiast, or a power user, we’d love to see you in September. Stay tuned for updates!

Originally published on the Guix blog.

Back to the future: modules for Guix packages

2022-05-06T14:45:00Z

Some things in our software world are timeless. The venerable Environment Modules are one of these. If you’ve ever used a high-performance cluster in the last three decades, chances are you’re already familiar with it. Modules is about managing software environments, just like Guix is—or, perhaps more accurately, guix shell.

You will be delighted, or surprised, to learn that Guix now has a compatibility layer with Modules.

The legacy of Modules

As Furlani’s 1991 introductory paper explains, Modules were—and still are—a key enabler for Unix users, especially in high-performance computing (HPC). The module command lets users manipulate their software environment in terms of packages, without having to be Unix or shell experts; they let them compose packages and build the software environment of their choice, without interfering with other users; they give a level of flexibility that Unix alone wouldn’t provide. The command-line interface is easily understood:

module load gcc/11.2

“loads” GCC 11.2 in your shell. You can “load” and “unload” software components at will:

module load python/3.8
module unload gcc

As an interface, Modules are easy to use and understand. However, they leave it up to sysadmins (sometimes users) to actually deploy the software. The common approach has been for sysadmins to build and install, by themselves, the software that Modules refer to. The end result is that modules vary from machine to machine. For example the gcc module shown above might refer to GCC 11.2 on one cluster and GCC 8 on another; it might have an entirely different name on a third cluster. Likewise, the python/3.8 module above might refer to different patch-level versions of Python 3.8, or it might refer to a variant of Python built with different dependencies or different build flags.

These issues have been largely mitigated by package managers such as EasyBuild and Spack: both automate package builds, and both can generate module files—Tcl snippets that define environment variables to set when “loading” a module. With EasyBuild and Spack, it becomes possible to not only automate deployment and module file generation, but also to deploy similar software on different machines.

“Similar”, though, does not mean “the same”. Software built with Spack or EasyBuild depends on software already available on the host system: it is built on top of a GNU/Linux distribution, which could be CentOS 7.4 (released in 2017), or Ubuntu 22.04, or really anything else. Thus, software installed with these tools depends on software provided by the underlying distribution, at build time and at run time.

This “hidden dependency” makes it hard to redeploy the exact same environment on a different machine or at a different point in time: the same build process might fail, or it might succeed but the resulting software might behave differently. Our approach in Guix is to not have that “hidden dependency”. Instead, the package dependency graph that Guix manipulates is self-contained: it includes package definitions for all the user-land software one may use.

From Guix to Modules

The news today is the release of Guix-Modules, a new tool to generate module files from Guix packages. The primary goal, as with the module file generation tools in EasyBuild and Spack, is to make it easy for HPC cluster sysadmins to provide a set of modules for their users—more on that below. Guix-Modules is an extension of Guix. To use it, you need to install it and to set the GUIX_EXTENSIONS_PATH environment variable, like so:

guix install guix-modules
export GUIX_EXTENSIONS_PATH="$HOME/.guix-profile/share/guix/extensions"

That gives you a new guix module sub-command.

Let’s say you want to generate modules to /opt/modules for selected packages; you can do so by running:

guix module create -o /opt/modules \
  coreutils gcc-toolchain python python-numpy

As with all Guix commands, it will build or download the packages if they’re not around already and populate /opt/modules with a bunch of module files. If /opt/modules already existed, it has been backed up under /var/guix/profiles, which lets you roll back to the previous modules should you regret your changes.

As an admin, you can periodically update the set of modules by running:

guix pull
guix module create -o /opt/modules …

The good thing is that users can still access the previous module set, until you explicitly remove it, under /var/guix/profiles.

Instead of having those long guix module create command lines, you can opt for listing the packages of interest in a manifest file, which you can keep under version control. As with most other guix commands, you can pass the manifest with:

guix module create -m my-modules.scm -o /opt/modules

Once the modules have been generated, you can happily load and unload them using the familiar module sub-commands:

unset MODULEPATH
module use /opt/modules
module load gcc-toolchain/11.2.0
module load python/3.9.9

Voilà! If you’re a sysadmin, here’s a new way to offer scientific software to your users without asking them to change their habits. The generated module files work equally well with the “original” Module implementation and with Lmod.

Provenance tracking

Since we, Guix developers, pride ourselves on providing a deployment tool with good support for provenance tracking, we couldn’t just let that guix module command generate module files of unclear provenance. Users—we think—ought to be able to determine the provenance of the modules they use. We want to avoid the scenario many HPC practitioners are familiar with whereby, six months after publishing an article, you can no longer reproduce the computational results it contains because the relevant modules have been upgraded or removed from under your feet and you just don’t know how to reproduce them.

Thus, guix module create records provenance data in the module files it generates. You can view that info by running module help:

$ module help openblas

----------- Module Specific Help for 'openblas/0.3.18' ------------

This module was generated from a GNU Guix package.
Provenance data (channels):

  (list (channel
          (url "https://git.savannah.gnu.org/git/guix.git")
          (branch "master")
          (commit
            "4ba35ccd18f90314caa76ea1833ffc383559401c")
          (name 'guix)
          (introduction
            (make-channel-introduction
              "9edb3f66fd807b096b48283debdcddccfea34bad"
              (openpgp-fingerprint
                "BBB0 2DDF 2CEA F6A8 0D1D  E643 A2A0 6DF2 A33A 54FA")))))

What module help shows is the list of channels from which this particular package was built. The information is in a format that guix time-machine can readily consume. Assuming you store the (list (channel …)) snippet in file channels.scm, you can go to another machine, at a later point in time, and deploy the exact same software with this command:

guix time-machine -C channels.scm -- \
  shell gcc-toolchain openblas

For users, it makes a big difference: modules are no longer ephemeral—they’re now a reproducible artifact that you can redeploy with Guix anywhere, anytime.

Customization

HPC users are often demanding when it comes to customizing software build processes. Guix supports this need with a gamut of package transformation options available from the command line as well as through programming interfaces. Good news: guix module create honors package transformation options.

Among those, the --tune option, which instructs Guix to optimize relevant packages for the host micro-architecture, may come in handy. If you know your cluster contains only Skylake CPUs, you’d rather make sure relevant packages are optimized for Skylake. To do that, you would run, say:

guix module create --tune=skylake \
  gcc-toolchain openblas gsl

In this particular case, GSL gets built for Skylake, using GCC’s -march=skylake option (OpenBLAS itself chooses optimized routines at run time so it is unaffected).

“But what about reproducibility?”, you ask. The chosen package transformation option(s)—--tune in this case—are also recorded as part of the provenance data. This is what module help reports:

$ module help gsl

----------- Module Specific Help for 'gsl/2.7' --------------------

This module was generated from a GNU Guix package.
Provenance data (channels):

  (list (channel
          (url "https://git.savannah.gnu.org/git/guix.git")
          (branch "master")
          (commit
            "4ba35ccd18f90314caa76ea1833ffc383559401c")
          (name 'guix)
          (introduction
            (make-channel-introduction
              "9edb3f66fd807b096b48283debdcddccfea34bad"
              (openpgp-fingerprint
                "BBB0 2DDF 2CEA F6A8 0D1D  E643 A2A0 6DF2 A33A 54FA")))))

Package transformations:

  ((tune . "skylake"))

The “Package transformations” bit is self-explanatory; it can be passed as-is to options->transformation in a manifest.

We strongly believe one shouldn’t have to choose between performance and reproducibility and this is what this feature set supports.

Why all the fuss?

Guix is ten years old, Guix-HPC itself is turning five this year, so you might wonder why after all these years we’re adding a Modules compatibility layer. After all, guix shell can set up software environments on-the-fly in a way that is comparable to module load. For instance, to start a shell to use GCC and Python as in the example above, you would type:

guix shell gcc-toolchain@11 python@3.8

More generally, Guix puts users in control: it lets them upgrade when they want to and allows them to travel in time; it lets them customize packages, and it lets them replicate the same environment elsewhere or at a different point in time.

Using Guix directly remains the most empowering approach for users, but module files created from Guix packages can satisfy a number of user needs:

Matching user habits. For some communities, not having to learn a new command—even if it’s not all that different, even if it has more to offer—is a big plus. It’s not uncommon for cluster admins to offer Modules in addition to Guix or other tools for that reason.
Supporting incremental software environment construction. With module, you can “load” and “unload” modules until you obtain the desired environment, whereas guix shell currently expects a list of packages upfront. While exploring a problem space, the incremental mode might be more convenient—and indeed, patches have recently been discussed to support an incremental mode in guix shell.
Supporting simple Guixy cluster setups. The Guix typical cluster setup requires running the build daemon, ensuring it can access the network to download source or binaries, making it accessible to front nodes and (optionally) build nodes, and setting up a couple of NFS exports. Sysadmins who’d rather not do that can instead use guix module create and offer those modules to users. The /gnu/store directory still needs to be exported over NFS, but that’s a read-only export, and it’s all that’s needed—a simpler setup.

If you’re an HPC cluster user or system administrator, we’d love to hear your thoughts on the guix-science mailing list or #guix-hpc channel on Libera.chat!

Guix-HPC Activity Report, 2021

2022-02-03T14:00:00Z

This document is also available as PDF (printable booklet).

Guix-HPC was launched in September 2017 as a joint software development project involving three research institutes: Inria, the Max Delbrück Center for Molecular Medicine (MDC), and the Utrecht Bioinformatics Center (UBC). GNU Guix for HPC and reproducible science has received contributions from additional individuals and organizations, including CNRS, the University of Paris (Diderot), the University of Tennessee Health Science Center (UTHSC), the Leibniz Institute for Psychology (ZPID), Cray, Inc. (now HPE), and Tourbillion Technology.

This report highlights key achievements of Guix-HPC between our previous report a year ago and today, February 2022. This year was marked by exciting developments for HPC and reproducible workflows: the release of GNU Guix 1.3.0 in May, the ability to tune packages for a CPU micro-architecture with the --tune option, improved Software Heritage support, new releases of Guix-Jupyter and the Guix Workflow Language (GWL), support for POWER9 CPUs and on-going work porting to RISC-V, and more.

Outline

Guix-HPC aims to tackle the following high-level objectives:

Reproducible scientific workflows. Improve the GNU Guix tool set to better support reproducible scientific workflows and to simplify sharing and publication of software environments.
Cluster usage. Streamlining Guix deployment on HPC clusters, and providing interoperability with clusters not running Guix.
Outreach & user support. Reaching out to the HPC and scientific research communities and organizing training sessions.

The following sections detail work that has been carried out in each of these areas.

Reproducible Scientific Workflows

Supporting reproducible research workflows is a major goal for Guix-HPC. The ability to reproduce and inspect computational experiments—today’s lab notebooks—is key to establishing a rigorous scientific method. UNESCO’s Recommendation on Open Science, published in November 2021, recognizes the importance of free software in research and further notes (§7d):

In the context of open science, when open source code is a component of a research process, enabling reuse and replication generally requires that it be accompanied with open data and open specifications of the environment required to compile and run it.

This key point is often overlooked: the ability to reproduce and inspect the software environments of experiments is a prerequisite for transparent and reproducible research workflows.

To that end, we work not only on deployment issues, but also upstream—ensuring source code is archived at Software Heritage—and downstream—devising tools and workflows for scientists to use. The sections below summarize the progress made on these fronts and include experience reports by two PhD candidates showing in concrete terms how Guix fits in reproducible HPC workflows.

Workflow Languages

The Guix Workflow Language (or GWL) is a scientific computing extension to GNU Guix's declarative language for package management. It allows for the declaration of scientific workflows, which will always run in reproducible environments that GNU Guix automatically prepares. In the past year the GWL has received several bug fixes and infrastructure for detailed logging; it also gained a DRMAA process engine to submit generated jobs to any HPC scheduler with an implementation of DRMAA, such as Slurm and Grid Engine. This was made possible through the newly released high-level Guile bindings to DRMAA version 1. We released version 0.4.0 of the GWL on January 29.

Earlier in January, we announced ccwl, the Concise Common Workflow Language. ccwl is a workflow language with a concise syntax compiling to the Common Workflow Language (CWL). While GWL offers a novel workflow language with integrated deployment via Guix, ccwl instead aims to leverage tooling around the popular Common Workflow Language while addressing some of its limitations. We published a detailed article introducing ccwl and expounding its merits. ccwl significantly cuts short on the verbosity of CWL, thus removing one of the barriers to its wider adoption. ccwl is implemented as a domain specific language embedded in GNU Guile, and interoperates with GNU Guix to provide reproducibility. ccwl also aims to minimize frustration for users by providing strong compile-time error checking and high-quality error messages. We also plan to pre-package commonly used command-line scientific tools into ready-made ccwl workflows. Work on these exciting new features is already underway.

Reproducible Software Deployment for Jupyter

We announced Guix-Jupyter two years ago, with two goals: making notebooks self-contained or “deployment-aware” so that they automatically deploy the software (and data!) that they need, and making said deployment bit-reproducible. Earlier this year, we published version 0.2.2 as a bug-fix release.

Guix-Jupyter is implemented as a Jupyter kernel: it acts as a proxy between the notebook and the programming language notebook cells are written in. It interprets annotations found in the notebook to deploy precisely the right software packages needed to run the notebook. We believe this is a robust approach to address the Achilles’ heel that software deployment represents for reproducible computations with Jupyter.

Yet, because Binder and its associated services and tools are a popular way to deploy Jupyter notebooks, we wanted to offer an alternative solution integrated with Binder. Under the hood, Binder builds upon repo2docker, a tool to build Docker images straight from source code repositories. Repo2docker has a number of back-ends called buildpacks to handle packaging metadata in a variety of formats: when a setup.py file is available, software is deployed using standard Python tools, the presence of an install.R file leads to deployment using GNU R, an apt.txt file instructs it to install software using Debian’s package manager, and so on.

As part of a three-month internship at Inria, Hugo Lecomte implemented a Guix buildpack for repo2docker. If a guix.scm or a manifest.scm file is found in the source repository, repo2docker uses it to populate the Docker image being built. Additionally—and this is a significant difference compared to other buildpacks—, software deployed with Guix can be pinned at a specific revision: if a channels.scm file is found, the buildpack passes it to guix time-machine; this ensures that software is deployed from the exact Guix revision specified in channels.scm.

This Guix buildpack for repo2docker has been submitted upstream and reviewed, but as of this writing it has yet to be merged. We believe it provides another convenient way for Jupyter Notebook users to ensure their code runs in the right software environment.

Ensuring Source Code Availability

Guix lets users re-deploy software environments, for instance via guix time-machine. This is possible because Guix can rebuild software, which, in turn, is only possible if source code is permanently available. Since 2019 Guix developers collaborate with Software Heritage (SWH) to make that a reality. A lot has been achieved since then but some challenges remained before we could be sure that SWH would archive every piece of source code Guix packages refer to.

One of the main roadblocks we identified early on are source code archives— tar.gz and similar files, colloquially known as “tarballs”. SWH, rationally, stores the contents of these archives, but it does not store the archives themselves. Yet, most Guix package definitions refer to tarballs; Guix expects to be able to download those tarballs and to verify that they match. How do we deal with this impedance mismatch?

Last year, Guix developer Timothy Sample had just started work to address this. Timothy developed a tool called Disarchive that supports two operations: “disassembling” and “reassembling” tarballs. In the former case, it extracts tar and compression metadata along with an identifier (SWHID) pointing to contents available at SWH; in the latter case, Disarchive assembles content and metadata to recreate the tarball as it initially existed. From there we create a Disarchive database that maps cryptographic hashes of tarballs to their metadata.

This year we deployed, on the Guix build farm, infrastructure to continuously build the database and to publish it at disarchive.guix.gnu.org. We added support in Guix so that it can use Disarchive + SWH as a fallback when downloading a tarball from its original URL fails, significantly improving source code archival coverage.

Beyond Guix, this work is crucial for all the deployment tools that rely on the availability of tarballs—Brew, Gentoo, Nix, Spack, and other package managers, but also scientific workflow tools such as Maneage and individual Dockerfiles and scripts. This led SWH and the Sloan Foundation to allocate a grant so that Timothy Sample could address some of the remaining challenges.

Among those, Timothy has already been able to expand Disarchive compression support beyond gzip—version 0.4.0 adds support for xz, the second most popular compression format for tarballs. To have a clear vision of the progress being made, Timothy has been publishing periodic Preservation of Guix Reports. The latest one shows that archival coverage for all the Guix revisions since version 1.0.0 is at 72%; the breakdown by revision shows that coverage reaches 86% for recent commits. Simon Tournier has been carefully monitoring coverage and discussing with other Guix developers and with the SWH team to identify reasons why specific pieces of source code would not be archived. Ludovic Courtès had the pleasure to join the SWH Fifth Anniversary event, on behalf of the Guix team, to show all the progress made and to discuss the road ahead.

Tuning Packages for a CPU

GNU Guix is now well known for supporting “reproducibility”, which is really twofold: it is first the ability to re-deploy the same software stack on another machine or at a different point in time, and second the ability to verify that binaries being run match the source code—the latter is what reproducible builds are concerned with.

However, in HPC circles there is the entrenched perception that reproducibility is antithetic to performance. Practitioners are especially concerned with the performance of the Message Passing Interface (MPI) implementations on high-speed network devices, and with the ability of code to use single-instruction/multiple-data (SIMD) extensions of the latest CPUs—such as AVX-512 on x86_64, or NEON on ARMv8. We showed that these concerns are largely unfounded in a 2018 article on achieving performance with portable binaries and in a 2019 article on Open MPI.

The former article showed how performance-sensitive C code is already taking advantage of function multi-versioning (FMV). There remain cases, though, where this technique is not applicable. As a result, GNU/Linux distributions—from Guix to Debian and CentOS—that distribute binaries built for the baseline x86_64 architecture miss out on SIMD optimizations. A notorious example of packages that do not support FMV is C++ header-only libraries, such as the Eigen linear algebra library.

To address this, we introduced what we call package multi-versioning: with the new --tune package transformation option, Guix users can obtain a package variant specifically tailored for the host CPU. Yet, users can avoid rebuilding time-consuming local builds if a pre-built binary for the same CPU variant is available on-line.

While building a package with -march=native (instructing the compiler to optimize for the CPU of the build machine) leaves no trace, using Guix’s --tune is properly recorded in metadata. For example, a Docker image built with guix pack --tune --save-provenance contains, in its metadata, the CPU type for which it was tuned, allowing for independent verification of its binaries. This is to our knowledge the first implementation of CPU tuning that does not sacrifice reproducibility.

Packaging

The package collection that comes with Guix keeps growing. It now contains more than 20,000 curated packages, including many scientific packages ranging from run-time support software such as implementations of the Message Passing Interface (MPI), to linear algebra software, to statistics and bioinformatics modules for R.

The Julia programming language has been gaining traction in the scientific community and efforts in Guix reflect that momentum. At the time of the previous report, February 2021, Guix included a dozen Julia packages. Today, January 2022, it includes more than 260 Julia packages, from bioinformatics software such as BioSequence.jl to machine learning software like Zygote.jl. Under the hood, the Julia build system in Guix has been improved; in particular, it now supports both parallel builds and parallel tests, providing a significant speedup. It also allows the built-in Julia package manager Pkg to find packages already installed by Guix.

In 2021, we added the popular PyTorch machine learning framework to our package collection. While it had long been available via pip, the Python package manager, we highlighted in a blog post things that we as users do not notice about packages: what is inside of them, and the work behind it. We showed that the requirements for Guix packages to build software from source and to avoid bundling external dependencies are key to transparency, auditability, and provenance tracking—all of which are ultimately the foundations of reproducible research.

Many scientific packages were upgraded: the Dune finite element libraries have been updated to 2.7.1, the Python bindings to Gmsh were updated to 7.1.11, PETSc and related packages were updated to 3.16.1, to name a few. Run-time support packages such as MPI libraries also received a number of updates.

Statistical and bioinformatics packages for the R programming language have seen regular comprehensive upgrades, closely following updates to the popular CRAN and Bioconductor repositories. At the time of this writing Guix provides a collection of more than 1900 reproducibly built R packages, making R one of the best supported programming environments in Guix.

Core packages have seen important changes; in particular, packages are now built with GCC 10.3 by default (instead of 7.5), using the GNU C Library version 2.33. The style of package inputs has been considerably simplified; together with the introduction of guix style for automatic formatting, we hope it will make it easier to get started writing new packages.

Supporting POWER9 and RISC-V CPUs

In April 2021, Guix gained support for POWER9 CPUs, a platform that some HPC clusters build upon. While support in Guix—and in the broader free software stack—is not yet on par with that of x86_64, it is gradually improving. The project’s build farm now has two beefy POWER9 build machines.

While it is perhaps early days to call RISC-V an HPC platform, there are indicators that this may happen in the near future with investments from the USA, the EU, India, and China.

Together with Chris Batten of Cornell and Michael Taylor of the University of Washington, Erik Garrison and Pjotr Prins are UTHSC PIs responsible for creating a new NSF-funded RISC-V supercomputer for pangenomics. It will incorporate GNU Guix and the GNU Mes bootstrap, with input from Arun Isaac, Efraim Flashner and others. NLNet is also funding the GNU Mes RISC-V bootstrap project with Ekaitz Zarraga and Jan Nieuwenhuizen. We aim to continue adding RISC-V support to GNU Guix at a rapid pace.

On the way to a reproducible PhD thesis

GNU Guix and Org mode form a powerful association when it comes to setting up a PhD thesis workflow. On one hand, GNU Guix allows us to ensure an experimental software environment is reproducible across various high-performance testbeds. On the other hand, we can take advantage of the literate programming paradigm using Org mode to describe the experimental environment as well as the experiments themselves, then post-process and reuse the results in final scientific publications.

The ongoing work of Marek Felšöci at Inria is an actual attempt for a reproducible PhD thesis relying on the conjunction of GNU Guix and Org mode. The thesis project resides in a Git repository where a dedicated Org file describes and explains all of the source code and procedures involved in the construction of the experimental software environment, the execution of experiments as well as the gathering and the post-processing of the results. This includes a Guix channel file, scripts for running the experiments, parsing the output logs, producing figures and so on.

Other Org documents of the repository may then build on these results and produce the final publications, such as research reports, articles and slideshows, in various formats. As an existing publication example we can cite the research report #9412 and the associated technical report #0513 providing a literate description of the environment and the experiments the study presented in the research report relies on.

In the end, the entire process of setting up the software environment, running experiments, post-processing results and publishing documents is automated using continuous integration.

The result of the continuous integration is publicly available as a collection of web pages and PDF documents hosted using GitLab Pages.

The initiative does not stop here. There is an effort to transform this monolithic setup into independent modules with the aim to share and reuse portions of the setup in other projects within the research team.

Feedback from using Guix to ensure reproducible HPC experiments

Philippe Swartvagher (Inria) took the opportunity of writing an article on the impact of execution tracing on complex HPC applications to discover how GNU Guix could help perform reproducible experiments. The article studies the impact of tracing on application performance, evaluates solutions to reduce this impact, and explores clock synchronization issues when distributed applications are traced. The paper is still under review.

The software stack considered in the article is made of several libraries (StarPU, Chameleon, PM2 and FxT), all of them being already packaged in GNU Guix, in the Guix-HPC channel. Manually installing this software stack can be painful, the set of compilation options is wide and desired options can change from an experience to another, to see their impact. Correctly compiling the software stack before each experiment and tracking its current state can be pretty tedious.

This source of headaches disappears with GNU Guix, especially with the help of package transformations. For instance, --with-input allowed us to use of PM2 instead of the Open MPI as the communication engine, --with-commit was handy to select a specific commit of a library (for instance to compare performance before and after a specific change), and --with-patch was convenient to apply code modification for a specific experiments (for instance modifications not suited to be included upstream, but required for the experiment). These package tranformations, used with guix environment (the predecessor of guix shell), removes the burden of compiling the correct version of each software before each experiment.

This intense use of package transformations lead to some corner cases of GNU Guix features and raises several issues.

To ensure reproducibility of experiments made with GNU Guix, software versions have to be pinned and saved along with scripts to launch the experiments. guix describe and guix time-machine are the two GNU Guix's commands to pin revisions and execute applications built from these precise revisions. Making the experimental scripts publicly available is another step to achieve a reproducible article. It requires us to clearly organize experiments, describe their goals and workings and ensure the maximum independence from cluster specificities (or document which changes are necessary to launch the experiments on another cluster). When the repository describing the experiments is completed, archiving it on Software Heritage and providing the obtained ID in the paper to easily retrieve the scripts is effortless.

This first paper with GNU Guix was a great opportunity to discover the help provided by GNU Guix, its ecosystem and support. It also showed areas where documentation can be improved regarding the workflow to ensure reproducibility of the experiments—from using guix describe to pin versions, to obtaining an ID to easily cite the scripts in a paper. Moreover, there are still pending questions about the best way to generalize experimentation scripts and make them independent from the clusters being used—e.g., how to deal with different job schedulers, file systems, and how to provide instructions to replicate experiments even without Guix.

Cluster Usage and Deployment

At UTHSC, Memphis (USA), we are running an 11-node large-memory HPC Octopus cluster (264 cores) dedicated to pangenome and genetics research. In 2021 more SSDs and RAM were added. Notable about this HPC is that it is administered by the users themselves. Thanks to GNU Guix we install, run and manage the cluster as researchers (and roll back in case of a mistake). UTHSC IT manages the infrastructure, i.e., physical placement, routers and firewalls, but beyond that there are no demands on IT. Thanks to out-of-band access we can completely (re)install machines remotely. Octopus runs GNU Guix on top of a minimal Debian install and we are experimenting with pure GNU Guix nodes that can be run on demand. Lizardfs is used for distributed network storage. Almost all deployed software has been packaged in GNU Guix and can be installed by regular users on the cluster without root access.

At GLiCID (Nantes, France) we are in the process of merging two existing HPC clusters (10,000+ cores). The first cluster (based on Slurm + CentOS) already offers the guix command to our users as well as some specific software on our own Guix channel, since a few years. This merger involves a lot of change, including identity management. We wanted to take advantage of this profound change to be more ambitious and explore automated generation of part of the core infrastructure, using virtual machines generated by guix system, deployed on KVM+Ceph. We aim to eventually replace as many of these deployed machines as possible, adjusting Guix system services and implementing new ones as we go, benefiting the wider community.

Outreach and User Support

Articles

The following articles were published in the November 2021 edition of 1024, the magazine of the Société Informatique de France (SIF), the French computer science society:

Konrad Hinsen, La Reproductibilité des calculs coûteux
Ludovic Courtès, Reproduire les environnements logiciels : un maillon incontournable de la recherche reproductible

This article appeared in the March 2021 special edition of French-speaking GNU/Linux Magazine France:

Ludovic Courtès, Déploiements reproductibles dans le temps avec GNU Guix

The following article introducing the most recent addition to the PiGx framework of reproducible workflows backed by Guix is awaiting peer-review and has been submitted to the medRxiv preprint server:

Vic-Fabienne Schumann et al., COVID-19 infection dynamics revealed by SARS-CoV-2 wastewater sequencing analysis and deconvolution

Talks

Since last year, we gave the following talks at the following venues:

JCAD conference, Dec. 2021 (Ludovic Courtès)
Software Heritage Firth Anniversary, joint event with UNESCO, Nov. 2021 (Ludovic Courtès)
TREX Build System Hackathon, Nov. 2021 (Ludovic Courtès)
PackagingCon, Nov. 2021 (Ludovic Courtès)
“Pour une recherche reproductible”, MAiMoSIne, SARI, GRICAD, Nov. 2021 (P.-A. Bouttier)
RDA 18th plenary, Software Source Code, Nov. 2021 (P.-A. Bouttier)
Reproducible FAIR+ Workflows and the CCWL, at the US NIH National Cancer Institute in Oct. 2021 (Pjotr Prins, Arun Isaac)
special event on reproducibility of the Société Informatique de France (French computer science society), May 2021 (Konrad Hinsen, Ludovic Courtès)

We also organised the following events:

the first on-line workshop on the reproducibility of software environments for French-speaking scientists, engineers, and system administrators, on May 17–18th, 2021 with up to 80 participants.
“Declarative and minimalistic computing” track at FOSDEM

Training Sessions

A training session on computational reproducibility for high-energy sessions took place at the Centre de Physique des Particules de Marseille in April/May 2021. It included a hands-on session about Guix.

For the French HPC Guix community, we have set up a monthly on-line event called “Café Guix”, started in October 2021. Each month, a user or developer informally presents a Guix feature or workflow and answers questions.

Personnel

Inria: 2 person-years (Ludovic Courtès and the contributors to the Guix-HPC channel: Emmanuel Agullo, Marek Felšöci, Nathalie Furmento, Hugo Lecomte, Gilles Marait, Florent Pruvost, Matthieu Simonin, Philippe Swartvagher)
Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC): 2 person-years (Ricardo Wurmus and Mădălin Ionel Patrașcu)
University of Tennessee Health Science Center (UTHSC): 3+ person-years (Efraim Flashner, Bonface Munyoki, Fred Muriithi, Arun Isaac, Jorge Gomez, Erik Garrison and Pjotr Prins)
Utrecht Bioinformatics Center (UBC): 0.1 person-year (Roel Janssen)
University of Paris (Diderot): 0.5 person-year (Simon Tournier)

Perspectives

Guix availability on scientific computing clusters remains a top priority. More HPC practitioners—researchers, engineers, and system administrators—are adopting Guix and showing interests, from reproducible research to flexible deployment of virtual machines. We expect to continue to work on these two complementary fronts: streamlining the use of reproducible packs, and reaching out to system administrators and cluster users, notably through training sessions.

Upstream, we will continue to work with Software Heritage with the goal of achieving complete archive coverage of the source code Guix refers to. We have identified challenges related to source code availability; this will probably be one of the main efforts in this area for the coming year.

Downstream, a lot of work has happened in the area of reproducible research tools. Our package collection has grown to include more and more scientific tools. Tools like the Guix Workflow Language and Guix-Jupyter have matured; along with the PsychNotebook service, they bridge the gap between reproducible software deployment and reproducible scientific tools and workflows. We also showed how to achieve high performance while preserving provenance tracking, which we hope dispels the entrenched perception in HPC circles that reproducibility and performance are antithetic.

Our work happens in a context of growing awareness of the importance of software and software environments in research workflows. UNESCO’s Recommendation on Open Science and, for example, the Second French Plan for Open Science are two illustrations of that.

We gave demonstrations of what Guix brings to scientific workflows and we expect to continue to show that reproducible scientific workflows are indeed a possibility. Working on the tools and workflows directly in the hands of scientists will be a major focus of the coming year. We want to contribute to raising the bar of what scientists come to expect in terms of reproducible workflows.

There’s a lot we can do and we’d love to hear your ideas!

ccwl for concise and painless CWL workflows

2022-01-10T14:30:00Z

In modern science, analysis is required to process data. When the data-flow is linear, such a process is easily represented by tools such as the standard Unix pipeline. However, this data-flow is often modeled by a directed graph: each processing node may have one or more inputs and the outputs may be directed to different processing nodes. This directed graph, mainly used in the fields of bioinformatics, medical imaging and astronomy, among many others, is called a workflow.

The Common Workflow Language (CWL) is a specification to describe computational workflows that makes it easy to reproduce and port to different hardware and software environments. But, why do we need workflow languages such as CWL? Why will a simple shell script or a Makefile not suffice?

Why not shell scripts?

Housekeeping tasks

With shell scripts, you need to not only code the actual command invocations but also add a lot of boilerplate to perform housekeeping tasks such as managing intermediate inputs/outputs. This makes the script hard to read and the logic of the pipeline less obvious. Even with a Makefile, the programmer needs to explicitly handle cleanup tasks, typically with a clean target.

Workflow languages allow the programmer to focus only on the actual command invocations—the essence—of the workflow and let the workflow language deal with the housekeeping tasks. For instance, CWL automatically deals with input and output files produced by a command, and ensures that only the necessary intermediate files are exposed to the next command.

When there is an error in a step, shell scripts usually leave the user with arcane error messages, or worse, mindlessly march on as though nothing went wrong. But workflow languages can clearly indicate which step failed.

Portability to different software and hardware environments

Workflows often need to be deployed to different software and hardware environments—to a cluster, to containers in the cloud, etc. When a shell script workflow needs to be deployed in a new environment, it will most likely need to be tweaked a little. Even Makefiles invoke commands using a shell, and thus suffer from the same portability issues. Workflow languages, on the other hand, aim to handle this transparently. This leads to higher confidence in the workflow, and allows a wide community to reproduce and deploy the workflow easily.

Data types, type conversion and static type checking

For better or for worse, due to historical reasons, shells (and by extension, Makefiles) revolve around only a single data type—the string. For instance, all command line arguments passed into a shell script, or indeed any other command, is a string. These strings may actually represent strings, but often, they represent numbers, names of files, etc. It is up to the programmer to convert these string arguments to suitable types, and deal with any errors that may arise in that conversion.

Workflow languages can handle this type conversion automatically. For example, they can ensure arguments representing numbers indeed contain only digits, or that there indeed exist files whose names are mentioned in the arguments. And some workflow languages such as CWL, funflow and bistro even have static typing so that typing errors can be detected at compile-time, instead of at run-time.

Human-readable and machine-readable

And finally, workflow languages need to be easy not just for a human to read and write, but also for machines to inspect. For instance, it should be tractable for a computer to read a workflow and generate a graphical visualization of the steps to be executed and the dependencies between those steps. This is where CWL stands out. Another way to understand this is that it is possible to automatically convert a CWL workflow into a shell script, but not the other way around. In this regard, Makefiles are a little better than shell scripts. But, with their many complex features to ease human-writability, Makefiles sacrifice machine-readability.

So, what's wrong with CWL?

So, CWL has all these nice properties. Why do we need anything else?

Limitations of YAML

CWL is, in effect, a special purpose programming language built into YAML syntax. CWL is fundamentally limited by this constraint, and often has verbose constructs to express relatively simple ideas. For example, there are at least three different fields that together build up the command to be executed!

Too many files

Even simple workflows have to be spread out over multiple files. Each command or step in the workflow needs its own CWL file. And all these individual commands need to be wired up together in another CWL file that specifies the overall workflow. Human short term memory is limited, and if one has to juggle around several files and associated tabs/buffers, the overhead is often too much.

Why ccwl?

What if instead of manually writing a CWL workflow, we could treat CWL as a compilation target and auto-generate it? We would then be free to use a more human-friendly frontend language without losing any of the machine-readability of CWL. This is exactly what ccwl, the Concise Common Workflow Language, does.

ccwl is a domain-specific language embedded into GNU Guile, a Scheme implementation. Lisp dialects such as Scheme are programmable programming languages and among the few that allow you to directly hack the compiler. As such, it is extremely well suited for embedding domain-specific languages into.

To the uninitiated, writing in a lisp may seem less human-friendly than writing in YAML. But, if you try it, you might like it so much that you'll never want to write anything else! And, if you're not convinced, there's always wisp, a Python-like whitespace-significant syntax for GNU Guile. In fact, this is what the Guix Workflow Language (GWL), another excellent workflow language written in GNU Guile, favors.

Human-readable and writable

For the user, ccwl aims to be as easy to write as a shell script, or at least a Makefile. But, by compiling to CWL, ccwl preserves all the benefits of CWL.

Compile-time error checking

Detecting errors as early as possible, preferably at compile time, significantly improves the user experience. There is nothing more frustrating than running a long workflow for several hours, only to have it error out in between and being forced to restart all over again without knowing for sure if it will succeed this time. ccwl, by virtue of the very hackable Scheme compiler that it is built on, aims to provide excellent compile-time error checking along with source references. ccwl isn't quite there yet, but hopefully will be in the coming releases.

Interface with external CWL workflows

Not everybody might convert to ccwl. And often, it will be necessary to reuse CWL workflows written by others. ccwl is pragmatic and allows calling external CWL workflows as part of a larger ccwl workflow. If CWL grows to become a common compilation target for many different workflow languages, this feature could enable seamless collaboration between communities.

Pre-packaged commands

In the future, ccwl might also provide pre-packaged ccwl commands for commonly used tools in bioinformatics, astronomy, etc. so that the user is freed from having to write these wrappers and can instead focus on writing only the workflow.

Reproducibility with GNU Guix

ccwl leaves all the hard work of reproducibility in Guix's capable hands. CWL (and, by consequence, ccwl) are agnostic to deployment. As long as a tool can be found in PATH, it does not care how that tool was deployed to PATH. This means we can offload all reproducibility responsibilities to Guix. We could simply fire up a Guix shell with the required packages in the environment, and run our workflow from within that environment. If we fixate the Guix commit we are running from, we can perfectly reproduce our workflow.

$ guix shell ccwl cwltool package1 package2 ...
[env]$ ccwl compile workflow.scm > workflow.cwl
[env]$ cwltool workflow.cwl

In contrast, the Guix Workflow Language (GWL) uses Guix internally to prepare a reproducible environment. It is thus deployment-aware and tied to Guix.

A taste of ccwl

This article is not a ccwl tutorial. So, we will stop short of describing how to write your own ccwl workflows. But, just to provide a taste for the syntax, here is an example spell check workflow from the ccwl manual, followed by a graphical visualization of it.

(define split-words
  (command #:inputs text
           #:run "tr" "--complement" "--squeeze-repeats" "A-Za-z" "\\n"
           #:stdin text
           #:outputs (words #:type stdout)))

(define downcase
  (command #:inputs words
           #:run "tr" "A-Z" "a-z"
           #:stdin words
           #:outputs (downcased-words #:type stdout)))

(define sort
  (command #:inputs words
           #:run "sort" "--unique"
           #:stdin words
           #:outputs (sorted #:type stdout)))

(define find-misspellings
  (command #:inputs words dictionary
           #:run "comm" "-23" words dictionary
           #:outputs (misspellings #:type stdout)))

(workflow (text-file dictionary)
  (pipe (tee (pipe (split-words #:text text-file)
                   (downcase #:words words)
                   (sort (sort-words) #:words downcased-words)
                   (rename #:sorted-words sorted))
             (pipe (sort (sort-dictionary) #:words dictionary)
                   (rename #:sorted-dictionary sorted)))
        (find-misspellings #:words sorted-words
                           #:dictionary sorted-dictionary)))

Contact

ccwl development happens on GitHub. Please do drop by to raise issues and offer suggestions. You may also peruse the ccwl manual for a detailed introduction to ccwl. Thank you!

Tuning packages for a CPU micro-architecture

2022-01-06T14:30:00Z

It should come as no surprise that the execution speed of programs is a primary concern in high-performance computing (HPC). Many HPC practitioners would tell you that, among their top concerns, is the performance of high-speed networks used by the Message Passing Interface (MPI) and use of the latest vectorization extensions of modern CPUs.

This post focuses on the latter: tuning code for specific CPU micro-architectures, to reap the benefits of modern CPUs, with the introduction of a new tuning option in Guix. But first, let us consider this central question in the HPC and scientific community: can “reproducibility” be achieved without sacrificing performance? Our answer is a resounding “yes”, but that deserves clarifications.

Reproducibility & high performance

The author remembers advice heard at the beginning of their career in HPC—advice still given today—: that to get optimal MPI performance, you would have to use the vendor-provided MPI library; that to get your code to perform well on this new cluster, you would have to recompile the complete software stack locally; that using generic, pre-built binaries from a GNU/Linux distribution just won’t give you good performance.

From a software engineering viewpoint, this looks like a sad situation and an inefficient approach, dismissing the benefits of automated software deployment as pioneered by Debian, Red Hat, and others in the 90’s or, more recently, as popularized with container images. It also means doing away with reproducibility, where “reproducibility” is to be understood in two different ways: first as the ability to re-deploy the same software stack on another machine or at a different point in time, and second as the ability to verify that binaries being run match the source code—the latter is what reproducible builds are concerned with.

But does it really have to be this way? Engineering efforts to support performance portability suggest otherwise. We saw earlier that an MPI implementation like Open MPI, today, does achieve performance portability—that it takes advantage of the high-speed networking hardware at run-time without requiring recompilation.

Likewise, in a 2018 article, we looked at how generic, pre-built binaries can and indeed often do take advantage of modern CPUs by selecting at run-time the most efficient implementation of performance-sensitive routines for the host CPU. The article also highlighted cases where this is not the case; these are those we will focus on here.

The jungle of SIMD extensions

While major CPU architectures such as x86_64, AArch64, and POWER9 were defined years ago, CPU vendors regularly extend them. Extensions that matter most in HPC are vector extensions: single instruction/multiple data instructions and registers. In this area, a lot has happened on x86_64 CPUs since the baseline instruction set architecture (ISA) was defined. As shown in the diagram below, Intel and AMD have been tacking ever more powerful SIMD extensions to their CPUs over the years, from SSE3 to AVX-512, leading to a wealth of CPU “micro-architectures”.

This gives a high-level view, but just looking at generations of Intel processors by their code name shows an already more complicated story:

Linear algebra routines that scientific software relies on greatly benefit from SIMD extensions. For example, on a modest Intel CORE i7 processor (of the Skylake generation, which supports AVX2), the AVX2-optimized version of the dense matrix multiplication routines of Eigen, built with GCC 10.3, peaks at ≅40 Gflops/s, compared to ≅11 Gflops/s for its baseline x86_64 version—four times faster!

When function multi-versioning isn’t enough

In our 2018 post, we contemplated function multi-versioning (FMV) as the solution to performance portability: the implementation provides multiple versions of “hot” routines, one for each relevant CPU micro-architecture, and picks the best one for the host CPU at run time. Many pieces of performance-critical software already use this technique; software that doesn’t do that yet can easily do so thanks to compiler toolchain support.

To make the case for FMV, we wanted to see what it would take us to actually add FMV support to code that would benefit from it. In the spirit of the Clear Linux automatic FMV patch generator, we wrote an automatic FMV tool for Guix: you would give it a package name, and it would:

Build the package with the -fopt-info-vec compiler flag to gather information about vectorization opportunities and their source code location.
Generate a patch that, for each C function with vectorization opportunities, adds the target_clone attribute to generate a couple of vectorized versions—generic, AVX2, and AVX-512.
Build the package with this FMV patch.

The tool can successfully FMV-patch a variety of packages written in C, such as the GNU Scientific Library, which contains plain sequential implementations of a variety of math routines. It was an exciting engineering experiment… but we found it to be all too often inapplicable, for two reasons: performance-critical software already does FMV, or it’s not written in C.

We realized there’s a common pattern where FMV isn’t applicable, or at least isn’t applied: C++ header-only libraries. There’s no shortage of C++ header-only math libraries providing hand-optimized SIMD versions of their routines or otherwise supporting SIMD programming: Eigen, MIPP, xsimd and xtensor, SIMD Everywhere (SIMDe), Highway, and many more (C++ meta-programming for SIMD appears to be an attractive engineering effort). All these, except Highway, have in common that they do not support FMV and run-time implementation selection. Since they “just” provide headers, it is up to each package using them to figure out what to do in terms of performance portability.

In practice though, software using these C++ header-only libraries rarely makes provisions for performance portability. Thus, when compiling those packages for the baseline ISA, one misses out on all the vectorized implementations that libraries like Eigen provide. This is a known issue in search of a solution. It is a bit of a problem considering for instance the sheer number of packages depending on Eigen:

Fundamentally, run-time dispatch is at odds with the all-compile-time approach that header-only C++ template libraries are about. Furthermore, Eigen, for example, supports fine-grain vectorization; it may be used to operate on small matrices, as is common in computer graphics, and in that case inlining matrix operations is key to good performance—run-time dispatch would have to be done at a higher level.

Package multi-versioning

With our packaging hammer, one could envision a solution to these problems: if we cannot do function multi-versioning, what about implementing package multi-versioning? Guix makes it easy to define package variants, so we can define package variants optimized for a specific CPU—compiled with -march=skylake, for instance. What we need is to define those variants “on the fly”.

The new --tune package transformation option, which landed in Guix master a week ago, works along those lines. Users can pass --tune to any of the command-line tools (guix install, guix shell, etc.) and that causes “tunable” packages to be optimized for the host CPU. For example, here is how you would run Eigen’s matrix multiplication benchmark from the eigen-benchmarks package, both with and without micro-architecture tuning:

$ guix shell eigen-benchmarks -- \
    benchBlasGemm 240 240 240
240 x 240 x 240
cblas: 0.239963 (13.826 GFlops/s)
eigen : 0.267135 (12.419 GFlops/s)
l1: 32768
l2: 262144

$ guix shell --tune eigen-benchmarks -- \
    benchBlasGemm 240 240 240
guix shell: tuning eigen-benchmarks@3.3.8 for CPU skylake
240 x 240 x 240
cblas: 0.208547 (15.908 GFlops/s)
eigen : 0.0720303 (46.06 GFlops/s)
l1: 32768
l2: 262144

There are several things happening behind the scenes. First, --tune determines the name of the host CPU as recognized by GCC’s (and Clang’s) -march option; it does that using code inspired by that used by GCC’s -march=native, thought it’s currently limited to x86_64.

Users can also override auto-detection by passing a CPU name—e.g., --tune=skylake-avx512. However, the set of recognized CPU names varies between GCC 11 and GCC 10, between GCC and Clang, and so on; passing the wrong name to -march could result in obscure compilation errors. To handle that gracefully, we instead add metadata to the compiler packages in Guix that lists the CPU names they know. This allows --tune to emit a meaningful error when a CPU name unknown to the compiler is given:

$ guix install eigen-benchmarks --tune=x86-64-v4
guix install: tuning eigen-benchmarks@3.3.8 for CPU x86-64-v4
The following package will be installed:
   eigen-benchmarks 3.3.8

guix install: error: compiler gcc@10.3.0 does not support micro-architecture x86-64-v4

As mentioned earlier, we made the conscious choice of letting --tune operate solely on packages explicitly marked as “tunable”, which packagers can do along these lines:

(define-public eigen-benchmarks
  (package
    (name "eigen-benchmarks")
    ;; …
    (properties '((tunable? . #true)))))

This is to ensure Guix does not end up rebuilding packages that could not possibly benefit from micro-architecture-specific optimizations, which would be a waste of resources. (For the same reason, we rejected the idea of defining separate system types for the various x86_64 CPU micro-architectures the way Nix 2.4 did.)

In the spirit of avoiding needless package rebuilds, --tune leverages the “graft” mechanism: package variants are grafted to the dependency graph, such that dependents of a tuned package do not need to be rebuilt. To illustrate that, consider the figure below:

OpenCV depends on VTK, which depends on Eigen, as shown by the dotted arrows. VTK is marked as tunable so it can benefit from SIMD optimizations in Eigen. When --tune is passed, the optimized variant of VTK built with -march=skylake is generated and grafted onto the dependency graph, such that OpenCV itself does not need to be recompiled and instead is relinked against the optimized VTK variant.

Importantly, this implementation of package multi-versioning does not sacrifice reproducibility. When --tune is used, from Guix’s viewpoint, it is just an alternate, but well-defined dependency graph that gets built. Guix records package transformation options that were used so it can “replay” them, for example by exporting a faithful manifest:

$ guix shell eigen-benchmarks --tune
guix shell: tuning eigen-benchmarks@3.3.8 for CPU skylake
[env]$ guix package --export-manifest -p $GUIX_ENVIRONMENT
;; This "manifest" file can be passed to 'guix package -m' to reproduce
;; the content of your profile.  This is "symbolic": it only specifies
;; package names.  To reproduce the exact same profile, you also need to
;; capture the channels being used, as returned by "guix describe".
;; See the "Replicating Guix" section in the manual.

(use-modules (guix transformations))

(define transform1
  (options->transformation '((tune . "skylake"))))

(packages->manifest
  (list (transform1
          (specification->package "eigen-benchmarks"))))

The dependency graph resulting from tuning is recorded and can be replayed—much unlike stealthily passing -march=native during a build. Like other transformation options, --tune is accepted by all the commands, so you could just as well build a Singularity image tuned for a particular CPU:

guix pack -f squashfs -S /bin=bin \
  eigen-benchmarks bash --tune

This comes in handy if you want to prepare an image to run on another cluster, where you know you can rely on a given CPU extension.

The Guix build farm is set up to build a few optimized package variants. That way, users of --tune are likely to get substitutes (pre-built binaries) even for the optimized variants, making deployment just as fast as with non-tuned packages. To achieve this, --tune skips running test suites when building packages: we cannot be sure that build machines implement the CPU micro-architecture at hand.

Conclusion and outlook

We implemented what we call “package multi-versioning” for C/C++ software that lacks function multi-versioning and run-time dispatch, a notable example of which is optimized C++ header-only libraries. The new --tune option is just one guix pull away; users and packagers can already take advantage of it. It is another way to ensure that users do not have to trade reproducibility for performance.

The scientific programming landscape has been evolving over the last few years. It is encouraging to see that Julia offers function multi-versioning for its “system image”, and that, similarly, Rust supports it with annotations similar to GCC’s target_clones. Hopefully these new development environments will support performance portability well enough that users and packagers will not need to worry about it.

Illustrations were taken from a talk given at JCAD 2021.

Acknowledgments

Thanks to Ricardo Wurmus for insightful comments and suggestions on an earlier draft of this article.

When Docker images become fixed-point

2021-10-22T16:00:00Z

We like to say that Docker images are like smoothies: you can immediately tell whether it’s your liking, but you can hardly guess what the ingredients are. Although containers are an efficient way to ship things, the core question is how these things are produced.

The aim of this post is to demonstrate that the issue is not Docker images by themselves. Instead the concrete question when talking about reproducibility is: where do binaries come from, and using which tool?

The scenario below illustrates how one can ship reproducible and verifiable Docker images built by guix pack. It had initially been written as comment while reviewing patch #45919.

Alice generates a Docker image

Alice is working on a standard scientific stack using Python. She stores along her project the files manifest.scm containing the package set and channels.scm containing the state of Guix (in other words, its revision). With these two files, one can redeploy using guix time-machine the exact same computational environment.

Concretely, manifest.scm reads:

(specifications->manifest
 (list
  "python"
  "python-numpy"))

Alice produces the channels.scm file by running guix describe -f channels, which returns this:

(list (channel
        (name 'guix)
        (url "https://git.savannah.gnu.org/git/guix.git")
        (commit
          "fb32a38db1d3a6d9bc970e14df5be95e59a8ab02")
        (introduction
          (make-channel-introduction
            "9edb3f66fd807b096b48283debdcddccfea34bad"
            (openpgp-fingerprint
              "BBB0 2DDF 2CEA F6A8 0D1D  E643 A2A0 6DF2 A33A 54FA")))))

So far, so good. Because Alice needs to run this stack on some infrastructure not running Guix but instead running Docker, she just packs her scientific stack with this command:

guix pack -f docker --save-provenance -m manifest.scm

For the next step, one option is to locally load the generated tarball using Docker tools, like so:

$ docker load < /gnu/store/6rga6pz60di21mn37y5v3lvrwxfvzcz9-python-python-numpy-docker-pack.tar.gz
Loaded image: python-python-numpy:latest
$ docker images
REPOSITORY                                TAG          IMAGE ID       CREATED         SIZE
python-python-numpy                       latest       ea2d5e62b2d2   51 years ago    431MB

… then running docker push to upload the image to a registry.

The second option is to transfer the image to the target computer, and to run over there the Docker commands shown above. Once the image has been loaded on the target machine, running Python from that image just works:

$ docker run -ti python-python-numpy:latest python3
Python 3.8.2 (default, Jan  1 1970, 00:00:01)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
import numpy as np
>>> A = np.array([[1,0,1],[0,1,0],[0,0,1]])
A = np.array([[1,0,1],[0,1,0],[0,0,1]])
>>> _, s, _ = np.linalg.svd(A); s; abs(s[0] - 1./s[2])
_, s, _ = np.linalg.svd(A); s; abs(s[0] - 1./s[2])
array([1.61803399, 1.        , 0.61803399])
0.0
>>> quit()

Neat!

On a side note, the Docker image is produced directly by Guix. That is, Guix manages everything, from the binary packages and all the requirements to the Docker image itself — no Dockerfile involved. To guix pack, Docker images are one container format among others; for instance guix pack -f squashfs --save-provenance -m manifest.scm generates a Singularity image (other container format) with the exact same binaries inside.

Bob retrieves and runs code from Alice’s image

Bob works with Alice's Docker image. He needs to run this exact same versions on another machine using plain relocatable tarballs, for example. Or he needs to scrutinize how all the binaries in this stack are produced, because maybe he found a bug and wants to know if all the results obtained with this Docker image are correct or not. Or maybe he wants to study a specific aspect to better understand a specific result. Bob is doing science and thus Bob needs transparency.

The files manifest.scm and channels.scm sadly disappeared a long time ago, probably at the end of Alice's postdoc. Had the Docker image been produced with a Dockerfile, the game would most likely be over: running docker build on that Dockerfile would probably give a different result than back then (for instance because it starts by running apt-get update), or it may simply fail because some of the resources it refers to have vanished from the Internet. There are ways to mitigate it, for instance by resorting to Debian’s snapshot service and/or using debuerreotype to recreate the image, assuming everything in the image was taken from Debian. But overall, it’s safe to assume that a regular Dockerfile does not describe a reproducible build process.

Fortunately, Bob remembers this Docker image had been produced with Guix (pack --save-provenance). Let’s get back the recipe of this smoothie.

First, let’s start the container, which makes it easier to export as a plain tarball. Second, let’s extract the embedded Guix profile:

$ docker run -d python-python-numpy:latest python3
e1775ff836915dc55195eafd1710eec07106bd1677bde153e5842a0ded43395d
$ docker export -o /tmp/re-pack.tar $(docker ps -a --format "{{.ID}}"| head -n1)

$ tar -xf /tmp/re-pack.tar $(tar -tf /tmp/re-pack.tar | grep 'profile/manifest')
$ tree gnu
gnu
└── store
    └── ia1sxr3qf3w9dj7y48rwvwyx289vpfgi-profile
        └── manifest

2 directories, 1 file

Wow! Is it really a regular profile? Yes, it is! Because that profile contains provenance metadata (thanks to --save-provenance), we can ask Guix to export that metadata in the form of a list of channels and a manifest:

$ guix package -p gnu/store/ia1sxr3qf3w9dj7y48rwvwyx289vpfgi-profile --export-channels
;; This channel file can be passed to 'guix pull -C' or to
;; 'guix time-machine -C' to obtain the Guix revision that was
;; used to populate this profile.

(list
     (channel
       (name 'guix)
       (url "https://git.savannah.gnu.org/git/guix.git")
       (commit
         "fb32a38db1d3a6d9bc970e14df5be95e59a8ab02")
       (introduction
         (make-channel-introduction
           "9edb3f66fd807b096b48283debdcddccfea34bad"
           (openpgp-fingerprint
             "BBB0 2DDF 2CEA F6A8 0D1D  E643 A2A0 6DF2 A33A 54FA"))))
)

$ guix package -p gnu/store/ia1sxr3qf3w9dj7y48rwvwyx289vpfgi-profile --export-manifest
;; This "manifest" file can be passed to 'guix package -m' to reproduce
;; the content of your profile.  This is "symbolic": it only specifies
;; package names.  To reproduce the exact same profile, you also need to
;; capture the channels being used, as returned by "guix describe".
;; See the "Replicating Guix" section in the manual.

(specifications->manifest
  (list "python" "python-numpy"))

Awesome, isn't it? These last two outputs are equivalent to Alice's manifest.scm and channels.scm files. At this stage, Bob’s a happy person: he can now take these two files anywhere and rebuild the exact same image at any time:

guix time-machine -C new-channels.scm \
     -- pack -f docker --save-provenance -m new-manifest.scm

The command should produce the exact same docker-pack.tar that Alice provided, bit for bit. If it does not, then either the original image had been tampered with, or one of the package build processes involved is non-deterministic — something we would invite you to report as a bug!

Join the fun, join us!