In modern science, analysis is required to process data. When the data-flow is linear, such a process is easily represented by tools such as the standard Unix pipeline. However, this data-flow is often modeled by a directed graph: each processing node may have one or more inputs and the outputs may be directed to different processing nodes. This directed graph, mainly used in the fields of bioinformatics, medical imaging and astronomy, among many others, is called a workflow.
It should come as no surprise that the execution speed of programs is a primary concern in high-performance computing (HPC). Many HPC practitioners would tell you that, among their top concerns, is the performance of high-speed networks used by the Message Passing Interface (MPI) and use of the latest vectorization extensions of modern CPUs.
There is no shortage of package managers. Each tool makes its own set of tradeoffs regarding speed, ease of use, customizability, and reproducibility. Guix occupies a sweet spot, providing reproducibility by design as pioneered by Nix, package customization à la Spack from the command line, the ability to create container images without hassle, and more.
We are organizing the first French-speaking workshop on the reproducibility of software environments for scientists, engineers, and system administrators. The workshop will take place on-line on May 17–18th, 2021 from 09:00 to 12:30 CEST. Stay tuned for more reproducible research events!
Nous avons le plaisir d’annoncer le premier atelier francophone sur la reproductibilité des environnements logiciels, qui aura lieu en ligne les matinées des 17 et 18 mai 2021 — programme et informations pratiques sur la page de l’événement.
Cet atelier fait suite à l’intérêt porté par la communauté francophone du calcul scientifique aux questions de reproductibilité, notamment lors de l’Action Nationale de Formation UST4HPC 2021 et avec la journée reproductibilité de la Société Informatique de France (SIF) qui aura lieu le 10 mai. Elle s’inscrit aussi dans le cadre des activités du groupe Guix-HPC.
Au programme, sept retours d’expériences de scientifiques et de
responsables d’administration système sur le déploiement logiciel dans
les centres de calcul avec Guix mais aussi Spack ou
module, et sur la
création de pipelines reproductibles pour la recherche avec Debian,
Org-Mode et Guix.
Ces exposés seront suivis d’échanges sur les attentes et propositions de chacun·e, aussi bien du point de vue scientifique qu’en termes d’administration de centre de calcul.
La participation est libre et gratuite mais nous vous invitons toutefois à vous inscrire.
With increased usage of GNU Guix at scientific institutions there are also growing needs for packaging software used in research and teaching. The best place for that has been and still is Guix’ main repository because there the software is accessible and maintainable by the entire Guix community.
Early this year, ReScience, which is concerned with publishing replications (successful or not) of previously-published articles, organized the Ten Years Reproducibility Challenge. The idea is simple: pick a paper of yours that is at least ten years old, and try to replicate its results. The first difficulty is usually to get the source code of the software used to produce the results and to get that code to build and run. This challenge helped highlight again ways in which research practices can and must be improved. We took it as an opportunity to devise new practices and tools to ensure reproducibility and provenance tracking for articles, end-to-end: from source code to PDF.
command creates “application bundles” that can be used to deploy
software on machines that do not run Guix (yet!), such as HPC clusters. Since
its inception in
it has seen a number of improvements, such as the ability to create
Docker and Singularity container images. Some clusters lack these
tools, though, and the addition of relocatable
was a way to address that. This post looks at a new execution engine
for relocatable packs that has just
landed with the goal of improving
Version 1.1.0 of Guix was announced yesterday. As the announcement points out, some 200 people contributed more than 14,000 commits since the previous release. This post focuses on important changes for HPC users, admins, and scientists made since version 1.0.1 was released in May 2019.
This post is about reproducible computations, so let's start with a computation. A short, though rather uninteresting, C program is a good starting point. It computes π in three different ways:
High-performance networks have constantly been evolving, in sometimes hard-to-decipher ways. Once upon a time, hardware vendors would pre-install an MPI implementation (often an in-house fork of one of the free MPI implementations) specially tailored for their hardware. Fortunately, this time appears to be gone. Despite that, there is still widespread belief that MPI cannot be packaged in a way that achieves best performance on a variety of contemporary high-speed networking hardware.
Jupyter Notebooks are becoming a key component of the researcher’s toolbox when it comes to sharing and reproducing computational experiments. Jupyter notebooks allow users to not only intermingle a narrative with supporting code in a way reminiscent of literate programming, they also make it easy to interact with the code and, thus, build on the work of each other.
The book Evolutionary Genomics was published in July this year. Of particular interest to Guix-HPC is the chapter entitled “Scalable Workflows and Reproducible Data Analysis for Genomics”, by Francesco Strozzi et al.:
GNU Guix can be used as a “package manager” to install and upgrade software packages as is familiar to GNU/Linux users, or as an environment manager, but it can also provision containers or virtual machines, and manage the operating system running on your machine.
In the quest for truly reproducible workflows I set out to create an example of a reproducible workflow using GNU Guix, IPFS, and CWL. GNU Guix provides content-addressable, reproducible, and verifiable software deployment. IPFS provides content-addressable storage, and CWL describes workflows that can run on specifically supported backend hardware system. In principle, this combination of tools should be enough to provide reproducibility with provenance and improved security.
December 2018 the Akalin lab at the Berlin Institute of Medical Systems Biology (BIMSB) published a paper about a collection of reproducible genomics pipelines called PiGx that are made available through GNU Guix. The article was awarded third place in the GigaScience ICG-13 Prize. Representing the authors, Ricardo Wurmus was invited to present the work on PiGx and Guix in Shenzhen, China at ICG-13.
Ricardo urged the audience of wet lab scientists and bioinformaticians to apply the same rigorous standards of experimental design to experiments involving software: all variables need to be captured and constrained. To demonstrate that this does not need to be complicated, Ricardo reported the experiences of the Akalin lab in building a collection of reproducibly built automated genomics workflows using GNU Guix.
Due to technical difficulties the recording of the talk was lost, so Ricardo re-recorded the talk a few weeks later.
I’m happy to announce that the bioinformatics group at the Max Delbrück Center that I’m working with has released a preprint of a paper on reproducibility with the title Reproducible genomics analysis pipelines with GNU Guix.
Guix follows a transparent source/binary deployment model: it will
download pre-built binaries when they’re available—like
yum—and otherwise falls back to building from source. Most of the
time the project’s build farm provides binaries so that users don’t have
to spend resources building from source. Pre-built binaries may be
missing when you’re installing a custom package, or when the build farm
hasn’t caught up yet. However, deployment of binaries is often seen as
incompatible with high-performance requirements—binaries are “generic”,
so how can they take advantage of cutting-edge HPC hardware? In this
post, we explore the issue and solutions.
In the previous
post, we saw that
Guix’s build daemon needs to run as
root, and for a good reason:
that’s currently the only way to create isolated build environments for
packages on GNU/Linux. This requirement means that you cannot use Guix
on a cluster where the sysadmins have not already installed it. In this
article, we discuss how to take advantage of Guix on clusters that lack
a proper Guix installation.
Guix is a good fit for multi-user environments such as clusters:
allows non-root users to install packages at will without interfering with each other.
However, a common complaint is that installing Guix requires administrator
privileges. More precisely,
guix-daemon, the system-wide daemon that
spawns package builds and downloads on behalf of
must be running as
This is not much of a problem on one's laptop but it surely makes it
harder to adopt Guix on an HPC cluster.
This post marks the debut of Guix-HPC, an effort to optimize GNU Guix for reproducible scientific workflows in high-performance computing (HPC). Guix-HPC is a joint effort between Inria, the Max Delbrück Center for Molecular Medicine (MDC), and the Utrecht Bioinformatics Center (UBC). Ludovic Courtès, Ricardo Wurmus, Roel Janssen, and Pjotr Prins are driving the effort in each of these institutes, each one focusing specific areas of interest within this overall Guix-HPC effort. Our institutes have in common that they are users of HPC, and that, as scientific research institutes, they have an interest in using reproducible methodologies to carry out their research.