Guix-HPC Activity Report, 2019
This document is also available as PDF (printable booklet).
Guix-HPC is a collaborative effort to bring reproducible software deployment to scientific workflows and high-performance computing (HPC). Guix-HPC builds upon the GNU Guix software deployment tool and aims to make it a better tool for HPC practitioners and scientists concerned with reproducible research.
Guix-HPC was launched in September 2017 as a joint software development project involving three research institutes: Inria, the Max Delbrück Center for Molecular Medicine (MDC), and the Utrecht Bioinformatics Center (UBC). GNU Guix for HPC and reproducible science has received contributions from additional individuals and organizations, including CNRS, Cray, Inc., the University of Tennessee Health Science Center (UTHSC), and Tourbillion Technology.
This report highlights key achievements of Guix-HPC between our previous report a year ago and today, February 2020. This year was marked by a major milestone: the release in May 2019 of GNU Guix 1.0, seven years and more than 40,000 commits after its inception.
Outline
Guix-HPC aims to tackle the following high-level objectives:
- Reproducible scientific workflows. Improve the GNU Guix tool set to better support reproducible scientific workflows and to simplify sharing and publication of software environments.
- Cluster usage. Streamlining Guix deployment on HPC clusters, and providing interoperability with clusters not running Guix.
- Outreach & user support. Reaching out to the HPC and scientific research communities and organizing training sessions.
The following sections detail work that has been carried out in each of these areas.
Reproducible Scientific Workflows
Supporting reproducible research in general remains a major goal for Guix-HPC. The ability to reproduce and inspect computational experiments—today’s lab notebooks—is key to establishing a rigorous scientific method. We believe that a prerequisite for this is the ability to reproduce and inspect the software environments of those experiments. We have made further progress to ensure Guix addresses this use case.
Better Support for Reproducible Research
Guix has always supported reproducible computations by design, but there were two obstacles to using Guix for actually doing reproducible computations: the user interface to reproducibility features was a bit clumsy, and documentation, both practical and background, was scarce.
Supporting reproducible computations requires addressing four aspects:
- Finding the dependencies of a computation.
- Ensuring that there are no hidden dependencies, such as utility programs from the environment that are “just there”.
- Providing a record of the dependencies from which they can be reconstructed.
- Reproducing a computation from such a record.
Step 1 is very situation-dependent and can therefore not be fully automatized. Step 2 is supported by guix environment
, step 3 by guix describe
. Step 4 used to require a rather unintuitive form of guix pull
(whose main use case is updating Guix), but is now supported in a more straightforward way by guix time-machine
, which provides direct access to older versions of Guix and all the packages it defines.
A post on the Guix HPC blog explains how to perform the four steps of reproducible computation, and also explains how Guix ensures bit-for-bit reproducibility through comprehensive dependency tracking.
Reproducible Deployment for Jupyter Notebooks
Jupyter Notebooks have become a tool of choice for scientists willing to share and reproduce computational experiments. Yet, nothing in a notebook specifies which software packages it relies on, which puts reproducibility at risk.
Together with Pierre-Antoine Rouby as part of a four-month internship at Inria in 2018, we started work on Guix-Jupyter, a Guix “kernel” for Jupyter Notebook. In a nutshell, Guix-Jupyter allows notebook writers to specify the software environment the notebook depends on: the Guix packages, and the Guix commit. Furthermore, all the code in the notebook runs in an isolated environment (a “container”). This ensures that someone replaying the notebook will run it in the right environment as the author intended.
Guix-Jupyter reached its first release in October 2019. Many on Jupyter fora were enthusiastic about this approach. Compared to other approaches, which revolve around building container images, Guix-Jupyter addresses the deployment problem at its root, providing a maximum level of transparency. These Jupyter notebooks are being used in bioinformatics courses by, for example, the University of Tennessee.
The Guix Workflow Language
The Guix Workflow Language (or GWL), an extension of Guix for the description and execution of scientific workflows, has seen continuous improvements in the past year. The core idea remains unchanged: rather than grafting software deployment onto a workflow language, extend a mature software deployment solution just enough to accomodate the needs of users and authors of scientific workflows.
User testing revealed a desire for a more familiar syntax for users of other workflow systems without compromising the benefits of embedding a domain specific language in a general purpose language, as demonstrated by Guix itself. As a result of these tests and discussions, the Guix Workflow Language now accepts workflow definitions written in a pythonesque syntax called Wisp and provides about a dozen macros and procedures to simplify common tasks, such as embedding of foreign code snippets, string interpolation, file name expansion, etc. Of course, workflows can also be written in plain Scheme or even in a mix of both styles.
One of the benefits of “growing” a workflow language out of Guix is that non-trivial features implemented in Guix are readily available for co-option. For example, the GWL now uses the mature implementation of containers in Guix to provide support for evaluating processes in isolated container environments.
Work has begun to leverage the features of both guix pack
and guix deploy
to not only execute workflows on systems that share a Guix
installation but also to provision remote Guix systems from scratch to
run a distributed workflow without a traditional HPC scheduler. To
that end, a first
prototype of a Guile
library to manage storage and compute resources through Amazon Web
Services (AWS) has been developed, which will be integrated with the
Guix Workflow Language in future releases.
You can read more about the many changes to the GWL in the release notes of version 0.2.0.
Ensuring Source Code Availability
In April 2019, Software Heritage and GNU Guix announced their
collaboration
to enable long-term reproducibility. Being able to rely on a long-term
source code archive is crucial to support the use cases that matter to
reproducible science: what good would it be if guix time-machine
would
fail because upstream source code vanished? Starting from beginning of
2019, Guix is able to fall back to Software
Heritage
should upstream source code vanish.
We worked to improve coverage of the Software Heritage archive—making
sure source code Guix packages refer to is archived. That led to the
addition of an archival
tool to guix lint
,
our helper for package developers, which instructs Software Heritage to
archive source code it currently lacks, before the package even makes it
in Guix itself. We helped review work carried out by NixOS developer
“lewo” to further improve archive
coverage.
Packaging
The core package collection that comes with Guix went from 9,000 packages a year ago to more than 12,000 as of this writing. This rapid growth benefits users of all application domains, notably HPC practitioners and scientists.
The message passing interface (MPI) is a key component for our HPC users
and an important factor for the performance of multi-node parallel
applications. We have worked on improving Open MPI support for a wide
range of high-speed network devices, making sure our openmpi
package achieves peak performance by default on each of them—it is all
about portable performance. This work is described in our blog post
entitled Optimized and portable Open MPI
packaging.
It led to improvements in packages for the high-speed network drivers
and fabrics, such as UCX, PSM, and PSM2, improvements in the Open MPI
package itself, the addition of a package for the Intel MPI
Benchmarks,
and the addition of an MPICH package.
Numerical simulation is one of the key activities on HPC systems.
Within GNU Guix a simulation
module has been established to gather
together packages that are used in this field. Popular packages such
as OpenFOAM and
FEniCS have already been included, with
FEniCS having had a recent update. The Gmsh
package in the maths
module allows for sophisticated grid generation
and post-processing of results. This year the
FreeCAD package was added to the
engineering
module. This allows for the definition of complex
two-dimensional and three-dimensional geometries, often needed as the
first step in the simulation process. Engineers and scientists using
Guix can now conduct simulations and numerical experiments that span a
spectacular range of applications. Plans for the near future include
updates to Gmsh and OpenFOAM and the addition of a specialised solver
for the shallow water equations.
In HPC environments typically an underlying GNU/Linux distribution is used
such as Red Hat, Debian or Ubuntu. In addition user land build systems
are used such as Conda which has the downside of not being
reproducible because the bootstrap normally depends on the underlying
distribution. Guix, however, has support for a reproducible Conda
bootstrap. This means that HPC managers can support distro software
installs (e.g., through apt-get
), but in addition users get empowered
to install software themselves using thousands of GNU Guix supported
packages (and extra through Guix channels, see below) and thousands of
Conda packages. In practice, as system administrators, we find we
hardly ever have to build packages from source again and system
administrators hardly get bothered by their (scientific) users.
Many other key HPC packages have been added, upgraded, or improved, including the SLURM batch scheduler, the HDF5 data management suite, the LAPACK reference linear algebra package, the Julia and Rust programming languages, the PyOpenCL Python interface to OpenCL, and many more.
Statistical and bioinformatics packages for the R programming language in particular have seen regular comprehensive upgrades, closely following updates to the popular CRAN and Bioconductor repositories. At the time of this writing Guix provides a collection of more than 1300 reproducibly built R packages, making R one of the best supported programming environments in Guix.
In addition to the packages in core Guix, we have been developing channels providing packages that are closely related to the research work of teams at our institutes. One such example is the Guix-HPC channel, developed by HPC research teams at Inria, and which now contains about forty packages. Active bioinformatics channels include that of the BIMSB group at the Max Delbrück Center for Molecular Medicine (MDC) (130+ packages), that of the genetics group at UMC Utrecht (400+ packages), and the genomics channel by Erik Garrison.
Cluster Usage
This year Guix has become the deployment tool of choice on more clusters. We are notably aware of new deployments at several academic clusters such as GriCAD (France), CCIPL (France), and UTHSC (USA). Discussions are on-going with other academic and industrial partners who have shown interest in deploying Guix.
In order to improve the availability of binary substitutes for the more than 12,000 packages defined in Guix, the Max Delbrück Center for Molecular Medicine (MDC) in Berlin (Germany) generously provided funds to purchase 30 new servers to replace a number of outdated and failing build nodes in the distributed build farm. These new servers are now hosted at the MDC data center in Berlin and continuously build binaries for several of the architectures supported by Guix. The binaries are archived on a dedicated storage array and offered for download to all users of Guix.
We have further improved guix pack
to support users who wish to take advantage of Guix while deploying
software on machines where Guix is not available. One noteworthy
improvement is the addition of the -RR
option, which we like to refer
to as “reliably relocatable”: guix pack -RR
would create a relocatable
tarball that automatically falls back to using
PRoot for relocation when
unprivileged user namespaces are not
supported,
thereby providing a “universal” relocatable archive. The Docker and
Singularity back-ends of guix pack
have also seen improvements, in
particular the addition of the --entry-point
option to specify the
default entry point, and that of a --save-provenance
option to save
provenance meta-data in the container image.
Outreach and User Support
Guix-HPC is in part about “spreading the word” about our approach to reproducible software environments and how it can help further the goals of reproducible research and high-performance computing development. This section summarizes articles, talks, and training sessions given this year.
Articles
The book Evolutionary Genomics, published in July 2019, contains a chapter entitled “Scalable Workflows and Reproducible Data Analysis for Genomics”, by Francesco Strozzi et al. that discusses workflow and deployment tools, in particular looking at the GNU Guix Workflow Language, the Common Workflow Language, Snakemake, as well as Docker, CONDA, and Singularity.
We have published 7 articles on the Guix-HPC blog touching topics such as efficient Open MPI packaging, Guix-Jupyter, Software Heritage integration, and a hands-on tutorial using Guix for reproducible workflows and computations.
Talks
Since last year, we gave the following talks at the following venues:
- INRA MIA Seminar, Feb. 2019 (Ludovic Courtès)
- IN2P3/CNRS ComputeOps Workshop, March 2019 (Ludovic Courtès)
- ARAMIS Plenary Session on Reproducibility, May 2019 (Ludovic Courtès)
- JCAD, Oct. 2019 (Ludovic Courtès)
- SciCloj Web Meeting, Jan. 2020 (Ludovic Courtès)
- FOSDEM, Feb. 2020 (Ludovic Courtès, Efraim Flashner, Pjotr Prins)
We also organised the GNU Guix Days, which attracted 35 Guix contributors and ran for two days before FOSDEM 2020.
Training Sessions
The PRACE/Inria High-Performance Numerical Simulation School that took place in November 2019 contained an introduction to Guix and used it throughout its hands-on sessions. A Guix training session also took place at Inria (Bordeaux) in October 2019.
Personnel
GNU Guix is a collaborative effort, receiving contributions from more than 60 people every month—a 50% increase compared to last year. As part of Guix-HPC, participating institutions have dedicated work hours to the project, which we summarize here.
- CNRS: 0.25 person-year (Konrad Hinsen)
- Inria: 2 person-years (Ludovic Courtès, Maurice Brémond, and the contributors to the Guix-HPC channel: Florent Pruvost, Gilles Marait, Marek Felsoci, Emmanuel Agullo, Adrien Guilbaud)
- Max Delbrück Center for Molecular Medicine (MDC): 2 person-years (Ricardo Wurmus and Mădălin Ionel Patrașcu)
- Tourbillion Technology: 0.7 person-year (Paul Garlick)
- Université de Paris: 0.25 person-year (Simon Tournier)
- University of Tennessee Health Science Center (UTHSC): 0.8 person-year (Efraim Flashner and Pjotr Prins)
- Utrecht Bioinformatics Center (UBC): 1 person-year (Roel Janssen)
Perspectives
Making Guix more broadly usable on HPC clusters remains one of our top
priorities. Features added this year to guix pack
are one way to
approach it, and we will keep looking for ways to improve it. In
addition to this technical approach, we will keep working with cluster
administrators to allow them to deploy Guix directly on their cluster.
We have seen more cluster administrators deploy Guix this year and we
are confident that this trend will continue.
Last year, we advocated for tight integration of reproducible deployment
capabilities through Guix in scientific applications. The GNU Guix
Workflow Language and Guix-Jupyter have since matured, giving us more
insight into the benefits of the approach and opening new perspectives
that we will explore. We would additionally like to investigate a
complementary approach: adding Guix support to existing tools, such as
jupyter-repo2docker
.
For the Guix Workflow Language we will continue to explore its suitability in scheduler-less compute environments, such as ad-hoc clusters of short-lived virtual servers, that are becoming increasingly popular. We think that the properties of bit-reproducible builds and package-level granularity unlock hitherto unavailable sharing among independent parts of workflow environments to an extent that is impossible when using monolithic container images. This increase in storage and deployment efficiency is expected to result in significant cost savings when computations are offloaded to externally hosted and metered resources.
We have witnessed increasing awareness in the scientific community of the limitations of container-based tooling when it comes to building transparent and reproducible workflows. We are happy to be associated with the “Ten Years Reproducibility Challenge” where we plan to demonstrate how Guix can help reproduce computational experiments. In the same vein, we are also interested in adapting Mohammad Akhlaghi’s reproducible paper template to take advantage of Guix.
There’s a lot we can do and we’d love to hear your ideas!
Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).