Faster relocatable packs with Fakechroot

Ludovic Courtès — May 18, 2020

The guix pack command creates “application bundles” that can be used to deploy software on machines that do not run Guix (yet!), such as HPC clusters. Since its inception in 2017, it has seen a number of improvements, such as the ability to create Docker and Singularity container images. Some clusters lack these tools, though, and the addition of relocatable packs was a way to address that. This post looks at a new execution engine for relocatable packs that has just landed with the goal of improving performance.

Before we get into that, let’s recap how relocatable packs work.

Relocatable packs

Essentially, a relocatable pack is a plain old tarball that contains the applications of your choosing along with all their dependencies, such that you can run them on any GNU/Linux machine. To create a pack containing Python and NumPy, run:

guix pack -RR python python-numpy -S /bin=bin

The -RR flag asks for the creation of what we jokingly refer to as a reliably relocatable pack (more on that below), while the -S flag asks for the creation of a /bin symbolic link in the tarball.

The result of that command is a tarball that you can send on another machine, unpack, and then run Python directly from there without any special privileges:

tar xf pack.tar.gz
./bin/python

That’s it! All you need on the target machine is tar, and the rest just works.

Relocation with PRoot

guix pack -R (with a single -R) creates relocatable packs that require kernel support for unprivileged user namespaces. However, some systems have them disabled, and older systems do not support them at all—the ./bin/python command above wouldn’t work on them.

The -RR option we saw above adds a universal fallback option: on a system where unprivileged user namespaces are not available, the ./bin/python command above automatically falls back to using PRoot. PRoot achieves file system virtualization by intercepting the process’ system calls with ptrace.

The advantage is that it always works—it doesn’t rely on any special kernel feature, ptrace has “always been there” so to speak. The drawback is that it incurs significant overhead at every system call. This is acceptable for an interactive program, or, say, for a single-threaded number-crunching application. But the performance hit is prohibitive, for example, for an MPI or multi-threaded application—input/output and synchronization happen via system calls.

Enter Fakechroot

To address this performance issue, we have just added a third execution engine to relocatable packs relying on ELF trickery. Users of relocatable packs can now choose at run time an execution engine by setting the GUIX_EXECUTION_ENGINE environment variable. If you choose the performance engine, the application will choose user namespaces or, if they are not supported, fall back to the new fakechroot engine:

export GUIX_EXECUTION_ENGINE=performance
./bin/python

guix pack -RR wraps the application executables, in this case python. Those wrappers are small statically-linked programs that implement the execution engines. The new fakechroot engine works like that:

  1. The PT_INTERP segment of the wrapped executable contains the file name of the dynamic linker, ld.so, under /gnu/store. Since /gnu/store doesn’t exist on the host machine, the dynamic linker is invoked directly, with its file name computed relative to the wrapper’s file name.

  2. The loader is told to preload the Fakechroot shared library, which interposes on the file system functions of the C library (open, stat, etc.) and “translates” /gnu/store absolute file names to their actual location.

  3. The RUNPATH of Guix executables and shared libraries lists the /gnu/store directories that contain the libraries they depend on. The open calls that ld.so itself makes are not interposable, so Fakechroot doesn’t help here. Fortunately, the little-known audit interface of the GNU dynamic linker comes in handy: its la_objsearch hook allows you to alter the way ld.so looks for shared libraries. Thus, a few lines of C are all it takes to get ld.so to translate /gnu/store file names. Neat!

The fakechroot engine incurs very little overhead, and only on file system function calls, making it a great option for HPC workloads. The default engine remains user namespaces with a fallback to PRoot, so be sure to set GUIX_EXECUTION_ENGINE=performance. See the manual for more info.

A call to HPC system administrators

guix pack -RR allows you to deploy software stacks on a Guix-less cluster that lacks both support for unprivileged user namespaces and a container facility such as Singularity, without loss of performance. A similar combination of execution engines for unprivileged users can be found in udocker, though the tool has different goals. Having discussed these techniques, it’s good to take a step back and look at the bigger picture.

All these shenanigans would be unnecessary if unprivileged user namespaces were universally available. In fact, when we released guix pack -R two years ago, we thought (hoped?) that widespread availability of unprivileged user namespaces was imminent. After all, the feature had already been available in the Linux kernel since version 3.8, released in 2013.

Unfortunately, today, major academic HPC clusters still run a derivative of Red Hat Enterprise Linux (RHEL) or CentOS 7, released in 2015 with Linux 3.10, where the decision was made to disable user namespaces. RHEL 8 and derivatives are documented as having an easy way to set up user namespaces.

We encourage HPC system administrators to consider enabling unprivileged user namespaces. They allow unprivileged users to deploy pre-built software, be it through a relocatable Guix pack or via container run-time support tools like runC, with virtually no overhead. More generally, user namespaces enable reproducible software environments, a prerequisite for reproducible scientific experiments!

Acknowledgments

Many thanks to Carlos O’Donell, steward for the GNU C Library, for reviewing initial revisions of the fakechroot execution engine and for suggesting the use of the ld.so audit interface.

Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).

  • MDC
  • Inria
  • UBC
  • UTHSC