Pre-built binaries vs. performance
Guix follows a transparent source/binary deployment model: it will
download pre-built binaries when they’re available—like apt-get
or
yum
—and otherwise falls back to building from source. Most of the
time the project’s build farm provides binaries so that users don’t have
to spend resources building from source. Pre-built binaries may be
missing when you’re installing a custom package, or when the build farm
hasn’t caught up yet. However, deployment of binaries is often seen as
incompatible with high-performance requirements—binaries are “generic”,
so how can they take advantage of cutting-edge HPC hardware? In this
post, we explore the issue and solutions.
Building portable binaries
CPU architectures are a moving target. The x86_64 instruction set architecture (ISA), for instance, has a whole family of extensions—AVX and AVX2 being the most obvious ones on x86_64. These extensions are often critical for the performance of computational programs. For example, fused multiply-add (FMA), which can have significant impact on some applications, was only introduced in some relatively recent AMD and Intel processors, and new versions of these extensions are being deployed. Each x86_64 machine typically supports a subset of these extensions.
Package distributions that provide pre-built binaries—Guix, but also Debian, Fedora, CentOS, and so on—have one important constraint: they must provide binaries that work on all the computers for the target architecture. Therefore, those binaries should target the common denominator of that architecture. For x86_64, that means not using instructions from AVX & co. Put this way, pre-built binaries look unattractive from an HPC viewpoint.
Run-time selection
In Guix land, this has been the topic of lengthy, discussions over the years. Actually, distro developers know that this issue is not new, and that this concern is not specific to HPC. Many pieces of software, from video players to the C library, can–and do!—greatly benefit from some of these ISA extensions. How do they address this dilemma—providing portable binaries without compromising on performance?
The solution is to select the most appropriate implementation of “hot”
code at run time. Video players like MPlayer and number-crunching
software like the GNU multiprecision library have
used this “trick” since their inception: using the
cpuid
instruction, they can
determine at run time which ISA extensions are available and branch to
routines optimized for the available extensions. Many other
applications include similar ad-hoc mechanism.
GNU, which runs on 100% of the Top 500 supercomputers, now provides generic mechanisms for this in the toolchain. First, the GNU C Library (glibc) has always had vendor-provided optimized implementations of its string and math routines, selected at run time.
The underlying mechanisms have been generalized in glibc in the form of indirect functions or “IFUNCs”, which work along these lines:
- Application developers provide libc with a resolver. A resolver
is a function that selects the “best” optimized implementation for
the CPU at hand and returns it. As an example, glibc’s resolver for
memcmp
looks like this. - Resolvers are called at load time by the run-time linker,
ld.so
, once for all. Thus, selection happens only once at load time. - To simplify the use of IFUNCs, GCC provides an
ifunc
attribute to decorate functions that have an associated resolver.
IFUNCs are starting to be used outside glibc proper, for instance by the Nettle cryptographic library (code), though there are currently restrictions to be aware of.
Better yet, since version
6,
GCC supports automatic function
multi-versioning
(FMV): the target_clones
function
attribute
allows users to instruct GCC to generate several optimized variants of a
function and to generate a resolver to select the right one based on
the CPUID.
This LWN article nicely shows how
code can benefit from FMV. The article links to this script to
automatically annotate FMV candidates with
target_clones
; there’s
even a
tutorial!
Problem solved?
When upstream software lacks run-time selection
It turns out that not all software packages, especially scientific
software, use these techniques. Some do—for instance,
OpenBLAS supports
run-time selection when compiled with DYNAMIC_ARCH=1
—but many don’t.
For example, FFTW
insists on being compiled with
-mtune=native
and provides configuration
options
to statically select CPU optimizations (Update: FFTW 3.3.7+ can select
optimized routines at run time);
ATLAS optimizes
itself for the CPU it is being built on. We can always say that the
“right” solution would be to “fix” these packages upstream so that they
use run-time selection, but how do we handle these today in Guix?
Depending on the situation, we have so far resorted to different
solutions. ATLAS so heavily depends on configure-time tuning that we
simply don’t distribute pre-built binaries for it. Instead, running
guix package -i atlas
unconditionally builds it locally, as upstream
authors intended.
For FFTW, BLIS, and other packages where optimizations are selected at configure-time, we simply build the generic version, like Debian and others do. This is the most unsatisfactory situation: we have portable binaries at the cost of degraded performance.
However, we also programmatically provide optimized package variants
for these. For BLIS, we have a make-blis
function
that we use to generate a
blis-haswell
package optimized for Intel Haswell CPUs, a
blis-knl
package, and so on. Likewise, for FFTW, we have an fftw-avx
package
that uses AVX2-specific optimizations. We don’t provide binaries for
these optimized packages, but users can install the variant that
corresponds to their machine.
Dependency graph rewriting
Having optimized package variants is nice, but how can users take
advantage of them? For instance, the
julia
and
octave
packages
depend on the generic (unoptimized) fftw
package—this allows us to
distribute pre-built binaries. What if you want Octave to use the
AVX2-optimized FFTW?
One option is to rewrite the dependency graph of Octave, so that
occurrences of the generic fftw
package are replaced by fftw-avx
. This
can be done from the command line using the --with-input
option:
guix package -i octave --with-input=fftw@3.3.5=fftw-avx
The above command does that graph rewriting. Consequently, it ends up
building from source the part of the Octave dependency graph that
depends on fftw
. Not ideal because rebuilding can take a while, but
readily applicable.
When the library and its replacement (fftw
and fftw-avx
here) are
known to have the same application binary interface (ABI), as is the
case here, another option is to simply let the run-time linker pick up
the optimized version instead of the unoptimized one. This can be done
by setting the LD_LIBRARY_PATH
environment variable:
LD_LIBRARY_PATH=`guix build fftw-avx`/lib octave
Here Octave will pick the optimized libfftw.so
. (/etc/ld.so.conf
would be another possibility but the glibc package in Guix currently
ignores that file since that could lead to loading
binary-incompatible .so
files when using Guix on a distro other than
GuixSD.)
Where to go from here?
As we have seen, Guix does not sacrifice performance. In the worst case, it requires users to explicitly install optimized package variants, which get built from source. This is not as simple as we would like though, so people have been looking for ways to improve the situation.
The first option is to work with upstream software developers to introduce run-time selection—an option that benefits everyone. Of course, that’s something we can always do in the background, but it takes time. It does work in the long run though; for instance, BLIS has recently introduced support for run-time selection. Like Clear Linux, we can also start applying function multi-versioning based on compiler feedback in key packages and use that as a starting point when discussing with upstream.
Some have
proposed
making CPU features a first-class concept in Guix. That way, one could
install with, say, --cpu-features=avx2
and end up downloading binaries
or building binaries optimized for AVX2. The downsides are that this
would be a big change, and that it’s not clear how to tell package build
systems to enable such or such optimizations in a generic way.
Another option on the
table,
inspired by Fedora and Debian, is to provide a mechanism that makes it
easy for users to switch between implementations of an interface without
needing recompilation. This could work for BLAS implementations or MPI
implementations that are known to have the same ABI. Similarly, having
support for something similar to ld.so.conf
would help—though it would
have to be per-user rather than be limited to root
, to retain the
freedom that Guix provides to users. Such dynamic software composition
could work against the reproducibility mantra of Guix though, since
software behavior would depend on site-specific configuration not under
Guix control.
With its transparent source/binary deployment model, Guix offers both
the advantages of pre-built binaries à la apt-get
and that of
built-from-source, optimized software à la EasyBuild or Spack when it
must. The challenges ahead will be to streamline that experience.
Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).