High-performance networks have constantly been evolving, in sometimes hard-to-decipher ways. Once upon a time, hardware vendors would pre-install an MPI implementation (often an in-house fork of one of the free MPI implementations) specially tailored for their hardware. Fortunately, this time appears to be gone. Despite that, there is still widespread belief that MPI cannot be packaged in a way that achieves best performance on a variety of contemporary high-speed networking hardware.

This post is about our journey towards portable performance of the Open MPI package in GNU Guix. Spoiler: we reached that goal, but the road was bumpy.

Portable performance in theory

Blissfully ignorant of the details and complexity of real life, the author of this post initially thought that portable high-performance networking with Open MPI was a solved problem. Open MPI comes with “drivers” (shared libraries) for all the major kinds of high-speed networking hardware, and in particular various flavors of OpenFabrics devices (including InfiniBand or “IB”) and Omni-Path (by Intel; abbreviated as “OPA”). At run-time, Open MPI looks at the available networking hardware, dynamically loads drivers that “match”, and picks up “the best” match—or at least, that’s the goal.

The actual implementation of the networking primitives is left to lower-level libraries. rdma-core provides the venerable Verbs library (libibverbs), the historical driver for InfiniBand hardware, typically hardware from around 2015, noted as mlx_4. PSM supports the InfiniPath/TrueScale hardware sold by QLogic until some years ago. PSM2 supports Intel Omni-Path hardware, the successor of TrueScale. Omni-Path is to be discontinued, according to a July 2019 Intel announcement, and is still widely used, for example on the brand new Jean Zay supercomputer.

Open MPI has an interface to each of these. Given the proliferation of high-speed networks (many of which are variants of InfiniBand), engineers had the idea to come up with unified programming interfaces, with the idea that MPI implementations would use those interfaces instead of talking directly to the lower-level drivers that we’ve seen above. libfabric (aka. OpenFabrics or OFI) is one of these “unified” interfaces—I guess you can see the oxymoron, can’t you? Libfabric actually bundles Verbs, PSM, PSM2, and more, and provides a unique interface over them. UCX has a similar “unification” goal, but with a different interface. In addition to the lower-level PSM, PSM2, and Verbs, Open MPI can use libfabric and UCX directly, which in turn may drive a variety of networking interfaces.

The abstraction level of all these interfaces is not quite the same, because the level of hardware support differs among the different types of network. Thus, adding to this alphabet soup, Open MPI defines these categories:

the point-to-point management layer (PML) for high-level interfaces like UCX;
the matching transport layer (MTL) for PSM, PSM2, and OFI;
the byte transfer layer (BTL) for TCP, OpenIB, etc.

The general idea is that higher layers provide better performance on supported hardware.

Still here?

So what does Open MPI do with all these drivers and meta-drivers? Well, if you build Open MPI will all these dependencies, Open MPI picks up the right driver at run time for your high-speed network. Thus, Open MPI is designed to support performance portability: you can have a single Open MPI build (a single package) that will do the right thing whether it runs on machines with Omni-Path hardware or on machines with InfiniBand networking. At least, that’s the theory…

When reality gets in the way

How can one check whether practice matches theory? It turns out to be tricky because Open MPI, as of version 4.0.2, does not display the driver and networking hardware that it chose. Looking at strace or ltrace logs for your Open MPI program won’t necessarily help either because Open MPI may dlopen most or all the drivers, even if it just picks one of them in the end. Setting OMPI_MCA_mca_verbose=stderr,level:50 as an environment variable, or something like OMPI_MCA_pml_base_verbose=100 doesn’t quite help; surely there must be some setting to get valuable debugging logs, but the author was unable to find them.

One way to make sure you get the right performance for a given type of network is to run, for example, the ping-pong benchmark of the Intel MPI benchmarks. We’re lucky that our local cluster, PlaFRIM, contains a heterogeneous set of machines with different networking technologies: Omni-Path, TrueScale, InfiniBand (mlx4), with some machines having both Omni-Path and InifiniPath/TrueScale. A perfect playground. So we set out to test the openmpi package of Guix on all these networks to confirm—so we thought!—that we get the peak bandwidth and optimal latency for each of these:

# Here we ask SLURM to give us two Omni-Path nodes.
guix environment --pure --ad-hoc \
  openmpi openssh intel-mpi-benchmarks slurm -- \
  salloc -N 2 -C omnipath \
  mpirun -np 2 --map-by node IMB-MPI1 PingPong

And guess what: we’d get a much lower bandwidth than the expected 10 GiB/s (the theoretical peak bandwidth is 100 Gib/s, roughly 11 GiB/s in practice). You’d think you can force the use of PSM2 by passing --mca mtl psm2 to mpirun (this is the “MTL” we’ve seen above), but still, that’s not enough to get the right performance. Why is that? Is PSM2 used at all? Hard to tell. A bit of trial and error shows that explicitly disabling UCX with --mca pml ^ucx solves the problem and gives us the expected 10 GiB/s peak bandwidth and a latency around 2 μs for small messages. We’re on the right track!

This is when we wonder:

Why isn’t UCX giving the peak performance, even though it claims to support Omni-Path?
Why is Open MPI selecting UCX if PSM2 does a better job on Omni-Path?

The answer to question #1 is that UCX implements InfiniBand support, which also happens to work on Omni-Path, only with sub-optimal performance: PSM2 is the official high-performance driver while InfiniBand is a poor standard-compliant mode.

To answer question #2, we need to take a closer look at Open MPI driver selection method. At run time, Open MPI dlopens all its transport plugins. It then asks each plugin (via its init function) whether it supports the available networking interfaces and filters out those that don’t. If there’s more than one transport plugin left, it picks the one with the highest priority. Priorities can be changed on the command line; for instance, passing --mca pml_ucx_priority 20 sets the priority of UCX to 20. Default priorities are hard-coded. As it turns out, the UCX component has a higher priority than PSM2 claims to support Omni-Path, and thus takes precedence. A similar issue comes up with PSM.

Getting the best performance

To achieve optimal performance by default on Omni-Path, TrueScale, and InfiniBand networks, we thus had to raise the default priority of the PSM2 component and that of the PSM component relative to that of the UCX component.

This wasn’t quite the end of the road, though. PSM, which is apparently unmaintained, would segfault at initialization time; turning off its malloc statistics feature works around the problem. TrueScale is old and the PSM component will be gone in future Open MPI versions anyway so, assuming UCX works correctly on this hardware, this will not be a problem anymore.

Finally, with these changes in place, we are able to get the optimal performance on mlx4, InifiniPath, and Omni-Path networks on our cluster. We also checked on the GriCAD and MCIA clusters and confirmed that we also achieve peak performance there. The latter does not provide Guix (yet!), so we built a Singularity image with guix pack:

guix pack -S /bin=bin -f squashfs bash \
  openmpi intel-mpi-benchmarks

… that we sent over to the cluster and run there, using the system’s salloc and mpirun commands:

salloc -N2 mpirun -np 2 --map-by node -- \
  singularity exec intel-mpi-benchmarks.sqsh IMB-MPI1 PingPong

Conclusions

We are glad that we were able to show that, with quite an effort, practice matches theory—that we can get an Open MPI package that achieves optimal performance for the available high-speed network interconnect. Like in the case of pre-built binaries, and contrary to a common belief, software is designed to support portable performance without requiring a custom build for the target machine. We were only able to test the three most common interconnects available in the last years though, so we’d be happy to get your feedback if you’re using a different kind of hardware!

There are other conclusions to be drawn. First, we found it surprisingly difficult to get feedback from Open MPI. It would be tremendously useful to have an easy way to display the transport components that were selected and used when running an application. As far as default priorities go, it is hard to have a global picture and ensure the various relative priorities all make sense.

The interconnect driver situation is a bit dire. The coexistence of vendor-provided drivers and “unified” interfaces adds to the confusion. Efforts like UCX are a step in the right direction, but only insofar that they manage to actually supersede the more specific implementations—which is not yet the case, as we have seen with Omni-Path.

The last conclusion is on the importance of joining forces on packaging work. Getting to an Open MPI package in Guix that performs well and in a portable way has been quite a journey. The result is now under version control, available for all to use on their cluster, and regressions can be tracked. It is unreasonable to expect cluster admin teams to perform the same work for their own cluster, in an ad-hoc fashion, with a home-grown collection of modules.

Acknowledgments

I would like to thank Emanuel Agullo who was instrumental in starting this work as part of the 2019 Inria HPC School, Brice Goglin for tirelessly explaining Open MPI internals, and François Rue and Julien Lelaurain of the PlaFRIM team for their support.

Optimized and portable Open MPI packaging

Portable performance in theory

When reality gets in the way

Getting the best performance

Conclusions

Acknowledgments