We have some exciting news to share: AMD has just contributed 100+ Guix packages adding several versions of the whole HIP and ROCm stack! ROCm is AMD’s Radeon Open Compute Platform, a set of low-level support tools for general-purpose computing on graphics processing units (GPGPUs), and HIP is the Heterogeneous Interface for Portability, a language one can use to write code (computational kernels) targeting GPUs or CPUs. The whole stack is free and “open source” software—a breath of fresh air!—and is seeing increasing adoption in HPC. And, it can now be deployed with Guix!

In this post, written by AMD engineers and Inria research software engineers, we look at the packages AMD contributed and how you can use them, and we discuss the use cases at AMD and relation with the French and European supercomputing environments.

More than 100+ packages

The 100+ packages Kjetil Haugen and Thomas Gibson of AMD contributed to the Guix-HPC channel include 5 versions of the entire HIP/ROCm toolchain, all the way down to LLVM and including support in communication libraries ucx and Open MPI. Anyone who’s tried to package or to build this will understand that this is a major contribution: the software stack is complex, requiring careful assembly of the right versions or variants of each component.

As always with Guix, a key element here is that the package set is self-contained: these packages as well those that depend on them do not and in fact cannot rely on an external ROCm installation, contrary to what is customary in HPC environments. This is what has allowed us to run the exact same software stack both at AMD and on the French HPC clusters, as we will see below.

The foci of the initial packaging effort are to create a solid interface between Guix and ROCm, and to provide the components needed to start leveraging Guix for developing and deploying ROCm applications. To that end we provide two primary packages as the foundation for the AMD ROCm stack:

The ROCm toolchain
The HIP runtime for the AMD platform: hipamd

Note that all ROCm packages in Guix are considered experimental as the modest patching required to adapt to the Guix ecosystem implies that they deviate from the officially released ROCm binaries. Also note that we may modify the design as we gain experience with using Guix in our daily work.

The ROCm toolchain is analogous to clang-toolchain, and provides the ROCm variants of core LLVM components, such as clang, clang runtime, lld, libomp, and associated headers/binaries. In addition, the ROCm toolchain also provides the necessary ROCr/HSA runtimes and device libraries required for GPU offloading support. All supported GPU architectures can be found via AMD's official ROCm documentation.

The implementation of HIP runtime for AMD GPUs, hipamd, is an extension of the ROCm toolchain which provides necessary headers and the compiler wrapper hipcc. This is the primary user-facing package for developing or deploying applications using HIP; it provides a basic toolchain for most GPU kernel development, but does not include math libraries such as rocBLAS or rocFFT. Math libraries will be provided at a later date.

Due to the fact that both hardware and software advance quite rapidly, we make generous use of generator functions that enable the installation of multiple versions of ROCm/HIP to ensure that both existing stable versions as well as latest releases can be made easily available. Having older versions available ensures that projects relying on a particular release of ROCm/HIP are not distrupted. This also enables developers to examine performance impacts between versions to help guide their optimization efforts and track regressions/improvements.

As an application developer using Guix, one can utilize the guix shell command to create environments (on top of your system environment or completely isolated) with a fully functional HIP toolchain for any version you specify. For example:

guix shell hipamd@5.7.1

This shell will contain not only the standard ROCm-based Clang toolchain and its associated compilers/linkers, but will also provide hipcc and its associated utilities such as hipconfig (for HIP and Clang versions, include paths, and built-in flags) and rocminfo (for querying device information).

[env]$ ls -l `which hipcc`
lrwxrwxrwx 1 root root 66 Dec 31  1969 /gnu/store/2j5hqm1rk7q8h3ivwklpwmiv8nzkq15v-profile/bin/hipcc -> /gnu/store/kcfisihalab9fh75dd15rzwj30mv34bk-hipamd-5.7.1/bin/hipcc
[env]$ hipcc --version
HIP version: 5.7.1
clang version 17.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /gnu/store/r9zz6hjmgs2c79091s0s9zc43d0zq9vc-rocm-toolchain-5.7.1/bin

As an illustrative example, we can clone the open-source STREAM project for GPUs, BabelStream, and directly compile and run HIP implementation of the benchmark:

[env]$ git clone git@github.com:UoB-HPC/BabelStream.git

Once the repository is cloned, we can build the project using CMake as shown below:

[env]$ cd BabelStream/
[env]$ cmake -Bbuild -H. -DMODEL=hip -DCMAKE_CXX_COMPILER=hipcc
[env]$ cmake --build build

If neither Git nor CMake are available on your system, you can simply add both git and cmake to your guix shell command to automatically install them into your environment!

And finally, you can run the executable and immediately observe the measured streaming performance:

[env]$ ./build/hip-stream 
BabelStream
Version: 5.0
Implementation: HIP
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using HIP device AMD Radeon RX 6800 XT
Driver: 50731921
Memory: DEFAULT
Init: 0.150206 s (=5361.344563 MBytes/sec)
Read: 0.212430 s (=3790.920912 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        520715.707  0.00103     0.00104     0.00103     
Mul         450652.522  0.00119     0.00120     0.00119     
Add         438387.222  0.00184     0.00186     0.00184     
Triad       448402.828  0.00180     0.00180     0.00180     
Dot         438838.728  0.00122     0.00123     0.00123

This example shows how to obtain an interactive development environment with guix shell but if all you want is BabelStream, there’s a ready-to-use package.

Benchmarks

Adastra, one of the French national supercomputers, builds upon AMD GPUs. It’s a 78 PFlop machine that was ranked #3 in the November 2023 edition of Green500. ROCm and HIP are available pre-installed on Adastra, but naturally, we at Inria wanted to ensure that those packages that had been tested at AMD would also give the expected performance on this machine. Guix is currently unavailable on Adastra so we created a bundle of hpcg, a synthetic benchmark that exercises HIP, to ship it over to Adastra:

guix pack -RR hpcg bash-minimal -S /bin=bin

After unpacking, the resulting bundle lets us run hpcg on a single node of Adastra—each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node. We’d first allocate 8 CPUs on one node with SLURM:

salloc --time=01:00:00 --nodes=1 --ntasks-per-node=8 --cpus-per-task=8 \
  --gpus-per-task=1 --threads-per-core=1 --exclusive --account=ces1926 \
  --constraint=MI250 --mem=256000
ssh $SLURM_NODELIST

… and then run our Guix-built hpcg on the compute node, with 8 MPI processes:

module purge
GUIX_ROOT=$HOME/guix/hpcg
${GUIX_ROOT}/bin/mpirun -n 8 --map-by L3CACHE \
  --launch-agent ${GUIX_ROOT}/bin/orted       \
  -x GUIX_EXECUTION_ENGINE=performance        \
  ${GUIX_ROOT}/bin/rochpcg 280 280 280 180

Notice that we’re using the Guix-provided mpirun. We run module purge to avoid interference from environment modules available on the system. By setting GUIX_EXECUTION_ENGINE to performance, we instruct the Guix-provided wrapper of hpcg to select a relocation mechanism with no overhead.

The benchmark prints the kind of output we expected:

Total Time: 181.62 sec
Setup Time: 0.06 sec
Optimization Time: 0.12 sec

DDOT   =  1809.6 GFlop/s (14476.5 GB/s)     226.2 GFlop/s per process ( 1809.6 GB/s per process)
WAXPBY =   804.0 GFlop/s ( 9648.2 GB/s)     100.5 GFlop/s per process ( 1206.0 GB/s per process)
SpMV   =  1465.6 GFlop/s ( 9229.1 GB/s)     183.2 GFlop/s per process ( 1153.6 GB/s per process)
MG     =  1935.1 GFlop/s (14934.8 GB/s)     241.9 GFlop/s per process ( 1866.9 GB/s per process)
Total  =  1795.6 GFlop/s (13616.4 GB/s)     224.4 GFlop/s per process ( 1702.1 GB/s per process)
Final  =  1647.8 GFlop/s (12495.8 GB/s)     206.0 GFlop/s per process ( 1562.0 GB/s per process)

The software stack was packaged once and can now be used on a variety of machines without spending hours or days in deployment and testing. That alone is no small feat in a world where ad hoc HPC cluster deployments remain the norm.

Guix at AMD

Logo of “AMD lab notes”.

Currently, the use of Guix within AMD is a grassroot effort among members of the Data Center GPU Software Solutions Group. The team engages in porting and optimization of HPC applications across a variety of engineering disciplines, organizes ROCm training and hackathons, provides feedback to ROCm development teams, and participates in the bring-up process preceeding the release of new hardware. More details about our activities can be found at AMD lab notes.

Compared to most engineers, we touch a larger number of applications, across a larger number of HPC systems, and with a greater variety of software dependencies and GPU architectures. An immediate consequence is that the overhead of dependency management can become quite significant. Moreover, the effort is often duplicated between engineers working on applications with similar dependencies, system administrators providing environment modules, and deployment engineers preparing container images and recipes.

As a functional package manager, Guix promises deduplication and reproducibility. In other words, if a package description is created by someone somewhere, it can used by anyone anywhere! Guix is already providing a lot of value for individual engineers. The primary use case is to allow the use of less contested resources for development (workstations with gaming card) and reserve more contested resources for performance testing (nodes with emerging GPU architectures). We are currently considering using Guix to create environment modules and are working on integrating Cuirass into engineering workflows.

After using Guix extensively to package ROCm, there are two things missing to better support GPU-based development. First, a mechanism for running unit tests on the GPU. This is currently impossible because the isolated environments in which Guix builds packages do not expose the GPU. Second, a mechanism to specify the target GPU architecture on the fly—e.g., through package transformations. The size of many GPU libraries is proportional to the number of GPU architectures supported and limiting the support only to the GPUs available on the system of interest is good software hygiene and may signficicantly reduce compilation time.

Beyond that, we are mostly happy with the range of functionality Guix offers. However, we would like a more interactive debugging environment. Keeping the build directory, i.e. guix build -K and subsequently running guix shell --container on that directory as described in the Guix manual gets us close, but providing a gdb-like user experience where we can set breakpoints, and list, inspect, step through, modify, and rerun build phases would be helpful.

HIP, Guix, and HPC in Europe

HPC research teams at Inria develop software ranging from run-time support libraries such as StarPU and hwloc, to linear algebra solvers such as Chameleon, to numerical simulation libraries. Having the HIP/ROCm stack packaged in Guix allows us to deploy and run those even more complex stacks on supercomputers and readily take advantage of their processing power without going through a tedious installation and testing process.

This makes even more of a difference considering the breadth and depth of HPC software developed in NumPEx. NumPEx is the French national program for exascale HPC, launched in mid-2023 with a 41 M€ budget for 6 years. Its Development and Integration project aims to ensure the dozens of HPC libraries and applications developed by French researchers can easily be deployed on national and European clusters, with high quality assurance levels. Guix is one of the deployment tools used to achieve those goals and well poised to do so; having a well-tested GPGPU package set makes it an even better fit.

It remains to be seen whether Jules-Verne, the EuroHPC exascale supercomputer to be hosted in France in 2025, will provide AMD GPUs. Given that the software stack for these GPUs is free software, this would send a strong signal in favor of Open Science, in line with the recommendations of UNESCO and those of the French Plan for Open Science.

This is just the beginning

All these packages are available from the Guix-HPC channel; they are continuously-built on the build farm at Inria, providing users with readily usable binaries.

With the HIP and ROCm foundations in place, there’s a lot on our agenda: providing rocBLAS, rocFFT, and related math libraries, taking advantage of these in the linear algebra and numerical simulation packages developed at Inria and in NumPEx, and working with the broader Guix community to provide ROCm-enabled variants of major packages like PyTorch. We plan to make the ROCm/HIP packages part of the main Guix channel once we have gained enough experience. The other important benefit we expect from this collaboration is to better cater to the needs of engineers at AMD.

Working together in the open has been a fruitful and pleasant experience and we can already foresee lots of opportunities to keep this going!