What’s in a package

Ludovic Courtès — September 20, 2021

There is no shortage of package managers. Each tool makes its own set of tradeoffs regarding speed, ease of use, customizability, and reproducibility. Guix occupies a sweet spot, providing reproducibility by design as pioneered by Nix, package customization à la Spack from the command line, the ability to create container images without hassle, and more.

Beyond the “feature matrix” of the tools themselves, a topic that is often overlooked is packages—or rather, what’s inside of them. Chances are that a given package may be installed using any of the many tools at your disposal. But are you really getting the same thing regardless of the tool you are using? The answer is “no”, contrary to what one might think. The author realized this very acutely while fearlessly attempting to package the PyTorch machine learning framework for Guix.

This post is about the journey packaging PyTorch the Guix way, the rationale, a glimpse at what other PyTorch packages out there look like, and conclusions we can draw for high-performance computing and scientific workflows.

Getting PyTorch in Guix

One can install PyTorch in literally seconds with pip:

$ time pip install torch
Collecting torch
  Downloading https://files.pythonhosted.org/packages/69/f2/2c0114a3ba44445de3e6a45c4a2bf33c7f6711774adece8627746380780c/torch-1.9.0-cp38-cp38-manylinux1_x86_64.whl (831.4MB)
     |████████████████████████████████| 831.4MB 91kB/s 
Collecting typing-extensions (from torch)
  Downloading https://files.pythonhosted.org/packages/74/60/18783336cc7fcdd95dae91d73477830aa53f5d3181ae4fe20491d7fc3199/typing_extensions-3.10.0.2-py3-none-any.whl
Installing collected packages: typing-extensions, torch

real    0m24.502s
user    0m19.711s
sys     0m3.811s

Since it’s on PyPI, the Python Package Index, one might think it’s a simple Python package that can be imported in Guix the easy way. That’s unfortunately not the case:

$ guix import pypi torch
guix import: error: no source release for pypi package torch 1.9.0

The reason guix import bails out is that the only thing PyPI provides is a binary-only “wheels” package: the .whl file downloaded above contains pre-built binaries only, not source.

In Guix we insist on building software from source: it’s a matter of transparency, auditability, and provenance tracking. We want to make sure our users can see the source code that corresponds to the code they run; we want to make sure they can build it locally, should they choose not to trust the project’s pre-built binaries; or, when they do use pre-built binaries, we want to make sure they can verify that those binaries correspond to the source code they claim to match.

Transparency, provenance tracking, verifiability: it’s about extending the scientific method to the whole computational experiment, including software that powers it.

Bundling

The first surprise when starting packaging PyTorch is that, despite being on PyPI, PyTorch is first and foremost a large C++ code base. It does have a setup.py as commonly found in pure Python packages, but that file delegates the bulk of the work to CMake.

The second surprise is that PyTorch bundles (or “vendors”, as some would say) source code for no less than 41 dependencies, ranging from small Python and C++ helper libraries to large C++ neural network tools. Like other distributions such as Debian, Guix avoids bundling: we would rather have one Guix package for each of these dependencies. The rationale is manifold, but it boils down to keeping things auditable, reducing resource usage, and making security updates practical.

Long story short: “unbundling” is often tedious, all the more so in this case. We ended up packaging about ten dependencies that were not already available or were otherwise outdated or incomplete, including big C++ libraries like the XNNPACK and onnx neural network helper libraries. Each of these typically bundles code for yet another bunch of dependencies. Often, the CMake-based build system of these packages would need patching so we could use our own copies of the dependencies. Curious readers can take a look at the commits leading to XNNPACK and those leading to onnx. Another interesting thing is the use of derivatives: PyTorch depends on both QNNPACK and XNNPACK, even though the latter is a derivative of the former, and of course, it bundles both.

Icing on the cake: most of these machine learning software packages do not have proper releases—no Git tag, nothing—so we were left to pick the commit du jour or the one explicitly referred to by Git submodules.

Most PyTorch dependencies were unbundled. The end result is a PyTorch package in its full glory, actually built from source. Phew! Its dependency graph looks like this (only showing dependencies at distance 2 or less):

Excerpt from the PyTorch package dependency graph.

With this many dependencies bundled, these projects resemble the JavaScript dystopia Christine Lemmer-Webber described. Anyway, PyTorch is now also installable with Guix in seconds when enabling pre-built binaries:

$ time guix install python-pytorch
The following package will be installed:
    python-pytorch 1.9.0

52.3 MB will be downloaded
 python-pytorch-1.9.0  49.9MiB                          6.2MiB/s 00:08 [##################] 100.0%
The following derivation will be built:
   /gnu/store/yvygv6nlichbzyynvg4w04xa7xarx3rp-profile.drv

applying 16 grafts for /gnu/store/6qgcb3a7x1wg4havsryjh6zsy3za7h3b-python-pytorch-1.9.0.drv ...
building profile with 2 packages...

real    0m20.697s
user    0m3.604s
sys     0m0.118s

This time though, one can view the self-contained package definition by running guix edit python-pytorch and, say, rebuild it locally to verify the source/binary correspondence:

guix build python-pytorch --no-grafts --check

… or at least it will be possible once NNPACK’s build system generates code in a deterministic order.

pip & CONDA

Having done all this work, the author entered a soul-searching phase: sure, the rationale is well documented, but is it worth it? It looks as though everyone (everyone?) is installing PyTorch using pip anyway and considering it good enough. Also, why was it so much work to package PyTorch for Guix? Could it be that we’re missing packaging tricks that make it so easy for others to provide PyTorch & co.?

To answer these questions, let’s first take a look at what pip provides. The pip install command above completed after less than thirty seconds, and most of that time went into downloading an 831 MiB archive—no less. What’s in there? Those .whl files are actually zip archives, which one can easily inspect:

$ wget -qO /tmp/pytorch.zip https://files.pythonhosted.org/packages/69/f2/2c0114a3ba44445de3e6a45c4a2bf33c7f6711774adece8627746380780c/torch-1.9.0-cp38-cp38-manylinux1_x86_64.whl
$ unzip -l /tmp/pytorch.zip | grep '\.so'
    29832  06-12-2021 00:37   torch/_dl.cpython-38-x86_64-linux-gnu.so
    29296  06-12-2021 00:37   torch/_C.cpython-38-x86_64-linux-gnu.so
372539384  06-12-2021 00:37   torch/lib/libtorch_cpu.so
    43520  06-12-2021 00:37   torch/lib/libnvToolsExt-3965bdd0.so.1
 28964064  06-12-2021 00:37   torch/lib/libtorch_python.so
 46351784  06-12-2021 00:37   torch/lib/libcaffe2_detectron_ops_gpu.so
1159370040  06-12-2021 00:37   torch/lib/libtorch_cuda.so
  4862944  06-12-2021 00:37   torch/lib/libnvrtc-builtins.so
   168720  06-12-2021 00:37   torch/lib/libgomp-a34b3233.so.1
   116240  06-12-2021 00:37   torch/lib/libtorch.so
   523816  06-12-2021 00:37   torch/lib/libcudart-80664282.so.10.2
   222224  06-12-2021 00:37   torch/lib/libc10_cuda.so
    36360  06-12-2021 00:37   torch/lib/libshm.so
    47944  06-12-2021 00:37   torch/lib/libcaffe2_module_test_dynamic.so
 22045456  06-12-2021 00:37   torch/lib/libnvrtc-08c4863f.so.10.2
    12616  06-12-2021 00:37   torch/lib/libtorch_global_deps.so
    21352  06-12-2021 00:37   torch/lib/libcaffe2_nvrtc.so
   842376  06-12-2021 00:37   torch/lib/libc10.so
   552808  06-12-2021 00:37   torch/lib/libcaffe2_observers.so
 46651272  06-12-2021 00:37   caffe2/python/caffe2_pybind11_state.cpython-38-x86_64-linux-gnu.so
 47391432  06-12-2021 00:37   caffe2/python/caffe2_pybind11_state_gpu.cpython-38-x86_64-linux-gnu.so
$ unzip -l /tmp/pytorch.zip | grep '\.so' | wc -l
21

Twenty-one pre-compiled shared libraries in there! Most are part of PyTorch, but some are external dependencies. First there’s libgomp, GCC’s OpenMP and OpenACC run-time support library; we can guess it’s shipped to avoid incompatibilities with the user-installed libgomp, but it could also be a fork of the official libgomp—hard to tell. Then there’s libcudart and libnvToolsExt, both of which are proprietary NVIDIA GPU support libraries—a bit of a surprise, and a bad one, as nothing indicated that pip fetched proprietary software alongside PyTorch. What’s also interesting is dependencies that are not there, such as onnx and XNNPACK; we can only guess that they’re statically linked within libtorch.so.

Will these binaries work? On my system, they won’t work without tweaks, such as setting LD_LIBRARY_PATH, so these libraries find those they depend on. Using ldd shows the “system libraries” that are assumed to be available; this includes GNU libstdc++ and GCC’s run-time support library:

$ ldd torch/lib/libtorch_cpu.so 
        linux-vdso.so.1 (0x00007ffca6d31000)
        libgomp-a34b3233.so.1 => /tmp/pt/torch/lib/libgomp-a34b3233.so.1 (0x00007ff435723000)
        …
        libstdc++.so.6 => not found
        libgcc_s.so.1 => not found

Not providing those libraries, or providing a variant that is not binary-compatible with what libtorch_cpu.so expects, is the end of the game. Fortunately these two libraries rarely change, so the assumption made here is that “most” users will have them. It’s interesting that the authors deemed it necessary to ship libgomp.so and not libstdc++.so—maybe a mixture of insider knowledge and dice roll.

How were these binaries built in the first place? Essentially, by running python setup.py bdist_wheel “on some system” which, as we saw, invokes cmake to build PyTorch and all its bundled dependencies. But the PyTorch project does a little bit more than this to build and publish binaries for pip and CONDA. The entry point for both is binary_linux_build.sh, which in turn delegates to scripts living in another repo, build_pytorch.sh for CONDA or one of the wheels scripts; it’s one of these scripts that’s in charge of embedding libgomp.so, libcudart.so, and other libraries present on the system.

And where do these libraries come from? They come from the GNU/Linux distribution beneath it which, going back to the initial repository, may typically be some version of Ubuntu or CentOS running on the machines of CircleCI or Microsoft Azure.

At the end of the process is a bunch of wheel or CONDA archives ready to be uploaded as-is to Anaconda or to PyPI.

Looking at these scripts gives useful hints. But going back to the code pip and CONDA users are actually running: is libgomp-a34b3233.so.1 the libgomp, or is it a modified version? Is libtorch_cpu.so really obtained by building source from the 1.9.0 Git tag?

Let’s make it clear: verifying the source/binary correspondence for all the bits in the pip and CONDA packages is practically infeasible. Merely rebuilding them locally is hard. Reasoning about the build process is hard because of all the layers involved and because of the ball of spaghetti that these scripts are. Such a setup rightfully raises red flags for any security-minded person—we’ll get to that below—or freedom-conscious user: it’s also about user freedom. Is PyPI conveying the Corresponding Source of libgomp, as per Section 6 of its license? Probably not. PyTorch’s own license doesn’t have this requirement, but there’s certainly a tacit agreement that pip install torch provides the PyTorch, and it’s unpleasant at best that this claim is unverifiable in practice. This, should be a red flag for anyone doing reproducible science—in other words, science.

Source-based distros

CONDA and pip (at least the “wheels” part of it) are essentially “binary distros”: they focus on distributing pre-built binaries without concern on how they were built, nor whether they can actually be built from source. Without a conscious effort to require reproducible builds so that anyone can independently verify binaries, these tools are doomed to be not only unsafe but also opaque—and there are to date no signs of CONDA and PyPI/pip moving in that direction.

Update (2021-09-21): Bovy on Twitter mentions conda-forge as a possible answer. Public build recipes (here’s that of PyTorch) and automated builds improve transparency compared to binaries uploaded straight from developer machines, but build reproducibility remains to be addressed.

Like Guix, Spack and Nix are source-based: their primary job is to build software from source and use of pre-built binaries is “an optimization”. The Spack package and the Nixpkgs package are all about building it all from source. The Spack package avoids using some of the bundled dependencies, though it does use large ones: XNNPACK and onnx; the Nixpkgs package makes no such effort and builds it all as-is.

Unlike Nix or Guix, Spack assumes core packages—for some definition of “core”, but that includes at least a C/C++ compiler, a C library, and a Python interpreter—are already available. Thus, by definition, the Spack package is not self-contained and may fail to build, plain and simple, if some of the implicit assumptions are not met. When fetching pre-built binaries from a “binary cache”, the problems are similar to those of CONDA and pip: binaries might not work if assumptions about system libraries are not met (though Spack mitigates this risk by tying binaries to the underlying GNU/Linux distro), and it may be hard to verify them through rebuilding, again because these implicit assumptions have an impact on the bits in the resulting binaries.

On convenience, security, and reproducible science

The convenience and ease of use of pip and CONDA has undeniable appeal. That one can, in a matter of minutes, install the tool and use it to deploy a complex software stack like that of PyTorch has certainly contributed to their success. Our view though, as Guix packagers, is that we should take a step back and open the package—look at what’s inside and the impact it has.

What we see when we look inside PyPI wheels and CONDA packages is opaque binaries built on a developer’s machine and later uploaded to the central repository. They are opaque because, lacking reproducible build methodology and tooling, one cannot independently verify that they correspond to the presumed source code. They may also be deceptive: you get not just PyTorch but also the binary of a proprietary piece of software.

In their ESEC/FSE 2021 paper on LastPyMile, Duc-Ly Vu et al. empirically show that “the last mile from source to package” on PyPI is indeed the weakest link in the software supply chain, and that actual differences between packaged source code and upstream source code are observed in the wild. And this is only source code—for binaries as found in the torch wheel, there is just no practical way to verify that they genuinely correspond to that source code.

Machine-learning software is fast-moving. The desire to be fast already shows in upstream development practices: lack of releases for important dependencies, careless dependency bundling. Coupled with the user’s legitimate demand for “easy installation”, this turned PyPI, in the footsteps of CONDA, into a huge software supply chain vulnerability waiting to be exploited. It’s a step backwards several years in the past, when Debian hadn’t yet put an end to its “dirtiest secret”—that Debian packages would be non-reproducible, built on developer machines, and uploaded to the servers. Reproducible builds should be the norm; building from source, too, should be the norm.

It is surprising that such a blatant weakness goes unnoticed, especially on high-performance computing clusters that are usually subject to strict security policies. Even more so at a time where awareness about software supply chain security grows, and when the US Government’s Executive Order on cybersecurity, for example, explicitly calls for work on subjects as concrete as “using administratively separate build environments” and “employing automated tools (…) to maintain trusted source code supply chains”.

Beyond security, what are the implications for scientific workflows? Can we build reproducible computational workflows using software that is itself non-reproducible, non-verifiable? The answer is “yes”, one can do that. However, just like one wouldn’t build a house on a quagmire, building scientific workflows on shaky foundations is inadvisable. Far from being an abstract principle, it has concrete implications: scientists and their peers need to be able to reproduce the software environment, all of it; they need the ability to customize it and experiment with it, as opposed to merely running code from an “inert” binary.

It is time to stop running opaque binaries and to value transparency and verifiability for our foundational software, as much as we value transparency and verifiability for scientific work.

Acknowledgments

The author thanks Ricardo Wurmus and Simon Tournier for insightful feedback and suggestions on an earlier draft of this post.

Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).

  • MDC
  • Inria
  • UBC
  • UTHSC