Towards reproducible Jupyter notebooks

Ludovic Courtès — October 10, 2019

Jupyter Notebooks are becoming a key component of the researcher’s toolbox when it comes to sharing and reproducing computational experiments. Jupyter notebooks allow users to not only intermingle a narrative with supporting code in a way reminiscent of literate programming, they also make it easy to interact with the code and, thus, build on the work of each other.

Update: The slides and video of my talk (in French) on this topic at JCAD 2019 are now available on-line.

To give a few examples, Jupyter notebooks are one of the topics of Inria Learning Lab’s new MOOC on reproducible science and a central part of the “Reproducible Science Curriculum”; researchers in different domains view Jupyter notebooks as the tool of choice to support reproducible research and, possibly, as the seed for changes in scientific publication practices.

In this post we explore a solution to Jupyter Notebook’s Achille’s heel: software deployment. We have been working on a solution to this dubbed Guix-Jupyter that we’re releasing today, but first, let us explain how Jupyter’s promises for reproducible research are hindered by a lack of support for reproducible software deployment.

The extent of Notebook reproducibility

A Jupyter Notebook is essentially a program, and the jupyter notebook command provides an interactive user interface in your web browser to that command. Of course, to execute the program, an interpreter and a Jupyter kernel for the language the program is written in must be available in the execution environment of Jupyter. If the program refers to external libraries or data, well, those also have to be available in the execution environment.

Unfortunately, it is up to you, the user, to run Jupyter in the right environment, with the right interpreter and libraries available. Say I publish a notebook containing a Python 3 program that uses NumPy and SciPy. If you try to run it on a Jupyter instance that only has Python 2, or that lacks SciPy, it won’t run; if you have SciPy but a different version that the one I used, well, it may fail to run because of API changes, or you could get different results.

What’s worse, Jupyter does not offer a standard way for notebooks to express their software dependencies. You can only hope the author provides detailed instructions and then walk your way through deployment.

Simple solutions!

People have got used to resorting to a simple solution: running pip install or conda install right at the top of their notebook! This isn’t great for several reasons. As a researcher, you’re interested in reproducing or deriving someone else’s research; you’re certainly much less interested in installing arbitrary packages on your system as you do that. Plus, that doesn’t necessarily quite solve the dependency issue: pip can only install Python dependencies, not Python itself or anything written in another language; the JupyterLab Package Installer streamlines the pip install trick but doesn’t address this core limitation. conda doesn’t have this drawback, but it still assumes the availability of “system software” packages, and is generally not very good at reproducing software environments at different points in time or space.

I hear you… we have our next go-to solution: containers to the rescue! That is true, thanks for suggesting it! Indeed, many people have developed solutions to the Jupyter deployment problem around containers. As a user, you can entrust a service such as Binder with your code and data, which conveniently takes care of building a Docker image for the software environment of your notebook (using the nifty repo2docker) and spawning a Jupyter Notebook instance in that environment.

Research institutes and computing centers have also started offering “Jupyter Notebook as a service” in a similar way. Administrators of those systems can “just” deploy JupyterHub with Kubernetes on OpenStack, possibly with a drop of BinderHub for good measure.

Mind you, this software stack achieves an amazing job, but its complexity is baffling if we think about the simple deployment problem we’re trying to solve. And maybe you don’t want to turn your computers into mere Web terminals when you could run notebooks locally.

Last but not least, we still haven’t solved the core issue, which is that notebooks are not self-contained: they do not describe the dependencies they need. Binder’s configuration files, such as environment.yml for Anaconda, get us close to that, but they fail to capture a complete environment, thereby making it hard to impossible to reproduce the same environment on different machines or at different points in time.

Making Notebooks “deployment-aware”

What if we could make notebooks “deployment-aware” from the start? What if the notebook itself could describe its dependencies? What if reproducible software deployment was an integral part of the notebook?

We started working in that direction a year ago when Pierre-Antoine Rouby wrote a first version of the Guix kernel for Jupyter.

Today, we’re happy to announce the first beta release of the Guix-Jupyter, a Guix kernel for Jupyter!

Guix-Jupyter logo.

The Guix kernel is still very much a work-in-progress but it already lays the foundation for self-contained, reproducible notebooks—notebooks that automatically run in the right software environment, regardless of the machine where you run it or the time at which you run it. We’re pretty excited to share it today!

So, what does the Guix kernel have to offer? First and foremost, it allows you to define environments in which the notebook code is going to be executed. An environment consists of any number of Guix packages and one of them must be a Jupyter kernel—e.g., python-ipykernel for Python 3 or r-irkernel for GNU R. And of course, you can add any Python or R libraries or really any package you need to use in those environments. Subsequent cells are automatically executed in that environment, using the Jupyter kernel it contains.

In fact, a single notebook can define several environments, each with a possibly different Jupyter kernel, which allows you to create a multi-lingual notebook:

Multi-lingual notebook.

(The IPython kernel has a built-in mechanism to interface with languages other than Python, but that’s a wholly different approach.)

How does that differ from running pip install or similar right from the notebook? First, it doesn’t fiddle with your home directory or similar—the environments are one-off environments created on the fly. Second, it’s not limited to a particular language. And third, it’s reproducible.

Namely, since Guix is able to reproduce software environments at any point in time and space, you can not only specify packages to include in the environment, but also pin a specific revision of the Guix channels:

Pinning a Guix revision.

How do you obtain the commit ID that you want to pin to in the first place? If you’re using Guix, you can obtain it by running guix describe on a configuration that works for you. Beyond that, the brand-new Guix Data Service will come in handy. For example, it can show you the history of upgrades to any given package, say python-scipy. By hooking up Guix-Jupyter and the Data Service, we could make it easier to do time traveling. We’ll see!

Isolated execution environments

What’s more, the notebook code runs in an isolated environment: it cannot access any of your files and cannot fetch data from the Internet (more on the implementation below). That’s good for security (you can now run untrusted notebooks locally), but that’s also good for reproducibility: the notebook cannot have undeclared dependencies. In fact, we’re adapting the functional model of build processes pioneered by Nix to an interactive execution environment. In other words, we’re saying that a reproducible notebook is a pure function, and we create an isolated execution environment to make it happen.

So far so good, but if you’ve payed attention, you’re probably wondering: how do I get my data in that environment? To refer to data, the notebook must use a ;;guix download directive containing a URL and expected SHA256 hash of the data:

The download magic.

In practice, data is only downloaded the first time. Subsequent executions reuse the pre-downloaded data. In Nix/Guix terms, this is a “fixed-output derivation”. Since the hash of the data is specified, we make sure the notebook operates on the intended data, and an error is raised if the downloaded data has a different hash.

If you’ve used Guix before, the ;;guix annotations are similar to the interface of command-line tools like guix environment, guix pull, and guix describe.

Guix kernel architecture

The Guix support for Jupyter we presented above is not implemented as a standard Jupyter extension, but rather as a Jupyter Kernel that stands alone and acts as a proxy between the clients and the actual kernels you use.

Thus, as a user, you first have to install Guix on your machine, and then Jupyter and the Guix kernel:

guix install jupyter guix-jupyter

At that point, you can start a notebook:

jupyter notebook

… and select the “Guix” kernel.

Selecting the kernel.

Then you don’t need to explicitly install any other Jupyter kernel since you can just add them to your notebook via ;;guix environment annotations. That’s the nice thing about implementing it as a kernel.

Technically, the kernel implements all the Jupyter messaging protocol in Guile Scheme, in a type-safe way: JSON messages are converted to Scheme records and back, which allows us to catch certain mistakes at compile time. There kernel maintains state, such as the list of environments and proxied kernels running. It inspects execute_request messages to see if they might contain a ;;guix magic, handles that if needed, and otherwise passes them on to the relevant proxied kernel. Other messages such as complete_request (for code completion) are treated similarly. Processes in separate namespaces are created using Guix’s container API.

As a bonus, there’s, of course, a built-in kernel for GNU Guile, the great Scheme implementation that powers Guix. Pictures, relational programming, delimited continuations, and whatnot in your notebook!

One downside to the proxying approach is that since a notebook is normally monolingual, there’s no way to tell Jupyter that some cells are Python, while others are R, Guile, and so on.

It must be said though that we’re much more familiar with Guix than with Jupyter. So if you’re a Jupyter hacker, do share any piece of advice you may have!

Conclusion

The Guix Jupyter kernel is still “beta” but it already demonstrates most of the things we had in mind when we toyed with the idea of “notebooks with reproducible deployment built-in”. There’s many improvements we can make, notably to the user interface: things like showing a progress bar when an environment is being built, providing widgets to navigate environments or packages, etc.

It remains to be seen how convenient Guix-Jupyter is for “real-world” notebooks, and we’d very much like to hear from intrepid Jupyter users who’d want to try and add ;;guix annotations to their favorite notebooks.

A practical question is: what happens if you publish a notebook for the Guix-Jupyter kernel but your collaborators don’t have that kernel? If your notebook uses a single environment (say, a single Python environment), they’ll be able to run it provided they remove or skip the ;;guix annotations. But then, of course, they’re on their own when it comes to deploying the environment of that notebook. If you use ;;guix download or multiple environments, then the notebook won’t be readily usable to someone who doesn’t have Guix-Jupyter. That’s a limitation, but one that’s probably hard to avoid.

Is a kernel the right approach to adding reproducible deployment to Jupyter? Should it be a built-in feature of Notebook or of Jupyter Lab? Maybe. There’s an engineering argument that Jupyter probably shouldn’t be tied to a specific deployment tool, and in that sense, handling it as a kernel or as an extension leaves Jupyter users a freedom of choice.

No matter what approach is used, our best-practices book should be updated so that Jupyter notebooks lacking deployment information become a thing of the past!

  • MDC
  • Inria
  • UBC