In modern science, analysis is required to process data. When the data-flow is linear, such a process is easily represented by tools such as the standard Unix pipeline. However, this data-flow is often modeled by a directed graph: each processing node may have one or more inputs and the outputs may be directed to different processing nodes. This directed graph, mainly used in the fields of bioinformatics, medical imaging and astronomy, among many others, is called a workflow.
The book Evolutionary Genomics was published in July this year. Of particular interest to Guix-HPC is the chapter entitled “Scalable Workflows and Reproducible Data Analysis for Genomics”, by Francesco Strozzi et al.:
In the quest for truly reproducible workflows I set out to create an example of a reproducible workflow using GNU Guix, IPFS, and CWL. GNU Guix provides content-addressable, reproducible, and verifiable software deployment. IPFS provides content-addressable storage, and CWL describes workflows that can run on specifically supported backend hardware system. In principle, this combination of tools should be enough to provide reproducibility with provenance and improved security.