ccwl for concise and painless CWL workflows
In modern science, analysis is required to process data. When the data-flow is linear, such a process is easily represented by tools such as the standard Unix pipeline. However, this data-flow is often modeled by a directed graph: each processing node may have one or more inputs and the outputs may be directed to different processing nodes. This directed graph, mainly used in the fields of bioinformatics, medical imaging and astronomy, among many others, is called a workflow.
The Common Workflow Language (CWL) is a specification to describe computational workflows that makes it easy to reproduce and port to different hardware and software environments. But, why do we need workflow languages such as CWL? Why will a simple shell script or a Makefile not suffice?
Why not shell scripts?
With shell scripts, you need to not only code the actual command invocations
but also add a lot of boilerplate to perform housekeeping tasks such as
managing intermediate inputs/outputs. This makes the script hard to read and
the logic of the pipeline less obvious. Even with a Makefile, the programmer
needs to explicitly handle cleanup tasks, typically with a
Workflow languages allow the programmer to focus only on the actual command invocations—the essence—of the workflow and let the workflow language deal with the housekeeping tasks. For instance, CWL automatically deals with input and output files produced by a command, and ensures that only the necessary intermediate files are exposed to the next command.
When there is an error in a step, shell scripts usually leave the user with arcane error messages, or worse, mindlessly march on as though nothing went wrong. But workflow languages can clearly indicate which step failed.
Portability to different software and hardware environments
Workflows often need to be deployed to different software and hardware environments—to a cluster, to containers in the cloud, etc. When a shell script workflow needs to be deployed in a new environment, it will most likely need to be tweaked a little. Even Makefiles invoke commands using a shell, and thus suffer from the same portability issues. Workflow languages, on the other hand, aim to handle this transparently. This leads to higher confidence in the workflow, and allows a wide community to reproduce and deploy the workflow easily.
Data types, type conversion and static type checking
For better or for worse, due to historical reasons, shells (and by extension, Makefiles) revolve around only a single data type—the string. For instance, all command line arguments passed into a shell script, or indeed any other command, is a string. These strings may actually represent strings, but often, they represent numbers, names of files, etc. It is up to the programmer to convert these string arguments to suitable types, and deal with any errors that may arise in that conversion.
Workflow languages can handle this type conversion automatically. For example, they can ensure arguments representing numbers indeed contain only digits, or that there indeed exist files whose names are mentioned in the arguments. And some workflow languages such as CWL, funflow and bistro even have static typing so that typing errors can be detected at compile-time, instead of at run-time.
Human-readable and machine-readable
And finally, workflow languages need to be easy not just for a human to read and write, but also for machines to inspect. For instance, it should be tractable for a computer to read a workflow and generate a graphical visualization of the steps to be executed and the dependencies between those steps. This is where CWL stands out. Another way to understand this is that it is possible to automatically convert a CWL workflow into a shell script, but not the other way around. In this regard, Makefiles are a little better than shell scripts. But, with their many complex features to ease human-writability, Makefiles sacrifice machine-readability.
So, what's wrong with CWL?
So, CWL has all these nice properties. Why do we need anything else?
Limitations of YAML
CWL is, in effect, a special purpose programming language built into YAML syntax. CWL is fundamentally limited by this constraint, and often has verbose constructs to express relatively simple ideas. For example, there are at least three different fields that together build up the command to be executed!
Too many files
Even simple workflows have to be spread out over multiple files. Each command or step in the workflow needs its own CWL file. And all these individual commands need to be wired up together in another CWL file that specifies the overall workflow. Human short term memory is limited, and if one has to juggle around several files and associated tabs/buffers, the overhead is often too much.
What if instead of manually writing a CWL workflow, we could treat CWL as a compilation target and auto-generate it? We would then be free to use a more human-friendly frontend language without losing any of the machine-readability of CWL. This is exactly what ccwl, the Concise Common Workflow Language, does.
ccwl is a domain-specific language embedded into GNU Guile, a Scheme implementation. Lisp dialects such as Scheme are programmable programming languages and among the few that allow you to directly hack the compiler. As such, it is extremely well suited for embedding domain-specific languages into.
To the uninitiated, writing in a lisp may seem less human-friendly than writing in YAML. But, if you try it, you might like it so much that you'll never want to write anything else! And, if you're not convinced, there's always wisp, a Python-like whitespace-significant syntax for GNU Guile. In fact, this is what the Guix Workflow Language (GWL), another excellent workflow language written in GNU Guile, favors.
Human-readable and writable
For the user, ccwl aims to be as easy to write as a shell script, or at least a Makefile. But, by compiling to CWL, ccwl preserves all the benefits of CWL.
Compile-time error checking
Detecting errors as early as possible, preferably at compile time, significantly improves the user experience. There is nothing more frustrating than running a long workflow for several hours, only to have it error out in between and being forced to restart all over again without knowing for sure if it will succeed this time. ccwl, by virtue of the very hackable Scheme compiler that it is built on, aims to provide excellent compile-time error checking along with source references. ccwl isn't quite there yet, but hopefully will be in the coming releases.
Interface with external CWL workflows
Not everybody might convert to ccwl. And often, it will be necessary to reuse CWL workflows written by others. ccwl is pragmatic and allows calling external CWL workflows as part of a larger ccwl workflow. If CWL grows to become a common compilation target for many different workflow languages, this feature could enable seamless collaboration between communities.
In the future, ccwl might also provide pre-packaged ccwl commands for commonly used tools in bioinformatics, astronomy, etc. so that the user is freed from having to write these wrappers and can instead focus on writing only the workflow.
Reproducibility with GNU Guix
ccwl leaves all the hard work of reproducibility in Guix's capable hands. CWL (and, by consequence, ccwl) are agnostic to deployment. As long as a tool can be found in PATH, it does not care how that tool was deployed to PATH. This means we can offload all reproducibility responsibilities to Guix. We could simply fire up a Guix shell with the required packages in the environment, and run our workflow from within that environment. If we fixate the Guix commit we are running from, we can perfectly reproduce our workflow.
$ guix shell ccwl cwltool package1 package2 ... [env]$ ccwl compile workflow.scm > workflow.cwl [env]$ cwltool workflow.cwl
In contrast, the Guix Workflow Language (GWL) uses Guix internally to prepare a reproducible environment. It is thus deployment-aware and tied to Guix.
A taste of ccwl
This article is not a ccwl tutorial. So, we will stop short of describing how to write your own ccwl workflows. But, just to provide a taste for the syntax, here is an example spell check workflow from the ccwl manual, followed by a graphical visualization of it.
(define split-words (command #:inputs text #:run "tr" "--complement" "--squeeze-repeats" "A-Za-z" "\\n" #:stdin text #:outputs (words #:type stdout))) (define downcase (command #:inputs words #:run "tr" "A-Z" "a-z" #:stdin words #:outputs (downcased-words #:type stdout))) (define sort (command #:inputs words #:run "sort" "--unique" #:stdin words #:outputs (sorted #:type stdout))) (define find-misspellings (command #:inputs words dictionary #:run "comm" "-23" words dictionary #:outputs (misspellings #:type stdout))) (workflow (text-file dictionary) (pipe (tee (pipe (split-words #:text text-file) (downcase #:words words) (sort (sort-words) #:words downcased-words) (rename #:sorted-words sorted)) (pipe (sort (sort-dictionary) #:words dictionary) (rename #:sorted-dictionary sorted))) (find-misspellings #:words sorted-words #:dictionary sorted-dictionary)))