Building Your Own Computational Workflow for Social Scientists: Key Points

Introduction

Name your files consistently
Keep it short but descriptive
Share/establish a naming convention when working with collaborators
Consider generating output file names dynamically
Avoid special characters or spaces to keep it machine-compatible
Use capitals or underscores to keep it human-readable
Use consistent date formatting, for example ISO 8601: YYYY-MM-DD to maintain default order
Include a version number when applicable
Record a naming convention in your data management plan

A workflow is a sequence of tasks that process a set of data.
A workflow management system (WfMS) is a computational platform that provides an infrastructure for the set-up, execution and monitoring of workflows.
Nextflow scripts comprise of channels for controlling inputs and outputs, and processes for defining workflow tasks.
You run a Nextflow script using the nextflow run command.

Pipeline parameters are specified by prepending the prefix params to a variable name, separated by a dot character.
To specify a pipeline parameter on the command line for a Nextflow run use --variable_name syntax.

Channels must be used to import data into Nextflow.
Nextflow has two different kinds of channels: queue channels and value channels.
Data in value channels can be used multiple times in workflow.
Data in queue channels are consumed when they are used by a process or an operator.
Channel factory methods, such as Channel.of, are used to create channels.
Channel factory methods have optional parameters e.g., checkIfExists, that can be used to alter the creation and behaviour of a channel.

A Nextflow module is an independent step in a workflow.
Modules contain up to five definition blocks including: directives, inputs, outputs, when clause and finally a script block.
The script block contains the commands you would like to run.
A module should have a script but the other four blocks are optional.
Inputs are defined in the input block with a type qualifier and a name.

Outputs to a process are defined using the output blocks.
You can group input and output data from a process using the tuple qualifier.
The execution of a process can be controlled using the when declaration and conditional statements.
Files produced within a process and defined as output can be saved to a directory using the publishDir directive.

A Nextflow workflow is defined by invoking processes inside the workflow scope.
A process is invoked like a function inside the workflow scope passing any required input parameters as arguments. e.g. ESTIMATION(estimation_channel).
Process outputs can be accessed using the out attribute for the respective process object or assigning the output to a Nextflow variable.
Multiple outputs from a single process can be accessed using the list syntax [] and it’s index or by referencing the a named process output .

Nextflow operators are methods that allow you to modify, set or view channels.
Operators can be separated in to several groups; filtering , transforming , splitting , combining , forking and Maths operators.
To use an operator use the dot notation after the Channel object e.g. ESTIMATION.simulation_ch.view().
You can parse text items emitted by a channel, that are formatted using the CSV format, using the splitCsv operator.

Nextflow can produce a custom execution report with run information using the log command.
You can generate a report or timeline using the template specified by Nextflow.

Nextflow configuration can be managed using a Nextflow configuration file.
Nextflow configuration files are plain text files containing a set of properties.
You can define process specific settings, such as cpus and memory, within the process scope.
You can assign different resources to different processes using the process selectors withName or withLabel.
You can define a profile for different configurations using the profiles scope. These profiles can be selected when launching a pipeline execution by using the -profile command-line option
Nextflow configuration settings are evaluated in the order they are read-in.

The Docker Hub is an online repository of container images.
Find a container recipe file that works for your project and customise this.
Nextflow can pull a docker container from Docker Hub and convert this to an Apptainer image.
Docker is not permitted on most HPC environments, apptainer sif files are used instead.
Containers are important to reproducible workflows and portability of workflows across environments.

Nextflow automatically keeps track of all the processes executed in your pipeline via checkpointing.
Nextflow caches intermediate data in task directories within the work directory.
Nextflow caching and checkpointing allows re-entrancy into a workflow after a pipeline error or using new data, skipping steps that have been successfully executed.
Re-entrancy is enabled using the -resume option.

Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution system.
The nextflow configuration file can help define a target platform where we intend to implement our workflow.
We can specify a profile for our target platform through the -profile option.