Introduction
- Name your files consistently
- Keep it short but descriptive
- Share/establish a naming convention when working with collaborators
- Consider generating output file names dynamically
- Avoid special characters or spaces to keep it machine-compatible
- Use capitals or underscores to keep it human-readable
- Use consistent date formatting, for example ISO 8601:
YYYY-MM-DD
to maintain default order - Include a version number when applicable
- Record a naming convention in your data management plan
Hello Nextflow
- A workflow is a sequence of tasks that process a set of data.
- A workflow management system (WfMS) is a computational platform that provides an infrastructure for the set-up, execution and monitoring of workflows.
- Nextflow scripts comprise of channels for controlling inputs and outputs, and processes for defining workflow tasks.
- You run a Nextflow script using the
nextflow run
command.
Parameters
- Pipeline parameters are specified by prepending the prefix
params
to a variable name, separated by a dot character. - To specify a pipeline parameter on the command line for a Nextflow
run use
--variable_name
syntax.
Channels
- Channels must be used to import data into Nextflow.
- Nextflow has two different kinds of channels: queue channels and value channels.
- Data in value channels can be used multiple times in workflow.
- Data in queue channels are consumed when they are used by a process or an operator.
- Channel factory methods, such as
Channel.of
, are used to create channels. - Channel factory methods have optional parameters e.g.,
checkIfExists
, that can be used to alter the creation and behaviour of a channel.
Modules
- A Nextflow module is an independent step in a workflow.
- Modules contain up to five definition blocks including: directives, inputs, outputs, when clause and finally a script block.
- The script block contains the commands you would like to run.
- A module should have a script but the other four blocks are optional.
- Inputs are defined in the input block with a type qualifier and a name.
Modules Part 2
- Outputs to a process are defined using the output blocks.
- You can group input and output data from a process using the tuple qualifier.
- The execution of a process can be controlled using the
when
declaration and conditional statements. - Files produced within a process and defined as
output
can be saved to a directory using thepublishDir
directive.
Workflow
- A Nextflow workflow is defined by invoking
processes
inside theworkflow
scope. - A process is invoked like a function inside the
workflow
scope passing any required input parameters as arguments. e.g.ESTIMATION(estimation_channel)
. - Process outputs can be accessed using the
out
attribute for the respectiveprocess
object or assigning the output to a Nextflow variable. - Multiple outputs from a single process can be accessed using the
list syntax
[]
and it’s index or by referencing the a named process output .
Operators
- Nextflow operators are methods that allow you to modify, set or view channels.
- Operators can be separated in to several groups; filtering , transforming , splitting , combining , forking and Maths operators.
- To use an operator use the dot notation after the Channel object
e.g.
ESTIMATION.simulation_ch.view()
. - You can parse text items emitted by a channel, that are formatted
using the CSV format, using the
splitCsv
operator.
Reporting
- Nextflow can produce a custom execution report with run information
using the
log
command. - You can generate a report or timeline using the template specified by Nextflow.
Nextflow configuration
- Nextflow configuration can be managed using a Nextflow configuration file.
- Nextflow configuration files are plain text files containing a set of properties.
- You can define process specific settings, such as cpus and memory,
within the
process
scope. - You can assign different resources to different processes using the
process selectors
withName
orwithLabel
. - You can define a profile for different configurations using the
profiles
scope. These profiles can be selected when launching a pipeline execution by using the-profile
command-line option - Nextflow configuration settings are evaluated in the order they are read-in.
Auxiliary Tools
- The Docker Hub is an online repository of container images.
- Find a container recipe file that works for your project and customise this.
- Nextflow can pull a docker container from Docker Hub and convert this to an Apptainer image.
- Docker is not permitted on most HPC environments, apptainer sif files are used instead.
- Containers are important to reproducible workflows and portability of workflows across environments.
Resuming a Workflow
- Nextflow automatically keeps track of all the processes executed in your pipeline via checkpointing.
- Nextflow caches intermediate data in task directories within the work directory.
- Nextflow caching and checkpointing allows re-entrancy into a workflow after a pipeline error or using new data, skipping steps that have been successfully executed.
- Re-entrancy is enabled using the
-resume
option.
Portability of Workflow
- Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution system.
- The nextflow configuration file can help define a target platform where we intend to implement our workflow.
- We can specify a profile for our target platform through the
-profile
option.