Summary and Schedule

Computational social science (CSS) brings computational approaches to social science questions. Nextflow is a workflow management software which enables the writing of scalable and reproducible scientific workflows. With this half day workshop we will motivate the use of this tool in operationalising reproducible social science research.

This is a student led introductory lesson to computational workflows. No previous knowledge of Nextflow, or other workflow software is required.

Checklist

Optional

It is helpful to be familiar with using a programming language, to the level of Plotting and Programming in Python or R for Reproducible Scientific Analysis, although this lesson does not specifically rely on Python or R. A full set of recommended courses and resources you can explore is covered in Software Carpentry Lessons.

The workshop offers an overview to Nextflow. Nextflow integrates various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as R and Python, to be seamlessly coupled together. It simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructure.

Explore the Material

Screenshot of the workshop website navigation interface showing a sidebar menu with links to different sections such as Introduction, Prerequisites, and Explore the Material. The main content area displays instructions for navigating the material.
Screenshot of the workshop website navigation interface showing a sidebar menu with links to different sections such as Introduction, Prerequisites, and Explore the Material. The main content area displays instructions for navigating the material.

Schedule

Expand the callout button below to explore the schedule for the workshop.

Section Time Topics Covered
1. Introduction 00h 25m What are the FAIR research principles?
How do FAIR principles apply to software?
How does folder organisation help me?
2. Hello Nextflow 00h 50m What is Nextflow?
Why should I use a workflow management system?
What are the features of Nextflow?
What are the main components of a Nextflow script?
How do I run a Nextflow script?
Break 10m
3. Parameters 01h 00m How can I change the data a workflow uses?
How can I parameterise a workflow?
How can I add my parameters to a file?
4. Channels 01h 40m How do I move data around in Nextflow?
How do I handle different types of input, e.g. files and parameters?
How can I use pattern matching to select input files?
Break 10m
5. Modules 02h 00m How do I run tasks/modules in Nextflow?
How do I get data, files and values, into a module?
Finish Introductory Material 02h 20m
6. Modules Part 2 optional How do I get data, files, and values, out of processes?
How do I handle grouped input and output?
How can I control when a process is implemented?
How do I control resources, such as number of CPUs and memory, available to processes?
How do I save output/results from a process?
7. Workflow optional How do I connect channels and processes to create a workflow?
How do I invoke a process inside a workflow?
8. Operators optional How do I perform operations, such as filtering, on channels?
What are the different kinds of operations I can perform on channels?
How do I combine operations?
How can I use a CSV file to process data into a Channel?
9. Reporting optional How do I get information about my pipeline run?
How can I see what commands I ran?
How can I create a report from my run?
10. Nextflow configuration optional How do I configure a Nextflow workflow?
How do I assign different resources to different processes?
How do I separate and provide configuration for different computational systems?
11. Auxiliary Tools optional When should I use a pre-built container?
How can I customise a container?
What is a remote codespace?
12. Resuming a Workflow optional How can I restart a Nextflow workflow after an error?
How can I add new data to a workflow without starting from the beginning?
Where can I find intermediate data and results?
13. Portability of Workflow optional How can I move my analysis to a computer cluster?

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

The workshop offers an overview to Nextflow. Nextflow integrates various software package and environment management systems such as Docker, Singularity, and Conda. It allows for existing pipelines written in common scripting languages, such as R and Python, to be seamlessly coupled together. It simplifies the implementation and running of workflows on cloud or high-performance computing (HPC) infrastructure.

Set-up Material

To follow along the practical component it is recommended use GitHub Codespaces. This will require a stable internet connection. If you are not signed in to GitHub, you may be prompted to do so, once you open the material in GitHub Codespaces.

Step 1: Set up Coding Environment

Step 2: Clear Template Content

BASH

mkdir templates

mv * templates/

Step 3: Load Material for Workshop

BASH

git clone --branch workflow-scripts --single-branch https://github.com/omiridoue/sgsss-workflow.git

Step 4: Change Working Directory

BASH


cd sgsss-workflow

Online Learning Environment

GitHub Codespaces is a cloud development environment for teams to develop software efficiently and securely. We use it as a training environment because it allows us to work in a consistent and thoroughly tested environment. It requires connection to Internet and can be accessed through your web browser.

You can create a free GitHub account from the GitHub home page. You can upgrade your GitHub account to an Education account from the GitHub Education home page using your affiliate/student email.

Running GitHub Codespaces

You can click on the button shown below from the many pages in the training portal where it is displayed. Open in GitHub Codespaces

Once you are logged in to GitHub, you can open this link in your browser to open the training environment: https://codespaces.new/nextflow-io/training?quickstart=1&ref=master.

You should be presented with a page where you can create a new GitHub Codespace. You can click “Change options” to configure the machine used.

Using a machine with more cores allows you to take greater advantage of Nextflow’s ability to parallelize workflow execution.

For the hands-on component, we recommend using a 4-core machine.

The free GitHub plan includes 120 core-hours of Codespaces compute per month, which amounts to 30 hours of a 4-core machine. Opening a new GitHub Codespaces environment for the first time can take several minutes.

Explore GitHub Codespaces

After GitHub Codespaces has loaded, you should see the welcome page:

GitHub Codespaces welcome
GitHub Codespaces welcome

This is the interface of the VSCode IDE, a popular code development application that we recommend using for Nextflow development.

  • The sidebar allows you to customize your GitHub Codespace environment and perform basic tasks (copy, paste, open files, search, git, etc.). You can click the explorer button to see which files are in this repository.
  • The terminal allows you to run all the programs in the repository. For example, both nextflow and docker are installed and can be executed.
  • The file explorer allows you to view and edit files. Clicking on a file in the explorer will open it within the main window.
  • The main editor showing you a preview of the README.md file. When you open code or data files, they will open there.
Reopening a GitHub Codespaces session

Once you have created an environment, you can easily resume or restart it and continue from where you left off. Your environment will time out after 30 minutes of inactivity and will save your changes for up to 2 weeks.

You can reopen an environment from https://github.com/codespaces/.

Previous environments will be listed. You can manage these sessions by freezing or removing previous sessions. For the moment you can click a session to resume it, just be mindful of your usage. If you have saved the URL for your previous GitHub Codespaces environment, you can simply open it in your browser.

Alternatively, click the same button that you used to create it in the first place:

Open in GitHub Codespaces

You should see the previous session, the default option is to resume it:

Resume a GitHub Codespace
Resume a GitHub Codespace
Saving files from GitHub Codespaces to your local machine

To save any file from the explorer panel, right-click the file and select Download.

GitHub Codespaces quotas

GitHub Codespaces gives you up to 15 GB-month storage per month, and 120 core-hours per month. This is equivalent to around 60 hours of the default environment runtime using the standard workspace (up to 2 cores, 8 GB RAM, and 32 GB storage).

GitHub Codespaces environments are configurable. You can create them with more resources, but this will consume your free usage faster and you will have fewer hours of access to this space.

More information can be found in the GitHub docs: About billing for GitHub Codespaces