Jan 1, 0001

Reproducibility in Bioinformatics

Deb Triant & Marcus Bobar

Research Computing, University of Virginia dtriant@virginia.edu, mb5wt@virginia.edu

Introductions

Workshop outline

Difficulties in achieving reproducibility

Potential problems with bioinformatics pipelines

Some helpful tools

Snakemake & Nextflow examples

Reproducibility in science

Reproducibility - redo a scientific experiment & generate similar results
- Same sample, software, data, code - same result?
Replication - different data, same methods - conclusions consistent?
Reusability - Will someone be able to use your pipeline in the future?
- - Will you be able to use it?

Reproducibility Problem

Where did you do the analysis - laptop, server, lab computer, environment

Are you using the most recent version (scripts, datasets, analyses)

We just used the default settings!

Studies in reproducibility

Nature (2016) - Found that 70% of researchers have failed in reproducing another researcher’s results & >50% failed to reproduce their own

PLoS Biology (2024) - Biomedical researchers - 72% reported “reproducibility crisis”

Genome Biol (2024) - Reproducibility in bioinformatics era

Challenges of Bioinformatics

So many tools, often with:
- Multiple versions & releases
- Complex dependencies & hidden parameters, starting seeds
- Running tools locally vs on HPC
- Formatting conversions between software
- Scalability - how tools handle datasets increasing in size
- Keeping codes organized!

Aspects of reproducibility

Version control

Environment management

Data storage

Containers

Tool/software maintenance

Saving document versions

Version Control

GitHub: https://github.com

Track and manage changes to your code & files

Store and label changes at every step

Small or large projects

Collaborate on projects and minimize conflicting edits

Works on multiple platforms (MacOS, Windows, Linux)

Website for github, cutadapt repository

Environment Management

Conda/Mamba environments
- Isolated spaces for each project with specific tool versions
- Manage Python versions and dependencies
- Install packages and software directly into environment
- Stable and reproducible place to run code and applications
- Not limited to Python, can run bash, Rscript
- YAML configuration file to create or export and transfer an environment

Storing results

Public repositories for sequence data - required for most journals
- NCBI: https://www.ncbi.nlm.nih.gov
- Ensembl: https://www.ensembl.org/index.html
- Always document and archive changes, especially if unpublished:
- - genome assembly versions
- - sequence data: SNPs, isoforms

Websites: NCBI, Ensembl, Santa Cruz

Containers

Portable environments that run across different computing environments

Contain packages, software and dependencies that remain isolated from host infrastructure

Standalone unit of software and can produce same results on different machine or server

Ruoshi __ Sun - Research Computing Workshop Series__

1. Using Containers on HPC - Monday, March 30, 2026 - 9:00AM

2. Building Containers on HPC - Monday April 6, 2026 - 9:00AM

Bioinformatic Pipelines

Typical bioinformatics workflows involve many steps:
FASTQ → QC → Alignment → Sorting → Variant Calling → Annotation
- - FASTQ files need quality check and trimming
- Cutadapt
- BWA
- Samtools
- Freebayes
- VCFtools
Create pipeline to string software together for “final” output

Bioinformatic Pipeline challenges

Complex dependencies between steps

Formatting inconsistencies

Hard to reproduce results - scalability, parameters, version changes

Difficult to parallelize efficiently

Manual scripts often fail on HPC

Bioinformatic Pipelines on HPC

Which modules were loaded?

Where are scripts being run

Tracking paths - hard-coded in scripts?

Out/error files - software vs slurm conflicts

Goal: Automate and track these workflows

Snakemake

https://snakemake.github.io/

Snakemake is a workflow management system designed for scientific pipelines

Created by Johannes Köster, first released in 2012

Based on UNIX make - originally created in 1976 but still standard use

Python based - “ snake-make ”

Free and open source, available on Mac, Windows, Unix

https://snakemake.readthedocs.io/en/stable/

https://github.com/snakemake

Make is a command-line interface software tool that performs actions ordered by configured dependencies as defined in a configuration file called a makefile. It is commonly used for build automation to build executable code from source code.

Snakemake format

Similar to writing shell scripts but snake files contains sets of rules

Format is based on Python structure

Snakemake reads from snakefile that defines the rules

Snakefile rules have a target output

Snakemake uses pattern matching to follow the inputs, outputs and commands contained in rules to reach final target output

Snakemake Core Idea

Instead of defining steps , you define rules that produce files .

rule align:

input:

“reads.fastq”

output:

“aligned.bam”

shell:

“bwa mem ref.fa {input} > {output}”

Snakemake builds a directed acyclic graph (DAG) automatically.

Fastq → Cutadapt → BWA → Sorted BAM → Freebayes → VCF

Recommended Pipeline Directory Structure

Benefits:

separates workflow logic from data

easier debugging

easier collaboration

Common practice:

config/ → parameters and sample tables

envs/ → reproducible environments

rules/ → modular workflow steps

results/ → generated outputs

Example:

bioinformatics_pipeline/├── Snakefile├── config/│ └── config.yml├── envs/│ └── bwa.yml├── rules/│ ├── alignment.smk│ ├── qc.smk│ └── variant_calling.smk├── scripts/│ └── custom_processing.py├── data/│ └── raw/├── results/│ ├── bam/│ ├── qc/│ └── variants/└── logs/

A clean directory structure makes pipelines easier to maintain and reproduce.

.yml file can indicate how to make conda environment and what packages and dependencies you need

Snakefile breakdown

Fastq files that need trimming - input: sample.fastq

Cutadapt - output: sample-trimmed.fastq

BWA - align trimmed fastq to assembly output: sample-aligned.sam

Samtools sorting, indexing - output: sample-sorted.bam

Freebayes variant calling - output: sample-variants.vcf

Example snakefile

rule all: input: “variants/sample1.vcf”

rule trim:

input:

”reads/sample1.fastq”

output:

”trimmed_reads/sample1-trimmed.fastq”

shell:

cutadapt -A TCCGGGTS -o {output} {input}

rule align: input: “trimmed_reads/sample1-trimmed.fastq” output: “bam/sample1.bam” threads: 1 shell: “bwa mem -t {threads} ref.fa {input} | samtools view -Sb - > {output}”

rule call_variants: input: “bam/sample1.bam” output: “variants/sample1.vcf” shell: “freebayes -f ref.fa {input} > {output}”

Snakemake takes first rule as the target

then constructs graph of dependencies

{wildcards} serve as placeholders within rules to operate

on multiple files via pattern matching

Snakemake builds the entire pipeline graph automatically.

Snakemake exercises on HPC

Class data:

/project/ hpc_training /reproducibility/ snakemake

$ cp /project/ hpc_training /reproducibility/ snakemake .

- GCF_000005845.2_ASM584v2_genomic.fna - genome assembly

- SRR2584863_1.fastq - fastq sequence file, paired-1

- SRR2584863_2.fastq - fastq sequence file, paired-2

- *. smk - snakemake files

- config_variant.yml - configuration file

- submit_snakemake.sh - sample slurm file

Yet another markup language- YAML Ain’t Markup Language

Running jobs on interactive node

Run interactively - good for testing

$ ijob -c 1 -A hpc_training -p interactive –v -t 2:00:00

$ cp /project/ hpc_training /reproducibility/ snakemake .

Default execution here is local so everything is running in my ijob session on a compute node. If we wanted to have these processes run non-interactively we would want to make sure we are using the executor flag in our snakemake call: “–executor slurm”

Work in scratch

Modules

$ module spider <package>

- specifics and version of package available

$ module spider snakemake

$ module load snakemake/9.8.1

$ module list

$ snakemake -help

Other modules needed for today

$ module load bwa/0.7.17

$ module load cutadapt/4.9

$ module load snakemake/9.8.1

$ module load freebayes/1.3.10

$ module load samtools/1.21

Running snakemake - genome alignment

Snakefile - file.smk, contains rules for snakemake

$ snakemake -c 1 -s align.smk

--dry-run -np good to test first without producing output

-n only show steps, don’t run, -p print shell commands

-c number of cores

-s needed if using a named snakefile (if just called “snakefile”, don’t need the –s flag)

$ snakemake --dag| dot -Tpng > dag_align.png

Running snakemake - variant detection

Snakefile - file.smk, contains rules for snakemake

$ snakemake -c 1 -s variant-call.smk

--dry-run

-c number of cores

-s needed if using a named snakefile (if just called “snakefile”, don’t need)

$ snakemake --dag -s variant-call.smk | dot -Tpng \ > dag_variant.png

Snakemake Examples on HPC

Not recommended to hard-code files within snake file

Can organize sample names, file paths, and software parameters in a YAML configuration file

YAML - serialization language that transforms data into a format that can be shared between systems

With snakemake, configuration file is a reference for the workflow

Yet another markup language- YAML Ain’t Markup Language Easy to keep things organized within a single file While showing, good to have separate config files rather than one huge one and commenting sections out

Running snakemake with config file

Snakefile - file.smk, contains rules for snakemake

$ snakemake -c 1 -s variant- yml.smk -- configfile config_variant.yml

--configfile – directing snakemake to a config file

-c number of cores

-s needed if using a named snakefile

Reproducible environments

Snakemake supports reproducible environments.

Example with Conda:

rule fastqc: input: “reads.fastq” output: “qc.html” conda: ”~/.conda/envs/fastqc_env” #path to conda environment shell: “fastqc {input}”

Benefits: Easy dependency management, portable workflows

.yml file can indicate how to make conda environment and what packages and dependencies you need

Using Environments

├── Snakefile├── config/│ └── config.yml├── envs/│ └── bwa.yml├── rules/│ ├── alignment.smk│ ├── qc.smk│ └── variant_calling.smk├── scripts/│ └── custom_processing.py├── data/│ └── raw/├── results/│ ├── bam/│ ├── qc/│ └── variants/└── logs/

Can also create a environment.yml file, list conda envs and what to install

name : bwa.yml

channels:

- conda-forge

- bioconda

dependencies :

-bwa= 0.7.17

.yml file can indicate how to make conda environment and what packages and dependencies you need

Snakemake with conda environment

$ module load miniforge

$ conda create

$ conda activate

$ snakemake command

$ screen/tmux

- keeps session running when disconnected

- make sure to connect to same login node,

- confirm login node with: hostname

Can create different conda environment for different rules

Smakemake and containers

Snakemake also supports containers:

rule align: container: “docker://biocontainers/bwa”

Advantages:

identical software environments

portable across HPC systems

easier collaboration

Ruoshi __ Sun - Research Computing Workshop Series__

1. Using Containers on HPC - Monday, March 30, 2026 - 9:00AM

2. Building Containers on HPC - Monday April 6, 2026 - 9:00AM

Best Practices for HPC

Recommendations:

Use threads and resources properlyAvoid huge single jobsBreak workflows into modular rulesUse conda or containersUse --dry-run before submitting large workflowsStore configuration in YAML files

Common HPC Pitfalls with workflow managers

Examples:

requesting too many cores per rule

forgetting to specify memory

submitting thousands of tiny jobs

running Snakemake or Nextflow themselves on a login node

Key Takeaways with workflow managers

Snakemake & Nextflow provide:

reproducible pipelines

automatic dependency tracking

scalable HPC execution

environment management

workflow portability

Nextflow

Snakemake & Nextflow provide:

reproducible pipelines

automatic dependency tracking

scalable HPC execution

environment management

workflow portability

What is Nextflow?

Nextflow is a workflow management system that helps automate and organize multi-step computational pipelines.

At a high level, it connects software steps together, manages how data moves between them, and handles execution across local machines, HPC schedulers like SLURM, or cloud platforms.

Nextflow Pipelines

Key concepts:
- Processes, workflows, and parameters
In general, we are going to:
- Create processes to execute desired commands
- Specify parameters to represent workflow settings
- Define a workflow to execute processes in a specific order
Key files:
- main.nf and nextflow.config

Parameters are user-adjustable values that control how a workflow runs. They can specify input files, output locations, software options, reference files, or general pipeline behavior.

Toy example: print the text “Hello World!”

First, create a process called HELLO with our shell command:

process HELLO {

script:

"””

echo “Hello World!”

"”"

}

Then we execute this process in our workflow:

workflow {

HELLO()

}

Let’s start with a very simple toy example for echo’ing the text “Hello World!” And then we’ll build to our bioinformatics example.

Create a new file called main.nf

process HELLO {

script:

"""

echo “Hello World!”

"""

}

workflow {

HELLO()

}

We can create a new file called main.nf with these lines.

Show and execute main.nf in terminal. Show where the file goes. Went to .command.out file in ‘work’ directory for the specific process

Let’s make some changes

process hello { output : path ‘hello.txt’ script: """ echo ‘Hello world!’ > hello.txt “”"}

We want to send the text to a file called ‘hello.txt.’ Now we can update our shell command to send the text to a file, and we can add an output in our process to define our file name and since out output is a file, we’ll specify the type of output as a path.

Run main.nf in terminal and show it still went to ‘work’ directory

This was better, but we still have to dig around for the file, so let’s add one more thing to our process.

Add a publishDir for output file destination

process hello { publishDir “results/” , mode: “copy”

__ output__ : path ‘hello.txt’ script: """ echo ‘Hello world!’ > hello.txt “”"}

Now let’s try sending our output to a directory called ‘results’ - we can add a publishDir to our process and specify the mode “copy” is safest, but you can do other things like move or even create links to the file. Re-run the main.nf in the terminal and show where the file goes to results but since we did copy, it still does go to work. Point out that we need to be mindful of any extra data we’re creating so we don’t unnecessarily have duplicates for everything.

Let’s look at our snakemake “trim” rule from earlier

rule trim:

input:

”reads/sample1.fastq”

output:

”trimmed_reads/sample1-trimmed.fastq”

shell:

cutadapt -A TCCGGGTS -o {output} {input}

Here we specified our inputs/outputs and our shell command.

What do we need to update in Nextflow?

process HELLO { publishDir “results/” , mode: “copy”

__ __ output: path ‘hello.txt’ script: """ echo ‘Hello world!’ > hello.txt “”"}

So, looking at our HELLO process, what do we need to add? We already have a publishDir, an output, and script, so let’s update those for cutadapt.

Update for running cutadapt

process CUTADAPT { publishDir “results/” , mode: “copy”

output: path ’trimmed.fastq’ script: """ cutadapt -a AACCGGTT -o trimmed.fastq ~/sample1.fastq “”"}

workflow {

CUTADAPT()

}

We can keep ‘results’ as our publishDir for this example, but we’ll need to change our output to trimmed.fastq and we’ll change the command for cutadapt with our adapter and our input and output file names. Because Nextflow executes each task in its own work directory, we need to provide the full path. Our workflow just becomes running the CUTADAPT process. Does this work? Yes, it does. However, but we are hard-coding everything and this not really flexible and does not really allow us to scale.

More common approach for input files

process CUTADAPT { publishDir “results/” , mode: “copy”

input: path reads_var

output: path ’trimmed.fastq’ script: """ cutadapt -a AACCGGTT -o trimmed.fastq $reads_var “”"}

workflow {

CUTADAPT(Channel.fromPath(’~/sample1.fastq’, checkIfExists: true))

}

A better approach is to pass the file into the process with Channel.fromPath() and use input: path reads. The “input:” declares an input variable, not a literal source file location. And we use this variable “reads” our shell command and here $reads means: the local process input variable and use the actual input file that was provided to Nextflow for this task via our workflow. We can also use the reads variable to other things like dynamically name files or any

Dynamically scaling to many samples

process CUTADAPT { __ __ publishDir __ “results/”, mode: “copy”__ __ input:__ __ path __ reads_var __ output:__ __ path “${__ reads_var.simpleName }_ trimmed.fastq ” __ script:__ __ “”"__ __ __ cutadapt __ -a AACCGGTT -o ${__ reads_var.simpleName }_ trimmed.fastq __ $__ reads_var __ “”"__ } workflow { __ CUTADAPT(__ Channel.fromPath (’*. fastq __’, __ checkIfExists : true)) }

Now we can start to use the flexibility nextflow provides to name our output files dynamically based on sample name and we also can start to scale up by using the wildcard to grab all the fastq files in our example ‘reads’ directory. Here nextflow is going to create a new separate process for each of our samples.

Parameter options for input files

Add a parameter for ‘--reads’ in your ’nextflow run’ command

Add a params.reads at the top of your main.nf file

Add a params.reads to a nextflow.config file

Works for one file (‘reads/sample1.fastq’) or many (‘reads/*.fastq’)

As with many things with Nextflow, we have multple different ways we can accomplish this. Will talk about nextflow.config shortly.

Less hard-coding = more reproducibility

From:workflow { CUTADAPT(Channel.fromPath(~/sample1.fastq’, checkIfExists: true))}

To:

workflow { CUTADAPT(Channel.fromPath(params.reads, checkIfExists: true))}

And if we use one of those parameter methods, instead of our workflow having a hard-coded path for our inputs, we can dynamically provide our input file names and clean things up in our workflow even further.

Loading software – main.nf

Use a ‘beforeScript’ in the CUTADAPT process in main.nf

beforeScript runs specified shell command(s) before running the script command

Load the cutadapt module: beforeScript ‘module load cutadapt’

Can also do other things like export variables or create directories

beforeScript """ module purge module load cutadapt mkdir results export PATH="$PATH:/opt/tools"’ """

We can definitely load the software in our process, but we just cleaned that thing up, so let’s put it somewhere better to keep our main.nf focused on workflow logic. To do this, let’s go ahead and start to build a nextflow.config file.

Loading software – nextflow.config

Again, we use a ‘beforeScript’ specific to the CUTADAPT process

Process {withName: CUTADAPT { beforeScript = ’’’ module purge module load cutadapt

’’’

Now when we specifically run our CUTADAPT process, these commands will run before our script command and set up our process environment. Ok, so now we have our software dialed in for cutadapt. But we need to think about where we are running these processes. By default, nextflow is running shell commands locally, so that means if we’re just at the command line, we’d be running the processes on the login nodes, which is a no-no.

Adding SLURM options – nextflow.config

Process { withName: CUTADAPT { beforeScript = ’’’ module purge module load cutadapt

’’’

executor = ‘slurm’ queue = ‘standard’ cpus = 2 mem = ‘16 GB’ time = ‘1h’ clusterOptions = ‘--account=hpc_build’

}

So, we need to let Nextflow know that we want to use SLURM to execute our processes and with do this by specifying SLURM as our executor. We can also use this to specify various other options – nextflow doesn’t have explicit options for all possible slurm commands, so we can supplement with any additional options we need with ‘clusterOptions.’ Again, there’s multiple ways to configure everything – you can also do global slurm options, but often different parts of the workflow are going to need different resources. And we could potentially specify these slurm options in the CUTADAPT process in our main.nf, but we’re trying to keep that tidy and focused on the workflow logic.

Now we have:

Workflow logic in main.nf

Software and slurm options in nextflow.config

As you can imagine, there’s also multiple ways to set

Extend to CUTADAPT → BWA_ALIGN → FREEBAYES

Same rules apply – largely rinse and repeat for additional processes
Create processes for each step: inputs/outputs, commands, etc.
Software and slurm options in nextflow.config
Main difference is our workflow - more processes and channels
- Send channel into process
- Process produces output
- Output becomes new channel for next process.

With Nextflow, channels carry data and processes do work on that data. You link them together by sending a channel into a process, and if that process produces output, its output can become a new channel for the next process.

Workflow for CUTADAPT → BWA_ALIGN → FREEBAYES

workflow {

reads_ch = Channel.fromPath("${params.reads_dir}/*.fastq", checkIfExists: true)

trimmed_ch = CUTADAPT(reads_ch)

aligned_ch = BWA_ALIGN(trimmed_ch)

FREEBAYES(aligned_ch)

}

Here’s how we could link the trim, align and variant calling together. So now we’ll put it all together and run the entire workflow from end to end on the system.

Additional links

https ://nf-co.re/rnaseq/3.23.0/

https://training.nextflow.io

https://github.com/nextflow-io/nextflow

Workflows for computational data analysis

https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems
https://github.com/pditommaso/awesome-pipeline
Galaxy platform - bioinformatic software, pipeline and workflows:
- https://usegalaxy.org