Reproducibility in Bioinformatics
Deb Triant & Marcus Bobar
Research Computing, University of Virginia dtriant@virginia.edu, mb5wt@virginia.edu

Introductions
Workshop outline
Difficulties in achieving reproducibility
Potential problems with bioinformatics pipelines
Some helpful tools
Snakemake & Nextflow examples

Reproducibility in science
- Reproducibility - redo a scientific experiment & generate similar results
- Same sample, software, data, code - same result?
- Replication - different data, same methods - conclusions consistent?
- Reusability - Will someone be able to use your pipeline in the future?
- - Will you be able to use it?

Reproducibility Problem
Where did you do the analysis - laptop, server, lab computer, environment
Are you using the most recent version (scripts, datasets, analyses)
We just used the default settings!


Studies in reproducibility
Nature (2016) - Found that 70% of researchers have failed in reproducing another researcher’s results & >50% failed to reproduce their own
PLoS Biology (2024) - Biomedical researchers - 72% reported “reproducibility crisis”
Genome Biol (2024) - Reproducibility in bioinformatics era


Challenges of Bioinformatics
- So many tools, often with:
- Multiple versions & releases
- Complex dependencies & hidden parameters, starting seeds
- Running tools locally vs on HPC
- Formatting conversions between software
- Scalability - how tools handle datasets increasing in size
- Keeping codes organized!

Aspects of reproducibility
Version control
Environment management
Data storage
Containers
Tool/software maintenance

Saving document versions


Version Control

GitHub: https://github.com
Track and manage changes to your code & files
Store and label changes at every step
Small or large projects
Collaborate on projects and minimize conflicting edits
Works on multiple platforms (MacOS, Windows, Linux)

Website for github, cutadapt repository
Environment Management
- Conda/Mamba environments
- Isolated spaces for each project with specific tool versions
- Manage Python versions and dependencies
- Install packages and software directly into environment
- Stable and reproducible place to run code and applications
- Not limited to Python, can run bash, Rscript
- YAML configuration file to create or export and transfer an environment

Storing results
- Public repositories for sequence data - required for most journals
- NCBI: https://www.ncbi.nlm.nih.gov
- Ensembl: https://www.ensembl.org/index.html
- Always document and archive changes, especially if unpublished:
- - genome assembly versions
- - sequence data: SNPs, isoforms

Websites: NCBI, Ensembl, Santa Cruz
Containers
Portable environments that run across different computing environments
Contain packages, software and dependencies that remain isolated from host infrastructure
Standalone unit of software and can produce same results on different machine or server
Ruoshi __ Sun - Research Computing Workshop Series__
1. Using Containers on HPC - Monday, March 30, 2026 - 9:00AM
2. Building Containers on HPC - Monday April 6, 2026 - 9:00AM

Bioinformatic Pipelines
- Typical bioinformatics workflows involve many steps:
- FASTQ → QC → Alignment → Sorting → Variant Calling → Annotation
- - FASTQ files need quality check and trimming
- Cutadapt
- BWA
- Samtools
- Freebayes
- VCFtools
- Create pipeline to string software together for “final” output

Bioinformatic Pipeline challenges
Complex dependencies between steps
Formatting inconsistencies
Hard to reproduce results - scalability, parameters, version changes
Difficult to parallelize efficiently
Manual scripts often fail on HPC

Bioinformatic Pipelines on HPC
Which modules were loaded?
Where are scripts being run
Tracking paths - hard-coded in scripts?
Out/error files - software vs slurm conflicts
Goal: Automate and track these workflows


Snakemake
https://snakemake.github.io/
Snakemake is a workflow management system designed for scientific pipelines
Created by Johannes Köster, first released in 2012
Based on UNIX make - originally created in 1976 but still standard use
Python based - “ snake-make ”
Free and open source, available on Mac, Windows, Unix
https://snakemake.readthedocs.io/en/stable/
https://github.com/snakemake
Make is a command-line interface software tool that performs actions ordered by configured dependencies as defined in a configuration file called a makefile. It is commonly used for build automation to build executable code from source code.
Snakemake format
Similar to writing shell scripts but snake files contains sets of rules
Format is based on Python structure
Snakemake reads from snakefile that defines the rules
Snakefile rules have a target output
Snakemake uses pattern matching to follow the inputs, outputs and commands contained in rules to reach final target output

Snakemake Core Idea
Instead of defining steps , you define rules that produce files .
rule align:
input:
“reads.fastq”
output:
“aligned.bam”
shell:
“bwa mem ref.fa {input} > {output}”
Snakemake builds a directed acyclic graph (DAG) automatically.
Fastq → Cutadapt → BWA → Sorted BAM → Freebayes → VCF

Recommended Pipeline Directory Structure
Benefits:
separates workflow logic from data
easier debugging
easier collaboration
Common practice:
config/ → parameters and sample tables
envs/ → reproducible environments
rules/ → modular workflow steps
results/ → generated outputs
Example:
bioinformatics_pipeline/├── Snakefile├── config/│ └── config.yml├── envs/│ └── bwa.yml├── rules/│ ├── alignment.smk│ ├── qc.smk│ └── variant_calling.smk├── scripts/│ └── custom_processing.py├── data/│ └── raw/├── results/│ ├── bam/│ ├── qc/│ └── variants/└── logs/
A clean directory structure makes pipelines easier to maintain and reproduce.
.yml file can indicate how to make conda environment and what packages and dependencies you need
Snakefile breakdown
Fastq files that need trimming - input: sample.fastq
Cutadapt - output: sample-trimmed.fastq
BWA - align trimmed fastq to assembly output: sample-aligned.sam
Samtools sorting, indexing - output: sample-sorted.bam
Freebayes variant calling - output: sample-variants.vcf
Example snakefile
rule all: input: “variants/sample1.vcf”
rule trim:
input:
”reads/sample1.fastq”
output:
”trimmed_reads/sample1-trimmed.fastq”
shell:
cutadapt -A TCCGGGTS -o {output} {input}
rule align: input: “trimmed_reads/sample1-trimmed.fastq” output: “bam/sample1.bam” threads: 1 shell: “bwa mem -t {threads} ref.fa {input} | samtools view -Sb - > {output}”
rule call_variants: input: “bam/sample1.bam” output: “variants/sample1.vcf” shell: “freebayes -f ref.fa {input} > {output}”
Snakemake takes first rule as the target
then constructs graph of dependencies
{wildcards} serve as placeholders within rules to operate
on multiple files via pattern matching
Snakemake builds the entire pipeline graph automatically.
Snakemake exercises on HPC
Class data:
/project/ hpc_training /reproducibility/ snakemake
$ cp /project/ hpc_training /reproducibility/ snakemake .
- GCF_000005845.2_ASM584v2_genomic.fna - genome assembly
- SRR2584863_1.fastq - fastq sequence file, paired-1
- SRR2584863_2.fastq - fastq sequence file, paired-2
- *. smk - snakemake files
- config_variant.yml - configuration file
- submit_snakemake.sh - sample slurm file

Yet another markup language- YAML Ain’t Markup Language
Running jobs on interactive node
Run interactively - good for testing
$ ijob -c 1 -A hpc_training -p interactive –v -t 2:00:00
$ cp /project/ hpc_training /reproducibility/ snakemake .

Default execution here is local so everything is running in my ijob session on a compute node. If we wanted to have these processes run non-interactively we would want to make sure we are using the executor flag in our snakemake call: “–executor slurm”
Work in scratch
Modules
$ module spider <package>
- specifics and version of package available
$ module spider snakemake
$ module load snakemake/9.8.1
$ module list
$ snakemake -help

Other modules needed for today
$ module load bwa/0.7.17
$ module load cutadapt/4.9
$ module load snakemake/9.8.1
$ module load freebayes/1.3.10
$ module load samtools/1.21

Running snakemake - genome alignment
Snakefile - file.smk, contains rules for snakemake
$ snakemake -c 1 -s align.smk
--dry-run -np good to test first without producing output
-n only show steps, don’t run, -p print shell commands
-c number of cores
-s needed if using a named snakefile (if just called “snakefile”, don’t need the –s flag)
$ snakemake --dag| dot -Tpng > dag_align.png

Running snakemake - variant detection
Snakefile - file.smk, contains rules for snakemake
$ snakemake -c 1 -s variant-call.smk
--dry-run
-c number of cores
-s needed if using a named snakefile (if just called “snakefile”, don’t need)
$ snakemake --dag -s variant-call.smk | dot -Tpng \ > dag_variant.png

Snakemake Examples on HPC
Not recommended to hard-code files within snake file
Can organize sample names, file paths, and software parameters in a YAML configuration file
YAML - serialization language that transforms data into a format that can be shared between systems
With snakemake, configuration file is a reference for the workflow

Yet another markup language- YAML Ain’t Markup Language Easy to keep things organized within a single file While showing, good to have separate config files rather than one huge one and commenting sections out
Running snakemake with config file
Snakefile - file.smk, contains rules for snakemake
$ snakemake -c 1 -s variant- yml.smk -- configfile config_variant.yml
--configfile – directing snakemake to a config file
-c number of cores
-s needed if using a named snakefile

Reproducible environments
Snakemake supports reproducible environments.
Example with Conda:
rule fastqc: input: “reads.fastq” output: “qc.html” conda: ”~/.conda/envs/fastqc_env” #path to conda environment shell: “fastqc {input}”
Benefits: Easy dependency management, portable workflows
.yml file can indicate how to make conda environment and what packages and dependencies you need
Using Environments
├── Snakefile├── config/│ └── config.yml├── envs/│ └── bwa.yml├── rules/│ ├── alignment.smk│ ├── qc.smk│ └── variant_calling.smk├── scripts/│ └── custom_processing.py├── data/│ └── raw/├── results/│ ├── bam/│ ├── qc/│ └── variants/└── logs/
Can also create a environment.yml file, list conda envs and what to install
name : bwa.yml
channels:
- conda-forge
- bioconda
dependencies :
-bwa= 0.7.17

.yml file can indicate how to make conda environment and what packages and dependencies you need
Snakemake with conda environment
$ module load miniforge
$ conda create
$ conda activate
$ snakemake command
$ screen/tmux
- keeps session running when disconnected
- make sure to connect to same login node,
- confirm login node with: hostname
Can create different conda environment for different rules
Smakemake and containers
Snakemake also supports containers:
rule align: container: “docker://biocontainers/bwa”
Advantages:
identical software environments
portable across HPC systems
easier collaboration
Ruoshi __ Sun - Research Computing Workshop Series__
1. Using Containers on HPC - Monday, March 30, 2026 - 9:00AM
2. Building Containers on HPC - Monday April 6, 2026 - 9:00AM
Best Practices for HPC
Recommendations:
Use threads and resources properlyAvoid huge single jobsBreak workflows into modular rulesUse conda or containersUse --dry-run before submitting large workflowsStore configuration in YAML files

Common HPC Pitfalls with workflow managers
Examples:
requesting too many cores per rule
forgetting to specify memory
submitting thousands of tiny jobs
running Snakemake or Nextflow themselves on a login node

Key Takeaways with workflow managers
Snakemake & Nextflow provide:
reproducible pipelines
automatic dependency tracking
scalable HPC execution
environment management
workflow portability

Nextflow
Snakemake & Nextflow provide:
reproducible pipelines
automatic dependency tracking
scalable HPC execution
environment management
workflow portability

What is Nextflow?
Nextflow is a workflow management system that helps automate and organize multi-step computational pipelines.
At a high level, it connects software steps together, manages how data moves between them, and handles execution across local machines, HPC schedulers like SLURM, or cloud platforms.

Nextflow Pipelines
- Key concepts:
- Processes, workflows, and parameters
- In general, we are going to:
- Create processes to execute desired commands
- Specify parameters to represent workflow settings
- Define a workflow to execute processes in a specific order
- Key files:
- main.nf and nextflow.config

Parameters are user-adjustable values that control how a workflow runs. They can specify input files, output locations, software options, reference files, or general pipeline behavior.
Toy example: print the text “Hello World!”
First, create a process called HELLO with our shell command:
process HELLO {
script:
"””
echo “Hello World!”
"”"
}
Then we execute this process in our workflow:
workflow {
HELLO()
}

Let’s start with a very simple toy example for echo’ing the text “Hello World!” And then we’ll build to our bioinformatics example.
Create a new file called main.nf
process HELLO {
script:
"""
echo “Hello World!”
"""
}
workflow {
HELLO()
}

We can create a new file called main.nf with these lines.
Show and execute main.nf in terminal. Show where the file goes. Went to .command.out file in ‘work’ directory for the specific process
Let’s make some changes
process hello { output : path ‘hello.txt’ script: """ echo ‘Hello world!’ > hello.txt “”"}

We want to send the text to a file called ‘hello.txt.’ Now we can update our shell command to send the text to a file, and we can add an output in our process to define our file name and since out output is a file, we’ll specify the type of output as a path.
Run main.nf in terminal and show it still went to ‘work’ directory
This was better, but we still have to dig around for the file, so let’s add one more thing to our process.
Add a publishDir for output file destination
process hello { publishDir “results/” , mode: “copy”
__ output__ : path ‘hello.txt’ script: """ echo ‘Hello world!’ > hello.txt “”"}

Now let’s try sending our output to a directory called ‘results’ - we can add a publishDir to our process and specify the mode “copy” is safest, but you can do other things like move or even create links to the file. Re-run the main.nf in the terminal and show where the file goes to results but since we did copy, it still does go to work. Point out that we need to be mindful of any extra data we’re creating so we don’t unnecessarily have duplicates for everything.
Let’s look at our snakemake “trim” rule from earlier
rule trim:
input:
”reads/sample1.fastq”
output:
”trimmed_reads/sample1-trimmed.fastq”
shell:
cutadapt -A TCCGGGTS -o {output} {input}

Here we specified our inputs/outputs and our shell command.
What do we need to update in Nextflow?
process HELLO { publishDir “results/” , mode: “copy”
__ __ output: path ‘hello.txt’ script: """ echo ‘Hello world!’ > hello.txt “”"}

So, looking at our HELLO process, what do we need to add? We already have a publishDir, an output, and script, so let’s update those for cutadapt.
Update for running cutadapt
process CUTADAPT { publishDir “results/” , mode: “copy”
output: path ’trimmed.fastq’ script: """ cutadapt -a AACCGGTT -o trimmed.fastq ~/sample1.fastq “”"}
workflow {
CUTADAPT()
}

We can keep ‘results’ as our publishDir for this example, but we’ll need to change our output to trimmed.fastq and we’ll change the command for cutadapt with our adapter and our input and output file names. Because Nextflow executes each task in its own work directory, we need to provide the full path. Our workflow just becomes running the CUTADAPT process. Does this work? Yes, it does. However, but we are hard-coding everything and this not really flexible and does not really allow us to scale.
More common approach for input files
process CUTADAPT { publishDir “results/” , mode: “copy”
input: path reads_var
output: path ’trimmed.fastq’ script: """ cutadapt -a AACCGGTT -o trimmed.fastq $reads_var “”"}
workflow {
CUTADAPT(Channel.fromPath(’~/sample1.fastq’, checkIfExists: true))
}

A better approach is to pass the file into the process with Channel.fromPath() and use input: path reads. The “input:” declares an input variable, not a literal source file location. And we use this variable “reads” our shell command and here $reads means: the local process input variable and use the actual input file that was provided to Nextflow for this task via our workflow. We can also use the reads variable to other things like dynamically name files or any
Dynamically scaling to many samples
process CUTADAPT { __ __ publishDir __ “results/”, mode: “copy”__ __ input:__ __ path __ reads_var __ output:__ __ path “${__ reads_var.simpleName }_ trimmed.fastq ” __ script:__ __ “”"__ __ __ cutadapt __ -a AACCGGTT -o ${__ reads_var.simpleName }_ trimmed.fastq __ $__ reads_var __ “”"__ } workflow { __ CUTADAPT(__ Channel.fromPath (’*. fastq __’, __ checkIfExists : true)) }

Now we can start to use the flexibility nextflow provides to name our output files dynamically based on sample name and we also can start to scale up by using the wildcard to grab all the fastq files in our example ‘reads’ directory. Here nextflow is going to create a new separate process for each of our samples.
Parameter options for input files
Add a parameter for ‘--reads’ in your ’nextflow run’ command
Add a params.reads at the top of your main.nf file
Add a params.reads to a nextflow.config file
Works for one file (‘reads/sample1.fastq’) or many (‘reads/*.fastq’)

As with many things with Nextflow, we have multple different ways we can accomplish this. Will talk about nextflow.config shortly.
Less hard-coding = more reproducibility
From:workflow { CUTADAPT(Channel.fromPath(~/sample1.fastq’, checkIfExists: true))}
To:
workflow { CUTADAPT(Channel.fromPath(params.reads, checkIfExists: true))}

And if we use one of those parameter methods, instead of our workflow having a hard-coded path for our inputs, we can dynamically provide our input file names and clean things up in our workflow even further.
Loading software – main.nf
Use a ‘beforeScript’ in the CUTADAPT process in main.nf
beforeScript runs specified shell command(s) before running the script command
Load the cutadapt module: beforeScript ‘module load cutadapt’
Can also do other things like export variables or create directories
beforeScript """ module purge module load cutadapt mkdir results export PATH="$PATH:/opt/tools"’ """

We can definitely load the software in our process, but we just cleaned that thing up, so let’s put it somewhere better to keep our main.nf focused on workflow logic. To do this, let’s go ahead and start to build a nextflow.config file.
Loading software – nextflow.config
Again, we use a ‘beforeScript’ specific to the CUTADAPT process
Process {withName: CUTADAPT { beforeScript = ’’’ module purge module load cutadapt
’’’

Now when we specifically run our CUTADAPT process, these commands will run before our script command and set up our process environment. Ok, so now we have our software dialed in for cutadapt. But we need to think about where we are running these processes. By default, nextflow is running shell commands locally, so that means if we’re just at the command line, we’d be running the processes on the login nodes, which is a no-no.
Adding SLURM options – nextflow.config
Process { withName: CUTADAPT { beforeScript = ’’’ module purge module load cutadapt
’’’
executor = ‘slurm’ queue = ‘standard’ cpus = 2 mem = ‘16 GB’ time = ‘1h’ clusterOptions = ‘--account=hpc_build’
}

So, we need to let Nextflow know that we want to use SLURM to execute our processes and with do this by specifying SLURM as our executor. We can also use this to specify various other options – nextflow doesn’t have explicit options for all possible slurm commands, so we can supplement with any additional options we need with ‘clusterOptions.’ Again, there’s multiple ways to configure everything – you can also do global slurm options, but often different parts of the workflow are going to need different resources. And we could potentially specify these slurm options in the CUTADAPT process in our main.nf, but we’re trying to keep that tidy and focused on the workflow logic.
Now we have:
Workflow logic in main.nf
Software and slurm options in nextflow.config

As you can imagine, there’s also multiple ways to set
Extend to CUTADAPT → BWA_ALIGN → FREEBAYES
- Same rules apply – largely rinse and repeat for additional processes
- Create processes for each step: inputs/outputs, commands, etc.
- Software and slurm options in nextflow.config
- Main difference is our workflow - more processes and channels
- Send channel into process
- Process produces output
- Output becomes new channel for next process.

With Nextflow, channels carry data and processes do work on that data. You link them together by sending a channel into a process, and if that process produces output, its output can become a new channel for the next process.
Workflow for CUTADAPT → BWA_ALIGN → FREEBAYES
workflow {
reads_ch = Channel.fromPath("${params.reads_dir}/*.fastq", checkIfExists: true)
trimmed_ch = CUTADAPT(reads_ch)
aligned_ch = BWA_ALIGN(trimmed_ch)
FREEBAYES(aligned_ch)
}

Here’s how we could link the trim, align and variant calling together. So now we’ll put it all together and run the entire workflow from end to end on the system.
Additional links
https ://nf-co.re/rnaseq/3.23.0/
https://github.com/nextflow-io/nextflow

Workflows for computational data analysis
- https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems
- https://github.com/pditommaso/awesome-pipeline
- Galaxy platform - bioinformatic software, pipeline and workflows:
- https://usegalaxy.org
