umi-pipeline-nf

Nextflow pipeline to analyze ONT-UMI-Sequencing data

https://github.com/genepi/umi-pipeline-nf

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: nature.com
○
Academic email domains
✓
Institutional organization owner
Organization genepi has institutional domain (genepi.i-med.ac.at)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Nextflow pipeline to analyze ONT-UMI-Sequencing data

Basic Info

Host: GitHub
Owner: genepi
License: mpl-2.0
Language: Python
Default Branch: main
Homepage:
Size: 179 MB

Statistics

Stars: 4
Watchers: 3
Forks: 4
Open Issues: 2
Releases: 5

Created about 4 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License Citation

Umi-pipeline-nf

Umi-pipeline-nf creates highly accurate single-molecule consensus sequences for unique molecular identifier (UMI)-tagged amplicons from nanopore sequencing data.
The pipeline processes FastQ files (typically from the fastq_pass folder of your nanopore run) and outputs high-quality aligned consensus sequences in BAM format for each UMI cluster. The optional variant calling creates a vcf file for all variants that are found in the consensus sequences.
The newest version of the pipeline supports live analysis of the clusters during sequencing and seemless polishing of the clusters as soon as enough clusters are found.

Umi-pipeline-nf originated from a Snakemake-based analysis pipeline (pipeline-umi-amplicon; originally developed by Karst et al, Nat Biotechnol 18:165–169, 2021). We have migrated the pipeline to Nextflow and incorporated several optimizations and additional functionalities.

Workflow

The pipeline is organized into four main subworkflows, each with its own processing steps and outputs:

LIVE UMI PROCESSING
- Purpose: Real-time processing of raw FastQ files.
- Steps:
  - Merge and filter raw FastQ files.
  - Align reads to the reference genome.
  - Extract UMI sequences.
  - Cluster UMI-tagged reads.
- Outputs:
  - Processed UMI clusters are passed on to later stages.
  - Raw alignment files (e.g., in <output>/<barcodeXX>/raw/align/ or <output>/<barcodeXX>/<target>/fastq_filtered/raw/).
  - Filtered FastQ files and clustering statistics.
To stop the pipeline when it's in live mode, create a CONTINUE file in the output directory:
touch <output>/CONTINUE
OFFLINE UMI PROCESSING
- Purpose: Batch processing with an optional subsampling step.
- Steps:
  - Merge and filter FastQ files.
  - Optionally subsample the merged reads.
  - Perform alignment, UMI extraction, and clustering similar to LIVE processing.
- Outputs:
  - Processed UMI clusters.
  - Alignment and subsampling reports (e.g., in <output>/<barcodeXX>/raw/subsampling/ and <output>/<barcodeXX>/<target>/fastq_filtered/raw/).
UMI POLISHING
- Purpose: Refine UMI clusters to generate high-quality consensus sequences.
- Steps:
  - Polish clusters using medaka.
  - Realign consensus sequences to the reference genome.
  - Re-extract and re-cluster UMIs from consensus reads.
  - Parse final consensus clusters.
- Outputs:
  - Consensus BAM and FastQ files (e.g., in <output>/<barcodeXX>/<target>/align/consensus/ and <output>/<barcodeXX>/<target>/fastq/consensus/).
  - Polishing logs and detailed cluster statistics.
VARIANT CALLING
- Purpose: Identify genetic variants from the consensus data.
- Steps:
  - Perform variant calling using one of the supported callers: freebayes, lofreq, or mutserve.
- Outputs:
  - VCF files with variant calls (e.g., in <output>/<barcodeXX>/<target>/<freebayes/mutserve/lofreq>/).

See the output documentation for a detailed overview of the pipeline outputs and directory structure.

Main Adaptations

It comes with a docker/singularity container making installation simple, easy to use on clusters and results highly reproducible.
The pipeline is optimized for parallelization.
Additional UMI cluster splitting step to remove admixed UMI clusters.
Read filtering strategy per UMI cluster was adapted to preserve the highest quality reads.
Three commonly used variant callers (freebayes, lofreq or mutserve) are supported by the pipeline.
The raw reads can be optionally subsampled.
The raw reads can be filtered by read length and quality.
GPU acceleration for cluster polishing by Medaka is available when using the docker profile. Tested with an RTX 4080 SUPER GPU (16 GB).
Allows multi line bed files to run the pipeline for several targets at once.
Supports live analysis of the clusters during sequencing and seemless polishing of the clusters as soon as enough clusters are found

To see all available parameters run bash nextflow run genepi/umi-pipeline-nf -r main --help

Quick Start

Install nextflow.
Download the pipeline and test it on a minimal dataset with a single command.

bash nextflow run genepi/umi-pipeline-nf -r v1.0.0-beta -profile test,docker

Start running your own analysis!
3.1 Download and adapt the config/custom.config with paths to your data (relative and absolute paths possible).

bash nextflow run genepi/umi-pipeline-nf -r v1.0.0-beta -c <custom.config> -profile custom,<docker,singularity>

Citation

If you use the pipeline please cite our Paper:

Amstler S, Streiter G, Pfurtscheller C, Forer L, Di Maio S, Weissensteiner H, Paulweber B, Schoenherr S, Kronenberg F, Coassin S. Nanopore sequencing with unique molecular identifiers enables accurate mutation analysis and haplotyping in the complex lipoprotein(a) KIV-2 VNTR. Genome Med 16, 117 (2024). https://doi.org/10.1186/s13073-024-01391-8

Credits

The pipeline was written by @StephanAmstler.
Nextflow template pipeline: EcSeq.
Snakemake-based ONT pipeline for UMI nanopore sequencing analysis: nanoporetech/pipeline-umi-amplicon.
UMI-corrected nanopore sequencing analysis first shown by: SorenKarst/longread_umi.

Owner

Name: Institute of Genetic Epidemiology
Login: genepi
Kind: organization
Location: Innsbruck, Austria

Website: http://genepi.i-med.ac.at
Repositories: 55
Profile: https://github.com/genepi

Medical University of Innsbruck

Citation (CITATION.cff)

cff-version: "1.2.0"
message: "If you use this software, please cite it as below."
title: "Nanopore sequencing with unique molecular identifiers enables accurate mutation analysis and haplotyping in the complex lipoprotein(a) KIV-2 VNTR"
authors:
  - family-names: "Amstler"
    given-names: "Stephan"
  - family-names: "Streiter"
    given-names: "Gertraud"
  - family-names: "Pfurtscheller"
    given-names: "Cathrin"
  - family-names: "Forer"
    given-names: "Lukas"
  - family-names: "Di Maio"
    given-names: "Silvia"
  - family-names: "Weissensteiner"
    given-names: "Hansi"
  - family-names: "Paulweber"
    given-names: "Bernhard"
  - family-names: "Schoenherr"
    given-names: "Sebastian"
  - family-names: "Kronenberg"
    given-names: "Florian"
  - family-names: "Coassin"
    given-names: "Stefan"
doi: "10.1186/s13073-024-01391-8"
date-released: "2024-10-08"
license: "Apache-2.0"
repository-code: "https://github.com/genepi/umi-pipeline-nf"
preferred-citation:
  type: "article"
  authors:
  - family-names: "Amstler"
    given-names: "Stephan"
  - family-names: "Streiter"
    given-names: "Gertraud"
  - family-names: "Pfurtscheller"
    given-names: "Cathrin"
  - family-names: "Forer"
    given-names: "Lukas"
  - family-names: "Di Maio"
    given-names: "Silvia"
  - family-names: "Weissensteiner"
    given-names: "Hansi"
  - family-names: "Paulweber"
    given-names: "Bernhard"
  - family-names: "Schoenherr"
    given-names: "Sebastian"
  - family-names: "Kronenberg"
    given-names: "Florian"
  - family-names: "Coassin"
    given-names: "Stefan"
  doi: "10.1186/s13073-024-01391-8"
  journal: "Genome Medicine"
  day: 8
  month: 10
  title: "Nanopore sequencing with unique molecular identifiers enables accurate mutation analysis and haplotyping in the complex Lipoprotein(a) KIV-2 VNTR"
  year: 2024

GitHub Events

Total

Create event: 14
Release event: 9
Issues event: 21
Watch event: 3
Delete event: 17
Issue comment event: 16
Push event: 110
Pull request event: 5
Fork event: 3

Last Year

Create event: 14
Release event: 9
Issues event: 21
Watch event: 3
Delete event: 17
Issue comment event: 16
Push event: 110
Pull request event: 5
Fork event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 11
Total pull requests: 3
Average time to close issues: about 1 year
Average time to close pull requests: 2 minutes
Total issue authors: 6
Total pull request authors: 1
Average comments per issue: 0.91
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 3
Average time to close issues: 27 days
Average time to close pull requests: 2 minutes
Issue authors: 4
Pull request authors: 1
Average comments per issue: 1.5
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

AmstlerStephan (4)
webbchen (3)
coro1c (3)
cmbrunet (1)
ebwinter95 (1)
cnk113 (1)
camcl (1)

Pull Request Authors

AmstlerStephan (14)
vmelichar (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/ci-tests.yml actions

actions/checkout v2 composite
actions/setup-java v2 composite
docker/build-push-action v5 composite
docker/setup-buildx-action v3 composite
nf-core/setup-nextflow v1 composite

Dockerfile docker

ubuntu 22.04 build

environment.yml conda

bedtools 2.30.0.*
freebayes 1.3.2.*
lofreq 2.1.5.*
medaka >2.0.0
minimap2 2.24.*
openjdk 11.0.9.*
pip 22.2.2.*
python >=3.8
samtools 1.15.1.*
seqtk 1.3.*
unzip 6.0.*
vcflib 1.0.0.*
vsearch 2.21.2.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science