umi-pipeline-nf

Nextflow pipeline to analyze ONT-UMI-Sequencing data

https://github.com/genepi/umi-pipeline-nf

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com
  • Academic email domains
  • Institutional organization owner
    Organization genepi has institutional domain (genepi.i-med.ac.at)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Nextflow pipeline to analyze ONT-UMI-Sequencing data

Basic Info
  • Host: GitHub
  • Owner: genepi
  • License: mpl-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 179 MB
Statistics
  • Stars: 4
  • Watchers: 3
  • Forks: 4
  • Open Issues: 2
  • Releases: 5
Created almost 4 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License Citation

README.md

Nextflow install with bioconda CI Tests

Umi-pipeline-nf

Umi-pipeline-nf creates highly accurate single-molecule consensus sequences for unique molecular identifier (UMI)-tagged amplicons from nanopore sequencing data.
The pipeline processes FastQ files (typically from the fastq_pass folder of your nanopore run) and outputs high-quality aligned consensus sequences in BAM format for each UMI cluster. The optional variant calling creates a vcf file for all variants that are found in the consensus sequences.
The newest version of the pipeline supports live analysis of the clusters during sequencing and seemless polishing of the clusters as soon as enough clusters are found.

Umi-pipeline-nf originated from a Snakemake-based analysis pipeline (pipeline-umi-amplicon; originally developed by Karst et al, Nat Biotechnol 18:165–169, 2021). We have migrated the pipeline to Nextflow and incorporated several optimizations and additional functionalities.

Workflow

Workflow

The pipeline is organized into four main subworkflows, each with its own processing steps and outputs:

  1. LIVE UMI PROCESSING

    • Purpose: Real-time processing of raw FastQ files.
    • Steps:
      • Merge and filter raw FastQ files.
      • Align reads to the reference genome.
      • Extract UMI sequences.
      • Cluster UMI-tagged reads.
    • Outputs:
      • Processed UMI clusters are passed on to later stages.
      • Raw alignment files (e.g., in <output>/<barcodeXX>/raw/align/ or <output>/<barcodeXX>/<target>/fastq_filtered/raw/).
      • Filtered FastQ files and clustering statistics.

    To stop the pipeline when it's in live mode, create a CONTINUE file in the output directory:
    touch <output>/CONTINUE

  2. OFFLINE UMI PROCESSING

    • Purpose: Batch processing with an optional subsampling step.
    • Steps:
      • Merge and filter FastQ files.
      • Optionally subsample the merged reads.
      • Perform alignment, UMI extraction, and clustering similar to LIVE processing.
    • Outputs:
      • Processed UMI clusters.
      • Alignment and subsampling reports (e.g., in <output>/<barcodeXX>/raw/subsampling/ and <output>/<barcodeXX>/<target>/fastq_filtered/raw/).
  3. UMI POLISHING

    • Purpose: Refine UMI clusters to generate high-quality consensus sequences.
    • Steps:
      • Polish clusters using medaka.
      • Realign consensus sequences to the reference genome.
      • Re-extract and re-cluster UMIs from consensus reads.
      • Parse final consensus clusters.
    • Outputs:
      • Consensus BAM and FastQ files (e.g., in <output>/<barcodeXX>/<target>/align/consensus/ and <output>/<barcodeXX>/<target>/fastq/consensus/).
      • Polishing logs and detailed cluster statistics.
  4. VARIANT CALLING

    • Purpose: Identify genetic variants from the consensus data.
    • Steps:
    • Outputs:
      • VCF files with variant calls (e.g., in <output>/<barcodeXX>/<target>/<freebayes/mutserve/lofreq>/).

See the output documentation for a detailed overview of the pipeline outputs and directory structure.

Main Adaptations

  • It comes with a docker/singularity container making installation simple, easy to use on clusters and results highly reproducible.
  • The pipeline is optimized for parallelization.
  • Additional UMI cluster splitting step to remove admixed UMI clusters.
  • Read filtering strategy per UMI cluster was adapted to preserve the highest quality reads.
  • Three commonly used variant callers (freebayes, lofreq or mutserve) are supported by the pipeline.
  • The raw reads can be optionally subsampled.
  • The raw reads can be filtered by read length and quality.
  • GPU acceleration for cluster polishing by Medaka is available when using the docker profile. Tested with an RTX 4080 SUPER GPU (16 GB).
  • Allows multi line bed files to run the pipeline for several targets at once.
  • Supports live analysis of the clusters during sequencing and seemless polishing of the clusters as soon as enough clusters are found

To see all available parameters run bash nextflow run genepi/umi-pipeline-nf -r main --help

Quick Start

  1. Install nextflow.

  2. Download the pipeline and test it on a minimal dataset with a single command.

bash nextflow run genepi/umi-pipeline-nf -r v1.0.0-beta -profile test,docker

  1. Start running your own analysis!
    3.1 Download and adapt the config/custom.config with paths to your data (relative and absolute paths possible).

bash nextflow run genepi/umi-pipeline-nf -r v1.0.0-beta -c <custom.config> -profile custom,<docker,singularity>

Citation

If you use the pipeline please cite our Paper:

Amstler S, Streiter G, Pfurtscheller C, Forer L, Di Maio S, Weissensteiner H, Paulweber B, Schoenherr S, Kronenberg F, Coassin S. Nanopore sequencing with unique molecular identifiers enables accurate mutation analysis and haplotyping in the complex lipoprotein(a) KIV-2 VNTR. Genome Med 16, 117 (2024). https://doi.org/10.1186/s13073-024-01391-8

Credits

The pipeline was written by @StephanAmstler.
Nextflow template pipeline: EcSeq.
Snakemake-based ONT pipeline for UMI nanopore sequencing analysis: nanoporetech/pipeline-umi-amplicon.
UMI-corrected nanopore sequencing analysis first shown by: SorenKarst/longread_umi.

Owner

  • Name: Institute of Genetic Epidemiology
  • Login: genepi
  • Kind: organization
  • Location: Innsbruck, Austria

Medical University of Innsbruck

Citation (CITATION.cff)

cff-version: "1.2.0"
message: "If you use this software, please cite it as below."
title: "Nanopore sequencing with unique molecular identifiers enables accurate mutation analysis and haplotyping in the complex lipoprotein(a) KIV-2 VNTR"
authors:
  - family-names: "Amstler"
    given-names: "Stephan"
  - family-names: "Streiter"
    given-names: "Gertraud"
  - family-names: "Pfurtscheller"
    given-names: "Cathrin"
  - family-names: "Forer"
    given-names: "Lukas"
  - family-names: "Di Maio"
    given-names: "Silvia"
  - family-names: "Weissensteiner"
    given-names: "Hansi"
  - family-names: "Paulweber"
    given-names: "Bernhard"
  - family-names: "Schoenherr"
    given-names: "Sebastian"
  - family-names: "Kronenberg"
    given-names: "Florian"
  - family-names: "Coassin"
    given-names: "Stefan"
doi: "10.1186/s13073-024-01391-8"
date-released: "2024-10-08"
license: "Apache-2.0"
repository-code: "https://github.com/genepi/umi-pipeline-nf"
preferred-citation:
  type: "article"
  authors:
  - family-names: "Amstler"
    given-names: "Stephan"
  - family-names: "Streiter"
    given-names: "Gertraud"
  - family-names: "Pfurtscheller"
    given-names: "Cathrin"
  - family-names: "Forer"
    given-names: "Lukas"
  - family-names: "Di Maio"
    given-names: "Silvia"
  - family-names: "Weissensteiner"
    given-names: "Hansi"
  - family-names: "Paulweber"
    given-names: "Bernhard"
  - family-names: "Schoenherr"
    given-names: "Sebastian"
  - family-names: "Kronenberg"
    given-names: "Florian"
  - family-names: "Coassin"
    given-names: "Stefan"
  doi: "10.1186/s13073-024-01391-8"
  journal: "Genome Medicine"
  day: 8
  month: 10
  title: "Nanopore sequencing with unique molecular identifiers enables accurate mutation analysis and haplotyping in the complex Lipoprotein(a) KIV-2 VNTR"
  year: 2024

GitHub Events

Total
  • Create event: 14
  • Release event: 9
  • Issues event: 21
  • Watch event: 3
  • Delete event: 17
  • Issue comment event: 16
  • Push event: 110
  • Pull request event: 5
  • Fork event: 3
Last Year
  • Create event: 14
  • Release event: 9
  • Issues event: 21
  • Watch event: 3
  • Delete event: 17
  • Issue comment event: 16
  • Push event: 110
  • Pull request event: 5
  • Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 11
  • Total pull requests: 3
  • Average time to close issues: about 1 year
  • Average time to close pull requests: 2 minutes
  • Total issue authors: 6
  • Total pull request authors: 1
  • Average comments per issue: 0.91
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 6
  • Pull requests: 3
  • Average time to close issues: 27 days
  • Average time to close pull requests: 2 minutes
  • Issue authors: 4
  • Pull request authors: 1
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • AmstlerStephan (4)
  • webbchen (3)
  • coro1c (3)
  • cmbrunet (1)
  • ebwinter95 (1)
  • cnk113 (1)
  • camcl (1)
Pull Request Authors
  • AmstlerStephan (14)
  • vmelichar (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/ci-tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-java v2 composite
  • docker/build-push-action v5 composite
  • docker/setup-buildx-action v3 composite
  • nf-core/setup-nextflow v1 composite
Dockerfile docker
  • ubuntu 22.04 build
environment.yml conda
  • bedtools 2.30.0.*
  • freebayes 1.3.2.*
  • lofreq 2.1.5.*
  • medaka >2.0.0
  • minimap2 2.24.*
  • openjdk 11.0.9.*
  • pip 22.2.2.*
  • python >=3.8
  • samtools 1.15.1.*
  • seqtk 1.3.*
  • unzip 6.0.*
  • vcflib 1.0.0.*
  • vsearch 2.21.2.*