nano-rave

Nextflow pipeline designed for rapid onsite QC and variant calling of Oxford Nanopore data (following basecalling and demultiplexing with Guppy).

https://github.com/sanger-pathogens/nano-rave

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
    Organization sanger-pathogens has institutional domain (www.sanger.ac.uk)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Nextflow pipeline designed for rapid onsite QC and variant calling of Oxford Nanopore data (following basecalling and demultiplexing with Guppy).

Basic Info
  • Host: GitHub
  • Owner: sanger-pathogens
  • License: agpl-3.0
  • Language: Nextflow
  • Default Branch: main
  • Homepage:
  • Size: 428 KB
Statistics
  • Stars: 10
  • Watchers: 6
  • Forks: 4
  • Open Issues: 1
  • Releases: 1
Created over 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

nano-rave

Nextflow pipeline designed for rapid onsite QC and variant calling of Oxford Nanopore data (following basecalling and demultiplexing with Guppy).

Pipeline summary

Image of nano-rave pipeline workflow

Getting started

Running on a personal computer

  1. Install nextflow.

  2. Install Docker.

  3. Run pipeline direct from Github repository.

Example: bash nextflow run github.com/sanger-pathogens/nano-rave --sequencing_manifest ./test_data/pipeline/inputs/test_manifest.csv --reference_manifest ./test_data/pipeline/inputs/reference_manifest.csv --variant_caller medaka_haploid --min_barcode_dir_size 5 --results_dir my_output

See usage for all available pipeline options.

  1. Once your run has finished, check output in the results_dir and clean up any intermediate files. To do this (assuming no other pipelines are running from the current working directory) run:

bash rm -rf work .nextflow*

Running on the farm (Sanger HPC clusters)

  1. Add the following line to ~/.bashrc (if not already present):

bash [[ -f /software/pathogen/farm5 ]] && source /software/pathogen/etc/pathogen.profile

  1. Source the updated .bashrc file

bash source ~/.bashrc

  1. Load the module bash module load nano-rave/<version>

  2. The pipeline should now be directly available with the command nano-rave bash nano-rave --help

  3. before excuting nano-rave, it is recommended to set the $SINGULARITY_CACHEDIR and $NXF_SINGULARITY_CACHEDIR environment variables so that they both point to a folder with enough space. This location is that one where singularity images supporting the pipeline dependencies will be downloaded; by default it is downloaded inside your home directory (spcifically in ${HOME}/.singularity/cache), which has space limitations and will rapidly fill up, causing the pipeline to fail. On the Sanger HPC, it is recommended to point to a location on your lustre scratch space.

  4. Start your analysis

To use the appropriate Sanger configuration, please run with -profile sanger_local option. Here is an example command: bash bsub -o nano-rave.o -e nano-rave.e -q long -n 4 -R "select[mem>16000] rusage[mem=16000]" -M16000 \ nano-rave -profile sanger_local --sequencing_manifest ./test_data/pipeline/inputs/test_manifest.csv --reference_manifest ./test_data/pipeline/inputs/reference_manifest.csv --variant_caller medaka_haploid --min_barcode_dir_size 5 --results_dir my_output This will run the whole pipeline i.e. all per-sample processes within a single siubmitted job, so please tailor your resource request accordingly.

NB: we are working on providing a sanger_lsf profile that will enable th e proper use of LSF cluster integration, meaning that each process is executed by submitting as a separate job on the HPC; under such configuration, you would be advised to submitted the main job (workflow head process) to the oversubscribed queue.

See usage for all available pipeline options.

  1. Once your run has finished, check output in the results_dir and clean up any intermediate files. To do this (assuming no other pipelines are running from the current working directory) run:

bash rm -rf work .nextflow*

Usage

``` nextflow run main.nf

Options: --sequencingmanifest Manifest containing paths to sequencing directories and sequencing summary files (mandatory) --referencemanifest Manifest containing reference identifiers and paths to fastq reference files (mandatory) --resultsdir Specify results directory default: ./nextflow_results --variantcaller Specify a variant caller to use medaka (default), medaka_haploid, freebayes, clair3 --clair3args Specify clair3 variant calling parameters - must include model e.g. --clair3args "--modelpath /opt/models/r941promsupg5014" (optional) --minbarcodedirsize Specify the expected minimum size of the barcode directories, in MB. Must be > 0. default: 10 --keepbam_files Save BAM files in results directory default: false --help Print this help message (optional) ```

Note:

  • Please refer to https://github.com/HKU-BAL/Clair3#usage for a comprehesive list of options that can be used with --clair3_args. Currently, by default, the software will assume you will want to variant call human chromosome contigs. If this is not the case, or you wish to use a custom set of contigs, please see clair3 options --include_all_ctgs or --ctg_name. You may also want to skip phasing with clair3 option --no_phasing_for_fa if this is not required or useful for you.

Sequencing manifest format

The sequencing manifest is in a csv format and contains two columns

  • sequencing_dir : folder containing all the Oxford Nanopore sequencing data
  • sequence_summary_file : required for QC - usually found in the sequencing directory. In this file, the paths to the fast5 read files (first column) must be full/absolute paths.

The pipeline assumes that sequencing_dir contains Guppy output for a particular sample. In particular, the parent and child folders of the given sequencing_dir assume the following structure: <sample>/<sequencing_dir>/fastq_pass/barcode* Where each barcode* directory contains fastq.gz files. Only barcode* directories whose total size on disk exceeds the threshold set with --min_barcode_dir_size are considered.

Example manifest: sequencing_dir,sequence_summary_file ./test_data/PIPELINE/inputs/sample/sequencing_dir,./test_data/PIPELINE/inputs/sample/sequencing_dir/sequencing_summary.txt

Note: When using relative paths in the manifest, they are relative to the current working directory (from which nextflow is run).

Reference manifest format

The reference manifest is in csv format and contains two columns

  • reference_id : identifier for the reference (e.g. gene name or reference genome name)
  • reference_path : path to the reference file (fasta format)

Example for amplicon data: reference_id,reference_path ama1,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_ama1.fasta crt,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_crt.fasta csp,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_csp.fasta dhfr,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_dhfr.fasta dhps,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_dhps.fasta eba175_3d7,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_eba175_3d7.fasta k13,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_k13.fasta mdr1,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_mdr1.fasta msp1,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_msp1.fasta msp2,./test_data/PIPELINE/inputs/references/ref_target_gene_cds_seq_msp2.fasta

Note: When using relative paths in the manifest, they are relative to the current working directory (from which nextflow is run).

Variant callers

Three variant callers are currently supported:

Software versions

The pipeline makes use of docker images to ensure reproducibility. This version of the pipeline uses the following software dependencies:

| Software | Version | Image URL | |-----------|---------|-------------------------------------------------------| | bedtools | 2.29.2 | quay.io/biocontainers/bedtools:2.29.2--hc088bd40 | | clair3 | 1.0.0 | docker.io/hkubal/clair3@sha256:3c4c6db3bb6118e3156630ee62de8f6afef7f7acc9215199f9b6c1b2e1926cf8 | | freebayes | 1.3.5 | docker.io/gfanz/freebayes@sha256:d32bbce0216754bfc7e01ad6af18e74df3950fb900de69253107dc7bcf4e1351 | | medaka | 1.4.4 | quay.io/biocontainers/medaka:1.4.4--py38h130def00 | | minimap2 | 2.17 | quay.io-biocontainers-minimap2:2.17--hed695b03 | | nanoplot | 1.38.0 | quay.io/biocontainers/nanoplot:1.38.0--pyhdfd78af0 | | pycoqc | 2.5.2 | quay.io/biocontainers/pycoqc:2.5.2--py0 | | samtools | 1.15.1 | quay.io/biocontainers/samtools:1.15.1--h11701150 | | tabix | 1.11 | quay.io/biocontainers/tabix:1.11--hdfd78af_0 |

Contributions and testing

Developer contributions to this pipeline will only be accepted if all pipeline tests pass. To check:

  1. Make your changes.

  2. Download the test data. A utility script is provided:

python3 scripts/download_test_data.py

  1. Install nf-test (>=0.7.0) and run the tests:

nf-test test tests/*.nf.test

If running on Sanger HPC cluster, add the option --profile sanger_local.

  1. Submit a PR.

Citations

If you use this pipeline for your analysis, please cite our paper:

Drug resistance and vaccine target surveillance of Plasmodium falciparum using nanopore sequencing in Ghana

Sophia T. Girgis, Edem Adika, Felix E. Nenyewodey, Dodzi K. Senoo Jnr, Joyce M. Ngoi, Kukua Bandoh, Oliver Lorenz, Guus van de Steeg, Alexandria J. R. Harrott, Sebastian Nsoh, Kim Judge, Richard D. Pearson, Jacob Almagro-Garcia, Samirah Saiid, Solomon Atampah, Enock K. Amoako, Collins M. Morang’a, Victor Asoala, Elrmion S. Adjei, William Burden, William Roberts-Sengier, Eleanor Drury, Megan L. Pierce, Sónia Gonçalves, Gordon A. Awandare, Dominic P. Kwiatkowski, Lucas N. Amenga-Etego & William L. Hamilton

Nature Microbiology 8:2365–2377 (2023); doi: 10.1038/s41564-023-01516-6.

Which was initially released as a pre-print:

Nanopore sequencing for real-time genomic surveillance of Plasmodium falciparum

Sophia T. Girgis, Edem Adika, Felix E. Nenyewodey, Dodzi K. Senoo Jnr, Joyce M. Ngoi, Kukua Bandoh, Oliver Lorenz, Guus van de Steeg, Sebastian Nsoh, Kim Judge, Richard D. Pearson, Jacob Almagro-Garcia, Samirah Saiid, Solomon Atampah, Enock K. Amoako, Collins M. Morang’a, Victor Asoala, Elrmion S. Adjei, William Burden, William Roberts-Sengier, Eleanor Drury, Sónia Gonçalves, Gordon A. Awandare, Dominic P. Kwiatkowski, Lucas N. Amenga-Etego, William L. Hamilton

bioRxiv 2022.12.20.521122; doi: 10.1101/2022.12.20.521122

This pipeline was adapted from the nf-core/nanoseq pipeline.

A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines

Ying Chen, Nadia M. Davidson, Yuk Kei Wan, Harshil Patel, Fei Yao, Hwee Meng Low, Christopher Hendra, Laura Watten, Andre Sim, Chelsea Sawyer, Viktoriia Iakovleva, Puay Leng Lee, Lixia Xin, Hui En Vanessa Ng, Jia Min Loo, Xuewen Ong, Hui Qi Amanda Ng, Jiaxu Wang, Wei Qian Casslynn Koh, Suk Yeah Polly Poon, Dominik Stanojevic, Hoang-Dai Tran, Kok Hao Edwin Lim, Shen Yon Toh, Philip Andrew Ewels, Huck-Hui Ng, N.Gopalakrishna Iyer, Alexandre Thiery, Wee Joo Chng, Leilei Chen, Ramanuj DasGupta, Mile Sikic, Yun-Shen Chan, Boon Ooi Patrick Tan, Yue Wan, Wai Leong Tam, Qiang Yu, Chiea Chuan Khor, Torsten Wüstefeld, Ploy N. Pratanwanich, Michael I. Love, Wee Siong Sho Goh, Sarah B. Ng, Alicia Oshlack, Jonathan Göke, SG-NEx consortium

bioRxiv 610741; doi: 10.1101/610741

A full list of citations for tools used in the pipeline is given in CITATIONS.md

Copyright

Copyright (C) 2022,2023 Genome Research Ltd.

Owner

  • Name: Pathogen Informatics, Wellcome Sanger Institute
  • Login: sanger-pathogens
  • Kind: organization
  • Location: Hinxton, Cambs., UK

Citation (CITATIONS.md)

# nano-rave: Citations

## [nano-rave](https://github.com/sanger-pathogens/nano-rave)

> Sophia T. Girgis, Edem Adika, Felix E. Nenyewodey, Dodzi K. Senoo Jnr, Joyce M. Ngoi, Kukua Bandoh, Oliver Lorenz, Guus van de Steeg, Sebastian Nsoh, Kim Judge, Richard D. Pearson, Jacob Almagro-Garcia, Samirah Saiid, Solomon Atampah, Enock K. Amoako, Collins M. Morang’a, Victor Asoala, Elrmion S. Adjei, William Burden, William Roberts-Sengier, Eleanor Drury, Sónia Gonçalves, Gordon A. Awandare, Dominic P. Kwiatkowski, Lucas N. Amenga-Etego, William L. Hamilton. Nanopore sequencing for real-time genomic surveillance of Plasmodium falciparum. [bioRxiv 521122](https://www.biorxiv.org/content/10.1101/2022.12.20.521122v1); doi: [10.1101/521122](https://doi.org/10.1101/2022.12.20.521122)

## [nf-core/nanoseq](https://github.com/nf-core/nanoseq/)

> Chen Y, Davidson NM, Wan YK, Patel H, Yao F, Low HM, Hendra C, Watten L, Sim A, Sawyer C, Iakovleva V, Lee PL, Xin L, Ng HEV, Loo JM, Ong X, Ng HQA, Wang J, Koh WQC, Poon SYP, Stanojevic D, Tran H-D, Lim KHE, Toh SY, Ewels PA, Ng H-H, Iyer N.G, Thiery A, Chng WJ, Chen L, DasGupta R, Sikic M, Chan Y-S, Tan BOP, Wan Y, Tam WL, Yu Q, Khor CC, Wüstefeld T, Pratanwanich PN, Love MI, Goh WSS, Ng SB, Oshlack A, Göke J, SG-NEx consortium. A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines. [bioRxiv 610741](https://www.biorxiv.org/content/10.1101/2021.04.21.440736v1). doi: [10.1101/610741](https://doi.org/10.1101/2021.04.21.440736)

## [nf-core](https://www.ncbi.nlm.nih.gov/pubmed/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://www.ncbi.nlm.nih.gov/pubmed/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/)

  > Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

- [Clair3](https://doi.org/10.1038/s43588-022-00387-x)

  > Zheng Z, Li S, Su J, Leung A W-S, Lam T-W, Luo R. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nature Computational Science. 2022;2(12):797–803. doi: 10.1038/s43588-022-00387-x. [bioRxiv 474431](https://www.biorxiv.org/content/10.1101/2021.12.29.474431v2)

- [Minimap2](https://pubmed.ncbi.nlm.nih.gov/29750242/)

  > Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191. PMID: 29750242; PMCID: PMC6137996.

- [Medaka](https://github.com/nanoporetech/medaka)

- [NanoPlot](https://pubmed.ncbi.nlm.nih.gov/29547981/)

  > De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018 Aug 1;34(15):2666-2669. doi: 10.1093/bioinformatics/bty149. PubMed PMID: 29547981; PubMed Central PMCID: PMC6061794.

- [pycoQC](https://doi.org/10.21105/joss.01236)

  > Leger A, Leonardi T. pycoQC, interactive quality control for Oxford Nanopore Sequencing. Journal of Open Source Software. 2019;4(34):1236. doi: 10.21105/joss.01236

- [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)

  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

GitHub Events

Total
  • Issues event: 2
  • Issue comment event: 4
  • Member event: 1
  • Fork event: 3
Last Year
  • Issues event: 2
  • Issue comment event: 4
  • Member event: 1
  • Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • phuongvnl (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

Dockerfile docker
  • bhklab/samtools-1.9.0 latest build
  • quay.io/biocontainers/bedtools 2.29.2--hc088bd4_0 build
  • quay.io/biocontainers/minimap2 2.17--hed695b0_3 build
  • quay.io/biocontainers/sniffles 1.0.12--h8b12597_1 build