measeq

Measles Sequence Analysis and Automation

https://github.com/phac-nml/measeq

Last synced: 10 months ago · JSON representation ·

Repository

Measles Sequence Analysis and Automation

Basic Info

Host: GitHub
Owner: phac-nml
License: mit
Language: Nextflow
Default Branch: main
Homepage:
Size: 33.2 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 6

Created about 1 year ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License Citation

MeaSeq: Measles Sequence Analysis Automation

Current Updates
- 2025-07-18
Introduction
Installation
Resource Requirements
Usage
Outputs
Steps
- Illumina Steps
- Nanopore Steps
Troubleshooting
Credits
Citations
Contributing
Legal

Current Updates

2025-09-03

Illumina and Nanopore workflows fully functional with the same (or equivalent) outputs
Dependency management fully available with Docker, Singularity, and Conda
Can assign DSIds from reference multi-fasta file and give new N450s a Novel-hash label
- With --dsid_fasta <FASTA>
- If no DISd fasta file available, it will assign all N450 as Novel-hash with hashes matching if the sequence is the same

Introduction

MeaSeq is a measles virus (MeV) specific pipeline established for use in surveillance and outbreak analysis. This pipeline utilizes a reference-based read mapping approach for Whole Genome or Amplicon sequencing data from both the Illumina and Nanopore platforms to output MeV consensus sequences (whole genome and N450), variant data, sequencing qualtiy information, and custom summary reports.

MeaSeq Workflow Diagram

This project aims to implement an open-source, easy to run, MeV Whole Genome Sequence analysis pipeline that works on both Illumina and Nanopore data. The end goal of this project is to deploy a standardized pipeline focused on final reporting metrics and plots for rapid detection and response to MeV outbreaks in Canada and abroad.

The basis of the pipeline come from two other pipelines. The illumina side from nf-cores' viralrecon pipeline and for nanopore the artic pipeline. Most additions were added for measles-specific QC or reporting.

Installation

[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test_illumina before running the workflow on actual data.

Installation requires both nextflow at a minimum version of 24.10.0 and a dependency management system to run.

Steps:

Download and install nextflow
1. Download and install with conda
  - Conda command: conda create -n nextflow -c conda-forge -c bioconda nextflow
2. Install with the instructions at https://www.nextflow.io/
Determine which dependency management system works best for you

Note: Currently the plotting process is using a custom docker container but it should work for both docker and singularity

Run the pipeline with one of the following profiles to handle dependencies (or use your own profile) if you have one for your institution!:
- conda
- mamba
- singularity
- docker

Resources Requirements

By default, the bwamem2 step has a minimum resource usage allocation set to 12 cpus and 72GB memory using the nf-core process_high label.

This can be adjusted (along with the other labels) by creating and passing a custom configuration file with -c <config>. More info can be found in the usage doc

The pipeline has also been test using as low as 2 cpus and 8GB memory with a few throttling steps but functional.

Usage

Illumina

First, prepare a samplesheet with your input data that looks as follows for Illumina paired-end data:

samplesheet.csv:

csv sample,fastq_1,fastq_2 MeVSample01,/PATH/TO/inputread1_S1_L002_R1_001.fastq.gz,/PATH/TO/inputread1_S1_L002_R2_001.fastq.gz PosCtrl01,/PATH/TO/inputread2_S1_L003_R1_001.fastq.gz,/PATH/TO/inputread2_S1_L003_R2_001.fastq.gz Sample3,/PATH/TO/inputread3_S1_L004_R1_001.fastq.gz,/PATH/TO/inputread3_S1_L004_R2_001.fastq.gz

Each row represents a sample and its associated paired-end Illumina read data.

You can then run the pipeline using:

bash nextflow run phac-nml/measeq \ -profile <docker/singularity/.../institute> --input <SAMPLESHEET> \ --outdir <OUTDIR> \ --reference <REFERENCE FASTA> \ --platform illumina \

Nanopore

And as follows for nanopore data:

samplesheet.csv

csv sample,fastq_1,fastq_2 MeVSample01,/PATH/TO/inputread1.fastq.gz, PosCtrl01,/PATH/TO/inputread2.fastq.gz, Sample3,/PATH/TO/inputread3.fastq.gz,

Each row represents a sample and its single-end nanopore data.

You can then run the pipeline using:

bash nextflow run phac-nml/measeq \ --input <SAMPLESHEET> \ --outdir <OUTDIR> \ --reference <REFERENCE FASTA> \ --platform nanopore \ --model <CLAIR3_MODEL> \ -profile <docker/singularity/institute/etc>

Clair3 Models

The Nanopore pipeline utilizes Clair3 to call nanopore variants which requires a model that should be picked based off of the flowcell, pore, translocation speed, and basecalling model.

Some models are built into clair3 and some need to be downloaded. The pre-trained clair3 models are able to be automatically downloaded when running the pipeline using artic get_models and can be specified as a parameter with --model <MODEL>.

Additional or local models can also be used, you just have to provide a path to them and use the --local_model <PATH> parameter instead

Amplicon and Primer Files

Both Illumina and Nanopore support running amplicon data using a primer scheme file. To run amplicon data all you need is a primer bed file where the primers have been mapped to the location in the reference genome used. The parameter being --primer_bed <PRIMER_BED>. An example primer bed file looks as such:

primer.bed

<CHROM> <START> <END> <PRIMER_NAME> <POOL> <DIRECTION> MH356245.1 1 25 MSV_1_LEFT 1 + MH356245.1 400 425 MSV_2_LEFT 2 + MH356245.1 500 525 MSV_1_RIGHT 1 - MH356245.1 900 925 MSV_2_RIGHT 2 -

To properly pair the primers, make sure that the names match up until the _LEFT or _RIGHT that mark the primer direction in the primer name. You can also use the following direction extensions in pairing:

_LEFT and _RIGHT
_L and _R
_FORWARD and _REVERSE
_F and _R

Note: The first line in the example file is just to display what each line expects and should not be included when creating a primer bed file

DSIds

While 24 MeV genotypes were initially identified, only 2 have been detected since 2021: B3 and D8. Due to this, the Distinct Sequence Identifier (DSId) system was created to designate a unique 4-digit identifier based on the precise N450 sequence as a sub-genotype nomenclature. The Measles Nucleotide Surveillance database (MeaNS) is the global resource for these measles virus genetic sequences that is maintained by the WHO. N450 sequences can be submitted to the database to generate a distinct sequence identifier (DSId) for each unique sequence.

There is no way to query the current database so a multifasta file with DSId calls is required to match them up locally. If a match is found, the matching DSId is assigned! If no match is found, the distinct sequence is given a Novel-<MD5 HASH> (first 7 characters for now) identifier so that it can be submitted to the database. To do this, use the parameter --dsid_fasta <FASTA>. The fasta file would look as such:

dsid_fasta

```

1931 D8 GTCAGTTCCACATTGGCATCTGAACTCG 2001 D8 GTCAGTTCCACATTGGCATCAGAACTCG 2418 B3 GTCAGTTCCACAGTGGCATCTGAACTCG ```

If this parameter is not given, the DSIds will still be generated as hashes to group up samples in the dsid.tsv and in the final report.

More Run Options

For more detailed running options including adding metadata, adjusting parameters, adding in DSID matches, and more, please refer to the usage docs.

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Testing

To test the MeaSeq pipeline, and that everything works on your system, a small set of illumina D8 genotype samples have been included from SRA BioProject PRJNA480551 in the test_data/fastqs directory.

To run the pipeline on these samples run the following command:

bash nextflow run phac-nml/measeq -profile test_illumina,<docker/singularity/institute/etc>

Outputs

The main outputs of the pipeline are the consensus sequences (N450 and Full), the overall.qc.csv summary table, and the MeaSeq_Report.html. The final MeaSeq report gives a summary of the run including sample quality metrics, plots, and any additional information. Detailed pipeline outputs are described within the output docs

Steps

More detailed steps are available in the output docs

Illumina Steps

Generate Reference and Primer Intermediates
FastQC
Illumina Consensus Workflow
1. FastP
2. BWAMem2
3. iVar Trim (Amplicon input only)
4. Freebayes
5. Process Freebayes VCF
6. Make Depth Mask
7. Bcftools Consensus (Ambiguous and Consensus variants)
Nextclade (N450 and Custom datasets, N450 fasta output)
Samtools depth
Compare DSId (Optional with --dsid_fasta parameter)
Make sample QC
Amplicon Summary Workflow (Amp only data)
1. Bedtools Coverage
2. Summarize Amplicon Depth
3. Summarize Amplicon Completeness
4. MultiQC Amplicon Report
Report Workflow
1. Samtools mpileup
2. Pysamstats
3. Rmarkdown

Nanopore Steps

Generate Reference and Primer Intermediates
FastQC
Nanopore Consensus Workflow
1. Artic Get Models
2. NanoQ
3. Minimap2
4. Amplicon
  1. Artic Align Trim
  2. Clair3 Pool
  3. Artic VCF Merge
5. Clair3 No Pool (non-amplicon)
6. Make Depth Mask
7. VCF Filter
8. Artic Mask
9. Bcftools Norm
10. Bcftools Consensus
Nextclade (N450 and Custom datasets, N450 fasta output)
Samtools depth
Compare DSId (Optional with --dsid_fasta parameter)
Make sample QC
Amplicon Summary Workflow (Amp only data)
1. Bedtools Coverage
2. Summarize Amplicon Depth
3. Summarize Amplicon Completeness
4. MultiQC Amplicon Report
Report Workflow
1. Samtools mpileup
2. Pysamstats
3. Rmarkdown

Troubleshooting

For troubleshooting, please open an issue or consult the usage docs to see if they have the information you require.

Credits

MeaSeq was originally written as an illumina-focused bash pipeline by McMaster University Co-op student - Ahmed Abdalla and has now been expanded to cover nanopore data along with being fully converted to Nextflow.

For questions please contact either:

Darian Hole (darian.hole@phac-aspc.gc.ca)
Molly Pratt (molly.pratt@phac-aspc.gc.ca)

Citations

A citation for this pipeline will be available soon.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows:

Detailed citations for utilized tools are found in CITATIONS.md

Contributing

Contributions are welcome through creating PRs or Issues

Legal

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner

Name: National Microbiology Laboratory
Login: phac-nml
Kind: organization

Website: https://www.nml-lnm.gc.ca/
Repositories: 50
Profile: https://github.com/phac-nml

Citation (CITATIONS.md)

# phac-nml/MeaSeq: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

[Artic](https://github.com/artic-network/fieldbioinformatics)

[Clair3](https://github.com/HKU-BAL/Clair3)

> Zheng, Z.; Li, S.; Su, J.; Leung, A. W.-S.; Lam, T.-W.; Luo, R. Symphonizing Pileup and Full-Alignment for Deep Learning-Based Long-Read Variant Calling. Nature Computational Science 2022, 2 (12), 797–803. https://doi.org/10.1038/s43588-022-00387-x.

[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

> Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

[fastp](https://github.com/OpenGene/fastp/)

> Chen S. (2023). Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107

[BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/)

> Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

[iVar](https://www.ncbi.nlm.nih.gov/pubmed/30621750/)

> Grubaugh ND, Gangavarapu K, Quick J, Matteson NL, De Jesus JG, Main BJ, Tan AL, Paul LM, Brackney DE, Grewal S, Gurfield N, Van Rompay KKA, Isern S, Michael SF, Coffey LL, Loman NJ, Andersen KG. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol. 2019 Jan 8;20(1):8. doi: 10.1186/s13059-018-1618-7. PubMed PMID: 30621750; PubMed Central PMCID: PMC6325816.

[MultiQC](https://www.ncbi.nlm.nih.gov/pubmed/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

[R](https://www.R-project.org/)

> R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

[SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)

> Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

[QUAST](https://www.ncbi.nlm.nih.gov/pubmed/23422339/)

> Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PubMed PMID: 23422339; PubMed Central PMCID: PMC3624806.

[Nextclade](https://clades.nextstrain.org/)

> Aksamentov, I., Roemer, C., Hodcroft, E. B., & Neher, R. A., (2021). Nextclade: clade assignment, mutation calling and quality control for viral genomes. Journal of Open Source Software, 6(67), 3773, https://doi.org/10.21105/joss.03773

[pysamstats](https://github.com/alimanfoo/pysamstats)

> Miles A. (2014). pysamstats. Available at https://github.com/alimanfoo/pysamstats

[Python](https://github.com/python/)

> Python Software Foundation. Python Language Reference, version 3.8. Available at http://www.python.org

## R Packages

[data.table](https://CRAN.R-project.org/package=data.table)

> Barrett T, Dowle M, Srinivasan A, Gorecki J, Chirico M, Hocking T (2024). _data.table: Extension of `data.frame`_. R package version 1.15.4, https://CRAN.R-project.org/package=data.table

[DT](https://CRAN.R-project.org/package=DT)

> Xie Y, Cheng J, Tan X (2024). _DT: A Wrapper of the JavaScript Library 'DataTables'_. R package version 0.33, https://CRAN.R-project.org/package=DT

[dplyr](https://CRAN.R-project.org/package=dplyr)

> Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: AGrammar of Data Manipulation_. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr

[flexdashboard](https://CRAN.R-project.org/package=flexdashboard)

> Aden-Buie G, Sievert C, Iannone R, Allaire J, Borges B (2023). _flexdashboard: R Markdown Format for Flexible Dashboards_. R package version 0.6.2, https://CRAN.R-project.org/package=flexdashboard

[htmltools](https://CRAN.R-project.org/package=htmltools)

> Cheng J, Sievert C, Schloerke B, Chang W, Xie Y, Allen J (2024). _htmltools: Tools for HTML_. R package version 0.5.8.1, https://CRAN.R-project.org/package=htmltools

[htmlwidgets](https://CRAN.R-project.org/package=htmltools)

> Vaidyanathan R, Xie Y, Allaire J, Cheng J, Sievert C, Russell K (2023). _htmlwidgets: HTML Widgets for R_. R package version 1.6.4, https://CRAN.R-project.org/package=htmlwidgets

[plotly](https://plotly-r.com)

> C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020. https://plotly-r.com

[rmarkdown](https://github.com/rstudio/rmarkdown)

> Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2024). _rmarkdown: Dynamic Documents for R_. R package version 2.27, https://github.com/rstudio/rmarkdown

[tidyverse](https://doi.org/10.21105/joss.01686)

> Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” _Journal of Open Source Software_, _4_(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686

[knitr](https://yihui.org/knitr/)

> Xie Y (2024). _knitr: A General-Purpose Package for Dynamic Report Generation in R_. R package version 1.49, https://yihui.org/knitr/

[stringr](https://CRAN.R-project.org/package=stringr)

> Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String Operations_. R package version 1.5.1, https://CRAN.R-project.org/package=stringr

[readr](https://CRAN.R-project.org/package=readr)

> Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R package version 2.1.5, https://CRAN.R-project.org/package=readr

[shidashi](https://CRAN.R-project.org/package=shidashi)

> Wang Z (2024). _shidashi: A Shiny Dashboard Template System_. R package version 0.1.6, https://CRAN.R-project.org/package=shidashi

GitHub Events

Total

Release event: 2
Delete event: 5
Issue comment event: 8
Member event: 1
Push event: 39
Pull request review comment event: 19
Pull request review event: 9
Pull request event: 16
Create event: 8

Last Year

Release event: 2
Delete event: 5
Issue comment event: 8
Member event: 1
Push event: 39
Pull request review comment event: 19
Pull request review event: 9
Pull request event: 16
Create event: 8

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 7
Average time to close issues: N/A
Average time to close pull requests: 7 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.43
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 7
Average time to close issues: N/A
Average time to close pull requests: 7 days
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.43
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0