https://github.com/ajodeh-juma/rvfv-amplicon-seq

A simple pipeline for Rift Valley fever virus consensus genome generation from amplicon sequencing

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 20 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

A simple pipeline for Rift Valley fever virus consensus genome generation from amplicon sequencing

Basic Info

Host: GitHub
Owner: ajodeh-juma
License: mit
Language: Nextflow
Default Branch: master
Size: 11.4 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Changelog License Code of conduct

Introduction

rvfvampliconseq A nextflow pipeline for analyzing Rift Valley fever virus amplicon sequencing data from Illumina instrument.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Installation

rvfvampliconseq runs on UNIX/LINUX systems. You will install Miniconda3 from here. Once Miniconda3 has been installed, proceed with pipeline installation

Install nextflow
Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort; see docs)
Download the pipeline

git clone https://github.com/ajodeh-juma/rvfv-amplicon-seq.git cd rvfv-amplicon-seq conda env create -f environment.yml conda activate rvfvampliconseq-env

Testing

Optional: Test the installation on a minimal dataset bundled in the installation

nextflow run main.nf -profile test

Usage

For minimal pipeline options, use the --help flag e.g.

nextflow run main.nf --help

To see all the options, use the --show_hidden_params flag e.g.

nextflow run main.nf --help --show_hidden_params

Input

The input is a comma-separated values (CSV) file having the columns: sample, fastq_1 and fastq_2 for paired-end sequence data. The 'fastq1' and 'fastq2' columns should point to the absolute paths of the fastq files. The 'sample' name should correspond to the basename of the fastq files.

|sample |fastq_1 |fastq_2 | | --- | --- | --- | |AM-M1|/home/jjuma/data/genomics/rvfv/illumina/miseq/run/09032022/testdata/AM-M1_S1_L001_R1_001.fastq.gz| /home/jjuma/data/genomics/rvfv/illumina/miseq/run/09032022/testdata/AM-M1_S1_L001_R2_001.fastq.gz | |RU-1|/Users/jjuma/data/genomics/rvfv/illumina/miseq/run/09032022/testdata/RU-1_S1_L001_R1_001.fastq.gz| /home/jjuma/data/genomics/rvfv/illumina/miseq/run/09032022/testdata/RU-1_S1_L001_R2_001.fastq.gz |

Metadata

If you intend to generate plots on coverage vs Ct values, include a metadata table in csv format having the columns sample_name, Ct. For example:

|sample_name|sample_type|host|platform|instrument|strategy|date|Ct|location|country|culture| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AM-M1|Aborted Foetus|Cow|Illumina|MiSeq|amplicon|16-01-2021|21.918|Kiambu|Kenya|No| | RU-1|Serum|Cow|Illumina|MiSeq|amplicon|01-01-2018|25.245|Rulindo|Rwanda|No|

A typical command for S segment is nextflow run main.nf \ --input /home/jjuma/PhD_RVF2019/projects/RVFv_amplicon_tiling_PCR/qPCR_Read_outs/09032022_samplesheet.csv \ --segment S \ --skip_markduplicates \ --metadata /home/jjuma/PhD_RVF2019/projects/RVFv_amplicon_tiling_PCR/qPCR_Read_outs/metadata_09032022.csv \ --outdir "${OUTDIR}/S-segment-outdir" \ -work-dir "${OUTDIR}/S-segment-workdir" \ -resume For the other segments, just replace the argument value for --segment to either L or M.

Method details

The pipeline offers several parameters including as highlighted:

``` Input/output options --input [string] Path to comma-separated file containing information about the samples in the experiment. --singleend [boolean] Specifies that the input is single-end reads. --outdir [string] The output directory where the results will be saved. [default: ./results] --multiqctitle [string] MultiQC report title. Printed as page header, used for filename if not otherwise specified. --email [string] Email address for completion summary.

Reference genome options --segment [string] genomic segment of the virus. options are 'S', 'M' and 'L' --hostfasta [string] Path to the host FASTA genome file --hostbwaindex [string] Path to host genome directory or tar.gz archive for pre-built BWA index. --hostbowtie2index [string] Path to host genome directory or tar.gz archive for pre-built BOWTIE2 index. --savereference [boolean] If generated by the pipeline save the BWA index in the results directory.

Read trimming options --trimmer [string] Specifies the alignment algorithm to use - available options are 'fastp', 'trimmomatic'. [default: fastp] --adapters [string] Path to FASTA adapters file --leading [integer] Instructs Trimmomatic to cut bases off the start of a read, if below a threshold quality [default: 3] --trailing [integer] Instructs Trimmomatic to cut bases off the end of a read, if below a threshold quality [default: 3] --averagequality [integer] Instructs Trimmomatic or Fastp the average quality required in the sliding window [default: 20] --minlength [integer] Instructs Trimmomatic or Fastp to drop the read if it is below a specified length [default: 20] --qualifiedqualityphred [integer] Instructs Fastp to apply the --qualifiedqualityphred option [default: 30] --unqualifiedpercentlimit [integer] Instructs Fastp to apply the --unqualifiedpercentlimit option [default: 10] --skiptrimming [boolean] Skip the adapter trimming step. --savetrimmed [boolean] Save the trimmed FastQ files in the results directory. --savetrimmedfail [boolean] Save failed trimmed reads.

Alignment options --aligner [string] Specifies the alignment algorithm to use - available options are 'bwa', 'bowtie2'. [default: bwa] --seqcenter [string] Sequencing center information to be added to read group of BAM files. --savealignintermeds [boolean] Save the intermediate BAM files from the alignment step. --skipmarkduplicates [boolean] Skip picard MarkDuplicates step. --skipalignment [boolean] Skip all of the alignment-based processes within the pipeline. --minmapped [integer] Minimum number of mapped reads to be used as threshold to drop low mapped samples [default: 200]

Amplicon trimming options --primerschemeversion [string] PrimalScheme RVFV primer scheme to use 'V1', 'V2' and 'V3' --ivartrimnoprimer [boolean] Unset -e parameter for ivar trim. Reads with primers are excluded by default --ivartrimminlen [integer] Minimum length of read to retain after trimming [default: 20] --ivartrimminqual [integer] Minimum quality threshold for sliding window to pass [default: 20] --ivartrimwindowwidth [integer] Size of the sliding window [default: 4] --ampliconleftsuffix [string] Left suffix string in the amplicons primer bed file [default: _LEFT] --ampliconright_suffix [string] Right suffix string in the amplicons primer bed file [default: _RIGHT]

Variant calling options --mpileupdepth [integer] SAMtools mpileup max per-file depth, avoids excessive memory usage --minbasequality [integer] Skip bases with baseQ/BAQ smaller than this value when performing variant calling [default: 20] --mincoverage [integer] Skip positions with an overall read depth smaller than this value when performing variant calling [default: 10] --minallelefreq [number] Minimum allele frequency threshold for calling variant [default: 0.25] --maxallelefreq [number] Maximum allele frequency threshold for calling variant [default: 0.75] --save_mplieup [boolean] Save SAMtools mpileup output file

Process skipping options --skipmultiqc [boolean] Skip MultiQC. --skipqc [boolean] Skip all QC steps except for MultiQC. ```

Output

All the output results will be written to the results directory if no --outdir is not used. Masked and non-masked consensus genomes will be located in bcftools/consensus

See usage docs for all of the available options when running the pipeline.

Pipeline Summary

By default, the pipeline currently performs the following:

Sequencing quality control (FastQC)
Quality control and preprocessing (fastp) or (trimmomatic)
Reads alignment/mapping (BWA) or (Bowtie2)
Alignment summary (SAMtools)
Call variants (iVar)
Annotate variants (SnpEff) or (SnpSift)
Genome coverage (BEDTools)
Visualization (R), (ggplot2)
Overall pipeline run summaries (MultiQC)

Documentation

Generating whole genome sequences of segmented viruses has largely depended on sequencing of partial gene sequences of the viruses. Here we implement a pipeline that can be adopted to other segmented viruses in order to assemble complete genomic sequences from RNA metagenomic sequencing. We implement this pipeline to generate complete genome sequences of Rift Valley fever virus, a tripartite virus having 3 segments - Small (S), Medium (M) and Large (L).

The pipeline comes bundled with reference genome and annotation, and the user only has to specify the segment to obtain full genome sequences. The pipeline calls variants using iVar and annotates the variants using SnpEff and SnpSift

Credits

rvfvampliconseq was originally written by @ajodeh-juma with inspiration from the @nf-core team, particularly on viralrecon

We thank the following people for their extensive assistance in the development of this pipeline:

License

rvfvampliconseq is free software, licensed under MIT.

Issues

Please report any issues to the issues page.

Contribute

If you wish to fix a bug or add new features to the software we welcome Pull Requests. We use GitHub Flow style development. Please fork the repo, make the change, then submit a Pull Request against out master branch, with details about what the change is and what it fixes/adds. We will then review your changes and merge them, or provide feedback on enhancements.

Citations

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

In addition, references of tools and data used in this pipeline are as follows:

Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 30(15), 2114–2120. https://doi.org/10.1093/bioinformatics/btu170

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324

Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. Published 2012 Mar 4. doi:10.1038/nmeth.1923

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16)> 2078–2079. https://doi.org/10.1093/bioinformatics/btp352

Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987-2993. doi:10.1093/bioinformatics/btr509

R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing,. https://www.R-project.org/

Grubaugh, N.D.; Gangavarapu, K.; Quick, J.; Matteson, N.L.; De Jesus, J.G.; Main, B.J.; Tan, A.L.; Paul, L.M.; Brackney, D.E.; Grewal, S.; et al. An Amplicon-Based Sequencing Framework for Accurately Measuring Intrahost Virus Diversity Using PrimalSeq and IVar. Genome Biol. 2019, 20, 8, doi:10.1186/s13059-018-1618-7

Cingolani, P., Platts, A., Wang, l., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X., & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695

Owner

Name: JJ
Login: ajodeh-juma
Kind: user
Location: Nairobi, KE

Repositories: 5
Profile: https://github.com/ajodeh-juma

A biologist with interest in computational biology and bioinformatics.

GitHub Events

Total

Last Year

Dependencies

environment.yml conda

bedtools 2.29.2.*
bioconductor-biostrings 2.58.0.*
bioconductor-complexheatmap 2.6.2.*
biopython 1.78.*
bowtie2 2.4.2.*
bwa 0.7.17.*
fastp 0.20.1.*
fastqc 0.11.9.*
ivar 1.3.1.*
multiqc 1.9.*
nextflow 20.10.0.*
pandas 1.0.5.*
picard 2.25.1.*
r-argparse 2.0.3.*
r-cowplot 1.1.0.*
r-data.table 1.13.4.*
r-ggforce 0.3.3.*
r-gplots 3.1.1.*
r-gridextra 2.3.*
r-hrbrthemes 0.8.0.*
r-markdown 1.1.*
r-optparse 1.6.6.*
r-svglite 1.2.3.2.*
r-tidyverse 1.3.0.*
r-viridis 0.5.1.*
r-zoo 1.8_8.*
samtools 1.10.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ajodeh-juma/rvfv-amplicon-seq

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Introduction

Installation

Testing

Usage

Input

Metadata

Method details

Output

Pipeline Summary

Documentation

Credits

License

Issues

Contribute

Citations

Owner

GitHub Events

Total

Last Year

Dependencies