sangerflow

A bioinformatics pipeline that automates the Sanger amplicon sequencing data analysis of thousands of samples in parallel.

https://github.com/asadprodhan/sangerflow

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Keywords

nextflow-pipeline singularity-containers

Last synced: 6 months ago · JSON representation

Repository

A bioinformatics pipeline that automates the Sanger amplicon sequencing data analysis of thousands of samples in parallel.

Basic Info

Host: GitHub
Owner: asadprodhan
License: gpl-3.0
Language: Nextflow
Default Branch: main
Homepage:
Size: 1.16 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 0

Topics

nextflow-pipeline singularity-containers

Created over 2 years ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

sangerFlow, an automated bioinformatics pipeline to analyse Sanger amplicon sequencing data for pest and pathogen diagnosis

M. Asaduzzaman Prodhan^*

DPIRD Diagnostics and Laboratory Services, Department of Primary Industries and Regional Development

3 Baron-Hay Court, South Perth, WA 6151, Australia

^*Correspondence: Asad.Prodhan@dpird.wa.gov.au

About the sangerFlow

DNA barcoding is a powerful tool to identify species. It involves i) DNA or RNA extraction from the specimen, ii) performing a Polymerase Chain Reaction (PCR) targeting a DNA barcode, and iii) high-quality sequencing such as Sanger sequencing of the PCR product. The sequencing data come as forward- and reverse reads that require manually quality control, alignment, and sequence similarity analysis using web-based Blastn to identify the species. However, this manual analysis might be a limiting factor in biosecurity surveillance or diagnosis settings that requires high-throughput analysis. sangerFlow addresses this challenge by automating this entire analysis (Fig. 1).

Figure 1: sangerFlow automates pest and pathogen identification using PCR Sanger sequencing data.

sangerFlow automatically analyses the forward and reverse reads from the PCR Sanger sequencing data. The pipeline takes the fasta files as input and returns Blastn hits i.e., species identifications for each specimen (Fig. 2). Therefore, the pipeline is automated and scalable. Furthermore, the pipeline is written using the modern workflow manager, Nextflow; and Singularity containers. Therefore, it does not require software installation except Nextflow and Singularity, software subscription, or programming expertise from the end users. All these features make the pipeline ideal for large-scale Sanger amplicon sequencing data analysis and user-friendly.

Figure 2: sangerFlow pipeline.

How to use the sangerFlow

Follow the following steps to use sangerFlow.

Step 1: Install the required softwares

Install conda
Create a conda environment and name it sangerFlow

conda create -n sangerFlow

Activate the conda environment sangerFlow

conda activate sangerFlow

Install Nextflow

conda install -c bioconda nextflow

Run the following command to make sure that Nextflow has been installed

nextflow -h

If you see the Nextflow options like Fig. 3, then the Nextflow has been installed

Figure 3: Nextflow options.

Install Singularity

conda install -c conda-forge singularity

Run the following command to make sure that Singularity has been installed

singularity -h

If you see the Singularity options like Fig. 4, then the Singularity has been installed

Figure 4: Singularity options.

Step 2: Prepare a sample description file

See Fig. 5. This is an example of a sample description file. It is a tsv file format.

First column is an Id for the sample
Second column is the forward (or read1) sequence file name
Third column is the reverse (or read2) sequence file name
If your data files have .fa extension instead of .fasta extension, then replace the .fasta file extension by .fa in your sample description sheet (Fig. 5). No other changes are required.

Figure 5: Sample description file.

Step 3: Download a blastn database from NCBI

How to download a blastn database from NCBI

Step 4: Run sangerFlow

Create a directory say amplicon_analysis
Transfer your Sanger amplicon sequencing data to amplicon_analysis directory

sangerFlow can be tested using its publicly avaialble test dataset (NCBI Project ID PRJNA37833, NCBI Sample ID SAMN12109156, and NCBI Run ID SRR9339436) DOWNLOAD

Keep your sample description tsv file in the amplicon_analysis directory
Run the following command to make sure the all the files are in UNIX format

dos2unix *

Run the following command to make sure that all the files are executable

chmod +x *

Run the following command to run sangerFlow

nextflow run asadprodhan/sangerFlow -r VERSION-NUMBER --db="/path/to/your/blastn_database"

Collect the VERSION-NUMBER from the sangerFlow GitHub home page. It is located as shown in the red box in Fig. 6.

Figure 6: sangerFlow version number location.

You can set the following thresholds for the blastn analysis using the following flags

--evalue=XX. Default is 0.1

--cpus=XX. Default is 18

--topHits=XX. Default is 5

For example

nextflow run asadprodhan/sangerFlow -r VERSION-NUMBER --evalue=0.05 --topHits=1 --cpus=16 --db="/path/to/your/blastn_database"

Outputs

When the run is successfully completed, there will be three new directories (results, temp, and work) in your working directory

Results

This directory contains the blastn results. One tsv file per sample. In addition, there will be a master blastn result sheet named concatenatedHits_withHeaders.tsv. This file contains the user-defined top most Blastn hits of all the samples (Fig. 7).

Figure 7: sangerFlow master result sheet containing the user-defined top most Blastn hits of all the samples.

Temp

This directory contains all the intermediate files in case you will need to have a look at them.

Work

This directory contains one sub-directory per sample. The work directory is created by Nextflow by default. You can delete it to free up space in your computer.

The End

Owner

Name: Asad Prodhan
Login: asadprodhan
Kind: user
Location: Perth, Australia
Company: Department of Primary Industries and Regional Development

Website: www.linkedin.com/in/asadprodhan
Twitter: Asad_Prodhan
Repositories: 2
Profile: https://github.com/asadprodhan

Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sangerflow

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

sangerFlow, an automated bioinformatics pipeline to analyse Sanger amplicon sequencing data for pest and pathogen diagnosis

M. Asaduzzaman Prodhan^*

About the sangerFlow

How to use the sangerFlow

Step 1: Install the required softwares

Step 2: Prepare a sample description file

Step 3: Download a blastn database from NCBI

Step 4: Run sangerFlow

For example

Outputs

Results

Temp

Work

The End

Owner

GitHub Events

Total

Last Year

sangerflow

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

sangerFlow, an automated bioinformatics pipeline to analyse Sanger amplicon sequencing data for pest and pathogen diagnosis

M. Asaduzzaman Prodhan*

About the sangerFlow

How to use the sangerFlow

Step 1: Install the required softwares

Step 2: Prepare a sample description file

Step 3: Download a blastn database from NCBI

Step 4: Run sangerFlow

For example

Outputs

Results

Temp

Work

The End

Owner

GitHub Events

Total

Last Year

M. Asaduzzaman Prodhan^*