ngspeciesid

Reference-free clustering and consensus forming of long-read amplicon sequencing

https://github.com/ksahlin/ngspeciesid

Last synced: 7 months ago · JSON representation

Repository

Reference-free clustering and consensus forming of long-read amplicon sequencing

Basic Info

Host: GitHub
Owner: ksahlin
License: gpl-3.0
Language: Python
Default Branch: master
Homepage:
Size: 480 MB

Statistics

Stars: 59
Watchers: 4
Forks: 17
Open Issues: 5
Releases: 2

Created about 6 years ago · Last pushed 12 months ago

Metadata Files

Readme License

NGSpeciesID

NGSpeciesID is a tool for clustering and consensus forming of long-read amplicon sequencing data (has been used with both PacBio and Oxford Nanopore data). The repository is a modified version of isONclust, where consensus, primer-removal, and polishing feautures have been added.

NGSpeciesID is distributed as a python package supported on Linux / OSX with python v3.6. .

NGSpeciesID employs quality filtering of the reads based on read Phred scores. However, we recommend also removing reads much shorter or longer than the intended target, which often represent chimeras or contaminations. This can be done by specifying the --m (intended target length) and --s (maximum deviation from target length). NGSpeciesID also has the feature of subsampling reads using parameter --sample_size. Altogether, if we want to filter out reads outside the length interval [700,800] and using a subset of 300 reads (if the dataset consists of more reads) we could run

NGSpeciesID --ont --sample_size 300 --m 750 --s 50 --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output]

By default, length filtering and subsampling are not invoked if parameters are not specified.

Removing primers

If customized primers are to be expected in the reads thay can be detected and removed. The primer file is expected to be in fasta format. Here is an example of a primer file:

```

MCB869ONTR CGATCAATCCCCTAACAAACTAGG MCB398ONTF TACCATGAGGACAAATATCATTCTG ``` NGSpeciesID searches for primes in a window of Xbp (parameter, default 150bp) at the beginning and end of each consensus.

Trimming of primers is performed after consensus forming and can be invoked as NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --primer_file [primers.fa]

NGSpeciesID can also remove universal tails. Trimming of tails is performed after consensus forming and can be invoked as

NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --remove_universal_tails

The two options are mutually exclusive, i.e., only one of them can be run.

Output

The output consists of the polished consensus sequences along with some information about clustering.

Polished consensus sequence(s). A folder named “medakaclidX”[/"raconclidX"] is created for each predicted consensus. Each such folder contains a sequence “consensus.fasta” which is the final output of NGSpeciesID.
Draft spoa consensus sequences of each of the clusters are given as consensusreferenceX.fasta (where X is a number).
The final cluster information is given in a tsv file final_clusters.tsv present in the specified output folder.

In the cluster TSV-file, the first column is the cluster ID and the second column is the read accession. For example:

0 read_X_acc 0 read_Y_acc ... n read_Z_acc if there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first).

EXAMPLE WORKFLOW

The bioinformatics workflow below was developed as part of a step-by-step protocol for field-deployable DNA amplicon sequencing with the Oxford Nanopore Technologies MinION. The full protocol manuscript is in submission; a link will be posted here when available. The steps below correspond to step numbers in the protocol.

P2 | Generate custom indexes for uniquely identifying samples using `barcode_generator`. This software uses Python3.

python3 barcode_generator_3.4.py none 24 40 8

Here, the parameters are set as: - tableexcludedbarcodes = 'none' - index length = 24 base pairs - number of barcodes to generate = 40 - hamming distance = 8

After lab steps are complete:

B1 | Basecalling and quality check (optional) with Guppy

These commands use the fast basecalling model from Guppy.

Basecalling for R9.4 flow cell:

guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings

Basecalling and filter reads by quality score (here, set to 7):

guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings --min_qscore 7

Basecalling for R10.3 flow cell:

guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r10.3_450bps_fast.cfg --recursive --disable_pings

B2 | Go to folder with the fastq files generated by Guppy

B3 | Concatenate all the read files into one large file

cat *.fastq > sequencing_reads.fastq

B4 | Check raw read quality/stats with NanoPlot

NanoPlot --fastq_rich sequencing_reads.fastq -o sequencing_run -p sequencing_run

B5 | Demultiplexing of the sequencing data with minibar or Guppy

Example files can be found in: - Supplementary Data 1: 3,000 reads in fastq format from three fish species - Atlantic cod (Gadus morhua), Haddock (Melanogrammus aeglefinus), and Whiting (Merlangius merlangus) - sequenced on a Flongle flow cell. - Supplementary Data 2: index file used for demultiplexing with minibar

The example files Supplementary Data 1 can be used for sequencing_reads.fastq and Supplementary Data 2 can be used for indexes.txt.

B5a | minibar (using example files):

python minibar.py indexes.txt sequencing_reads.fastq -T -F -e 3 -E 11

Here, the edit distance allowed between indexes (-e) is set to 3 base pairs and the edit distance allowed between primer sequences (-E) is set to 11 base pairs.

B5b | Guppy:

guppy_barcoder -i sequencing_reads.fastq -s demultiplex_folder --trim_barcodes --disable_pings

B6 | Read filtering, clustering, consensus generation and polishing with NGSpeciesID

For a single sample (using example primer file):

NGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq barcode0.fastq --outfolder barcode0_consensus

Here, the parameters are set as: - the data is from ONT MinION (--ont) - we want to generate consensus sequences (--consensus) - subsample of reads (--sample_size) = 500 reads subsampled per sample to analyze - intended target length (--m) = 800 base pairs - maximum deviation from target length (--s) = 100 base pairs - use Medaka to polish the final consensus sequences (--medaka) - if a --primer_file is given, NGSpeciesID will check to remove any remaining primer sequence. The example primer file is available in Supplementary Data 3. The primers were developed in Mikkelsen, P.M., Bieler, R., Kappner, I., & Rawlings, T.A. (2006). Phylogeny of Veneroidea (Mollusca: Bivalvia) based on morphology and molecules. Zoological Journal of the Linnean Society, 148(3), 439-521. - the input file of demultiplexed reads is specified by --fastq (output from step B5) - the output consensus files will be saved to --outfolder

To run this step on more than one sample, use a bash script with a for loop:

for file in *.fastq; do bn=`basename $file .fastq` NGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq $file --outfolder ${bn} done

This loop uses the wildcard * to indicate you want to analyze all files with the .fastq extension and assumes the command is run from the directory that contains the read files (if not, be sure to change the file path: path/to/*.fastq).

This loop code can be entered at a UNIX/Mac terminal (be sure the spacing/indentation is correct) or saved as a script (see consensus.sh. The script should be run from the terminal and in the directory that contains the read files as:

./consensus.sh

B7 | Compare consensus sequences to reference database with BLAST

Create/format database for BLAST search:

makeblastdb -in database.fasta -dbtype nucl -out database

Conduct BLAST search:

blastn -db database -query barcode0_consensus.fasta -outfmt 6 -out barcode0_consensus_blast.out

Check the results and refine the search or database as needed to better identify the sequence identity of your samples!

CREDITS

Please cite [1] when using NGSpeciesID.

Sahlin, K, Lim, MCW, Prost, S. NGSpeciesID: DNA barcode and amplicon consensus generation from long‐read sequencing data. Ecol Evol. 2021; 00: 1– 7. https://doi.org/10.1002/ece3.7146

LICENCE

GPL v3.0, see LICENSE.txt.

Owner

Name: Kristoffer
Login: ksahlin
Kind: user

Website: http://sahlingroup.github.io/
Repositories: 26
Profile: https://github.com/ksahlin

GitHub Events

Total

Create event: 1
Release event: 1
Issues event: 12
Watch event: 9
Issue comment event: 10
Push event: 3
Pull request event: 4
Fork event: 1

Last Year

Create event: 1
Release event: 1
Issues event: 12
Watch event: 9
Issue comment event: 10
Push event: 3
Pull request event: 4
Fork event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 149
Total Committers: 11
Avg Commits per committer: 13.545
Development Distribution Score (DDS): 0.436

Past Year

Commits: 3
Committers: 2
Avg Commits per committer: 1.5
Development Distribution Score (DDS): 0.333

Top Committers

Name	Email	Commits
Kristoffer Sahlin	k**4@p**u	84
Kristoffer Sahlin	k**n@g**m	26
Marisa Lim	m****m	14
Coppini	g**i@g**m	12
Kristoffer Sahlin	k**n@k**e	6
ksahlin	k**n@m**e	2
pashadag	p**g@g**m	1
David McCheyne	d**e@g**m	1
Sahlin	s**s@l**i	1
stefanscripts	5****s	1
jdalino	4****o	1

Committer Domains (Top 20 + Academic)

lm9-523-002.pc.helsinki.fi: 1 math.su.se: 1 kristoffersmbp.dyn.scilifelab.se: 1 psu.edu: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 27
Total pull requests: 10
Average time to close issues: 10 months
Average time to close pull requests: 3 days
Total issue authors: 21
Total pull request authors: 6
Average comments per issue: 3.37
Average comments per pull request: 1.4
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 3
Average time to close issues: about 11 hours
Average time to close pull requests: 6 days
Issue authors: 5
Pull request authors: 1
Average comments per issue: 0.4
Average comments per pull request: 1.67
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

pecholleyn (3)
atks (3)
Coppini (2)
omarkr8 (2)
AdrianoAbbOK (1)
edgardomortiz (1)
MarcV98 (1)
rkimoakbioinformatics (1)
gabyrech (1)
lixiaopi1985 (1)
PJV-Ecu (1)
PKuperus (1)
2tony2 (1)
josiah-liew (1)
mlosilla (1)

Pull Request Authors

joshuaowalker (6)
marisalim (2)
Coppini (2)
jdalino (1)
pecholleyn (1)
dwmccheyne (1)

Top Labels

Issue Labels

enhancement (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 136 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 20
Total maintainers: 1

pypi.org: ngspeciesid

Reconstructs viral consensus sequences from a set of ONT reads.

Homepage: https://github.com/ksahlin/NGSpeciesID
Documentation: https://ngspeciesid.readthedocs.io/
License: gpl-3.0
Latest release: 0.3.1
published 12 months ago

Versions: 20
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 136 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 10.2%

Forks count: 10.2%

Average: 14.2%

Downloads: 19.1%

Dependent repos count: 21.7%

Maintainers (1)

ksahlin

Last synced: 7 months ago

ngspeciesid

Science Score: 46.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

NGSpeciesID

Table of Contents

INSTALLATION

Using conda

Testing installation

USAGE

Filtering and subsampling

Removing primers

Output

EXAMPLE WORKFLOW

P2 | Generate custom indexes for uniquely identifying samples using barcode_generator. This software uses Python3.

B1 | Basecalling and quality check (optional) with Guppy

B2 | Go to folder with the fastq files generated by Guppy

B3 | Concatenate all the read files into one large file

B4 | Check raw read quality/stats with NanoPlot

B5 | Demultiplexing of the sequencing data with minibar or Guppy

B5a | minibar (using example files):

B5b | Guppy:

B6 | Read filtering, clustering, consensus generation and polishing with NGSpeciesID

B7 | Compare consensus sequences to reference database with BLAST

CREDITS

LICENCE

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: ngspeciesid

Rankings

Maintainers (1)

P2 | Generate custom indexes for uniquely identifying samples using `barcode_generator`. This software uses Python3.