sbdi-phylomarkercheck

https://github.com/biodiversitydata-se/sbdi-phylomarkercheck

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: biodiversitydata-se
License: mit
Language: Nextflow
Default Branch: master
Size: 4.24 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 3

Created almost 3 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

sbdi-phylomarkercheck

Introduction

biodiversitydata-se/sbdi-phylomarkercheck is a bioinformatics pipeline that checks GTDB 16S sequences for phylogenetic signal with Sativa

Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB are checked so that their phylogenetic signal is consistent with their taxonomy.

Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns are removed, and the reverse complement of each is calculated. Subsequently, sequences are aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps are removed. The remaining sequences are analyzed with Sativa, and sequences that are not phylogenetically consistent with their taxonomy are removed.

Files for the DADA2 (Callahan et al. 2016) methods assignTaxonomy and addSpecies are available, in three different versions each. The assignTaxonomy files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments with assignTaxonomy with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "n1", "n5" and "n20" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the addSpecies algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by assignTaxonomy. Our recommendation is hence to use the "n1" files for assignTaxonomy and "n20" for addSpecies.

All files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the addSpecies file have sequence identities and species names. There are also fasta files with the original GTDB sequence names, with "correct" in their names.

Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.1: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).

In addition to the fasta files, the workflow estimates phylogenetic trees from the original GTDB trees. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model.

Usage

Create a parameter file, e.g. params.yml, similar to this:

yml markername: 'arc-ssu-r214' input: 'input/arc-ssu-r214.fna' hmm: 'https://raw.githubusercontent.com/tseemann/barrnap/master/db/arc.hmm' hmmkey: '16S_rRNA' gtdb_metadata: 'https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_metadata_r214.tsv.gz' n_per_species: '1,5,20' outdir: 'r214' max_cpus: 12 non_gap_prop: 0.8 phylogeny: 'https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_r214.tree' model: 'GTR+F+I+G4'

And run the workflow like this:

bash nextflow run biodiversitydata-se/sbdi-phylomarkercheck \ -profile <docker/singularity/.../institute> \ --outdir <OUTDIR> -params-file params.yml

Pipeline output

A set of directories under <OUTDIR> that you specified as argument for --outdir will be created. Two of these are particularly interesting: * <OUTDIR>/correct: Fasta files with 16S sequences with a taxonomically consistent phylogenetic signal - <PREFIX>-n<N>.assignTaxonomy.fna.gz: N sequences per species formatted for DADA2's assignTaxonomy() - <PREFIX>-n<N>.addSpecies.fna.gz: N sequences per species formatted for DADA2's addSpecies() - <PREFIX>-n<N>.correct.fna.gz: N sequences per species in GTDB's original format * <OUTDIR>/iqtree: Tree files and taxonomy - <PREFIX>-sprep.alnfna: Fasta file with aligned 16S sequences - <PREFIX>-sprep.brlenopt.treefile: Newick formatted tree file - <PREFIX>-sprep.taxonomy.tsv: Taxonomy file for phylogenetic placement purposes - <PREFIX>-sprep.brlenopt.iqtree: IQTREE info file - <PREFIX>-sprep.brlenopt.log: Log file

Credits

sbdi-phylomarkercheck was originally written by Daniel Lundin.

An earlier, manual, procedure to produce the corresponding files for releases up to r207 was designed in collaboration with Anders Andersson.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Owner

Name: Swedish Biodiversity Data Infrastructure
Login: biodiversitydata-se
Kind: organization

Website: https://biodiversitydata.se
Repositories: 15
Profile: https://github.com/biodiversitydata-se

Citation (CITATIONS.md)

# nf-core/phylomarkercheck: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [biopython](https://academic.oup.com/bioinformatics/article/25/11/1422/330687)

  > Cock, Peter J. A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, et al. 2009. “Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics.” Bioinformatics 25 (11): 1422–23. https://doi.org/10.1093/bioinformatics/btp163.

- [HMMER](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002195)

  > Eddy, Sean R. 2011. “Accelerated Profile HMM Searches.” PLoS Comput Biol 7 (10): e1002195. https://doi.org/10.1371/journal.pcbi.1002195.
 
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [IQTREE](https://academic.oup.com/mbe/article/32/1/268/2925592)

  > Nguyen, Lam-Tung, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. 2015. “IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74. https://doi.org/10.1093/molbev/msu300.

- [SeqKit](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0163962)

  > Shen, Wei, Shuai Le, Yan Li, and Fuquan Hu. 2016. “SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.” PLOS ONE 11 (10): e0163962. https://doi.org/10.1371/journal.pone.0163962.

- [seqtk](https://github.com/lh3/seqtk)

  > Not published.

- [EMBOSS](http://emboss.open-bio.org/)

  > Rice, P., I. Longden, and A. Bleasby. 2000. “EMBOSS: The European Molecular Biology Open Software Suite.” Trends in Genetics: TIG 16 (6): 276–77. https://doi.org/10.1016/s0168-9525(00)02024-2.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Release event: 1
Push event: 2
Pull request event: 2
Create event: 2

Last Year

Release event: 1
Push event: 2
Pull request event: 2
Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 7 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 7 minutes
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sbdi-phylomarkercheck

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

sbdi-phylomarkercheck

Introduction

Usage

Pipeline output

Credits

Citations

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels