Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: biodiversitydata-se
  • License: mit
  • Language: Nextflow
  • Default Branch: master
  • Size: 4.24 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 3
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation

README.md

sbdi-phylomarkercheck

Introduction

biodiversitydata-se/sbdi-phylomarkercheck is a bioinformatics pipeline that checks GTDB 16S sequences for phylogenetic signal with Sativa

Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB are checked so that their phylogenetic signal is consistent with their taxonomy.

Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns are removed, and the reverse complement of each is calculated. Subsequently, sequences are aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps are removed. The remaining sequences are analyzed with Sativa, and sequences that are not phylogenetically consistent with their taxonomy are removed.

Files for the DADA2 (Callahan et al. 2016) methods assignTaxonomy and addSpecies are available, in three different versions each. The assignTaxonomy files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments with assignTaxonomy with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "n1", "n5" and "n20" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the addSpecies algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by assignTaxonomy. Our recommendation is hence to use the "n1" files for assignTaxonomy and "n20" for addSpecies.

All files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the addSpecies file have sequence identities and species names. There are also fasta files with the original GTDB sequence names, with "correct" in their names.

Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.1: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).

In addition to the fasta files, the workflow estimates phylogenetic trees from the original GTDB trees. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model.

Usage

Create a parameter file, e.g. params.yml, similar to this:

yml markername: 'arc-ssu-r214' input: 'input/arc-ssu-r214.fna' hmm: 'https://raw.githubusercontent.com/tseemann/barrnap/master/db/arc.hmm' hmmkey: '16S_rRNA' gtdb_metadata: 'https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_metadata_r214.tsv.gz' n_per_species: '1,5,20' outdir: 'r214' max_cpus: 12 non_gap_prop: 0.8 phylogeny: 'https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_r214.tree' model: 'GTR+F+I+G4'

And run the workflow like this:

bash nextflow run biodiversitydata-se/sbdi-phylomarkercheck \ -profile <docker/singularity/.../institute> \ --outdir <OUTDIR> -params-file params.yml

Pipeline output

A set of directories under <OUTDIR> that you specified as argument for --outdir will be created. Two of these are particularly interesting: * <OUTDIR>/correct: Fasta files with 16S sequences with a taxonomically consistent phylogenetic signal - <PREFIX>-n<N>.assignTaxonomy.fna.gz: N sequences per species formatted for DADA2's assignTaxonomy() - <PREFIX>-n<N>.addSpecies.fna.gz: N sequences per species formatted for DADA2's addSpecies() - <PREFIX>-n<N>.correct.fna.gz: N sequences per species in GTDB's original format * <OUTDIR>/iqtree: Tree files and taxonomy - <PREFIX>-sprep.alnfna: Fasta file with aligned 16S sequences - <PREFIX>-sprep.brlenopt.treefile: Newick formatted tree file - <PREFIX>-sprep.taxonomy.tsv: Taxonomy file for phylogenetic placement purposes - <PREFIX>-sprep.brlenopt.iqtree: IQTREE info file - <PREFIX>-sprep.brlenopt.log: Log file

Credits

sbdi-phylomarkercheck was originally written by Daniel Lundin.

An earlier, manual, procedure to produce the corresponding files for releases up to r207 was designed in collaboration with Anders Andersson.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Owner

  • Name: Swedish Biodiversity Data Infrastructure
  • Login: biodiversitydata-se
  • Kind: organization

Citation (CITATIONS.md)

# nf-core/phylomarkercheck: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [biopython](https://academic.oup.com/bioinformatics/article/25/11/1422/330687)

  > Cock, Peter J. A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, et al. 2009. “Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics.” Bioinformatics 25 (11): 1422–23. https://doi.org/10.1093/bioinformatics/btp163.

- [HMMER](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002195)

  > Eddy, Sean R. 2011. “Accelerated Profile HMM Searches.” PLoS Comput Biol 7 (10): e1002195. https://doi.org/10.1371/journal.pcbi.1002195.
 
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [IQTREE](https://academic.oup.com/mbe/article/32/1/268/2925592)

  > Nguyen, Lam-Tung, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. 2015. “IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74. https://doi.org/10.1093/molbev/msu300.

- [SeqKit](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0163962)

  > Shen, Wei, Shuai Le, Yan Li, and Fuquan Hu. 2016. “SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.” PLOS ONE 11 (10): e0163962. https://doi.org/10.1371/journal.pone.0163962.

- [seqtk](https://github.com/lh3/seqtk)

  > Not published.

- [EMBOSS](http://emboss.open-bio.org/)

  > Rice, P., I. Longden, and A. Bleasby. 2000. “EMBOSS: The European Molecular Biology Open Software Suite.” Trends in Genetics: TIG 16 (6): 276–77. https://doi.org/10.1016/s0168-9525(00)02024-2.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Release event: 1
  • Push event: 2
  • Pull request event: 2
  • Create event: 2
Last Year
  • Release event: 1
  • Push event: 2
  • Pull request event: 2
  • Create event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 minutes
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 minutes
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • erikrikarddaniel (1)
Top Labels
Issue Labels
Pull Request Labels