sbdi-phylomarkercheck
https://github.com/biodiversitydata-se/sbdi-phylomarkercheck
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: biodiversitydata-se
- License: mit
- Language: Nextflow
- Default Branch: master
- Size: 4.24 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
sbdi-phylomarkercheck
Introduction
biodiversitydata-se/sbdi-phylomarkercheck is a bioinformatics pipeline that checks GTDB 16S sequences for phylogenetic signal with Sativa
Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB are checked so that their phylogenetic signal is consistent with their taxonomy.
Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns are removed, and the reverse complement of each is calculated. Subsequently, sequences are aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps are removed. The remaining sequences are analyzed with Sativa, and sequences that are not phylogenetically consistent with their taxonomy are removed.
Files for the DADA2 (Callahan et al. 2016) methods assignTaxonomy and addSpecies are available, in three different versions each.
The assignTaxonomy files contain taxonomy for domain, phylum, class, order, family, genus and species.
(Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments with assignTaxonomy with caution.)
The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "n1", "n5" and "n20" in the file names respectively.
Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the addSpecies algorithm, while using a file with many genomes
per species could potentially give biases in the taxonomic annotations at higher levels by assignTaxonomy.
Our recommendation is hence to use the "n1" files for assignTaxonomy and "n20" for addSpecies.
All files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the addSpecies file have sequence identities and species names.
There are also fasta files with the original GTDB sequence names, with "correct" in their names.
Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.1: --dada_ref_taxonomy sbdi-gtdb
(https://nf-co.re/ampliseq; Straub et al. 2020).
In addition to the fasta files, the workflow estimates phylogenetic trees from the original GTDB trees. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model.
Usage
Create a parameter file, e.g. params.yml, similar to this:
yml
markername: 'arc-ssu-r214'
input: 'input/arc-ssu-r214.fna'
hmm: 'https://raw.githubusercontent.com/tseemann/barrnap/master/db/arc.hmm'
hmmkey: '16S_rRNA'
gtdb_metadata: 'https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_metadata_r214.tsv.gz'
n_per_species: '1,5,20'
outdir: 'r214'
max_cpus: 12
non_gap_prop: 0.8
phylogeny: 'https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_r214.tree'
model: 'GTR+F+I+G4'
And run the workflow like this:
bash
nextflow run biodiversitydata-se/sbdi-phylomarkercheck \
-profile <docker/singularity/.../institute> \
--outdir <OUTDIR>
-params-file params.yml
Pipeline output
A set of directories under <OUTDIR> that you specified as argument for --outdir will be created.
Two of these are particularly interesting:
* <OUTDIR>/correct: Fasta files with 16S sequences with a taxonomically consistent phylogenetic signal
- <PREFIX>-n<N>.assignTaxonomy.fna.gz: N sequences per species formatted for DADA2's assignTaxonomy()
- <PREFIX>-n<N>.addSpecies.fna.gz: N sequences per species formatted for DADA2's addSpecies()
- <PREFIX>-n<N>.correct.fna.gz: N sequences per species in GTDB's original format
* <OUTDIR>/iqtree: Tree files and taxonomy
- <PREFIX>-sprep.alnfna: Fasta file with aligned 16S sequences
- <PREFIX>-sprep.brlenopt.treefile: Newick formatted tree file
- <PREFIX>-sprep.taxonomy.tsv: Taxonomy file for phylogenetic placement purposes
- <PREFIX>-sprep.brlenopt.iqtree: IQTREE info file
- <PREFIX>-sprep.brlenopt.log: Log file
Credits
sbdi-phylomarkercheck was originally written by Daniel Lundin.
An earlier, manual, procedure to produce the corresponding files for releases up to r207 was designed in collaboration with Anders Andersson.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Citations
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
Owner
- Name: Swedish Biodiversity Data Infrastructure
- Login: biodiversitydata-se
- Kind: organization
- Website: https://biodiversitydata.se
- Repositories: 15
- Profile: https://github.com/biodiversitydata-se
Citation (CITATIONS.md)
# nf-core/phylomarkercheck: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [biopython](https://academic.oup.com/bioinformatics/article/25/11/1422/330687) > Cock, Peter J. A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, et al. 2009. “Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics.” Bioinformatics 25 (11): 1422–23. https://doi.org/10.1093/bioinformatics/btp163. - [HMMER](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002195) > Eddy, Sean R. 2011. “Accelerated Profile HMM Searches.” PLoS Comput Biol 7 (10): e1002195. https://doi.org/10.1371/journal.pcbi.1002195. - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. - [IQTREE](https://academic.oup.com/mbe/article/32/1/268/2925592) > Nguyen, Lam-Tung, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. 2015. “IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74. https://doi.org/10.1093/molbev/msu300. - [SeqKit](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0163962) > Shen, Wei, Shuai Le, Yan Li, and Fuquan Hu. 2016. “SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.” PLOS ONE 11 (10): e0163962. https://doi.org/10.1371/journal.pone.0163962. - [seqtk](https://github.com/lh3/seqtk) > Not published. - [EMBOSS](http://emboss.open-bio.org/) > Rice, P., I. Longden, and A. Bleasby. 2000. “EMBOSS: The European Molecular Biology Open Software Suite.” Trends in Genetics: TIG 16 (6): 276–77. https://doi.org/10.1016/s0168-9525(00)02024-2. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Release event: 1
- Push event: 2
- Pull request event: 2
- Create event: 2
Last Year
- Release event: 1
- Push event: 2
- Pull request event: 2
- Create event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 7 minutes
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 7 minutes
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- erikrikarddaniel (1)