https://github.com/biocore/q2-greengenes2

A QIIME 2 plugin for interaction with the Greengenes2 database

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: biorxiv.org, nature.com
✓
Committers with academic emails
2 of 4 committers (50.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

A QIIME 2 plugin for interaction with the Greengenes2 database

Basic Info

Host: GitHub
Owner: biocore
License: bsd-3-clause
Language: Python
Default Branch: main
Size: 155 KB

Statistics

Stars: 31
Watchers: 4
Forks: 3
Open Issues: 6
Releases: 4

Created about 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License

Background

The Greengenes2 phylogeny is based on whole genome information from the Web of Life, and revised with high quality full length 16S from the Living Tree Project and full length 16S extracted from bacterial operons using uDance. A seed taxonomy is derived using the mappings from the Web of Life to GTDB. This taxonomy is then augmented using information from the Living Tree Project when possible. The augmented taxonomy is decorated onto the backbone using tax2tree.

Using this decorated backbone, all public and private 16S V4 ASVs from Qiita pulled from redbiom representing hundreds of thousands of samples, as well as full length mitochondrial and chloroplast 16S (sourced from SILVA, are then placed using DEPP. Fragments are resolved. The resulting tree contains > 15,000,000 tips.

Fragment resolution can result in fragments being placed on the parent edge of a named node. This can occur if the node representing a clade, such as d__Archaea, does not represent sufficient diversity for the input fragments to place. As a result, prior to reading taxonomy off of the tree, each name from the backbone is evaluated for whether its edge to parent has a single or multifurcation of placements. If this occurs, the name is “promoted”. The idea being that fragments off a named edge to its parent are more like the named node than a sibling.

Following this name promotion, the full taxonomy is then read off the tree providing lineage information for each fragment and sequence represented in the tree. This taxonomy information can be utilized within QIIME 2 by cross referencing your input feature set against what’s present in the tree. By doing so, we can obtain taxonomy for both WGS data (if processed by Woltka and 16S V4 ASVs. There is an important caveat though: right now, we can only classify based sequences already represented by the tree, so unrepresented V4 ASVs will be unassigned.

Install

$ source activate qiime2.2022.8 $ pip install q2-greengenes2

Reference database artifacts

The reference database release contains the following artifacts. <version> refers to the version of the database, which follows a YYYY.MM format.

The feature IDs present in the artifacts use the WoL namespace for genomes. For ASVs, we provide reference files which use the ASV, MD5 hashes, and internal identifiers (asv, md5, id respectively).

The following files are provided on the FTP:

``` * .backbone.full-length.fna.qza All the full length 16S sequences in the backbone of the tree

.backbone.tax.qza Taxonomy information for the backbone
.backbone.v4.fna.qza In silico extracted V4 sequences from the backbone based on the EMP 16S primers
.backbone.v4.nb.qza Naive Bayes classifier trained on the V4 sequences from the backbone
.phylogeny.asv.nwk
.phylogeny.asv.nwk.qza
.phylogeny.id.nwk
.phylogeny.id.nwk.qza
.phylogeny.md5.nwk
.phylogeny.md5.nwk.qza The full phylogeny. Fragments are expressed as ASVs, simple IDs, or MD5s as tips. We also provided as a QIIME 2 QZA files.
.seqs.fna.gz
.seqs.fna.qza All sequences used in the construction of the tree
.taxonomy.asv.nwk
.taxonomy.asv.nwk.qza
.taxonomy.asv.tsv.gz
.taxonomy.asv.tsv.qza
.taxonomy.id.nwk
.taxonomy.id.nwk.qza
.taxonomy.id.tsv.gz
.taxonomy.id.tsv.qza
.taxonomy.md5.nwk
.taxonomy.md5.nwk.qza
.taxonomy.md5.tsv.gz
.taxonomy.md5.tsv.qza The full taxonomic records for the database. Fragments are expressed as ASVs, simple IDs or MD5s. We also provide QIIME 2 QZA files. The taxonomy is expressed both in tab delimited form as well as Newick ```

To classify

Feature tables which contain either 16S V4, WoL genome identifiers, or a combination thereof can be taxonomically characterized against Greengenes using the following command:

$ qiime greengenes2 taxonomy-from-table \ --i-reference-taxonomy <the_greengenes_reference> \ --i-table <your_feature_table> \ --o-classification <the_resulting_classifications>

The QIIME 2 Greengenes2 plugin also supports the classic method of classification through FeatureData[Sequence] artifacts:

$ qiime greengenes2 taxonomy-from-features \ --i-reference-taxonomy <the_greengenes_reference> \ --i-reads <your_FeatureData[Sequence]> \ --o-classification <the_resulting_classifications>

Diversity calculations

The Greengenes2 reference tree can be readily used for feature tables based on WoL-assessed short read data, 16S V4 ASVs, or the combination of those data types. The only essential requirement is that the features represented by the table are also present in the tree.

The Greengenes2 plugin implements a rapid method for performing this filtering. The same filtering is also possible using the q2-fragment-insertion plugin, however that plugin does not presently use faster phylogeny parsing logic. For filtering, either the phylogeny or taxonomy tree can be used:

$ qiime greengenes2 filter-features \ --i-feature-table <your_feature_table> \ --i-reference <a_greengenes_tree> \ --o-filtered-table <the_filtered_table>

From here, other phylogenetic-aware methods such as UniFrac can be performed as normal.

Relabeling

The Greengenes2 taxonomy and phylogeny can be expressed in three different namespaces:

Record IDs, such as Genbank accessions
md5 sequence hashes
The actual sequence

If your input FeatureTable[Frequency] was computed using default parameters from q2-dada2 or q2-deblur, then it is likely the features are expressed as md5 hashes. However, if your table is coming from a redbiom query or Qiita, then your features are most likely expressed as sequences and/or Woltka record IDs.

The q2-gg2 plugin provides a way to relabel your table. For example, the following will relabel as "md5":

$ qiime greengenes2 relabel \ --i-feature-table <your_feature_table> \ --i-reference-label-map gg2-<version>-label_map.qza \ --p-as-md5 \ --o-relabeled-table <the_relabeled_table>

Citing

If you use Greengenes2, please cite McDonald et al bioRxiv 2022.

Owner

Name: biocore
Login: biocore
Kind: organization
Location: Cyberspace

Website: http://biocore.github.io/
Repositories: 76
Profile: https://github.com/biocore

Collaboratively developed bioinformatics software.

GitHub Events

Total

Issues event: 3
Watch event: 6
Issue comment event: 6

Last Year

Issues event: 3
Watch event: 6
Issue comment event: 6

Committers

Last synced: over 3 years ago

All Time

Total Commits: 58
Total Committers: 4
Avg Commits per committer: 14.5
Development Distribution Score (DDS): 0.397

Top Committers

Name	Email	Commits
Daniel McDonald	d**d@u**u	35
giorgianicolaou	5**u@u**m	16
Daniel McDonald	m**t@c**u	4
giorgianicolaou	g**u@g**m	3

Committer Domains (Top 20 + Academic)

colorado.edu: 1 ucsd.edu: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 19
Total pull requests: 10
Average time to close issues: 3 months
Average time to close pull requests: about 16 hours
Total issue authors: 15
Total pull request authors: 2
Average comments per issue: 3.95
Average comments per pull request: 0.3
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 0
Average time to close issues: 9 days
Average time to close pull requests: N/A
Issue authors: 5
Pull request authors: 0
Average comments per issue: 3.8
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

wasade (3)
ilnamkang (2)
VoronDM (2)
lrt666 (1)
Micro-Biology (1)
mestaki (1)
ElDeveloper (1)
wz115 (1)
christineolson-ucsf (1)
hites77 (1)
aimirza (1)
elayton13 (1)
haolilan (1)
flopflip (1)
colinbrislawn (1)

Pull Request Authors

wasade (10)
giorgianicolaou (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 49 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 4
Total maintainers: 1

pypi.org: q2-greengenes2

Support methods for interaction with Greengenes2

Homepage: https://github.com/biocore/q2-greengenes2
Documentation: https://q2-greengenes2.readthedocs.io/
License: BSD-3-Clause
Latest release: 2024.1
published over 2 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 49 Last month

Rankings

Dependent packages count: 6.6%

Stargazers count: 19.5%

Average: 20.3%

Downloads: 21.7%

Forks count: 23.2%

Dependent repos count: 30.6%

Maintainers (1)

mcdonald

Last synced: 10 months ago

Dependencies

.github/workflows/python-package-conda.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
conda-incubator/setup-miniconda v2 composite

setup.py pypi

biom-format *
iow *
redbiom *
scikit-bio *

.github/workflows/release.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
pypa/gh-action-pypi-publish release/v1 composite

https://github.com/biocore/q2-greengenes2

Science Score: 59.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Background

Install

Reference database artifacts

To classify

Diversity calculations

Relabeling

Citing

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: q2-greengenes2

Rankings

Maintainers (1)

Dependencies