https://github.com/computational-biology-oceanomics/ednabert_s_ecosystem_mapping
Taxonomy-independent ecosystem mapping via per-site embeddings from ASV embeddings
https://github.com/computational-biology-oceanomics/ednabert_s_ecosystem_mapping
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary
Repository
Taxonomy-independent ecosystem mapping via per-site embeddings from ASV embeddings
Basic Info
- Host: GitHub
- Owner: Computational-Biology-OceanOmics
- Language: Python
- Default Branch: main
- Size: 2.55 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Calculate per-site embeddings using 12S/16S finetuned DNABERT-S
Using eDNA-data while ignoring taxonomy to learn more about ecosystems!
This script takes a FAIRe-formatted Excel sheet with ASV results of 12S (MiFishU, Miya et al) or 16S sequencing (Berry et al) ASVs, downloads the corresponding finetuned OceanOmics DNABERT-S models, and averages all embeddings for each site by weighing the embeddings using per-site read counts of all ASVs. We get per-site embeddings!
It then stores the per-site embeddings in a paraquet file, and optionally runs tSNE or UMAP on those embeddings to get per-site representations.
This is what the tSNE clustering of those per-site embeddings looks like along a latitude gradient:
Each dot is one site, colored by their latitude, shaped by sample collection device.
Usage
python calculate.py --cache-dir './cache' --12s-files run1.xlsx --16s-files run2.xlsx
--run-tsne --outdir './results' --run-umap
usage: calculatePerSite.py [-h] [--12s-files [12S_FILES ...]] [--16s-files [16S_FILES ...]] --outdir OUTDIR [--cache-dir CACHE_DIR]
[--min-asv-length MIN_ASV_LENGTH] [--max-asv-length MAX_ASV_LENGTH] [--force] [--model-12s MODEL_12S] [--model-16s MODEL_16S]
[--base-config BASE_CONFIG] [--pooling-token {mean,cls}] [--batch-size BATCH_SIZE] [--use-amp] [--max-length MAX_LENGTH]
[--weight-mode {hellinger,log,relative,softmax_tau3}] [--site-pooling {l2_weighted_mean,weighted_mean,gem_p2,gem_p3,simple_mean}]
[--run-tsne] [--run-umap] [--perplexity PERPLEXITY] [--n-neighbors N_NEIGHBORS] [--metric {cosine,euclidean}] [--seed SEED]
[--fuse {none,concat}]
eDNA DNABERT-S embedding pipeline (Excel -> ASVs -> sites -> t-SNE/UMAP)
optional arguments:
-h, --help show this help message and exit
--12s-files [12S_FILES ...]
Path(s) to Excel file(s) containing 12S data (default: [])
--16s-files [16S_FILES ...]
Path(s) to Excel file(s) containing 16S data (default: [])
--outdir OUTDIR Output directory (default: None)
--cache-dir CACHE_DIR
HuggingFace cache dir (optional) (default: None)
--min-asv-length MIN_ASV_LENGTH
Minimum ASV sequence length (optional) (default: None)
--max-asv-length MAX_ASV_LENGTH
Maximum ASV sequence length (optional) (default: None)
--force Force recalculation of all steps, ignoring existing intermediate files (default: False)
--model-12s MODEL_12S
--model-16s MODEL_16S
--base-config BASE_CONFIG
--pooling-token {mean,cls}
--batch-size BATCH_SIZE
--use-amp Enable mixed precision on CUDA (default: False)
--max-length MAX_LENGTH
Longest tokenized length for the tokenizer (default: 512)
--weight-mode {hellinger,log,relative,softmax_tau3}
--site-pooling {l2_weighted_mean,weighted_mean,gem_p2,gem_p3,simple_mean}
Method for pooling ASV embeddings to site embeddings. 'simple_mean' performs no normalisation. (default: l2_weighted_mean)
--run-tsne
--run-umap
--perplexity PERPLEXITY
Perplexity setting for tSNE (default: 5)
--n-neighbors N_NEIGHBORS
Number of neighbours for UMAP (default: 15)
--metric {cosine,euclidean}
--seed SEED
--fuse {none,concat} How to fuse 12S+16S site vectors (concat or none) (default: concat)
Results
The results folder will contain at least two files: the per-ASV embeddings in parquet and the per-site embeddings in parquet.
If you've turned on run-tsne and/or run-umap, there will be CSV files with TSNE1/TSNE2 and UMAP1/UMAP2 values for all sites.
Runtime
It usually runs only for a few minutes, but so far I've only tested it on systems without GPUs.
Installation
There's a conda environment with the DNABERT-S needed dependencies in DNABERT_S.yml
conda env create -f DNABERT_S.yml
AI statement
Most of this code is written by GPT5 after a long discussion! Every mode has heaps options, human-me would've never done that.
Regression
I am working on predicting latitude/longitude from the site embeddings alone. That is happening in regress.py. I am unclear how to handle replicates best as there's a lot of variability in between - by not accounting for replicates I get a test R2 of about 0.7, with grouped KFold stratification by replicate it goes down to 0.5 (better).
Example data
This repository comes with a FAIRe-formatted excel sheet from an OceanOmics transect from Perth to the Cocos Keeling Islands, only 12S (MiFish-U, Miya et al.)
Owner
- Name: Computational-Biology-OceanOmics
- Login: Computational-Biology-OceanOmics
- Kind: organization
- Website: https://www.uwa.edu.au/oceans-institute/Research/Minderoo-OceanOmics-Centre-at-UWA
- Repositories: 1
- Profile: https://github.com/Computational-Biology-OceanOmics
This is the research and development space for the OceanOmics team
GitHub Events
Total
- Push event: 13
- Create event: 2
Last Year
- Push event: 13
- Create event: 2