https://github.com/computational-biology-oceanomics/ednabert_s_ecosystem_mapping

Taxonomy-independent ecosystem mapping via per-site embeddings from ASV embeddings

https://github.com/computational-biology-oceanomics/ednabert_s_ecosystem_mapping

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Taxonomy-independent ecosystem mapping via per-site embeddings from ASV embeddings

Basic Info
  • Host: GitHub
  • Owner: Computational-Biology-OceanOmics
  • Language: Python
  • Default Branch: main
  • Size: 2.55 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 7 months ago · Last pushed 7 months ago
Metadata Files
Readme

README.md

Calculate per-site embeddings using 12S/16S finetuned DNABERT-S

Using eDNA-data while ignoring taxonomy to learn more about ecosystems!

This script takes a FAIRe-formatted Excel sheet with ASV results of 12S (MiFishU, Miya et al) or 16S sequencing (Berry et al) ASVs, downloads the corresponding finetuned OceanOmics DNABERT-S models, and averages all embeddings for each site by weighing the embeddings using per-site read counts of all ASVs. We get per-site embeddings!

It then stores the per-site embeddings in a paraquet file, and optionally runs tSNE or UMAP on those embeddings to get per-site representations.

This is what the tSNE clustering of those per-site embeddings looks like along a latitude gradient:

image image

Each dot is one site, colored by their latitude, shaped by sample collection device.

Usage

python calculate.py --cache-dir './cache' --12s-files run1.xlsx --16s-files run2.xlsx
    --run-tsne   --outdir './results' --run-umap
usage: calculatePerSite.py [-h] [--12s-files [12S_FILES ...]] [--16s-files [16S_FILES ...]] --outdir OUTDIR [--cache-dir CACHE_DIR]
                           [--min-asv-length MIN_ASV_LENGTH] [--max-asv-length MAX_ASV_LENGTH] [--force] [--model-12s MODEL_12S] [--model-16s MODEL_16S]
                           [--base-config BASE_CONFIG] [--pooling-token {mean,cls}] [--batch-size BATCH_SIZE] [--use-amp] [--max-length MAX_LENGTH]
                           [--weight-mode {hellinger,log,relative,softmax_tau3}] [--site-pooling {l2_weighted_mean,weighted_mean,gem_p2,gem_p3,simple_mean}]
                           [--run-tsne] [--run-umap] [--perplexity PERPLEXITY] [--n-neighbors N_NEIGHBORS] [--metric {cosine,euclidean}] [--seed SEED]
                           [--fuse {none,concat}]

eDNA DNABERT-S embedding pipeline (Excel -> ASVs -> sites -> t-SNE/UMAP)

optional arguments:
  -h, --help            show this help message and exit
  --12s-files [12S_FILES ...]
                        Path(s) to Excel file(s) containing 12S data (default: [])
  --16s-files [16S_FILES ...]
                        Path(s) to Excel file(s) containing 16S data (default: [])
  --outdir OUTDIR       Output directory (default: None)
  --cache-dir CACHE_DIR
                        HuggingFace cache dir (optional) (default: None)
  --min-asv-length MIN_ASV_LENGTH
                        Minimum ASV sequence length (optional) (default: None)
  --max-asv-length MAX_ASV_LENGTH
                        Maximum ASV sequence length (optional) (default: None)
  --force               Force recalculation of all steps, ignoring existing intermediate files (default: False)
  --model-12s MODEL_12S
  --model-16s MODEL_16S
  --base-config BASE_CONFIG
  --pooling-token {mean,cls}
  --batch-size BATCH_SIZE
  --use-amp             Enable mixed precision on CUDA (default: False)
  --max-length MAX_LENGTH
                        Longest tokenized length for the tokenizer (default: 512)
  --weight-mode {hellinger,log,relative,softmax_tau3}
  --site-pooling {l2_weighted_mean,weighted_mean,gem_p2,gem_p3,simple_mean}
                        Method for pooling ASV embeddings to site embeddings. 'simple_mean' performs no normalisation. (default: l2_weighted_mean)
  --run-tsne
  --run-umap
  --perplexity PERPLEXITY
                        Perplexity setting for tSNE (default: 5)
  --n-neighbors N_NEIGHBORS
                        Number of neighbours for UMAP (default: 15)
  --metric {cosine,euclidean}
  --seed SEED
  --fuse {none,concat}  How to fuse 12S+16S site vectors (concat or none) (default: concat)

Results

The results folder will contain at least two files: the per-ASV embeddings in parquet and the per-site embeddings in parquet. If you've turned on run-tsne and/or run-umap, there will be CSV files with TSNE1/TSNE2 and UMAP1/UMAP2 values for all sites.

Runtime

It usually runs only for a few minutes, but so far I've only tested it on systems without GPUs.

Installation

There's a conda environment with the DNABERT-S needed dependencies in DNABERT_S.yml

conda env create -f DNABERT_S.yml

AI statement

Most of this code is written by GPT5 after a long discussion! Every mode has heaps options, human-me would've never done that.

Regression

I am working on predicting latitude/longitude from the site embeddings alone. That is happening in regress.py. I am unclear how to handle replicates best as there's a lot of variability in between - by not accounting for replicates I get a test R2 of about 0.7, with grouped KFold stratification by replicate it goes down to 0.5 (better).

Example data

This repository comes with a FAIRe-formatted excel sheet from an OceanOmics transect from Perth to the Cocos Keeling Islands, only 12S (MiFish-U, Miya et al.)

Owner

  • Name: Computational-Biology-OceanOmics
  • Login: Computational-Biology-OceanOmics
  • Kind: organization

This is the research and development space for the OceanOmics team

GitHub Events

Total
  • Push event: 13
  • Create event: 2
Last Year
  • Push event: 13
  • Create event: 2