biotrove

NeurIPS 2024 Track on Datasets and Benchmarks (Spotlight)

https://github.com/baskargroup/biotrove

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
3 of 10 committers (30.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Keywords

animals clip image-classification multimodal rare-species species taxonomy zero-shot-classification

Last synced: 10 months ago · JSON representation

Repository

NeurIPS 2024 Track on Datasets and Benchmarks (Spotlight)

Basic Info

Host: GitHub
Owner: baskargroup
Language: Jupyter Notebook
Default Branch: main
Homepage: https://baskargroup.github.io/BioTrove/
Size: 195 MB

Statistics

Stars: 31
Watchers: 2
Forks: 4
Open Issues: 3
Releases: 0

Topics

animals clip image-classification multimodal rare-species species taxonomy zero-shot-classification

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme

BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Banner Image

BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Data Preprocessing

Before using this script, please download the metadata from Hugging Face and pre-process the data using the biotrove_process library. The library is located in the BioTrove-preprocess/biotrove_process directory. A detailed description can be found in the README file.

The library contains scripts to generate machine learning-ready image-text pairs from the downloaded metadata in four steps:

Processing metadata files to obtain category and species distribution.
Filtering metadata based on user-defined thresholds and generating shuffled chunks.
Downloading images based on URLs in the metadata.
Generating text labels for the images.

Model Training

We train three models using a modified version of the BioCLIP/OpenCLIP codebase. Each model is trained for 40 epochs on BioTrove-40M, on 2 nodes, 8xH100 GPUs, on NYU's Greene high-performance compute cluster.

We optimize our hyperparameters prior to training with Ray. Our standard training parameters are as follows:

--dataset-type webdataset --pretrained openai --text_type random --dataset-resampled --warmup 5000 --batch-size 4096 --accum-freq 1 --epochs 40 --workers 8 --model ViT-B-16 --lr 0.0005 --wd 0.0004 --precision bf16 --beta1 0.98 --beta2 0.99 --eps 1.0e-6 --local-loss --gather-with-grad --ddp-static-graph --grad-checkpointing

For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the OpenCLIP and BioCLIP documentation, respectively.

Model weights

See the BioTrove-CLIP Model card on HuggingFace to download the trained model checkpoints.

We released three trained model checkpoints in the BioTrove-CLIP model card on HuggingFace. These CLIP-style models were trained on BioTrove-Train (40M) for the following configurations:

BT-CLIP-O: Trained a ViT-B/16 backbone initialized from the OpenCLIP's checkpoint. The training was conducted for 40 epochs.
BT-CLIP-B: Trained a ViT-B/16 backbone initialized from the BioCLIP's checkpoint. The training was conducted for 8 epochs.
BT-CLIP-M: Trained a ViT-L/14 backbone initialized from the MetaCLIP's checkpoint. The training was conducted for 12 epochs.

These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source.

Model Validation

For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the VLHub repository with some slight modifications.

Pre-Run

After cloning this repository and navigating to the BioTrove/model_validation directory, we recommend installing all the project requirements into a conda container; pip install -r requirements.txt. Also, before executing a command in VLHub, please add BioTrove/model_validation/src to your PYTHONPATH.

bash export PYTHONPATH="$PYTHONPATH:$PWD/src";

Base Command

A basic BioTrove model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the --resume flag on the ImageNet validation set, and would report the results to Weights and Biases.

bash python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb

Baseline Models

We compare our trained checkpoints to three strong baselines. We describe our baselines in the table below, including the required flags to evaluate them.

| Model Name | Origin | Path to checkpoint | Runtime Flags | |-------------|----------------------------------------------|-------------------------------------------|---------------------------------------------------------| | BioCLIP | https://arxiv.org/abs/2311.18803 | https://huggingface.co/imageomics/bioclip | --model ViT-B-16 --resume "/PATH/TO/bioclipckpt.bin" | | OpenAI CLIP | https://arxiv.org/abs/2103.00020 | Downloads automatically | --model ViT-B-16 --pretrained=openai | | MetaCLIP-cc | https://github.com/facebookresearch/MetaCLIP | Downloads automatically | --model ViT-L-14-quickgelu --pretrained=metaclipfullcc |

Existing Benchmarks

In the BioTrove paper, we report results on the following established benchmarks from prior scientific literature: Birds525, BioCLIP-Rare, IP102 Insects, Fungi, Deepweeds, and Confounding Species. We also introduce three new benchmarks: BioTrove-Balanced, BioTrove-LifeStages, and BioTrove-Unseen.

Our package expects a valid path to each image to exist in its corresponding metadata file; therefore, metadata CSV paths must be updated before running each benchmark.

| Benchmark Name | Images URL | Metadata Path | Runtime Flag(s) | |---------------------|------------------------------------------------------------------------|-----------------------------------------------------|-------------------------------------| | BioTrove-Balanced | https://huggingface.co/datasets/BGLab/BioTrove-Train | https://huggingface.co/datasets/BGLab/BioTrove/tree/main/BioTrove-benchmark/BioTrove-Balanced.csv | --arbor-val --taxon MYTAXON | | BioTrove-Lifestages | https://huggingface.co/datasets/BGLab/BioTrove-Train | https://huggingface.co/datasets/BGLab/BioTrove/tree/main/BioTrove-benchmark/BioTrove-LifeStages.csv | --lifestages --taxon MYTAXON | | BioTrove-Unseen | https://huggingface.co/datasets/BGLab/BioTrove-Train | https://huggingface.co/datasets/BGLab/BioTrove/tree/main/BioTrove-benchmark/BioTrove-Unseen.csv | --arbor-rare --taxon MYTAXON | | BioCLIP Rare | https://huggingface.co/datasets/imageomics/rare-species | modelvalidation/metadata/bioclip-rare-metadata.csv | --bioclip-rare --taxon MYTAXON | | Birds525 | https://www.kaggle.com/datasets/gpiosenka/100-bird-species | modelvalidation/metadata/birds525metadata.csv | --birds /birds525 --ds-filter birds | | Confounding Species | TBD | modelvalidation/metadata/confoundingspecies.csv | --confounding | | Deepweeds | https://www.kaggle.com/datasets/imsparsh/deepweeds | modelvalidation/metadata/deepweedsmetadata.csv | --deepweeds | | Fungi | http://ptak.felk.cvut.cz/plants/DanishFungiDataset/DF20M-images.tar.gz | modelvalidation/metadata/fungimetadata.csv | --fungi | | IP102 Insects | https://www.kaggle.com/datasets/rtlmhjbn/ip02-dataset | modelvalidation/metadata/ins2_metadata.csv | --insects2 |

Acknowledgments

If you find this repository useful, please consider citing these related papers --

VLHub

bibtex @article{ feuer2023distributionally, title={Distributionally Robust Classification on a Data Budget}, author={Benjamin Feuer and Ameya Joshi and Minh Pham and Chinmay Hegde}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2023}, url={https://openreview.net/forum?id=D5Z2E8CNsD}, note={} }

BioCLIP

bibtex @misc{stevens2024bioclip, title={BioCLIP: A Vision Foundation Model for the Tree of Life}, author={Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su}, year={2024}, eprint={2311.18803}, archivePrefix={arXiv}, primaryClass={cs.CV} }

OpenCLIP

bibtex @software{ilharco_gabriel_2021_5143773, author = {Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig}, title = {OpenCLIP}, month = jul, year = 2021, note = {If you use this software, please cite it as below.}, publisher = {Zenodo}, version = {0.1}, doi = {10.5281/zenodo.5143773}, url = {https://doi.org/10.5281/zenodo.5143773} }

Parts of this project page were adopted from the Nerfies page.

Website License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation

bibtex @misc{yang2024arboretumlargemultimodaldataset, title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity}, author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian}, year={2024}, eprint={2406.17720}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2406.17720}, }

Owner

Name: BaskarGroup
Login: BaskarGroup
Kind: organization

Repositories: 1
Profile: https://github.com/BaskarGroup

GitHub Events

Total

Issues event: 1
Watch event: 13
Issue comment event: 3
Member event: 2
Push event: 34
Fork event: 1
Create event: 2

Last Year

Issues event: 1
Watch event: 13
Issue comment event: 3
Member event: 2
Push event: 34
Fork event: 1
Create event: 2

Committers

Last synced: 11 months ago

All Time

Total Commits: 247
Total Committers: 10
Avg Commits per committer: 24.7
Development Distribution Score (DDS): 0.555

Past Year

Commits: 42
Committers: 3
Avg Commits per committer: 14.0
Development Distribution Score (DDS): 0.238

Top Committers

Name	Email	Commits
Zahid-isu	8****u	110
zahid	8****u	33
penfever	p**r@g**m	29
znjubery	z**y@g**m	28
Ben Feuer	p**r@h**m	23
André	7****3	10
ChihHsuan-Yang	6****g	5
Nirmal	n**l@i**u	4
Kelly Marshall	k**8@n**u	4
Kelly Marshall	k**8@l**u	1

Committer Domains (Top 20 + Academic)

log-1.hpc.nyu.edu: 1 nyu.edu: 1 iastate.edu: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 2
Total pull requests: 4
Average time to close issues: 3 months
Average time to close pull requests: about 1 hour
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.5
Average comments per pull request: 0.25
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: 3 months
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

johnbradley (1)

Pull Request Authors

Km3888 (2)
hlapp (1)
ajn313 (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Biotrove-preprocess/requirement.txt pypi

aiohttp ==3.9.5
aiosignal ==1.3.1
async-timeout ==4.0.3
attrs ==23.2.0
contourpy ==1.2.1
cramjam ==2.8.3
cycler ==0.12.1
fastparquet ==2024.5.0
fonttools ==4.53.0
frozenlist ==1.4.1
fsspec ==2024.6.0
idna ==3.7
importlib_resources ==6.4.0
kiwisolver ==1.4.5
matplotlib ==3.9.0
multidict ==6.0.5
numpy ==1.26.4
packaging ==24.1
pandas ==2.2.2
pillow ==10.3.0
pyarrow ==16.1.0
pyparsing ==3.1.2
python-dateutil ==2.9.0.post0
pytz ==2024.1
seaborn ==0.13.2
six ==1.16.0
tqdm ==4.66.4
tzdata ==2024.1
yarl ==1.9.4
zipp ==3.19.2

Biotrove-preprocess/setup.py pypi

model_training/pyproject.toml pypi

model_training/requirements-training.txt pypi

Jinja2 *
Markdown *
MarkupSafe *
PyYAML *
Werkzeug *
absl-py *
braceexpand *
cachetools *
certifi *
charset-normalizer *
filelock *
fsspec *
ftfy *
google-auth *
google-auth-oauthlib *
grpcio *
huggingface-hub *
idna *
mpmath *
networkx *
numpy *
nvidia-cublas-cu12 *
nvidia-cuda-cupti-cu12 *
nvidia-cuda-nvrtc-cu12 *
nvidia-cuda-runtime-cu12 *
nvidia-cudnn-cu12 *
nvidia-cufft-cu12 *
nvidia-curand-cu12 *
nvidia-cusolver-cu12 *
nvidia-cusparse-cu12 *
nvidia-nccl-cu12 *
nvidia-nvjitlink-cu12 *
nvidia-nvtx-cu12 *
oauthlib *
packaging *
pandas *
pillow *
protobuf *
pyasn1 *
pyasn1-modules *
python-dateutil *
pytz *
regex *
requests *
requests-oauthlib *
rsa *
safetensors *
six *
sympy *
tensorboard *
tensorboard-data-server *
timm *
tokenizers *
torch *
torchvision *
tqdm *
transformers *
triton *
typing_extensions *
tzdata *
urllib3 *
wcwidth *
webdataset *

model_training/requirements-viz.txt pypi

pandas ==2.1.2
plotly =5.18.0

model_training/requirements.txt pypi

Jinja2 *
MarkupSafe *
PyYAML *
certifi *
charset-normalizer *
cmake *
filelock *
fsspec *
ftfy *
huggingface-hub *
idna *
lit *
mpmath *
networkx *
numpy *
nvidia-cublas-cu11 *
nvidia-cuda-cupti-cu11 *
nvidia-cuda-nvrtc-cu11 *
nvidia-cuda-runtime-cu11 *
nvidia-cudnn-cu11 *
nvidia-cufft-cu11 *
nvidia-curand-cu11 *
nvidia-cusolver-cu11 *
nvidia-cusparse-cu11 *
nvidia-nccl-cu11 *
nvidia-nvtx-cu11 *
packaging *
pandas *
pillow *
python-dateutil *
pytz *
regex *
requests *
safetensors *
scipy *
six *
sympy *
tokenizers *
torch *
torchvision *
tqdm *
transformers *
triton *
typing_extensions *
tzdata *
urllib3 *
wcwidth *

model_training/setup.py pypi

model_validation/requirements.txt pypi

braceexpand *
byol-pytorch *
clip-benchmark ==1.4.0
coca-pytorch *
datasets ==2.8.0
ftfy *
fuzzywuzzy *
huggingface_hub *
nltk *
open_clip_torch *
pandas <=1.4.0
regex *
scikit-learn *
scipy *
seaborn *
tensorboard *
timm *
torch >=1.9.0
torchvision *
tqdm *
vit-pytorch *
wandb *
webdataset >=0.2.5
x-clip *

model_validation/requirements_notorch.txt pypi

braceexpand *
byol-pytorch *
coca-pytorch *
ftfy *
fuzzywuzzy *
huggingface_hub *
nltk *
open_clip_torch *
pandas <=1.4.0
regex *
scikit-learn *
scipy *
seaborn *
tensorboard *
timm *
tqdm *
vit-pytorch *
wandb *
webdataset >=0.2.5
x-clip *

model_validation/requirements_ray.txt pypi

braceexpand *
byol-pytorch *
coca-pytorch *
ftfy *
fuzzywuzzy *
huggingface_hub *
nltk *
open_clip_torch *
pandas <=1.4.0
pyarrow *
ray *
regex *
scikit-learn *
scipy *
seaborn *
tensorboard *
timm *
torch >=1.9.0
torchvision *
tqdm *
vit-pytorch *
wandb *
webdataset >=0.2.5
x-clip *

biotrove

Science Score: 59.0%

Keywords

Basic Info

Statistics

Topics

Metadata Files

Contents

Data Preprocessing

Model Training

Model weights

Model Validation

Pre-Run

Base Command

Baseline Models

Existing Benchmarks

Acknowledgments

Website License

Citation

GitHub Events

Total

Last Year

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies