open_clip_with_biomedica
Custom Open CLIP repo to train biomedical CLIP models
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary
Repository
Custom Open CLIP repo to train biomedical CLIP models
Basic Info
- Host: GitHub
- Owner: Ale9806
- License: other
- Language: Python
- Default Branch: main
- Size: 16.9 MB
Statistics
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Arxiv: Arxiv | Website: Biomedica | Training instructions: CLIP | Models and Datasets Hugging Face
How to Train a CLIP-like Model with BIOMEDICA?
This is a (very minor) adaptation of the OpenCLIP repository to train CLIP-style models using the Biomedical archive. All credits go to the original developers of OpeCLIP. This repository also builds upon an original discussion on OpenCLIP's GitHub.
Introduction
OpenCLIP has gained recognition in both academic and industrial communities as an exceptional open-source framework for training CLIP-like models. However, the documentation can be lacking when it comes to fine-tuning these models for specific downstream tasks using custom datasets. For beginners, this can be overwhelming as they might not know where to begin. This guide outlines some key considerations and best practices for using OpenCLIP effectively.
Step 1: Create a Virtual Environment
To begin, we need to set up a virtual environment. Based on my own testing, Python 3.9 works well. You can create the environment using the following command:
```python
Create env
conda create --name train_clip python=3.9
Activate env
conda activate train_clip
```
Step 2: Install the environment
Check your CUDA version before installing torch and the corresponding packages, if we install the dependencies by directly using the official command, we will likely encounter a series of errors caused by mismatched torch versions and CUDA versions. So install your environment according to the actual situation.
```python
Check CUDA versaion
nvidia-smi ```
and we will get the driver version(Using my local device as an example):
NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7
Then visit torch official website to get a compatible distribution. It is recommended to use pip for installation. For example, for my version CUDA 11.7 I used:
python
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
Lastly, verify that the installation was successful: ```python import torch print(torch.cuda.is_available()) # verify it prints True True
```
Step 3: Clone and install the open_clip
```bash
Clone repo
git clone https://github.com/mlfoundations/open_clip.git
Enter the project root directory.
cd open_clip
Install training dependcies
pip install -r requirements-training.txt
Install webdataset
git clone https://github.com/minwoosun/webdataset.git cd webdataset git checkout hf-token pip install -e .
Install wandb
pip install wandb
Setup tokens
huggingface-cli login wandb login ```
Step 4: Chose a suitable pre-trained model
OpenClip official provides quite a lot of pre-trained models of the CLIP series for downloading and usage. You can use the following command to view the specific details of these models.
The first column represents the model’s name, which is also the parameter for text encoding in the model. The second column indicates either the provider of the model or the scale of training dataset used.
```python import openclip openclip.list_pretrained()
[('RN50', 'openai'),
('RN50', 'yfcc15m'),
('RN50', 'cc12m'),
...,
('nllb-clip-large-siglip', 'v1')]
```
5. Train Custom CLIP using Biomedica
To train CLIP-style models using a webdataset locally (e.g. biomedica_webdataset), first download the dataset locally. Then run the following commands:
5.A Training CLIP using webdataset without streaming
A SLURM-ready script is already provided at: train clip
```python
Enter the src folder of the open_clip repository
cd open_clip/src
Create a bash file
vim train_openclip.sh
Add the following:
specify which GPUs you want to use.
export CUDAVISIBLEDEVICES=0,1,2,3,4,5
set the training args, Example:
torchrun --nprocpernode 6 -m training.main \ --batch-size 500 \ --precision amp \ --workers 4 \ --report-to tensorboard \ --save-frequency 1 \ --logs="/path/to/your/local/logs" \ --dataset-type csv \ --csv-separator="," \ --train-data /path/to/your/local/training_dict.csv \ --csv-img-key filepath \ --csv-caption-key caption \ --warmup 1000 \ --lr=5e-6 \ --wd=0.1 \ --epochs=32 \ --model ViT-B-32 \ --pretrained /path/to/your/local/model
```
For more detailed args explanation, please refer to:https://github.com/mlfoundations/open_clip/blob/main/src/training/params.py
Epochs
For larger datasets (eg Laion2B), we recommend setting --train-num-samples to a lower value than the full epoch, for example --train-num-samples 135646078 to 1/16 of an epoch in conjunction with --dataset-resampled to do sampling with replacement. This allows having frequent checkpoints to evaluate more often.
Patch Dropout
Recent research has shown that one can dropout half to three-quarters of the visual tokens, leading to up to 2-3x training speeds without loss of accuracy.
You can set this on your visual transformer config with the key patch_dropout.
In the paper, they also finetuned without the patch dropout at the end. You can do this with the command-line argument --force-patch-dropout 0.
Multiple data sources
OpenCLIP supports using multiple data sources, by separating different data paths with ::.
For instance, to train on CC12M and on LAION, one might use --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar::/data/LAION-400M/{00000..41455}.tar".
Using --dataset-resampled is recommended for these cases.
By default, on expectation the amount of times the model will see a sample from each source is proportional to the size of the source. For instance, when training on one data source with size 400M and one with size 10M, samples from the first source are 40x more likely to be seen in expectation.
Single-Node
We make use of torchrun to launch distributed jobs. The following launches a
a job on a node of 4 GPUs:
bash
cd open_clip/src
torchrun --nproc_per_node 4 -m open_clip_train.main \
--train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
--train-num-samples 10968539 \
--dataset-type webdataset \
--batch-size 320 \
--precision amp \
--workers 4 \
--imagenet-val /data/imagenet/validation/
Multi-Node
The same script above works, so long as users include information about the number of nodes and host node.
bash
cd open_clip/src
torchrun --nproc_per_node=4 \
--rdzv_endpoint=$HOSTE_NODE_ADDR \
-m open_clip_train.main \
--train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
--train-num-samples 10968539 \
--dataset-type webdataset \
--batch-size 320 \
--precision amp \
--workers 4 \
--imagenet-val /data/imagenet/validation/
SLURM
This is likely the easiest solution to utilize. The following script was used to train our largest models:
```bash
!/bin/bash -x
SBATCH --nodes=32
SBATCH --gres=gpu:4
SBATCH --ntasks-per-node=4
SBATCH --cpus-per-task=6
SBATCH --wait-all-nodes=1
SBATCH --job-name=open_clip
SBATCH --account=ACCOUNT_NAME
SBATCH --partition PARTITION_NAME
eval "$(/path/to/conda/bin/conda shell.bash hook)" # init conda conda activate openclip export CUDAVISIBLEDEVICES=0,1,2,3 export MASTERPORT=12802
masteraddr=$(scontrol show hostnames "$SLURMJOBNODELIST" | head -n 1) export MASTERADDR=$master_addr
cd /shared/openclip export PYTHONPATH="$PYTHONPATH:$PWD/src" srun --cpubind=v --accel-bind=gn python -u src/opencliptrain/main.py \ --save-frequency 1 \ --report-to tensorboard \ --train-data="/pathtobiomedica_tars/{00000..41455}.tar" \ --warmup 2000 \ --batch-size=256 \ --epochs=32 \ --workers=8 \ --model ViT-B-32 \ --name "ViT-B-32-Vanilla" \ --seed 0 \ --local-loss \ --gather-with-grad ```
Resuming from a checkpoint:
bash
python -m open_clip_train.main \
--train-data="/path/to/train_data.csv" \
--val-data="/path/to/validation_data.csv" \
--resume /path/to/checkpoints/epoch_K.pt
Citing
If you found this repository useful, please consider citing:
bibtex
@software{ilharco_gabriel_2021_5143773,
author = {Ilharco, Gabriel and
Wortsman, Mitchell and
Wightman, Ross and
Gordon, Cade and
Carlini, Nicholas and
Taori, Rohan and
Dave, Achal and
Shankar, Vaishaal and
Namkoong, Hongseok and
Miller, John and
Hajishirzi, Hannaneh and
Farhadi, Ali and
Schmidt, Ludwig},
title = {OpenCLIP},
month = jul,
year = 2021,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {0.1},
doi = {10.5281/zenodo.5143773},
url = {https://doi.org/10.5281/zenodo.5143773}
}
bibtex
@inproceedings{cherti2023reproducible,
title={Reproducible scaling laws for contrastive language-image learning},
author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={2818--2829},
year={2023}
}
bibtex
@inproceedings{Radford2021LearningTV,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
booktitle={ICML},
year={2021}
}
bibtex
@inproceedings{schuhmann2022laionb,
title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
author={Christoph Schuhmann and
Romain Beaumont and
Richard Vencu and
Cade W Gordon and
Ross Wightman and
Mehdi Cherti and
Theo Coombes and
Aarush Katta and
Clayton Mullis and
Mitchell Wortsman and
Patrick Schramowski and
Srivatsa R Kundurthy and
Katherine Crowson and
Ludwig Schmidt and
Robert Kaczmarczyk and
Jenia Jitsev},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=M3Y74vmsMcY}
}
Owner
- Name: Alejandro Lozano
- Login: Ale9806
- Kind: user
- Location: Stanford, California
- Company: Stanford University
- Repositories: 72
- Profile: https://github.com/Ale9806
STANFORD BMI Ph.D.Student | BMSIS YOUNG SCIENTIST PROGRAM | UCSD ABROAD PROGAM 2020 | IPN-UPIBI BIOMEDICAL ENGINEER
Citation (CITATION.cff)
cff-version: 1.1.0
message: If you use this software, please cite it as below.
authors:
- family-names: Ilharco
given-names: Gabriel
- family-names: Wortsman
given-names: Mitchell
- family-names: Wightman
given-names: Ross
- family-names: Gordon
given-names: Cade
- family-names: Carlini
given-names: Nicholas
- family-names: Taori
given-names: Rohan
- family-names: Dave
given-names: Achal
- family-names: Shankar
given-names: Vaishaal
- family-names: Namkoong
given-names: Hongseok
- family-names: Miller
given-names: John
- family-names: Hajishirzi
given-names: Hannaneh
- family-names: Farhadi
given-names: Ali
- family-names: Schmidt
given-names: Ludwig
title: OpenCLIP
version: v0.1
doi: 10.5281/zenodo.5143773
date-released: 2021-07-28
GitHub Events
Total
- Issues event: 2
- Watch event: 22
- Issue comment event: 1
- Push event: 9
- Public event: 1
Last Year
- Issues event: 2
- Watch event: 22
- Issue comment event: 1
- Push event: 9
- Public event: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mayonezu216 (1)