open_clip_with_biomedica

Custom Open CLIP repo to train biomedical CLIP models

https://github.com/ale9806/open_clip_with_biomedica

Last synced: 10 months ago · JSON representation ·

Repository

Custom Open CLIP repo to train biomedical CLIP models

Basic Info

Host: GitHub
Owner: Ale9806
License: other
Language: Python
Default Branch: main
Size: 16.9 MB

Statistics

Stars: 9
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Citation

README.md

Arxiv: Arxiv | Website: Biomedica | Training instructions: CLIP | Models and Datasets Hugging Face

How to Train a CLIP-like Model with BIOMEDICA?

This is a (very minor) adaptation of the OpenCLIP repository to train CLIP-style models using the Biomedical archive. All credits go to the original developers of OpeCLIP. This repository also builds upon an original discussion on OpenCLIP's GitHub.

Introduction

OpenCLIP has gained recognition in both academic and industrial communities as an exceptional open-source framework for training CLIP-like models. However, the documentation can be lacking when it comes to fine-tuning these models for specific downstream tasks using custom datasets. For beginners, this can be overwhelming as they might not know where to begin. This guide outlines some key considerations and best practices for using OpenCLIP effectively.

Step 1: Create a Virtual Environment

To begin, we need to set up a virtual environment. Based on my own testing, Python 3.9 works well. You can create the environment using the following command:

```python

Create env

conda create --name train_clip python=3.9

Activate env

conda activate train_clip

```

Step 2: Install the environment

Check your CUDA version before installing torch and the corresponding packages， if we install the dependencies by directly using the official command, we will likely encounter a series of errors caused by mismatched torch versions and CUDA versions. So install your environment according to the actual situation.

```python

Check CUDA versaion

nvidia-smi ```

and we will get the driver version（Using my local device as an example):

NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7

Then visit torch official website to get a compatible distribution. It is recommended to use pip for installation. For example, for my version CUDA 11.7 I used:

python pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117

Lastly, verify that the installation was successful: ```python import torch print(torch.cuda.is_available()) # verify it prints True True

```

Step 3: Clone and install the open_clip

```bash

Clone repo

git clone https://github.com/mlfoundations/open_clip.git

Enter the project root directory.

cd open_clip

Install training dependcies

pip install -r requirements-training.txt

Install webdataset

git clone https://github.com/minwoosun/webdataset.git cd webdataset git checkout hf-token pip install -e .

Install wandb

pip install wandb

Setup tokens

huggingface-cli login wandb login ```

Step 4: Chose a suitable pre-trained model

OpenClip official provides quite a lot of pre-trained models of the CLIP series for downloading and usage. You can use the following command to view the specific details of these models.

The first column represents the model’s name, which is also the parameter for text encoding in the model. The second column indicates either the provider of the model or the scale of training dataset used.

```python import openclip openclip.list_pretrained()

[('RN50', 'openai'),

('RN50', 'yfcc15m'),

('RN50', 'cc12m'),

...,

('nllb-clip-large-siglip', 'v1')]

```

5. Train Custom CLIP using Biomedica

To train CLIP-style models using a webdataset locally (e.g. biomedica_webdataset), first download the dataset locally. Then run the following commands:

5.A Training CLIP using webdataset without streaming

A SLURM-ready script is already provided at: train clip

```python

Enter the src folder of the open_clip repository

cd open_clip/src

Create a bash file

vim train_openclip.sh

Add the following:

specify which GPUs you want to use.

export CUDAVISIBLEDEVICES=0,1,2,3,4,5

set the training args, Example:

torchrun --nprocpernode 6 -m training.main \ --batch-size 500 \ --precision amp \ --workers 4 \ --report-to tensorboard \ --save-frequency 1 \ --logs="/path/to/your/local/logs" \ --dataset-type csv \ --csv-separator="," \ --train-data /path/to/your/local/training_dict.csv \ --csv-img-key filepath \ --csv-caption-key caption \ --warmup 1000 \ --lr=5e-6 \ --wd=0.1 \ --epochs=32 \ --model ViT-B-32 \ --pretrained /path/to/your/local/model

```

For more detailed args explanation, please refer to：https://github.com/mlfoundations/open_clip/blob/main/src/training/params.py

Epochs

For larger datasets (eg Laion2B), we recommend setting --train-num-samples to a lower value than the full epoch, for example --train-num-samples 135646078 to 1/16 of an epoch in conjunction with --dataset-resampled to do sampling with replacement. This allows having frequent checkpoints to evaluate more often.

Patch Dropout

Recent research has shown that one can dropout half to three-quarters of the visual tokens, leading to up to 2-3x training speeds without loss of accuracy.

You can set this on your visual transformer config with the key patch_dropout.

In the paper, they also finetuned without the patch dropout at the end. You can do this with the command-line argument --force-patch-dropout 0.

Multiple data sources

OpenCLIP supports using multiple data sources, by separating different data paths with ::. For instance, to train on CC12M and on LAION, one might use --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar::/data/LAION-400M/{00000..41455}.tar". Using --dataset-resampled is recommended for these cases.

By default, on expectation the amount of times the model will see a sample from each source is proportional to the size of the source. For instance, when training on one data source with size 400M and one with size 10M, samples from the first source are 40x more likely to be seen in expectation.

Single-Node

We make use of torchrun to launch distributed jobs. The following launches a a job on a node of 4 GPUs:

bash cd open_clip/src torchrun --nproc_per_node 4 -m open_clip_train.main \ --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \ --train-num-samples 10968539 \ --dataset-type webdataset \ --batch-size 320 \ --precision amp \ --workers 4 \ --imagenet-val /data/imagenet/validation/

Multi-Node

The same script above works, so long as users include information about the number of nodes and host node.

bash cd open_clip/src torchrun --nproc_per_node=4 \ --rdzv_endpoint=$HOSTE_NODE_ADDR \ -m open_clip_train.main \ --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \ --train-num-samples 10968539 \ --dataset-type webdataset \ --batch-size 320 \ --precision amp \ --workers 4 \ --imagenet-val /data/imagenet/validation/

SLURM

This is likely the easiest solution to utilize. The following script was used to train our largest models:

```bash

!/bin/bash -x

SBATCH --nodes=32

SBATCH --gres=gpu:4

SBATCH --ntasks-per-node=4

SBATCH --cpus-per-task=6

SBATCH --wait-all-nodes=1

SBATCH --job-name=open_clip

SBATCH --account=ACCOUNT_NAME

SBATCH --partition PARTITION_NAME

eval "$(/path/to/conda/bin/conda shell.bash hook)" # init conda conda activate openclip export CUDAVISIBLEDEVICES=0,1,2,3 export MASTERPORT=12802

masteraddr=$(scontrol show hostnames "$SLURMJOBNODELIST" | head -n 1) export MASTERADDR=$master_addr

cd /shared/openclip export PYTHONPATH="$PYTHONPATH:$PWD/src" srun --cpubind=v --accel-bind=gn python -u src/opencliptrain/main.py \ --save-frequency 1 \ --report-to tensorboard \ --train-data="/pathtobiomedica_tars/{00000..41455}.tar" \ --warmup 2000 \ --batch-size=256 \ --epochs=32 \ --workers=8 \ --model ViT-B-32 \ --name "ViT-B-32-Vanilla" \ --seed 0 \ --local-loss \ --gather-with-grad ```

Resuming from a checkpoint:

bash python -m open_clip_train.main \ --train-data="/path/to/train_data.csv" \ --val-data="/path/to/validation_data.csv" \ --resume /path/to/checkpoints/epoch_K.pt

Citing

If you found this repository useful, please consider citing: bibtex @software{ilharco_gabriel_2021_5143773, author = {Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig}, title = {OpenCLIP}, month = jul, year = 2021, note = {If you use this software, please cite it as below.}, publisher = {Zenodo}, version = {0.1}, doi = {10.5281/zenodo.5143773}, url = {https://doi.org/10.5281/zenodo.5143773} }

bibtex @inproceedings{cherti2023reproducible, title={Reproducible scaling laws for contrastive language-image learning}, author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={2818--2829}, year={2023} }

bibtex @inproceedings{Radford2021LearningTV, title={Learning Transferable Visual Models From Natural Language Supervision}, author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever}, booktitle={ICML}, year={2021} }

bibtex @inproceedings{schuhmann2022laionb, title={{LAION}-5B: An open large-scale dataset for training next generation image-text models}, author={Christoph Schuhmann and Romain Beaumont and Richard Vencu and Cade W Gordon and Ross Wightman and Mehdi Cherti and Theo Coombes and Aarush Katta and Clayton Mullis and Mitchell Wortsman and Patrick Schramowski and Srivatsa R Kundurthy and Katherine Crowson and Ludwig Schmidt and Robert Kaczmarczyk and Jenia Jitsev}, booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2022}, url={https://openreview.net/forum?id=M3Y74vmsMcY} }

Owner

Name: Alejandro Lozano
Login: Ale9806
Kind: user
Location: Stanford, California
Company: Stanford University

Repositories: 72
Profile: https://github.com/Ale9806

STANFORD BMI Ph.D.Student | BMSIS YOUNG SCIENTIST PROGRAM | UCSD ABROAD PROGAM 2020 | IPN-UPIBI BIOMEDICAL ENGINEER

Citation (CITATION.cff)

cff-version: 1.1.0
message: If you use this software, please cite it as below.
authors:
  - family-names: Ilharco
    given-names: Gabriel
  - family-names: Wortsman
    given-names: Mitchell
  - family-names: Wightman
    given-names: Ross
  - family-names: Gordon
    given-names: Cade   
  - family-names: Carlini
    given-names: Nicholas
  - family-names: Taori
    given-names: Rohan
  - family-names: Dave
    given-names: Achal
  - family-names: Shankar
    given-names: Vaishaal
  - family-names: Namkoong
    given-names: Hongseok
  - family-names: Miller
    given-names: John
  - family-names: Hajishirzi
    given-names: Hannaneh
  - family-names: Farhadi
    given-names: Ali
  - family-names: Schmidt
    given-names: Ludwig
title: OpenCLIP
version: v0.1
doi: 10.5281/zenodo.5143773
date-released: 2021-07-28

GitHub Events

Total

Issues event: 2
Watch event: 22
Issue comment event: 1
Push event: 9
Public event: 1

Last Year

Issues event: 2
Watch event: 22
Issue comment event: 1
Push event: 9
Public event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

open_clip_with_biomedica

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

How to Train a CLIP-like Model with BIOMEDICA?

Introduction

Step 1: Create a Virtual Environment

Create env

Activate env

```

Step 2: Install the environment

Check CUDA versaion

```

Step 3: Clone and install the open_clip

Clone repo

Enter the project root directory.

Install training dependcies

Install webdataset

Install wandb

Setup tokens

Step 4: Chose a suitable pre-trained model

[('RN50', 'openai'),

('RN50', 'yfcc15m'),

('RN50', 'cc12m'),

...,

('nllb-clip-large-siglip', 'v1')]

5. Train Custom CLIP using Biomedica

5.A Training CLIP using webdataset without streaming

Enter the src folder of the open_clip repository

Create a bash file

Add the following:

specify which GPUs you want to use.

set the training args, Example:

Epochs

Patch Dropout

Multiple data sources

Single-Node

Multi-Node

SLURM

!/bin/bash -x

SBATCH --nodes=32

SBATCH --gres=gpu:4

SBATCH --ntasks-per-node=4

SBATCH --cpus-per-task=6

SBATCH --wait-all-nodes=1

SBATCH --job-name=open_clip

SBATCH --account=ACCOUNT_NAME

SBATCH --partition PARTITION_NAME

Resuming from a checkpoint:

Citing

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels