open_clip_with_biomedica

Custom Open CLIP repo to train biomedical CLIP models

https://github.com/ale9806/open_clip_with_biomedica

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Custom Open CLIP repo to train biomedical CLIP models

Basic Info
  • Host: GitHub
  • Owner: Ale9806
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 16.9 MB
Statistics
  • Stars: 9
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

Pull Figure

Arxiv: Arxiv     |     Website: Biomedica     |     Training instructions: CLIP     |     Models and Datasets Hugging Face

How to Train a CLIP-like Model with BIOMEDICA?

This is a (very minor) adaptation of the OpenCLIP repository to train CLIP-style models using the Biomedical archive. All credits go to the original developers of OpeCLIP. This repository also builds upon an original discussion on OpenCLIP's GitHub.

Introduction

OpenCLIP has gained recognition in both academic and industrial communities as an exceptional open-source framework for training CLIP-like models. However, the documentation can be lacking when it comes to fine-tuning these models for specific downstream tasks using custom datasets. For beginners, this can be overwhelming as they might not know where to begin. This guide outlines some key considerations and best practices for using OpenCLIP effectively.

Step 1: Create a Virtual Environment

To begin, we need to set up a virtual environment. Based on my own testing, Python 3.9 works well. You can create the environment using the following command:

```python

Create env

conda create --name train_clip python=3.9

Activate env

conda activate train_clip

```

Step 2: Install the environment

Check your CUDA version before installing torch and the corresponding packages, if we install the dependencies by directly using the official command, we will likely encounter a series of errors caused by mismatched torch versions and CUDA versions. So install your environment according to the actual situation.

```python

Check CUDA versaion

nvidia-smi ```

and we will get the driver version(Using my local device as an example):

NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7

Then visit torch official website to get a compatible distribution. It is recommended to use pip for installation. For example, for my version CUDA 11.7 I used:

python pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117

Lastly, verify that the installation was successful: ```python import torch print(torch.cuda.is_available()) # verify it prints True True

```

Step 3: Clone and install the open_clip

```bash

Clone repo

git clone https://github.com/mlfoundations/open_clip.git

Enter the project root directory.

cd open_clip

Install training dependcies

pip install -r requirements-training.txt

Install webdataset

git clone https://github.com/minwoosun/webdataset.git cd webdataset git checkout hf-token pip install -e .

Install wandb

pip install wandb

Setup tokens

huggingface-cli login wandb login ```


Step 4: Chose a suitable pre-trained model

OpenClip official provides quite a lot of pre-trained models of the CLIP series for downloading and usage. You can use the following command to view the specific details of these models.

The first column represents the model’s name, which is also the parameter for text encoding in the model. The second column indicates either the provider of the model or the scale of training dataset used.

```python import openclip openclip.list_pretrained()

[('RN50', 'openai'),

('RN50', 'yfcc15m'),

('RN50', 'cc12m'),

...,

('nllb-clip-large-siglip', 'v1')]

```

5. Train Custom CLIP using Biomedica

To train CLIP-style models using a webdataset locally (e.g. biomedica_webdataset), first download the dataset locally. Then run the following commands:

5.A Training CLIP using webdataset without streaming

A SLURM-ready script is already provided at: train clip

```python

Enter the src folder of the open_clip repository

cd open_clip/src

Create a bash file

vim train_openclip.sh

Add the following:

specify which GPUs you want to use.

export CUDAVISIBLEDEVICES=0,1,2,3,4,5

set the training args, Example:

torchrun --nprocpernode 6 -m training.main \ --batch-size 500 \ --precision amp \ --workers 4 \ --report-to tensorboard \ --save-frequency 1 \ --logs="/path/to/your/local/logs" \ --dataset-type csv \ --csv-separator="," \ --train-data /path/to/your/local/training_dict.csv \ --csv-img-key filepath \ --csv-caption-key caption \ --warmup 1000 \ --lr=5e-6 \ --wd=0.1 \ --epochs=32 \ --model ViT-B-32 \ --pretrained /path/to/your/local/model

```

For more detailed args explanation, please refer to:https://github.com/mlfoundations/open_clip/blob/main/src/training/params.py

Epochs

For larger datasets (eg Laion2B), we recommend setting --train-num-samples to a lower value than the full epoch, for example --train-num-samples 135646078 to 1/16 of an epoch in conjunction with --dataset-resampled to do sampling with replacement. This allows having frequent checkpoints to evaluate more often.

Patch Dropout

Recent research has shown that one can dropout half to three-quarters of the visual tokens, leading to up to 2-3x training speeds without loss of accuracy.

You can set this on your visual transformer config with the key patch_dropout.

In the paper, they also finetuned without the patch dropout at the end. You can do this with the command-line argument --force-patch-dropout 0.

Multiple data sources

OpenCLIP supports using multiple data sources, by separating different data paths with ::. For instance, to train on CC12M and on LAION, one might use --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar::/data/LAION-400M/{00000..41455}.tar". Using --dataset-resampled is recommended for these cases.

By default, on expectation the amount of times the model will see a sample from each source is proportional to the size of the source. For instance, when training on one data source with size 400M and one with size 10M, samples from the first source are 40x more likely to be seen in expectation.

Single-Node

We make use of torchrun to launch distributed jobs. The following launches a a job on a node of 4 GPUs:

bash cd open_clip/src torchrun --nproc_per_node 4 -m open_clip_train.main \ --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \ --train-num-samples 10968539 \ --dataset-type webdataset \ --batch-size 320 \ --precision amp \ --workers 4 \ --imagenet-val /data/imagenet/validation/

Multi-Node

The same script above works, so long as users include information about the number of nodes and host node.

bash cd open_clip/src torchrun --nproc_per_node=4 \ --rdzv_endpoint=$HOSTE_NODE_ADDR \ -m open_clip_train.main \ --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \ --train-num-samples 10968539 \ --dataset-type webdataset \ --batch-size 320 \ --precision amp \ --workers 4 \ --imagenet-val /data/imagenet/validation/

SLURM

This is likely the easiest solution to utilize. The following script was used to train our largest models:

```bash

!/bin/bash -x

SBATCH --nodes=32

SBATCH --gres=gpu:4

SBATCH --ntasks-per-node=4

SBATCH --cpus-per-task=6

SBATCH --wait-all-nodes=1

SBATCH --job-name=open_clip

SBATCH --account=ACCOUNT_NAME

SBATCH --partition PARTITION_NAME

eval "$(/path/to/conda/bin/conda shell.bash hook)" # init conda conda activate openclip export CUDAVISIBLEDEVICES=0,1,2,3 export MASTERPORT=12802

masteraddr=$(scontrol show hostnames "$SLURMJOBNODELIST" | head -n 1) export MASTERADDR=$master_addr

cd /shared/openclip export PYTHONPATH="$PYTHONPATH:$PWD/src" srun --cpubind=v --accel-bind=gn python -u src/opencliptrain/main.py \ --save-frequency 1 \ --report-to tensorboard \ --train-data="/pathtobiomedica_tars/{00000..41455}.tar" \ --warmup 2000 \ --batch-size=256 \ --epochs=32 \ --workers=8 \ --model ViT-B-32 \ --name "ViT-B-32-Vanilla" \ --seed 0 \ --local-loss \ --gather-with-grad ```

Resuming from a checkpoint:

bash python -m open_clip_train.main \ --train-data="/path/to/train_data.csv" \ --val-data="/path/to/validation_data.csv" \ --resume /path/to/checkpoints/epoch_K.pt

Citing

If you found this repository useful, please consider citing: bibtex @software{ilharco_gabriel_2021_5143773, author = {Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig}, title = {OpenCLIP}, month = jul, year = 2021, note = {If you use this software, please cite it as below.}, publisher = {Zenodo}, version = {0.1}, doi = {10.5281/zenodo.5143773}, url = {https://doi.org/10.5281/zenodo.5143773} }

bibtex @inproceedings{cherti2023reproducible, title={Reproducible scaling laws for contrastive language-image learning}, author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={2818--2829}, year={2023} }

bibtex @inproceedings{Radford2021LearningTV, title={Learning Transferable Visual Models From Natural Language Supervision}, author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever}, booktitle={ICML}, year={2021} }

bibtex @inproceedings{schuhmann2022laionb, title={{LAION}-5B: An open large-scale dataset for training next generation image-text models}, author={Christoph Schuhmann and Romain Beaumont and Richard Vencu and Cade W Gordon and Ross Wightman and Mehdi Cherti and Theo Coombes and Aarush Katta and Clayton Mullis and Mitchell Wortsman and Patrick Schramowski and Srivatsa R Kundurthy and Katherine Crowson and Ludwig Schmidt and Robert Kaczmarczyk and Jenia Jitsev}, booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2022}, url={https://openreview.net/forum?id=M3Y74vmsMcY} }

DOI

Owner

  • Name: Alejandro Lozano
  • Login: Ale9806
  • Kind: user
  • Location: Stanford, California
  • Company: Stanford University

STANFORD BMI Ph.D.Student | BMSIS YOUNG SCIENTIST PROGRAM | UCSD ABROAD PROGAM 2020 | IPN-UPIBI BIOMEDICAL ENGINEER

Citation (CITATION.cff)

cff-version: 1.1.0
message: If you use this software, please cite it as below.
authors:
  - family-names: Ilharco
    given-names: Gabriel
  - family-names: Wortsman
    given-names: Mitchell
  - family-names: Wightman
    given-names: Ross
  - family-names: Gordon
    given-names: Cade   
  - family-names: Carlini
    given-names: Nicholas
  - family-names: Taori
    given-names: Rohan
  - family-names: Dave
    given-names: Achal
  - family-names: Shankar
    given-names: Vaishaal
  - family-names: Namkoong
    given-names: Hongseok
  - family-names: Miller
    given-names: John
  - family-names: Hajishirzi
    given-names: Hannaneh
  - family-names: Farhadi
    given-names: Ali
  - family-names: Schmidt
    given-names: Ludwig
title: OpenCLIP
version: v0.1
doi: 10.5281/zenodo.5143773
date-released: 2021-07-28

GitHub Events

Total
  • Issues event: 2
  • Watch event: 22
  • Issue comment event: 1
  • Push event: 9
  • Public event: 1
Last Year
  • Issues event: 2
  • Watch event: 22
  • Issue comment event: 1
  • Push event: 9
  • Public event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mayonezu216 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels