clip-retrieval
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: sbrood
- License: other
- Language: Jupyter Notebook
- Default Branch: main
- Size: 5.95 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
CLIP for image-text retrieval
End of studies internship project
This repository is a fork of mlfundation implementation to train and use the CLIP model, please check their README as well as this one!
This aims at delivering a way to use and to finetune the CLIP model.
🪛 Installation
bash
docker build . -t open-clip
bash
docker run --rm -ti -v ${PWD}:/home/open-clip -v /my_dataset:/home/open-clip/my_dataset open-clip:latest
add --gpus '"device=0,1,2,3" to use gpus
add -v /dev/shm:/dev/shm if you are training on multiple gpus (this will enable access to shared memory)
⚠ This project uses python 3.7.
🍄 Usage
Metrics
Each of these metrics is available from image to text, and from text to image. - Median rank - Mean rank - R@X accuracy : accuracy on : 'grounth truth in the top-X ranked answers ?'
Available models
Open CLIP project tries to reach the same metric presented in OpenAI's paper. You can choose between OpenAI's pre-trained weights and open-clip pre-trained weights, with each available architecture and different pre-trained dataset.
Use the following commands to list the available models and their weights and to load a model.
python
import open_clip
open_clip.list_pretrained()
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_e16')
To load other pre-trained image use pretrained-image
python
import open_clip
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-B-32', pretrained-image='my_checkpoint_path')
👉 Basics
Simple inference
```python import torch from PIL import Image import open_clip
model, , preprocess = openclip.createmodelandtransforms('ViT-B-32-quickgelu', pretrained='laion400me32')
image = preprocess(Image.open("CLIP.png")).unsqueeze(0) text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.nograd(): imagefeatures = model.encodeimage(image) textfeatures = model.encodetext(text) imagefeatures /= imagefeatures.norm(dim=-1, keepdim=True) textfeatures /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs) # prints: [[1., 0., 0.]] ```
Evaluation
Use the cliptestacc.py script :
1. Compute simple metrics on all dataset (may not fit GPU!)
bash
python retrieval.main
--ground_truth_csv_path GROUND_TRUTH_CSV_PATH \
--csv_img_key CSV_IMG_KEY \
--csv_caption_key CSV_CAPTION_KEY \
--input_dir INPUT_DIR \
--csv_separator CSV_SEPARATOR \
[--network NETWORK] \
[--checkpoint CHECKPOINT] \
[--workers WORKERS] \
[--device DEVICE] \
[--pretrained PRETRAINED] \
[--log_rate LOG_RATE] \
[--tops TOPS] \
where
- ground_truth_csv_path is the csv where you store the image filename, its label and shooting id
- csv_img_key is the name of the column where the filename are
- csv_caption_key is the name of the column where the labels are
- input_dir is the folder where the images are stored
- csv_separator is the separator character of your csv file
- network is the name of the network, see Available Models for more precisions
- checkpoint is the filename of the checkpoint
- workers are the number of workers
- device is cpu or cuda , default is cpu
- pretrained is the source of the pretrained model, see Available Models for more precisions
- log_rate is the rate for printing the metrics, default is 10
- tops is the accuracy tops to compute, to enter with spaces (e.g 1 2 4 9), default 1 2 3 5 10
Compute average metrics on shootings add
--per_shooting --csv_shooting_key CSV_SHOOTING_KEYto retrieval commandFrom the training main from local checkpoint :
bash python -m training.main --val-data="/path/to/validation_data.csv" --model RN101 --pretrained /path/to/checkpoints/epoch_K.ptfrom a hosted pretrained checkpointbash python -m training.main --imagenet-val /path/to/imagenet/validation --model ViT-B-32-quickgelu --pretrained laion400m_e32
👉 Training
All the parameters can be found in training/params.py
Single GPU (example)
bash
python -m training.main \
--save-frequency 1 \
--zeroshot-frequency 1 \
--report-to tensorboard \
--train-data="path to train data csv" \
--val-data="path to validation csv" \
--csv-img-key filepath \
--csv-caption-key title \
--imagenet-val=/path/to/imagenet/root/val/ \
--warmup 10000 \
--batch-size=128 \
--lr=1e-3 \
--wd=0.1 \
--epochs=30 \
--workers=8 \
--model RN50
Multi GPUs (example) ```bash
torchrun --nprocpernode 4 -m training.main \ --train-data="path to train data csv" \ --val-data="path to validation csv" \ --csv-img-key "new filename" \ --csv-caption-key "food label" \ --csv-separator ',' \ --batch-size 128 \ --precision amp \ --workers 4 \ --model ViT-B-32 \ --epochs=40 \ --save-frequency 15 \ --pretrained 'openai' \ --warmup 100 \ --lr 5.0e-5\ --val-frequency 2 ```
🔒 LiT
LiT consist in lock the image tower and unlock the text tower. open-clip offers parameters to use this technique to fine-tune CLIP.
Use the following parameters :
- --lock-image to lock full image tower by disabling gradients.
- --lock-image-unlocked-groups n to leave last n image tower layer groups unlocked.
- --lock-image-freeze-bn-stats to freeze BatchNorm running stats in image tower for any locked layers
Weight and Biases
- Log to weight and biases with wandb login
- Add --report-to 'wandb' in script parameters
- Open your WandB dashboard, you're set !
🌶 Dataset tools
Some script are available inside src/data for dataset management
gather_cc.py is an open-clip tool to download conceptual caption dataset.
🔗 Resources
Articles - CLIP , article, original code - LiT, Zero-Shot Transfer with Locked-image text Tuning, article, code
Repositories - OpenAI CLIP - Open-CLIP from ML fundation
Owner
- Name: Sarah Brood
- Login: sbrood
- Kind: user
- Location: Paris
- Repositories: 1
- Profile: https://github.com/sbrood
PhD Student at ENS Paris and LSCE working on deep learning models to retrieve forest properties from remote sensing data. 🌳
Citation (CITATION.cff)
cff-version: 1.1.0
message: If you use this software, please cite it as below.
authors:
- family-names: Ilharco
given-names: Gabriel
- family-names: Wortsman
given-names: Mitchell
- family-names: Wightman
given-names: Ross
- family-names: Gordon
given-names: Cade
- family-names: Carlini
given-names: Nicholas
- family-names: Taori
given-names: Rohan
- family-names: Dave
given-names: Achal
- family-names: Shankar
given-names: Vaishaal
- family-names: Namkoong
given-names: Hongseok
- family-names: Miller
given-names: John
- family-names: Hajishirzi
given-names: Hannaneh
- family-names: Farhadi
given-names: Ali
- family-names: Schmidt
given-names: Ludwig
title: OpenCLIP
version: v0.1
doi: 10.5281/zenodo.5143773
date-released: 2021-07-28
GitHub Events
Total
Last Year
Dependencies
- nvidia/cuda 11.3.1-cudnn8-runtime-ubuntu20.04 build
- pytest ==7.0.1 test
- pytest-xdist ==2.5.0 test
- braceexpand *
- ftfy *
- pandas *
- regex *
- torch >=1.9.0
- torchvision *
- tqdm *
- webdataset >=0.2.5
- ftfy *
- regex *
- torch >=1.9.0
- torchvision *
- tqdm *
- braceexpand *
- ftfy *
- pandas *
- regex *
- setproctitle *
- torch *
- torchvision *
- tqdm *
- wandb *
- webdataset *