sw_clip

https://github.com/mingliangliang3/sw_clip

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: acm.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary

Keywords

vision-language-model

Last synced: 11 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: MingliangLiang3
License: other
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 12.6 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

vision-language-model

Created almost 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Changelog License Citation

SW-CLIP: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

ACM MM 2023 Workshop Paper Code: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model.

Comparison of SW-CLIP and CLIP for zero-shot classification on ImageNet1K. The backbone of image encoder is RN50, and the model pre-trained on CC3M for 30 epochs.

| method | text mask | text len | time | ImageNet1K | |---------|--------------------|----------|--------------|----------------| | CLIP | original, 100.00% | 32 | 1.00x | 16.9% | | SW-CLIP | SW, 42.30% | 16 | 0.86x | 17.2% |

Data

We use the CC3M dataset for training. Then, you can generate the sub-sampling file by python3 src/data/subsampling.py.

Train the model

Train our model on SLURM: sbatch clip_run_experiment_cluster_das_train.sh.

torchrun --nproc_per_node=8 --master_port=25678 training/main.py \ --save-frequency=1 \ --report-to=tensorboard \ --train-data="./path/to/cc3m_train.csv" \ --imagenet-val="./path/to/imagenet_validation" \ --csv-img-key=image \ --csv-caption-key=caption \ --model=RN50 \ --batch-size=256 \ --lr=1e-3 \ --wd=0.1 \ --epochs=30 \ --workers=8 \ --seed=42 \ --local-loss \ --gather-with-grad \ --force-custom-text \ --subsample \ --name pretrain_cc3m_train_RN50_subsample

Fine-tune the model

Fine-tune our model without subsamlping frequent words on SLURM : sbatch clip_run_experiment_cluster_das_finetune.sh.

torchrun --nproc_per_node=8 --master_port=25698 training/main.py \ --save-frequency=1 \ --report-to=tensorboard \ --zeroshot-frequency=1 \ --train-data="../path/to/cc3m/cc3m_train.csv" \ --imagenet-val="./path/to/imagenet_validation" \ --csv-img-key=image \ --csv-caption-key=caption \ --model=RN50 \ --pretrained="./path/to/checkpoints/epoch_K.pt" \ --batch-size=768 \ --warmup=125 \ --lr=1e-3 \ --wd=0.1 \ --epochs=1 \ --workers=8 \ --seed=42 \ --local-loss \ --gather-with-grad \ --force-custom-text \ --name pretrain_cc3m_train_RN50_subsample_finetune

Evaluation

Test our model on SLURM. sbatch ml_run_with_slurm_das_test.sh

We upload our pre-trained model here. You can download them and put them into the model directory. Test the model by: sbatch clip_run_experiment_cluster_das_test.sh

python -u training/main.py \ --report-to tensorboard \ --imagenet-val="./path/to/imagenet_validation/" \ --csv-img-key=image \ --csv-caption-key=caption \ --batch-size=256 \ --workers=6 \ --model=RN50 \ --pretrained="./path/to/checkpoints/epoch_K.pt" \ --seed=42 \ --local-loss \ --gather-with-grad \ --force-custom-text

Citation

@inproceedings{swclip2023liang,
author = {Liang, Mingliang and Larson, Martha},
title = {Subsampling of Frequent Words in Text for Pre-Training a Vision-Language Model},
year = {2023},
publisher = {Association for Computing Machinery},
booktitle = {Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications},
}

We borrow the code from "open_clip"

Owner

Name: Mingliang Liang
Login: MingliangLiang3
Kind: user
Location: Netherlands
Company: Radboud University

Repositories: 1
Profile: https://github.com/MingliangLiang3

Researcher in Multimedia retrieval

GitHub Events

Total

Last Year

Dependencies

.github/workflows/ci.yml actions

actions/cache v3 composite
actions/checkout v3 composite
actions/download-artifact v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite

.github/workflows/clear-cache.yml actions

actions/github-script v6 composite

.github/workflows/python-publish.yml actions

actions-ecosystem/action-regex-match v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite
softprops/action-gh-release v1 composite

requirements-test.txt pypi

pytest ==7.2.0 test
pytest-split ==0.8.0 test
timm ==0.6.11 test
transformers * test

requirements-training.txt pypi

braceexpand *
fsspec *
ftfy *
huggingface_hub *
pandas *
regex *
timm *
torch >=1.9.0
torchvision *
tqdm *
transformers *
webdataset >=0.2.5

requirements.txt pypi

ftfy *
huggingface_hub *
protobuf <4
regex *
sentencepiece *
timm *
torch >=1.9.0
torchvision *
tqdm *

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sw_clip

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SW-CLIP: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

Data

Train the model

Fine-tune the model

Evaluation

Citation

Owner

GitHub Events

Total

Last Year

Dependencies