Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: acm.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary
Keywords
Repository
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
SW-CLIP: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
ACM MM 2023 Workshop Paper Code: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model.

Comparison of SW-CLIP and CLIP for zero-shot classification on ImageNet1K. The backbone of image encoder is RN50, and the model pre-trained on CC3M for 30 epochs.
| method | text mask | text len | time | ImageNet1K | |---------|--------------------|----------|--------------|----------------| | CLIP | original, 100.00% | 32 | 1.00x | 16.9% | | SW-CLIP | SW, 42.30% | 16 | 0.86x | 17.2% |
Data
We use the CC3M dataset for training.
Then, you can generate the sub-sampling file by python3 src/data/subsampling.py.
Train the model
Train our model on SLURM: sbatch clip_run_experiment_cluster_das_train.sh.
torchrun --nproc_per_node=8 --master_port=25678 training/main.py \
--save-frequency=1 \
--report-to=tensorboard \
--train-data="./path/to/cc3m_train.csv" \
--imagenet-val="./path/to/imagenet_validation" \
--csv-img-key=image \
--csv-caption-key=caption \
--model=RN50 \
--batch-size=256 \
--lr=1e-3 \
--wd=0.1 \
--epochs=30 \
--workers=8 \
--seed=42 \
--local-loss \
--gather-with-grad \
--force-custom-text \
--subsample \
--name pretrain_cc3m_train_RN50_subsample
Fine-tune the model
Fine-tune our model without subsamlping frequent words on SLURM : sbatch clip_run_experiment_cluster_das_finetune.sh.
torchrun --nproc_per_node=8 --master_port=25698 training/main.py \
--save-frequency=1 \
--report-to=tensorboard \
--zeroshot-frequency=1 \
--train-data="../path/to/cc3m/cc3m_train.csv" \
--imagenet-val="./path/to/imagenet_validation" \
--csv-img-key=image \
--csv-caption-key=caption \
--model=RN50 \
--pretrained="./path/to/checkpoints/epoch_K.pt" \
--batch-size=768 \
--warmup=125 \
--lr=1e-3 \
--wd=0.1 \
--epochs=1 \
--workers=8 \
--seed=42 \
--local-loss \
--gather-with-grad \
--force-custom-text \
--name pretrain_cc3m_train_RN50_subsample_finetune
Evaluation
Test our model on SLURM.
sbatch ml_run_with_slurm_das_test.sh
We upload our pre-trained model here. You can download them and put them into the model directory.
Test the model by: sbatch clip_run_experiment_cluster_das_test.sh
python -u training/main.py \
--report-to tensorboard \
--imagenet-val="./path/to/imagenet_validation/" \
--csv-img-key=image \
--csv-caption-key=caption \
--batch-size=256 \
--workers=6 \
--model=RN50 \
--pretrained="./path/to/checkpoints/epoch_K.pt" \
--seed=42 \
--local-loss \
--gather-with-grad \
--force-custom-text
Citation
@inproceedings{swclip2023liang,
author = {Liang, Mingliang and Larson, Martha},
title = {Subsampling of Frequent Words in Text for Pre-Training a Vision-Language Model},
year = {2023},
publisher = {Association for Computing Machinery},
booktitle = {Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications},
}
We borrow the code from "open_clip"
Owner
- Name: Mingliang Liang
- Login: MingliangLiang3
- Kind: user
- Location: Netherlands
- Company: Radboud University
- Repositories: 1
- Profile: https://github.com/MingliangLiang3
Researcher in Multimedia retrieval
GitHub Events
Total
Last Year
Dependencies
- actions/cache v3 composite
- actions/checkout v3 composite
- actions/download-artifact v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- actions/github-script v6 composite
- actions-ecosystem/action-regex-match v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- softprops/action-gh-release v1 composite
- pytest ==7.2.0 test
- pytest-split ==0.8.0 test
- timm ==0.6.11 test
- transformers * test
- braceexpand *
- fsspec *
- ftfy *
- huggingface_hub *
- pandas *
- regex *
- timm *
- torch >=1.9.0
- torchvision *
- tqdm *
- transformers *
- webdataset >=0.2.5
- ftfy *
- huggingface_hub *
- protobuf <4
- regex *
- sentencepiece *
- timm *
- torch >=1.9.0
- torchvision *
- tqdm *