Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: acm.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.9%) to scientific vocabulary

Keywords

vision-language-model
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: MingliangLiang3
  • License: other
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 12.6 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
vision-language-model
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Changelog License Citation

README.md

SW-CLIP: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model

ACM MM 2023 Workshop Paper Code: Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model.

Comparison of SW-CLIP and CLIP for zero-shot classification on ImageNet1K. The backbone of image encoder is RN50, and the model pre-trained on CC3M for 30 epochs.

| method | text mask | text len | time | ImageNet1K | |---------|--------------------|----------|--------------|----------------| | CLIP | original, 100.00% | 32 | 1.00x | 16.9% | | SW-CLIP | SW, 42.30% | 16 | 0.86x | 17.2% |

Data

We use the CC3M dataset for training. Then, you can generate the sub-sampling file by python3 src/data/subsampling.py.

Train the model

Train our model on SLURM: sbatch clip_run_experiment_cluster_das_train.sh.

torchrun --nproc_per_node=8 --master_port=25678 training/main.py \ --save-frequency=1 \ --report-to=tensorboard \ --train-data="./path/to/cc3m_train.csv" \ --imagenet-val="./path/to/imagenet_validation" \ --csv-img-key=image \ --csv-caption-key=caption \ --model=RN50 \ --batch-size=256 \ --lr=1e-3 \ --wd=0.1 \ --epochs=30 \ --workers=8 \ --seed=42 \ --local-loss \ --gather-with-grad \ --force-custom-text \ --subsample \ --name pretrain_cc3m_train_RN50_subsample

Fine-tune the model

Fine-tune our model without subsamlping frequent words on SLURM : sbatch clip_run_experiment_cluster_das_finetune.sh.

torchrun --nproc_per_node=8 --master_port=25698 training/main.py \ --save-frequency=1 \ --report-to=tensorboard \ --zeroshot-frequency=1 \ --train-data="../path/to/cc3m/cc3m_train.csv" \ --imagenet-val="./path/to/imagenet_validation" \ --csv-img-key=image \ --csv-caption-key=caption \ --model=RN50 \ --pretrained="./path/to/checkpoints/epoch_K.pt" \ --batch-size=768 \ --warmup=125 \ --lr=1e-3 \ --wd=0.1 \ --epochs=1 \ --workers=8 \ --seed=42 \ --local-loss \ --gather-with-grad \ --force-custom-text \ --name pretrain_cc3m_train_RN50_subsample_finetune

Evaluation

Test our model on SLURM. sbatch ml_run_with_slurm_das_test.sh

We upload our pre-trained model here. You can download them and put them into the model directory. Test the model by: sbatch clip_run_experiment_cluster_das_test.sh

python -u training/main.py \ --report-to tensorboard \ --imagenet-val="./path/to/imagenet_validation/" \ --csv-img-key=image \ --csv-caption-key=caption \ --batch-size=256 \ --workers=6 \ --model=RN50 \ --pretrained="./path/to/checkpoints/epoch_K.pt" \ --seed=42 \ --local-loss \ --gather-with-grad \ --force-custom-text

Citation

@inproceedings{swclip2023liang,
author = {Liang, Mingliang and Larson, Martha},
title = {Subsampling of Frequent Words in Text for Pre-Training a Vision-Language Model},
year = {2023},
publisher = {Association for Computing Machinery},
booktitle = {Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications},
}

We borrow the code from "open_clip"

Owner

  • Name: Mingliang Liang
  • Login: MingliangLiang3
  • Kind: user
  • Location: Netherlands
  • Company: Radboud University

Researcher in Multimedia retrieval

GitHub Events

Total
Last Year

Dependencies

.github/workflows/ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
.github/workflows/clear-cache.yml actions
  • actions/github-script v6 composite
.github/workflows/python-publish.yml actions
  • actions-ecosystem/action-regex-match v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • softprops/action-gh-release v1 composite
requirements-test.txt pypi
  • pytest ==7.2.0 test
  • pytest-split ==0.8.0 test
  • timm ==0.6.11 test
  • transformers * test
requirements-training.txt pypi
  • braceexpand *
  • fsspec *
  • ftfy *
  • huggingface_hub *
  • pandas *
  • regex *
  • timm *
  • torch >=1.9.0
  • torchvision *
  • tqdm *
  • transformers *
  • webdataset >=0.2.5
requirements.txt pypi
  • ftfy *
  • huggingface_hub *
  • protobuf <4
  • regex *
  • sentencepiece *
  • timm *
  • torch >=1.9.0
  • torchvision *
  • tqdm *
setup.py pypi