https://github.com/amazon-science/object-centric-vol
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org, scholar.google -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: amazon-science
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 642 KB
Statistics
- Stars: 10
- Watchers: 2
- Forks: 2
- Open Issues: 3
- Releases: 0
Metadata Files
README.md
Official PyTorch Implementation of Unsupervised Open-Vocabulary Object Localization in Videos
Unsupervised Open-Vocabulary Object Localization in Videos
Ke Fan*, Zechen Bai*, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang†, Yanwei Fu†, Tong He
Introduction
We propose an unsupervised video object localization method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP.
Installation
This codebase is tested under PyTorch 1.11.0. You can config your PyTorch according to your machine and CUDA.
git clone git@github.com:amazon-science/object-centric-vol.git
cd object-centric-vol
conda create --name OV-VOL -y python=3.9
source activate OV-VOL
conda install ipython pip
conda install pytorch=1.11.0 torchvision cudatoolkit=10.0 -c pytorch
pip install -r requirements.txt
git clone git@github.com:MCG-NJU/VideoMAE.git
DS_BUILD_OPS=1 pip install deepspeed
Data Preparation
Download ImageNet 2012 for training the patch-based CLIP and ILSVRC2015 VID dataset(ImageNet-VID) for video object localization.
After downloading and unzipping the ImageNet-VID, you will get the folder with the following structures:
ILSVRC/
├── Annotations/
│ └── VID
│ ├── train
│ └── val
├── Data
│ └── VID
│ ├── snippets
│ ├── test
│ ├── train
│ └── val
└── ImageSets
├── VID
└── VID_val_videos.txt
Use the following code to resize the original video to video with short edge 224 and both the height and width could be divided by 16.
python data/resize_short_bar_resize_patch.py /home/ubuntu/ILSVRC/Data/VID/snippets/train /home/ubuntu/ILSVRC2015_224px/train --dense --level 2 --ext mp4 --to-mp4 --scale 224 --num-worker 2
python data/resize_short_bar_resize_patch.py /home/ubuntu/ILSVRC/Data/VID/snippets/test /home/ubuntu/ILSVRC2015_224px/test --dense --level 1 --ext mp4 --to-mp4 --scale 224 --num-worker 2
python data/resize_short_bar_resize_patch.py /home/ubuntu/ILSVRC/Data/VID/snippets/val /home/ubuntu/ILSVRC2015_224px/val --dense --level 1 --ext mp4 --to-mp4 --scale 224 --num-worker 2
Please place the dataset in the following structure:
code_root/
└── data_ckpt_logs/
├── ckpt
├── dataset
│ ├── ILSVRC2015_224px
│ │ ├── train
│ │ ├── test
│ │ └── val
│ └── ILSVRC
│ ├── Annotations
│ │ └── VID
│ │ ├── train
│ │ └── val
│ ├── Data
│ │ └── VID
│ │ ├── snippets
│ │ ├── test
│ │ ├── train
│ │ └── val
│ └── ImageSets
│ ├── VID
│ └── VID_val_videos.txt
└── logs
We recommend you to use symbol link ln.
Finally run the following code to generate the csv file for training the slot attention model
python generate_csv.py
Or you can download the generated list: train list and val list.
Training and Evaluation
Pretraining the VideoMAE
```bash YOURPATH=datackptlogs OUTPUTDIR=${YOURPATH}/ckpt/pretrain-backbones DATAPATH=path-to-the-pretraining-video-list
OMPNUMTHREADS=1 python -m torch.distributed.launch --nprocpernode=8 \ --masterport 12320 --nnodes=16 --noderank=number-of-rank --masteraddr=master-ip-addr \ /home/ubuntu/GitLab/Object-Centric-VOL/runmaepretrainingsingleframe.py \ --datapath ${DATAPATH} \ --masktype tube \ --maskratio 0.9 \ --model pretrainvideomaebasepatch16224 \ --decoderdepth 4 \ --batchsize 4 \ --numframes 16 \ --samplingrate 2 \ --opt adamw \ --optbetas 0.9 0.95 \ --warmupepochs 40 \ --saveckptfreq 20 \ --epochs 2401 \ --logdir ${OUTPUTDIR} \ --outputdir ${OUTPUT_DIR} ``` You can refer VideoMAE for data pre-processing of the pretraining stage.
Training the patch-based CLIP
run the following codes to train the patch-based CLIP
bash
python train.py --dist-url 'tcp://IP_OF_NODE0:FREEPORT' \
--dist-backend 'nccl' \
--multiprocessing-distributed \
--world-size 4 \
--rank the-rank-of-your-machine \
--data path-to-your-imagenet \
--epochs 200 \
--lr 1.0 \
--batch-size 4096
Train the slot attention grouping model on ImageNet-VID dataset
Then run the following codes to train the slot attention grouping after self-supervised pretraining
bash
torchrun --nnodes=4 --node_rank 0 --master_addr ip-of-your-first-machine \
--master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint
torchrun --nnodes=4 --node_rank 1 --master_addr ip-of-your-first-machine \
--master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint
torchrun --nnodes=4 --node_rank 2 --master_addr ip-of-your-first-machine \
--master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint
torchrun --nnodes=4 --node_rank 3 --master_addr ip-of-your-first-machine \
--master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint
Evaluation
After you trained the slot attention grouping model and patch-based clip, please use the following codes to eval the model:
bash
torchrun --nnodes=1 --nproc_per_node=8 test_imagenet_vid.py \
--st_grouping_ckpt_path --path-to-slot-attention-grouping-checkpoint \
--clip_pacl_ckpt_path path-to-patch-based-clip-checkpoint \
--num_slots 15 --n_stmae_seeds 1 \
--seed 287 \
--output_folder evaluation_results/VideoMAE_STGrouping_15slots_8frames_299epoch
Checkpoints
Due to the license restrictions, we can only provide the checkpoint of patch-based CLIP
Citation
If you find our paper useful for your research and applications, please cite using this BibTeX:
bibtex
@InProceedings{Fan_2023_ICCV,
author = {Fan, Ke and Bai, Zechen and Xiao, Tianjun and Zietlow, Dominik and Horn, Max and Zhao, Zixu and Simon-Gabriel, Carl-Johann and Shou, Mike Zheng and Locatello, Francesco and Schiele, Bernt and Brox, Thomas and Zhang, Zheng and Fu, Yanwei and He, Tong},
title = {Unsupervised Open-Vocabulary Object Localization in Videos},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {13747-13755}
}
Acknowledgements
Our code is based on VideoMAE, CLIP, mega and object-centric-learning-framework repositories. Thanks to the contributors of these great codebases.
Security
See CONTRIBUTING for more information.
License
This project is licensed under the Apache-2.0 License.
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Watch event: 3
- Fork event: 1
Last Year
- Watch event: 3
- Fork event: 1
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: about 2 months
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.4
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 5
Past Year
- Issues: 0
- Pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: about 2 months
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.4
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 5
Top Authors
Issue Authors
Pull Request Authors
- dependabot[bot] (5)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- Pillow ==9.4.0
- decord ==0.6.0
- einops ==0.4.1
- ftfy *
- matplotlib ==3.5.3
- numpy ==1.24.2
- opencv-python ==4.6.0.66
- regex *
- scikit-image ==0.19.3
- scikit-learn ==1.0
- scipy ==1.8.1
- tensorboard ==2.10.0
- tensorboardX ==2.5.1
- timm ==0.4.5
- torchmetrics ==0.10.2
- tqdm *