https://github.com/amazon-science/object-centric-vol

https://github.com/amazon-science/object-centric-vol

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, scholar.google
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 642 KB
Statistics
  • Stars: 10
  • Watchers: 2
  • Forks: 2
  • Open Issues: 3
  • Releases: 0
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Official PyTorch Implementation of Unsupervised Open-Vocabulary Object Localization in Videos

ArXivHomePageLicense

Unsupervised Open-Vocabulary Object Localization in Videos
Ke Fan*, Zechen Bai*, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang†, Yanwei Fu†, Tong He

Introduction

We propose an unsupervised video object localization method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP.

Installation

This codebase is tested under PyTorch 1.11.0. You can config your PyTorch according to your machine and CUDA.

git clone git@github.com:amazon-science/object-centric-vol.git cd object-centric-vol conda create --name OV-VOL -y python=3.9 source activate OV-VOL conda install ipython pip conda install pytorch=1.11.0 torchvision cudatoolkit=10.0 -c pytorch pip install -r requirements.txt git clone git@github.com:MCG-NJU/VideoMAE.git DS_BUILD_OPS=1 pip install deepspeed

Data Preparation

Download ImageNet 2012 for training the patch-based CLIP and ILSVRC2015 VID dataset(ImageNet-VID) for video object localization.

After downloading and unzipping the ImageNet-VID, you will get the folder with the following structures: ILSVRC/ ├── Annotations/ │ └── VID │ ├── train │ └── val ├── Data │ └── VID │ ├── snippets │ ├── test │ ├── train │ └── val └── ImageSets ├── VID └── VID_val_videos.txt Use the following code to resize the original video to video with short edge 224 and both the height and width could be divided by 16.

python data/resize_short_bar_resize_patch.py /home/ubuntu/ILSVRC/Data/VID/snippets/train /home/ubuntu/ILSVRC2015_224px/train --dense --level 2 --ext mp4 --to-mp4 --scale 224 --num-worker 2 python data/resize_short_bar_resize_patch.py /home/ubuntu/ILSVRC/Data/VID/snippets/test /home/ubuntu/ILSVRC2015_224px/test --dense --level 1 --ext mp4 --to-mp4 --scale 224 --num-worker 2 python data/resize_short_bar_resize_patch.py /home/ubuntu/ILSVRC/Data/VID/snippets/val /home/ubuntu/ILSVRC2015_224px/val --dense --level 1 --ext mp4 --to-mp4 --scale 224 --num-worker 2

Please place the dataset in the following structure: code_root/ └── data_ckpt_logs/ ├── ckpt ├── dataset │ ├── ILSVRC2015_224px │ │ ├── train │ │ ├── test │ │ └── val │ └── ILSVRC │ ├── Annotations │ │ └── VID │ │ ├── train │ │ └── val │ ├── Data │ │ └── VID │ │ ├── snippets │ │ ├── test │ │ ├── train │ │ └── val │ └── ImageSets │ ├── VID │ └── VID_val_videos.txt └── logs We recommend you to use symbol link ln. Finally run the following code to generate the csv file for training the slot attention model python generate_csv.py Or you can download the generated list: train list and val list.

Training and Evaluation

Pretraining the VideoMAE

```bash YOURPATH=datackptlogs OUTPUTDIR=${YOURPATH}/ckpt/pretrain-backbones DATAPATH=path-to-the-pretraining-video-list

OMPNUMTHREADS=1 python -m torch.distributed.launch --nprocpernode=8 \ --masterport 12320 --nnodes=16 --noderank=number-of-rank --masteraddr=master-ip-addr \ /home/ubuntu/GitLab/Object-Centric-VOL/runmaepretrainingsingleframe.py \ --datapath ${DATAPATH} \ --masktype tube \ --maskratio 0.9 \ --model pretrainvideomaebasepatch16224 \ --decoderdepth 4 \ --batchsize 4 \ --numframes 16 \ --samplingrate 2 \ --opt adamw \ --optbetas 0.9 0.95 \ --warmupepochs 40 \ --saveckptfreq 20 \ --epochs 2401 \ --logdir ${OUTPUTDIR} \ --outputdir ${OUTPUT_DIR} ``` You can refer VideoMAE for data pre-processing of the pretraining stage.

Training the patch-based CLIP

run the following codes to train the patch-based CLIP bash python train.py --dist-url 'tcp://IP_OF_NODE0:FREEPORT' \ --dist-backend 'nccl' \ --multiprocessing-distributed \ --world-size 4 \ --rank the-rank-of-your-machine \ --data path-to-your-imagenet \ --epochs 200 \ --lr 1.0 \ --batch-size 4096

Train the slot attention grouping model on ImageNet-VID dataset

Then run the following codes to train the slot attention grouping after self-supervised pretraining bash torchrun --nnodes=4 --node_rank 0 --master_addr ip-of-your-first-machine \ --master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint torchrun --nnodes=4 --node_rank 1 --master_addr ip-of-your-first-machine \ --master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint torchrun --nnodes=4 --node_rank 2 --master_addr ip-of-your-first-machine \ --master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint torchrun --nnodes=4 --node_rank 3 --master_addr ip-of-your-first-machine \ --master_port 8899 --nproc_per_node=8 ./train_grouping_imagenet_vid.py --pretrained_checkpint path-to-pretrained-backbone-checkpoint

Evaluation

After you trained the slot attention grouping model and patch-based clip, please use the following codes to eval the model: bash torchrun --nnodes=1 --nproc_per_node=8 test_imagenet_vid.py \ --st_grouping_ckpt_path --path-to-slot-attention-grouping-checkpoint \ --clip_pacl_ckpt_path path-to-patch-based-clip-checkpoint \ --num_slots 15 --n_stmae_seeds 1 \ --seed 287 \ --output_folder evaluation_results/VideoMAE_STGrouping_15slots_8frames_299epoch

Checkpoints

Due to the license restrictions, we can only provide the checkpoint of patch-based CLIP

Citation

If you find our paper useful for your research and applications, please cite using this BibTeX: bibtex @InProceedings{Fan_2023_ICCV, author = {Fan, Ke and Bai, Zechen and Xiao, Tianjun and Zietlow, Dominik and Horn, Max and Zhao, Zixu and Simon-Gabriel, Carl-Johann and Shou, Mike Zheng and Locatello, Francesco and Schiele, Bernt and Brox, Thomas and Zhang, Zheng and Fu, Yanwei and He, Tong}, title = {Unsupervised Open-Vocabulary Object Localization in Videos}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {13747-13755} }

Acknowledgements

Our code is based on VideoMAE, CLIP, mega and object-centric-learning-framework repositories. Thanks to the contributors of these great codebases.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 3
  • Fork event: 1
Last Year
  • Watch event: 3
  • Fork event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 months
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.4
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 5
Past Year
  • Issues: 0
  • Pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 months
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.4
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 5
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (5)
Top Labels
Issue Labels
Pull Request Labels
dependencies (5)

Dependencies

requirements.txt pypi
  • Pillow ==9.4.0
  • decord ==0.6.0
  • einops ==0.4.1
  • ftfy *
  • matplotlib ==3.5.3
  • numpy ==1.24.2
  • opencv-python ==4.6.0.66
  • regex *
  • scikit-image ==0.19.3
  • scikit-learn ==1.0
  • scipy ==1.8.1
  • tensorboard ==2.10.0
  • tensorboardX ==2.5.1
  • timm ==0.4.5
  • torchmetrics ==0.10.2
  • tqdm *