videomae

[NeurIPS'22] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

https://github.com/innat/videomae

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.9%) to scientific vocabulary

Keywords

jax keras tensorflow torch video-classification video-dataset videomae

Last synced: 6 months ago · JSON representation ·

Repository

[NeurIPS'22] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Basic Info

Host: GitHub
Owner: innat
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://arxiv.org/abs/2203.12602
Size: 13.9 MB

Statistics

Stars: 20
Watchers: 2
Forks: 3
Open Issues: 3
Releases: 2

Topics

jax keras tensorflow torch video-classification video-dataset videomae

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

VideoMAE

videomae

Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:

Masked Video Modeling for Video Pre-Training
A Simple, Efficient and Strong Baseline in SSVP
High performance, but NO extra data required

This is a unofficial Keras implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch implementation can be found here.

News

[27-10-2023]: Code of Video-FocalNet in Keras becomes available.
[15-10-2023]: Code of UniFormerV2 (UFV2) in Keras becomes available.
[06-10-2023]: Code of Video Swin Transformer in Keras becomes available.
[24-10-2023]: Kinetics-400 test data set can be found on kaggle, link.
[9-10-2023]: TensorFlow SavedModel (formet) checkpoints, link.
[6-10-2023]: VideoMAE integrated into Huggingface Space.
[4-10-2023]: VideoMAE checkpoints SSV2 and UCF101 becomes available, link.
[3-10-2023]: VideoMAE checkpoints on Kinetics-400 becomes available, link.
[29-9-2023]: GPU(s), TPU-VM for fine-tune training are supported.
[27-9-2023]: Code of VideoMAE in Keras becomes available.

Install

bash git clone https://github.com/innat/VideoMAE.git cd VideoMAE pip install -e .

Usage

There are many variants of VideoMAE mdoels available, i.e. small, base, large, and huge. And also for benchmark data specific, i.e. Kinetics-400, SSV2, and UCF101. Check this release and model zoo page to know details of it.

Pre-trained Masked Autoencoder

Only the inference part is provided for pre-trained VideoMAE models. Using the trained checkpoint, it would be possible to reconstruct the input sample even with high mask ratio. For end-to-end workflow, check this reconstruction.ipynb notebook. Some highlights:

```python from videomae import VideoMAE_ViTS16PT

pre-trained self-supervised model

model = VideoMAEViTS16PT(imgsize=224, patchsize=16) model.loadweights('TFVideoMAEBK40016x224PT.h5')

tube masking

tubemask = TubeMaskingGenerator( inputsize=windowsize, maskratio=0.80 ) makebool = tubemask() boolmaskedpostf = tf.constant(makebool, dtype=tf.int32) boolmaskedpostf = tf.expanddims(boolmaskedpostf, axis=0) boolmaskedpostf = tf.cast(boolmaskedpos_tf, tf.bool)

running

container = readvideo('sample.mp4') frames = framesampling(container, numframes=16) predtf = model(frames, boolmaskedpostf) predtf.numpy().shape TensorShape([1, 1176, 1536]) ```

A reconstructed results on a sample from SSV2 with mask_ratio=0.8

Fine Tuned Model

With the fine-tuned VideoMAE checkpoint, it would be possible to evaluate the benchmark datast and also retraining would be possible on custom dataset. For end-to-end workflow, check this quick retraining.ipynb notebook. It supports both multi-gpu and tpu-vm retraining and evaluation. Some highlights:

```python from videomae import VideoMAE_ViTS16FT

model = VideoMAEViTS16FT(imgsize=224, patchsize=16, numclasses=400) container = readvideo('sample.mp4') frames = framesampling(container, num_frames=16) y = model(frames) y.shape TensorShape([1, 400])

probabilities = tf.nn.softmax(ypredtf) probabilities = probabilities.numpy().squeeze(0) confidences = { labelmapinv[i]: float(probabilities[i]) \ for i in np.argsort(probabilities)[::-1] } confidences ``` A classification results on a sample from Kinetics-400.

| Video | Top-5 | |:---:|:---| | |

{
    'playingcello': 0.6552159786224365,
    'snowkiting': 0.0018940207082778215,
    'deadlifting': 0.0018381892004981637,
    'playingguitar': 0.001778001431375742,
    'playing_recorder': 0.0017528659664094448
}

Model Zoo

The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.

Kinetics-400

For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel and h5 format.

| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB) | FLOPs | | :--: | :--: | :---: | :---: | :---: | :---: | :---: | ViT-S | 16x5x3 | 79.0 | 93.8 | 22 | 24 | 57G | ViT-B | 16x5x3 | 81.5 | 95.1 | 87 | 94 | 181G | ViT-L | 16x5x3 | 85.2 | 96.8 | 304 | 343 | - | ViT-H | 16x5x3 | 86.6 | 97.1 | 632 | ? | - |

^{?* Official ViT-H backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.} ^{The FLOPs of encoder models (FT) are reported only.}

Something-Something V2

For SSv2, VideoMAE is trained around 2400 epoch without any extra data.

| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPs | | :------: | :-----: | :---: | :---: | :---: | :---: | :---: | | ViT-S | 16x2x3 | 66.8 | 90.3 | 22 | 24 | 57G | | ViT-B | 16x2x3 | 70.8 | 92.4 | 86 | 94 | 181G |

UCF101

For UCF101, VideoMAE is trained around 3200 epoch without any extra data.

| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPS | | :---: | :-----: | :---: | :---: | :---: | :---: | :---: | | ViT-B | 16x5x3 | 91.3 | 98.5 | 86 | 94 | 181G |

Visualization

Some reconstructed video sample using VideoMAE with different mask ratio.

| Kinetics-400-testset | mask | | :---: | :-----: | | | 0.8 |
| | 0.8 | | | 0.9 | | | 0.9 |

| SSv2-testset | mask | | :---: | :-----: | | | 0.9 |
| | 0.9 |

| UCF101-testset | mask | | :---: | :-----: | | | 0.8 | | | 0.9 |

TODO

[x] Custom fine-tuning code.
[ ] Publish on TF-Hub.
[ ] Support Keras V3to support multi-framework backend.

Citation

If you use this videomae implementation in your research, please cite it using the metadata from our CITATION.cff file.

python @inproceedings{tong2022videomae, title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training}, author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang}, booktitle={Advances in Neural Information Processing Systems}, year={2022} }

Owner

Name: Mohammed Innat
Login: innat
Kind: user
Location: Dhaka, Bangladesh
Company: 株式会社調和技研 | CHOWA GIKEN Corp

Website: https://www.linkedin.com/in/innat2k14/
Twitter: m_innat
Repositories: 139
Profile: https://github.com/innat

AI Research Software Engineer | Kaggler

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: videomae-keras
message: >-
  If you use this implementation, please cite it using the  
  metadata from this file
type: software
authors:
  - given-names: Mohammed
    family-names: Innat
    email: innat.dev@gmail.com
identifiers:
  - type: url
    value: 'https://github.com/innat/VideoMAE'
    description: Keras reimplementation of VideoMAE
keywords:
  - software
license: MIT
version: 1.0.0
date-released: '2023-10-13'

GitHub Events

Total

Issues event: 2
Watch event: 5

Last Year

Issues event: 2
Watch event: 5

Dependencies

requirements.txt pypi

opencv-python >=4.1.2
tensorflow >=2.12

setup.py pypi

opencv-python >=4.1.2
tensorflow >=2.12

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

videomae

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

VideoMAE

News

Install

Usage

Pre-trained Masked Autoencoder

pre-trained self-supervised model

tube masking

running

Fine Tuned Model

Model Zoo

Kinetics-400

Something-Something V2

UCF101

Visualization

TODO

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies