videomae

[NeurIPS'22] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

https://github.com/innat/videomae

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.9%) to scientific vocabulary

Keywords

jax keras tensorflow torch video-classification video-dataset videomae
Last synced: 6 months ago · JSON representation ·

Repository

[NeurIPS'22] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Basic Info
Statistics
  • Stars: 20
  • Watchers: 2
  • Forks: 3
  • Open Issues: 3
  • Releases: 2
Topics
jax keras tensorflow torch video-classification video-dataset videomae
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

VideoMAE

videomae

Palestine

arXiv keras-2.12. Open In Colab HugginFace badge HugginFace badge

Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:

  • Masked Video Modeling for Video Pre-Training
  • A Simple, Efficient and Strong Baseline in SSVP
  • High performance, but NO extra data required

This is a unofficial Keras implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch implementation can be found here.

News

  • [27-10-2023]: Code of Video-FocalNet in Keras becomes available.
  • [15-10-2023]: Code of UniFormerV2 (UFV2) in Keras becomes available.
  • [06-10-2023]: Code of Video Swin Transformer in Keras becomes available.
  • [24-10-2023]: Kinetics-400 test data set can be found on kaggle, link.
  • [9-10-2023]: TensorFlow SavedModel (formet) checkpoints, link.
  • [6-10-2023]: VideoMAE integrated into Huggingface Space.
  • [4-10-2023]: VideoMAE checkpoints SSV2 and UCF101 becomes available, link.
  • [3-10-2023]: VideoMAE checkpoints on Kinetics-400 becomes available, link.
  • [29-9-2023]: GPU(s), TPU-VM for fine-tune training are supported.
  • [27-9-2023]: Code of VideoMAE in Keras becomes available.

Install

bash git clone https://github.com/innat/VideoMAE.git cd VideoMAE pip install -e .

Usage

There are many variants of VideoMAE mdoels available, i.e. small, base, large, and huge. And also for benchmark data specific, i.e. Kinetics-400, SSV2, and UCF101. Check this release and model zoo page to know details of it.

Pre-trained Masked Autoencoder

Only the inference part is provided for pre-trained VideoMAE models. Using the trained checkpoint, it would be possible to reconstruct the input sample even with high mask ratio. For end-to-end workflow, check this reconstruction.ipynb notebook. Some highlights:

```python from videomae import VideoMAE_ViTS16PT

pre-trained self-supervised model

model = VideoMAEViTS16PT(imgsize=224, patchsize=16) model.loadweights('TFVideoMAEBK40016x224PT.h5')

tube masking

tubemask = TubeMaskingGenerator( inputsize=windowsize, maskratio=0.80 ) makebool = tubemask() boolmaskedpostf = tf.constant(makebool, dtype=tf.int32) boolmaskedpostf = tf.expanddims(boolmaskedpostf, axis=0) boolmaskedpostf = tf.cast(boolmaskedpos_tf, tf.bool)

running

container = readvideo('sample.mp4') frames = framesampling(container, numframes=16) predtf = model(frames, boolmaskedpostf) predtf.numpy().shape TensorShape([1, 1176, 1536]) ```

A reconstructed results on a sample from SSV2 with mask_ratio=0.8

Fine Tuned Model

With the fine-tuned VideoMAE checkpoint, it would be possible to evaluate the benchmark datast and also retraining would be possible on custom dataset. For end-to-end workflow, check this quick retraining.ipynb notebook. It supports both multi-gpu and tpu-vm retraining and evaluation. Some highlights:

```python from videomae import VideoMAE_ViTS16FT

model = VideoMAEViTS16FT(imgsize=224, patchsize=16, numclasses=400) container = readvideo('sample.mp4') frames = framesampling(container, num_frames=16) y = model(frames) y.shape TensorShape([1, 400])

probabilities = tf.nn.softmax(ypredtf) probabilities = probabilities.numpy().squeeze(0) confidences = { labelmapinv[i]: float(probabilities[i]) \ for i in np.argsort(probabilities)[::-1] } confidences ``` A classification results on a sample from Kinetics-400.

| Video | Top-5 | |:---:|:---| | |

{
'playingcello': 0.6552159786224365,
'snowkiting': 0.0018940207082778215,
'deadlifting': 0.0018381892004981637,
'playing
guitar': 0.001778001431375742,
'playing_recorder': 0.0017528659664094448
}
|

Model Zoo

The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.

Kinetics-400

For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel and h5 format.

| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB) | FLOPs | | :--: | :--: | :---: | :---: | :---: | :---: | :---: | ViT-S | 16x5x3 | 79.0 | 93.8 | 22 | 24 | 57G | ViT-B | 16x5x3 | 81.5 | 95.1 | 87 | 94 | 181G | ViT-L | 16x5x3 | 85.2 | 96.8 | 304 | 343 | - | ViT-H | 16x5x3 | 86.6 | 97.1 | 632 | ? | - |

?* Official ViT-H backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89. The FLOPs of encoder models (FT) are reported only.

Something-Something V2

For SSv2, VideoMAE is trained around 2400 epoch without any extra data.

| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPs | | :------: | :-----: | :---: | :---: | :---: | :---: | :---: | | ViT-S | 16x2x3 | 66.8 | 90.3 | 22 | 24 | 57G | | ViT-B | 16x2x3 | 70.8 | 92.4 | 86 | 94 | 181G |

UCF101

For UCF101, VideoMAE is trained around 3200 epoch without any extra data.

| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPS | | :---: | :-----: | :---: | :---: | :---: | :---: | :---: | | ViT-B | 16x5x3 | 91.3 | 98.5 | 86 | 94 | 181G |

Visualization

Some reconstructed video sample using VideoMAE with different mask ratio.

| Kinetics-400-testset | mask | | :---: | :-----: | | | 0.8 |
| | 0.8 | | | 0.9 | | | 0.9 |

| SSv2-testset | mask | | :---: | :-----: | | | 0.9 |
| | 0.9 |

| UCF101-testset | mask | | :---: | :-----: | | | 0.8 | | | 0.9 |

TODO

  • [x] Custom fine-tuning code.
  • [ ] Publish on TF-Hub.
  • [ ] Support Keras V3to support multi-framework backend.

Citation

If you use this videomae implementation in your research, please cite it using the metadata from our CITATION.cff file.

python @inproceedings{tong2022videomae, title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training}, author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang}, booktitle={Advances in Neural Information Processing Systems}, year={2022} }

Owner

  • Name: Mohammed Innat
  • Login: innat
  • Kind: user
  • Location: Dhaka, Bangladesh
  • Company: 株式会社 調和技研 | CHOWA GIKEN Corp

AI Research Software Engineer | Kaggler

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: videomae-keras
message: >-
  If you use this implementation, please cite it using the  
  metadata from this file
type: software
authors:
  - given-names: Mohammed
    family-names: Innat
    email: innat.dev@gmail.com
identifiers:
  - type: url
    value: 'https://github.com/innat/VideoMAE'
    description: Keras reimplementation of VideoMAE
keywords:
  - software
license: MIT
version: 1.0.0
date-released: '2023-10-13'

GitHub Events

Total
  • Issues event: 2
  • Watch event: 5
Last Year
  • Issues event: 2
  • Watch event: 5

Dependencies

requirements.txt pypi
  • opencv-python >=4.1.2
  • tensorflow >=2.12
setup.py pypi
  • opencv-python >=4.1.2
  • tensorflow >=2.12