videomae
[NeurIPS'22] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.9%) to scientific vocabulary
Keywords
Repository
[NeurIPS'22] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Basic Info
- Host: GitHub
- Owner: innat
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://arxiv.org/abs/2203.12602
- Size: 13.9 MB
Statistics
- Stars: 20
- Watchers: 2
- Forks: 3
- Open Issues: 3
- Releases: 2
Topics
Metadata Files
README.md
VideoMAE

Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:
- Masked Video Modeling for Video Pre-Training
- A Simple, Efficient and Strong Baseline in SSVP
- High performance, but NO extra data required
This is a unofficial Keras implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch implementation can be found here.
News
- [27-10-2023]: Code of Video-FocalNet in Keras becomes available.
- [15-10-2023]: Code of UniFormerV2 (UFV2) in Keras becomes available.
- [06-10-2023]: Code of Video Swin Transformer in Keras becomes available.
- [24-10-2023]: Kinetics-400 test data set can be found on kaggle, link.
- [9-10-2023]: TensorFlow SavedModel (formet) checkpoints, link.
- [6-10-2023]: VideoMAE integrated into Huggingface Space.
- [4-10-2023]: VideoMAE checkpoints SSV2 and UCF101 becomes available, link.
- [3-10-2023]: VideoMAE checkpoints on Kinetics-400 becomes available, link.
- [29-9-2023]: GPU(s), TPU-VM for fine-tune training are supported.
- [27-9-2023]: Code of VideoMAE in Keras becomes available.
Install
bash
git clone https://github.com/innat/VideoMAE.git
cd VideoMAE
pip install -e .
Usage
There are many variants of VideoMAE mdoels available, i.e. small, base, large, and huge. And also for benchmark data specific, i.e. Kinetics-400, SSV2, and UCF101. Check this release and model zoo page to know details of it.
Pre-trained Masked Autoencoder
Only the inference part is provided for pre-trained VideoMAE models. Using the trained checkpoint, it would be possible to reconstruct the input sample even with high mask ratio. For end-to-end workflow, check this reconstruction.ipynb notebook. Some highlights:
```python from videomae import VideoMAE_ViTS16PT
pre-trained self-supervised model
model = VideoMAEViTS16PT(imgsize=224, patchsize=16) model.loadweights('TFVideoMAEBK40016x224PT.h5')
tube masking
tubemask = TubeMaskingGenerator( inputsize=windowsize, maskratio=0.80 ) makebool = tubemask() boolmaskedpostf = tf.constant(makebool, dtype=tf.int32) boolmaskedpostf = tf.expanddims(boolmaskedpostf, axis=0) boolmaskedpostf = tf.cast(boolmaskedpos_tf, tf.bool)
running
container = readvideo('sample.mp4') frames = framesampling(container, numframes=16) predtf = model(frames, boolmaskedpostf) predtf.numpy().shape TensorShape([1, 1176, 1536]) ```
A reconstructed results on a sample from SSV2 with mask_ratio=0.8

Fine Tuned Model
With the fine-tuned VideoMAE checkpoint, it would be possible to evaluate the benchmark datast and also retraining would be possible on custom dataset. For end-to-end workflow, check this quick retraining.ipynb notebook. It supports both multi-gpu and tpu-vm retraining and evaluation. Some highlights:
```python from videomae import VideoMAE_ViTS16FT
model = VideoMAEViTS16FT(imgsize=224, patchsize=16, numclasses=400) container = readvideo('sample.mp4') frames = framesampling(container, num_frames=16) y = model(frames) y.shape TensorShape([1, 400])
probabilities = tf.nn.softmax(ypredtf) probabilities = probabilities.numpy().squeeze(0) confidences = { labelmapinv[i]: float(probabilities[i]) \ for i in np.argsort(probabilities)[::-1] } confidences ``` A classification results on a sample from Kinetics-400.
| Video | Top-5 |
|:---:|:---|
|
|
{
'playingcello': 0.6552159786224365,
'snowkiting': 0.0018940207082778215,
'deadlifting': 0.0018381892004981637,
'playingguitar': 0.001778001431375742,
'playing_recorder': 0.0017528659664094448
} |
Model Zoo
The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.
Kinetics-400
For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel and h5 format.
| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB) | FLOPs | | :--: | :--: | :---: | :---: | :---: | :---: | :---: | ViT-S | 16x5x3 | 79.0 | 93.8 | 22 | 24 | 57G | ViT-B | 16x5x3 | 81.5 | 95.1 | 87 | 94 | 181G | ViT-L | 16x5x3 | 85.2 | 96.8 | 304 | 343 | - | ViT-H | 16x5x3 | 86.6 | 97.1 | 632 | ? | - |
?* Official ViT-H backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.
The FLOPs of encoder models (FT) are reported only.
Something-Something V2
For SSv2, VideoMAE is trained around 2400 epoch without any extra data.
| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPs | | :------: | :-----: | :---: | :---: | :---: | :---: | :---: | | ViT-S | 16x2x3 | 66.8 | 90.3 | 22 | 24 | 57G | | ViT-B | 16x2x3 | 70.8 | 92.4 | 86 | 94 | 181G |
UCF101
For UCF101, VideoMAE is trained around 3200 epoch without any extra data.
| Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPS | | :---: | :-----: | :---: | :---: | :---: | :---: | :---: | | ViT-B | 16x5x3 | 91.3 | 98.5 | 86 | 94 | 181G |
Visualization
Some reconstructed video sample using VideoMAE with different mask ratio.
| Kinetics-400-testset | mask |
| :---: | :-----: |
|
| 0.8 |
|
| 0.8 |
|
| 0.9 |
|
| 0.9 |
| SSv2-testset | mask |
| :---: | :-----: |
|
| 0.9 |
|
| 0.9 |
| UCF101-testset | mask |
| :---: | :-----: |
|
| 0.8 |
|
| 0.9 |
TODO
- [x] Custom fine-tuning code.
- [ ] Publish on TF-Hub.
- [ ] Support
Keras V3to support multi-framework backend.
Citation
If you use this videomae implementation in your research, please cite it using the metadata from our CITATION.cff file.
python
@inproceedings{tong2022videomae,
title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}
Owner
- Name: Mohammed Innat
- Login: innat
- Kind: user
- Location: Dhaka, Bangladesh
- Company: 株式会社 調和技研 | CHOWA GIKEN Corp
- Website: https://www.linkedin.com/in/innat2k14/
- Twitter: m_innat
- Repositories: 139
- Profile: https://github.com/innat
AI Research Software Engineer | Kaggler
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: videomae-keras
message: >-
If you use this implementation, please cite it using the
metadata from this file
type: software
authors:
- given-names: Mohammed
family-names: Innat
email: innat.dev@gmail.com
identifiers:
- type: url
value: 'https://github.com/innat/VideoMAE'
description: Keras reimplementation of VideoMAE
keywords:
- software
license: MIT
version: 1.0.0
date-released: '2023-10-13'
GitHub Events
Total
- Issues event: 2
- Watch event: 5
Last Year
- Issues event: 2
- Watch event: 5
Dependencies
- opencv-python >=4.1.2
- tensorflow >=2.12
- opencv-python >=4.1.2
- tensorflow >=2.12