https://github.com/animesh/multimae

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

Basic Info

Host: GitHub
Owner: animesh
License: other
Default Branch: main
Homepage: https://multimae.epfl.ch
Size: 2.12 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Fork of EPFL-VILAB/MultiMAE

Created about 4 years ago · Last pushed over 4 years ago

https://github.com/animesh/MultiMAE/blob/main/

# MultiMAE: Multi-modal Multi-task Masked Autoencoders

[Roman Bachmann*](https://roman-bachmann.github.io/), [David Mizrahi*](https://dmizrahi.com), [Andrei Atanov](https://andrewatanov.github.io/), [Amir Zamir](https://vilab.epfl.ch/zamir/)


 [`Website`](https://multimae.epfl.ch) | [`arXiv`](https://arxiv.org/abs/2204.01678) | [`BibTeX`](#citation)



Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.






We introduce Multi-modal Multi-task Masked Autoencoders (**MultiMAE**), an efficient and effective pre-training strategy for Vision Transformers. 
Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. 
Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines. 

## Catalog
- [x] Pre-trained models
- [x] MultiMAE pre-training code
- [x] ImageNet-1K classification fine-tuning code
- [x] Semantic segmentation fine-tuning code (single-modal & multi-modal)
- [x] Depth estimation fine-tuning code
- [x] Taskonomy fine-tuning code
- [ ] Colab demo (coming soon)


## Pre-trained models

We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and [timm](https://github.com/rwightman/pytorch-image-models/tree/master/timm) (RGB-only) format. 

For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the [official MAE codebase](https://github.com/facebookresearch/mae) following the recommended settings.

| Method   	     | Arch. 	 | Pre-training
modalities 	 | Pre-training
epochs 	 | Weights
(MultiViT) 	     | Weights
(timm) 	     | Config  	                                                                  |
|----------------|---------|------------------------------|--------------------------|-----------------------------|-------------------------|----------------------------------------------------------------------------|
| MAE      	     | ViT-B 	 | RGB                        	 | 1600                   	 | [download](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/mae-b_dec512d8b_1600e_multivit-c477195b.pth)                  	 | [download](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/mae-b_dec512d8b_1600e_timm-f74f3a8d.pth)              	 | See [MAE](https://github.com/facebookresearch/mae/blob/main/PRETRAIN.md) 	 |
| **MultiMAE** 	 | ViT-B 	 | RGB+D+S                    	 | 1600                   	 | [**download**](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/multimae-b_98_rgb+-depth-semseg_1600e_multivit-afff3f8c.pth)                  	 | [**download**](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/multimae-b_98_rgb+-depth-semseg_1600e_timm-bafa5499.pth)             	 | [link](cfgs/pretrain/multimae-b_98_rgb+-depth-semseg_1600e.yaml) 	         |

These pre-trained models can then be fine-tuned using this codebase to reach the following performance:




  
    Method
    Classif. (@1)
    Semantic Segmentation (mIoU)
    Depth (1)
  


  
    
     ImageNet-1K
(RGB)

    ADE20K
(RGB)

    Hypersim
(RGB / D / RGB + D)

    NYUv2
(RGB / D / RGB + D)

    NYUv2
(RGB)

  
  
    Sup. (DeiT)
    81.8
    45.8
    33.9
    -
    -
    50.1
    -
    -
    80.7
  
  
    MAE
    83.3
    46.2
    36.5
    -
    -

    50.8
    -
    -
    85.1
  
  
    MultiMAE
    83.3
    46.2
    37.0
    38.5
    47.6
    52.0
    41.4
    56.0
    86.4
  



### Model formats

We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., [timm](https://github.com/rwightman/pytorch-image-models/tree/master/timm), [DINO](https://github.com/facebookresearch/dino
), [MAE](https://github.com/facebookresearch/mae)), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See [`multimae/multimae.py`](multimae/multimae.py) for the documentation and implementation of MultiMAE / MultiViT.

You can convert between these formats using the provided [`vit2multimae_converter.py`](tools/vit2multimae_converter.py) and [`multimae2vit_converter.py`](tools/multimae2vit_converter.py) scripts.

## Usage

### Set-up

See [SETUP.md](SETUP.md) for set-up instructions.

### Pre-training

See [PRETRAINING.md](PRETRAINING.md) for pre-training instructions.

### Fine-tuning

See [FINETUNING.md](FINETUNING.md) for fine-tuning instructions.


## Acknowledgement

This repository is built using the [timm](https://github.com/rwightman/pytorch-image-models/tree/master/timm), [DeiT](https://github.com/facebookresearch/deit), [DINO](https://github.com/facebookresearch/dino
), [MoCo v3](https://github.com/facebookresearch/moco-v3), [BEiT](https://github.com/microsoft/unilm/tree/master/beit), [MAE-priv](https://github.com/BUPT-PRIV/MAE-priv), and [MAE](https://github.com/facebookresearch/mae) repositories.

## License

This project is under the CC-BY-NC 4.0 license. See [LICENSE](LICENSE) for details.

## Citation

If you find this repository helpful, please consider citing our work:

```BibTeX
@article{bachmann2022multimae,
  author    = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
  title     = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
  journal   = {arXiv preprint arXiv:2204.01678},
  year      = {2022},
}
```

Method	Classif. (@1)	Semantic Segmentation (mIoU)	Depth (1)
	ImageNet-1K (RGB)	ADE20K (RGB)	Hypersim (RGB / D / RGB + D)	NYUv2 (RGB / D / RGB + D)	NYUv2 (RGB)
Sup. (DeiT)	81.8	45.8	33.9	-	-	50.1	-	-	80.7
MAE	83.3	46.2	36.5	-	-	50.8	-	-	85.1
MultiMAE	83.3	46.2	37.0	38.5	47.6	52.0	41.4	56.0	86.4

Owner

Name: Ani
Login: animesh
Kind: user
Location: Norway
Company: Norwegian University of Science and Technology

Website: https://www.fuzzylife.org
Twitter: animesh1977
Repositories: 749
Profile: https://github.com/animesh

A medical graduate from Delhi University with post-graduation in bioinformatics from Jawaharlal Nehru University, India.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/animesh/multimae

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/animesh/MultiMAE/blob/main/

Owner

GitHub Events

Total

Last Year