o2-magvit2

Open Source Implementation of Dual Modality MAGVIT2 Tokenizer

https://github.com/cofe-ai/o2-magvit2

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Open Source Implementation of Dual Modality MAGVIT2 Tokenizer

Basic Info
  • Host: GitHub
  • Owner: cofe-ai
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 5.69 MB
Statistics
  • Stars: 16
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

O2-MAGVIT2

Video reconstruction with O2-MAGVIT2-preview (under 720p)

Introduction

We present an open-source pytorch implementation of Google's MAGVIT-v2 visual tokenizer named O2-MAGVIT2, which stands for handling dual modality (Image and Video) tokenization with a single Tokenizer. O2-MAGVIT2 is aligned with MAGVIT-v2 to a large extent. It uses Lookup-free quantizer(LFQ) with codebook size of $2^{18}$ and the exact same architecture of the encoder, decoder and discriminator described in the original paper. To facilitate training, we use huggingface's accelerate to wrap the trainer. We also release a preview version of the video tokenizer trained on a Panda-70M subset to validate its performance.

Architecture

We re-implemented the MAGVIT-v2's architecture exactly. Below is from the magvit-v2's attachments.

Quick start

  • Inference: edit the arguments in scripts/run_inference.sh and run the following command to see the reconstruction result: bash bash scripts/run_inference.sh

    run python inference.py -h for more details.

  • Training: edit the config under the configs/ then run the following command to train the model: ```bash NODERANK=0 MASTERADDR=localhost:25001 NUMNODES=1 NUMGPUS=8

bash scripts/runtrain3d.sh $NODERANK $MASTERADDR $NUMNODES $NUMGPUS ```

Training Procedure

The whole training includes two stages. In stage I, we train an image tokenizer with OpenImage dataset (which contains 8M training samples) for 10 epochs with batch size 256. For stage II, we random sampled 9.3M samples from panda-70M and train the video tokenizer for 1 epoch with batch size 128.

Hyper parameters

We adopt almost the same hyper-parameter setting as MAGVIT-v2 with minimal change. See configs/magvit2_3d_model_config.yaml for model setup details and configs/magvit2_3d_train_config.yaml for training setup.

Pretrained Models

We release a pretrained checkpoint of the video tokenizer on huggingface as a preview. Note that due to much fewer training steps, the model is certainly under-trained and thus may not provide a good enough performance if you use it directly. We recommend considering it a step stone to continue training to get better results.

The checkpoint of O2-MAGVIT2-preview can be found here.

Acknowledgement

We refer some ideas and implementations from MAGVIT, vector-quantize-pytorch, praxis, LlamaGen, pytorch-image-models, and VQGAN. Thanks a lot for their excellent work.

Citation

If you found our work interesting, please cite the following references and give us a star. @misc{Fang_O2-MAGVIT2, author = {Fang, Xuezhi and Yao, Yiqun and Jiang, Xin and Li, Xiang and Yu, Naitong and Wang, Yequan}, license = {Apache-2.0}, title = {O2-MAGVIT2}, year = {2024}, url = {https://github.com/cofe-ai/O2-MAGVIT2} }

Owner

  • Name: cofe-ai
  • Login: cofe-ai
  • Kind: organization
  • Location: China

Big Model AI Groups from BAAI

Citation (citation.cff)

cff-version: 1.2.0
title: O2-MAGVIT2
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: misc
authors:
  - given-names: Xuezhi
    family-names: Fang
  - given-names: Yiqun
    family-names: Yao
  - given-names: Xin
    family-names: Jiang
  - given-names: Xiang
    family-names: Li
  - given-names: Naitong
    family-names: Yu
  - given-names: Yequan
    family-names: Wang
repository-code: 'https://github.com/cofe-ai/O2-MAGVIT2'
abstract: >-
  Open Source Implementation of Dual Modality MAGVIT2
  Tokenizer
license: Apache-2.0

GitHub Events

Total
  • Issues event: 4
  • Watch event: 18
  • Issue comment event: 4
  • Push event: 1
  • Public event: 1
  • Fork event: 1
Last Year
  • Issues event: 4
  • Watch event: 18
  • Issue comment event: 4
  • Push event: 1
  • Public event: 1
  • Fork event: 1

Dependencies

requirements.txt pypi
  • PyYAML ==6.0.2
  • accelerate ==1.0.0
  • arrow ==1.3.0
  • av ==13.1.0
  • beartype ==0.19.0
  • einops ==0.8.0
  • huggingface-hub ==0.26.2
  • moviepy ==1.0.3
  • ninja *
  • numpy ==1.24.4
  • omegaconf ==2.3.0
  • pillow ==10.4.0
  • safetensors ==0.4.5
  • scikit-learn ==1.5.2
  • scipy *
  • tabulate ==0.9.0
  • tensorboard ==2.16.2
  • tensorboard-data-server ==0.7.2
  • tokenizers ==0.20.3
  • torch *
  • torch-fidelity ==0.3.0
  • torchmetrics ==1.5.2
  • torchvision *
  • tqdm ==4.66.5
  • transformers ==4.46.2