https://github.com/cliangyu/emu

Emu: An Open Multimodal Generalist

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Emu: An Open Multimodal Generalist

Basic Info

Host: GitHub
Owner: cliangyu
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2307.05222
Size: 17 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Fork of baaivision/Emu

Created over 2 years ago · Last pushed over 2 years ago

https://github.com/cliangyu/Emu/blob/main/



Emu: An Open Multimodal Generalist
Generative Pretraining in Multimodality

[Quan Sun](https://github.com/Quan-Sun)^1*, [Qiying Yu](https://yqy2001.github.io)^2,1*, [Yufeng Cui]()^1*, [Fan Zhang](https://scholar.google.com/citations?user=VsJ39HMAAAAJ)^1*, [Xiaosong Zhang](https://github.com/zhangxiaosong18)^1*, [Yueze Wang]()¹, [Hongcheng Gao](https://hongcheng-gao.github.io/)¹,
[Jingjing Liu](https://air.tsinghua.edu.cn/en/info/1046/1194.htm)², [Tiejun Huang](https://scholar.google.com/citations?user=knvEK4AAAAAJ&hl=en)^1,3, [Xinlong Wang](https://www.xloong.wang/)¹
	
¹ [BAAI](https://www.baai.ac.cn/english.html), ² [THU](https://air.tsinghua.edu.cn), ³ [PKU](https://english.pku.edu.cn/) 
^* Equal Contribution

|  [Paper](https://arxiv.org/abs/2307.05222) | [Demo](https://emu.ssi.plus/) |

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/generative-pretraining-in-multimodality/visual-question-answering-on-mm-vet-w-o)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?tag_filter=0)


**Emu is a multimodal generalist that can seamlessly generate images and texts in multimodal context**. **Emu** is trained with a unified autoregressive objective, *i.e.*, predict-the-next-element, including both visual embeddings and textual tokens. Trained under this objective, **Emu** can serve as a generalist interface for both image-to-text and text-to-image tasks.

![](assets/Emu.png)

## News

* `Oct 16, 2023`: **Emu-I** achieves [state-of-the-art performance](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?tag_filter=0) on the [MM-Vet](https://github.com/yuweihao/MM-Vet) benchmark (w/o external tools like GPT-4), which assesses large multimodal models in real-world, in-the-wild scenarios.
* `Oct 13, 2023`: The code for the zero-shot evaluation of **Emu-I** has been released!
* `Sep 18, 2023`: Tools for processing YT-Storyboard-1b dataset have been released!

## Generalist Interface

**Emu** serves as a generalist interface capable of diverse multimodal tasks, such as image captioning, image/video question answering, and text-to-image generation, together with new abilities like in-context text and image generation, and image blending:

![](assets/generalist.png)

## Setup

Clone this repository and install required packages:

```shell
git clone https://github.com/baaivision/Emu
cd Emu

pip install -r requirements.txt
```

## Model Weights

We release the pretrained and instruction-tuned weights of **Emu**. Our weights are subject to LLaMA-1's [license](https://github.com/facebookresearch/llama/blob/1076b9c51c77ad06e9d7ba8a4c6df775741732bd/LICENSE).

| Model name         | Weight                                                  |
| ------------------ | ------------------------------------------------------- |
| **Emu w/ Decoder** | [ HF link](https://huggingface.co/BAAI/Emu/tree/main/pretrain) (34GB) |
| **Emu-I**          | [ HF link](https://huggingface.co/BAAI/Emu/blob/main/Emu-instruct.pt) (27GB) |

## Inference

At present, we provide inference code that can process interleaved image-text and **video** as input, and output text and image.

For instruction-tuned model, we provide examples for image captioning, visual question answering, and interleaved multi-image understanding:

```sh
python inference.py --instruct --ckpt-path ${INSTRUCT_CKPT_PATH}
```

For pretrained model, we provide an example for in-context learning:

```sh
python inference.py --ckpt-path ${PRETRAIN_CKPT_DIR}/multimodal_encoder/pytorch_model.bin
```

For image generation, we provide examples for image blending, text-to-image and in-context generation:

```sh
python image_inference.py --ckpt-path ${PRETRAIN_CKPT_DIR}
```

## Evaluation

We provide **Emu-I**'s zero-shot evaluation code on MM-Vet, COCO Caption, VQAv2, OKVQA, VizWiz and VisDial benchmarks. For example, evaluating COCO captioning on a node with 8 GPUs:
```sh
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env \
    eval.py \
    --instruct \
    --batch_size 4 \
    --ckpt_path ${INSTRUCT_CKPT_PATH} \
    --root_path /path/to/benchmark_root \
    --dataset_name coco \  # coco, mmvet, vqav2, okvqa, vizwiz, visdial
    --output_path ./output/
```
where `/path/to/benchmark_root` should contain the following file structure:
```
benchmark_root/
    mm-vet/
        mm-vet.json
        images/
            v1_0.png
            ...
    coco/
        images/
            test2015/
                COCO_test2015_{...}.jpg
                ...
            val2014/
                COCO_val2014_{...}.jpg
                ...
        annotations/
            coco_karpathy_test.json
            coco_karpathy_test_gt.json
            coco_karpathy_val.json
            coco_karpathy_val_gt.json
            v2_OpenEnded_mscoco_val2014_questions.json
            v2_mscoco_val2014_annotations.json
            vqa_test.json
            vqa_val_eval.json
    okvqa/
        annotations/
            OpenEnded_mscoco_val2014_questions.json
            mscoco_val2014_annotations.json
            vqa_val_eval.json
    vizwiz/
        images/
            test/
                VizWiz_test_{...}.jpg
                ...
            val/
                VizWiz_val_{...}.jpg
                ...
        annotations/
            test.json
            val.json
    visdial/
        VisualDialog_test2018/
            VisualDialog_test2018_{...}.jpg
            ...
        VisualDialog_val2018/
            VisualDialog_val2018_{...}.jpg
            ...
        visdial_1.0_test.json
        visdial_1.0_val.json
```
You can also customize your own file structure and modify the corresponding data loading code. Each dataset file can be found in the `mm_eval/datasets/` directory. All files can be downloaded from the official dataset websites or from [LAVIS](https://github.com/salesforce/LAVIS). 


## Schedule

We are committed to open-sourcing all Emu related materials, including:

- [x] The weights of **Emu** and **Emu-I**
- [x] Inference example for interleaved image-text as input, text as output
- [x] Video inference example
- [x] Weights of image decoder & image generation/blending example
- [x] YT-Storyboard-1B pretraining data
- [ ] Pretraining code
- [ ] Instruction tuning code
- [x] Evaluation code

We hope to foster the growth of our community through open-sourcing and promoting collaboration. Let's step towards multimodal intelligence together.

## Acknowledgement

We thank the great work from [LLaMA](https://github.com/facebookresearch/llama), [BLIP-2](https://github.com/salesforce/LAVIS), [Stable Diffusion](https://github.com/CompVis/stable-diffusion), and [FastChat](https://github.com/lm-sys/FastChat).

## Citation

If you find Emu useful for your research and applications, please consider starring this repository and citing:

```
@article{Emu,
  title={Generative Pretraining in Multimodality},
  author={Sun, Quan and Yu, Qiying and Cui, Yufeng and Zhang, Fan and Zhang, Xiaosong and Wang, Yueze and Gao, Hongcheng and Liu, Jingjing and Huang, Tiejun and Wang, Xinlong},
  publisher={arXiv preprint arXiv:2307.05222},
  year={2023},
}
```

## Misc



[![Stargazers repo roster for @baaivision/Emu](https://reporoster.com/stars/baaivision/Emu)](https://github.com/baaivision/Emu/stargazers)


[![Forkers repo roster for @baaivision/Emu](https://reporoster.com/forks/baaivision/Emu)](https://github.com/baaivision/Emu/network/members)


[![Star History Chart](https://api.star-history.com/svg?repos=baaivision/Emu&type=Date)](https://star-history.com/#baaivision/Emu&Date)

Owner

Name: Liangyu Chen
Login: cliangyu
Kind: user
Location: Singapore
Company: Nanyang Technological University

Website: cliangyu.com
Twitter: cliangyu_
Repositories: 1
Profile: https://github.com/cliangyu

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/cliangyu/emu

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/cliangyu/Emu/blob/main/

Emu: An Open Multimodal Generalist

Generative Pretraining in Multimodality

Owner

GitHub Events

Total

Last Year