mindall-e

PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs

https://github.com/kakaobrain/mindall-e

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs

Basic Info

Host: GitHub
Owner: kakaobrain
License: other
Language: Python
Default Branch: main
Homepage:
Size: 44.6 MB

Statistics

Stars: 634
Watchers: 14
Forks: 66
Open Issues: 10
Releases: 0

Created over 4 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

minDALL-E on Conceptual Captions

minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for non-commercial purposes.

a painting of a bird in the style of asian painting a photo of san francisco's golden gate bridge in black and white tone

Environment Setup

Basic setup PyTorch == 1.8.0 CUDA >= 10.1
Other packages pip install -r requirements.txt

Model Checkpoint

Model structure (two-stage autoregressive model)
- Stage1: Unlike the original DALL-E [1], we replace Discrete VAE with VQGAN [2] to generate high-quality samples effectively. We slightly fine-tune vqganimagenetf16_16384, provided by the official VQGAN repository, on FFHQ [3] as well as ImageNet.
- Stage2: We train our 1.3B transformer from scratch on 14 million image-text pairs from CC3M [4] and CC12M [5]. For the more detailed model spec, please see configs/dalle-1.3B.yaml.
You can download the pretrained models including the tokenizer from this link. This will require about 5GB space.

Sampling

Given a text prompt, the code snippet below generates candidate images and re-ranks them using OpenAI's CLIP [6].
This has been tested under a single V100 of 32GB memory. In the case of using GPUs with limited memory, please lower down numcandidates to avoid OOM. ```python from matplotlib import pyplot as plt import clip from dalle.models import Dalle from dalle.utils.utils import setseed, clip_score

device = 'cuda:0' set_seed(0)

prompt = "A painting of a monkey with sunglasses in the frame" model = Dalle.from_pretrained('minDALL-E/1.3B') # This will automatically download the pretrained model. model.to(device=device)

Sampling

images = model.sampling(prompt=prompt, topk=256, # It is recommended that topk is set lower than 256. topp=None, softmaxtemperature=1.0, num_candidates=96, device=device).cpu().numpy() images = np.transpose(images, (0, 2, 3, 1))

CLIP Re-ranking

modelclip, preprocessclip = clip.load("ViT-B/32", device=device) modelclip.to(device=device) rank = clipscore(prompt=prompt, images=images, modelclip=modelclip, preprocessclip=preprocessclip, device=device)

Plot images

images = images[rank] plt.imshow(images[0]) plt.show() ``` - If you want to use a complete python code for sampling, please see examples/sampling_ex.py - If you want to play with an interactive demo, please see examples/samplinginteractivedemo.ipynb. Before using this, you may need to install ipywidgets.

Samples (Top-K=256, Temperature=1.0)

"a painting of a {cat, dog} with sunglasses in the frame"
"a large {pink, black} elephant walking on the beach"
"Eiffel tower on a {desert, mountain}"

Quantitative Results

We have validated minDALL-E on the CC3M validation set (in-distribution evaluation) and MS-COCO (zero-shot evaluation).
For CC3M, we measure the cosine similarity between image and text representations from the pretrained CLIP model (ViT-B/32), referred to as CLIP-score.
For MS-COCO, we compute FID between 30K generated and real samples from MS-COCO 2017, where we randomly choose 30K captions from COCO as in DALL-E. We select the best out of 32 candidates by CLIP re-ranking.

| Model | CC3M:CLIP-score (higher is better) | MS-COCO:FID-30K (lower is better) | |:------|----:|----:| |VQGAN [2] | 0.20 | - | |ImageBART [7]| 0.23 | - | |DALL-E [1] | - | 27.5 | |minDALL-E | 0.26 | 14.7 |

Transfer Learning Examples

minDALL-E, which is pre-trained on noisy text supervisions, could be transferable to class-conditional and unconditional generation tasks. To validate this, we simply fine-tune it on ImageNet over 8 epochs in the case of class-conditional generation and unconditional generation.
The commands below fine-tune the pretrained DALL-E. It takes about 36 hours on 8 V100 GPUs. ``` # unconditinoal image generation for imagenet (256x256) python examples/transferlearningex.py -d=configs/transfer-imagenet-uncond-gen.yaml -u=[MODELCKPT] -r=[RESULTPATH] --n-gpus=[NUM_GPUS]

class-conditinoal image generation for imagenet (256x256)

python examples/transferlearningex.py -d=configs/transfer-imagenet-clscond-gen.yaml -u=[MODELCKPT] -r=[RESULTPATH] --n-gpus=[NUM_GPUS] ``* We compute FID-50K between 50K generated samples and all ImageNet training samples, where we use top-k=256 and softmax temperature=1.0 for generation. All results are obtained without the rejection sampling. Interestingly, our model achieves very competitive performance with baselines, even thoughminDALL-E` is fine-tuned in a few epochs.

| Model | Params | FID-50K(class-cond.) | FID-50K(uncond.) | |:-----|----:|----:|----:| |VQ-GAN | 1.4B | 15.78 | - | |ImageBART | 3.5B | 21.19 | - | |minDALL-E | 1.3B | 15.55 | 37.58 |

BibTex

If you find this repository useful in your research, please cite: @misc{kakaobrain2021minDALL-E, title = {minDALL-E on Conceptual Captions}, author = {Saehoon Kim, Sanghun Cho, Chiheon Kim, Doyup Lee, and Woonhyuk Baek}, year = {2021}, howpublished = {\url{https://github.com/kakaobrain/minDALL-E}}, }

References

[1] Ramesh et al. Zero-Shot Text-to-Image Generation, ICML 2021.
[2] Esser et al. Taming Transformers for High-Resolution Image Synthesis, CVPR 2021.
[3] Karras et al. A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR 2019.
[4] Sharma et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning, ACL 2018.
[5] Changpinyo et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, CVPR 2021.
[6] Radford et al. Learning Transferable Visual Models From Natural Language Supervision, ICML 2021.
[7] Esser et al. ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis, NeurIPS 2021.
[8] https://github.com/karpathy/minGPT

Licenses

The source codes are licensed under Apache 2.0 License.
The stage2 pretrained weights are licensed under CC-BY-NC-SA 4.0 License.

Contact

We hope that minDALL-E helps various projects in research-oriented institutes and startups. If you would like to collaborate with us or share a feedback, please e-mail to us, contact@kakaobrain.com

Limitations

Although minDALL-E is trained on a small set (14M image-text pairs), this might be vulnerable to malicious attacks from the prompt engineering to generate socially unacceptable images. If you obersve these images, please report the "prompt" and "generated images" to us.

Owner

Name: kakaobrain
Login: kakaobrain
Kind: organization
Location: Pankyo, Seongnam, Kyungki, Republic of Korea

Website: https://kakaobrain.com
Repositories: 26
Profile: https://github.com/kakaobrain

Kakao Brain Corp.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you find this repository useful in your research, please cite"
authors:
  - family-names: Kim
    given-names: Saehoon
  - family-names: Cho
    given-names: Sanghun
  - family-names: Kim
    given-names: Chiheon
  - family-names: Lee
    given-names: Doyup
  - family-names: Baek
    given-names: Woonhyuk
title: "minDALL-E on Conceptual Captions"
version: 0.1
date-released: 2021-12-14
repository-code: https://github.com/kakaobrain/minDALL-E

GitHub Events

Total

Watch event: 8
Pull request event: 1
Fork event: 2

Last Year

Watch event: 8
Pull request event: 1
Fork event: 2

Committers

Last synced: about 1 year ago

All Time

Total Commits: 6
Total Committers: 4
Avg Commits per committer: 1.5
Development Distribution Score (DDS): 0.667

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Saehoon Kim	k**a@g**m	2
Sam Kim	s**h@k**m	2
damien.re	d**e@k**m	1
clint	c**b@k**m	1

Committer Domains (Top 20 + Academic)

kakaobrain.com: 2 kakaobrain.coom: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 20
Total pull requests: 6
Average time to close issues: about 1 month
Average time to close pull requests: 6 months
Total issue authors: 18
Total pull request authors: 4
Average comments per issue: 1.65
Average comments per pull request: 0.0
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

SeungyounShin (1)
loboere (1)
johnpaulbin (1)
neverix (1)
PyDeps (1)
smittal10 (1)
woctezuma (1)
ChristiaensBert (1)
raki-1203 (1)
MyUsernamee (1)
j-min (1)
mjohanning99 (1)
thuangb (1)
tackgeun (1)
INF800 (1)

Pull Request Authors

saehoonkim (3)
chenxwh (2)
LeeDoYup (1)
wbaek (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

einops *
matplotlib *
omegaconf *
pyflakes >=2.2.0
pytorch-lightning >=1.5
tokenizers >=0.10.2
torch ==1.8.0
torchvision >=0.8.2
tqdm >=4.46.0

mindall-e

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

minDALL-E on Conceptual Captions

Environment Setup

Model Checkpoint

Sampling

Sampling

CLIP Re-ranking

Plot images

Samples (Top-K=256, Temperature=1.0)

Quantitative Results

Transfer Learning Examples

class-conditinoal image generation for imagenet (256x256)

BibTex

References

Licenses

Contact

Limitations

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies