https://github.com/ai-forever/kandinskyvideo

KandinskyVideo — multilingual end-to-end text2video latent diffusion model

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary

Keywords

kandinsky latent-diffusion text-to-video video-generation

Last synced: 9 months ago · JSON representation

Repository

KandinskyVideo — multilingual end-to-end text2video latent diffusion model

Basic Info

Host: GitHub
Owner: ai-forever
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 387 MB

Statistics

Stars: 184
Watchers: 13
Forks: 20
Open Issues: 6
Releases: 0

Topics

kandinsky latent-diffusion text-to-video video-generation

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

Kandinsky Video 1.1 — a new text-to-video generation model

SoTA quality among open-source solutions on EvalCrafter benchmark

This repository is the official implementation of Kandinsky Video 1.1 model.

| Telegram-bot | Habr post | Our text-to-image model | Project page

Our previous model Kandinsky Video 1.0, divides the video generation process into two stages: initially generating keyframes at a low FPS and then creating interpolated frames between these keyframes to increase the FPS. In Kandinsky Video 1.1, we further break down the keyframe generation into two extra steps: first, generating the initial frame of the video from the textual prompt using Text to Image Kandinsky 3.0, and then generating the subsequent keyframes based on the textual prompt and the previously generated first frame. This approach ensures more consistent content across the frames and significantly enhances the overall video quality. Furthermore, the approach allows animating any input image as an additional feature.

Pipeline

In the Kandinsky Video 1.0, the encoded text prompt enters the text-to-video U-Net3D keyframe generation model with temporal layers or blocks, and then the sampled latent keyframes are sent to the latent interpolation model to predict three interpolation frames between two keyframes. An image MoVQ-GAN decoder is used to obtain the final video result. In Kandinsky Video 1.1, text-to-video U-Net3D is also conditioned on text-to-image U-Net2D, which helps to improve the content quality. A temporal MoVQ-GAN decoder is used to decode the final video.

Architecture details

Text encoder (Flan-UL2) - 8.6B
Latent Diffusion U-Net3D - 4.15B
The interpolation model (Latent Diffusion U-Net3D) - 4.0B
Image MoVQ encoder/decoder - 256M
Video (temporal) MoVQ decoder - 556M

How to use

1. text2video

```python from kandinskyvideo import getT2V_pipeline

devicemap = 'cuda:0' t2vpipe = getT2Vpipeline(device_map)

prompt = "A cat wearing sunglasses and working as a lifeguard at a pool."

fps = 'medium' # ['low', 'medium', 'high'] motion = 'high' # ['low', 'medium', 'high']

video = t2vpipe( prompt, width=512, height=512, fps=fps, motion=motion, keyframeguidancescale=5.0, guidanceweightprompt=5.0, guidanceweightimage=3.0, )

pathtosave = f'./assets/video.gif' video[0].save( pathtosave, saveall=True, appendimages=video[1:], duration=int(5500/len(video)), loop=0 ) ```

Generated video

2. image2video

```python from kandinskyvideo import getT2V_pipeline

devicemap = 'cuda:0' t2vpipe = getT2Vpipeline(device_map)

from PIL import Image import requests from io import BytesIO

url = 'https://media.cnn.com/api/v1/images/stellar/prod/gettyimages-1961294831.jpg' response = requests.get(url) img = Image.open(BytesIO(response.content)) img.show()

prompt = "A panda climbs up a tree."

fps = 'medium' # ['low', 'medium', 'high'] motion = 'medium' # ['low', 'medium', 'high']

video = t2vpipe( prompt, image=img, width=640, height=384, fps=fps, motion=motion, keyframeguidancescale=5.0, guidanceweightprompt=5.0, guidanceweightimage=3.0, )

pathtosave = f'./assets/video2.gif' video[0].save( pathtosave, saveall=True, appendimages=video[1:], duration=int(5500/len(video)), loop=0 ) ```

Input image.

Generated Video.

Motion score and Noise Augmentation conditioning

Variations in generations based on different motion scores and noise augmentation levels. The horizontal axis shows noise augmentation levels (NA), while the vertical axis displays motion scores (MS).

Results

Kandinsky Video 1.1 achieves second place overall and best open source model on EvalCrafter text to video benchmark. VQ: visual quality, TVA: text-video alignment, MQ: motion quality, TC: temporal consistency and FAS: final average score.

Polygon-radar chart representing the performance of Kandinsky Video 1.1 on EvalCrafter benchmark.

Human evaluation study results. The bars in the plot correspond to the percentage of “wins” in the side-by-side comparison of model generations. We compare our model with Video LDM.

Authors

Zein Shaheen: Github, Google Scholar
Vladimir Arkhipkin: Github, Google Scholar
Viacheslav Vasilev: Github, Google Scholar
Igor Pavlov: Github
Elizaveta Dakhova: Github
Anastasia Lysenko: Github
Sergey Markov
Denis Dimitrov: Github, Google Scholar
Andrey Kuznetsov: Github, Google Scholar

BibTeX

If you use our work in your research, please cite our publication: @article{arkhipkin2023fusionframes, title = {FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline}, author = {Arkhipkin, Vladimir and Shaheen, Zein and Vasilev, Viacheslav and Dakhova, Elizaveta and Kuznetsov, Andrey and Dimitrov, Denis}, journal = {arXiv preprint arXiv:2311.13073}, year = {2023}, }

Owner

Name: AI Forever
Login: ai-forever
Kind: organization
Location: Armenia

Repositories: 60
Profile: https://github.com/ai-forever

Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.

GitHub Events

Total

Watch event: 15
Issue comment event: 1
Fork event: 4

Last Year

Watch event: 15
Issue comment event: 1
Fork event: 4

Committers

Last synced: 10 months ago

All Time

Total Commits: 34
Total Committers: 7
Avg Commits per committer: 4.857
Development Distribution Score (DDS): 0.588

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
vivasilev	s**8@y**u	14
oriBetelgeuse	a**8@g**m	8
Andrey Kuznetsov	k**y@g**m	4
Zein Shaheen	z**e@g**m	3
Andrei Filatov	4****h	3
chenxi	c**e@g**m	1
Denis	d**v@g**m	1

Committer Domains (Top 20 + Academic)

yandex.ru: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 7
Total pull requests: 4
Average time to close issues: 3 days
Average time to close pull requests: 2 days
Total issue authors: 7
Total pull request authors: 3
Average comments per issue: 0.29
Average comments per pull request: 0.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 1
Average time to close issues: 3 days
Average time to close pull requests: 1 minute
Issue authors: 3
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

SoftologyPro (1)
hcgprague (1)
Hangsiin (1)
Jzow (1)
l-dawei (1)
eisneim (1)

Pull Request Authors

oriBetelgeuse (2)
chenxwh (1)
zeinsh (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

accelerate *
albumentations *
av *
bezier *
datasets *
diffusers *
einops *
fsspec *
hydra-core *
matplotlib *
omegaconf *
pytorch_lightning ==1.7.5
s3fs *
scikit-image *
sentencepiece *
setuptools ==59.5.0
timm *
torch ==1.10.1
torchaudio ==0.10.1
torchvision ==0.11.2
transformers *
wandb *
webdataset *

https://github.com/ai-forever/kandinskyvideo

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Kandinsky Video 1.1 — a new text-to-video generation model

SoTA quality among open-source solutions on EvalCrafter benchmark

Pipeline

How to use

1. text2video

2. image2video

Motion score and Noise Augmentation conditioning

Results

Authors

BibTeX

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies