video-compression-and-future-prediction-using-gpt
This repository presents a project focused on advanced video compression and future prediction using Generative Pre-trained Transformer (GPT) and other state-of-the-art techniques.
https://github.com/rishikesh-jadhav/video-compression-and-future-prediction-using-gpt
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary
Keywords
Repository
This repository presents a project focused on advanced video compression and future prediction using Generative Pre-trained Transformer (GPT) and other state-of-the-art techniques.
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
| Source Video | Compressed Video | Future Prediction | | --------------- | ---------------- |------------------ | |
A world model is a model that can predict the next state of the world given the observed previous states and actions.
World models are essential to training all kinds of intelligent agents, especially self-driving models.
commaVQ contains: - encoder/decoder models used to heavily compress driving scenes - a world model trained on 3,000,000 minutes of driving videos - a dataset of 100,000 minutes of compressed driving videos
Task
Lossless compression challenge: make me smaller! $500 challenge
Losslessly compress 5,000 minutes of driving video "tokens". Go to ./compression/ to start
Prize: highest compression rate on 5,000 minutes of driving video (~915MB) - Challenge ended July, 1st 2024 11:59pm AOE
Submit a single zip file containing the compressed data and a python script to decompress it into its original form. Top solutions are listed on comma's official leaderboard.
| Implementation | Compression rate | | :----------------------------------------------------------------------------------| ---------------: | | pkourouklidis (arithmetic coding with GPT) | 2.6 | | anonymous (zpaq) | 2.3 | | rostislav (zpaq) | 2.3 | | anonymous (zpaq) | 2.2 | | anonymous (zpaq) | 2.2 | | 0x41head (zpaq) | 2.2 | | tillinf (zpaq) | 2.2 | | baseline (lzma) | 1.6 |
Overview
A VQ-VAE [1,2] was used to heavily compress each video frame into 128 "tokens" of 10 bits each. Each entry of the dataset is a "segment" of compressed driving video, i.e. 1min of frames at 20 FPS. Each file is of shape 1200x8x16 and saved as int16.
A world model [3] was trained to predict the next token given a context of past tokens. This world model is a Generative Pre-trained Transformer (GPT) [4] trained on 3,000,000 minutes of driving videos following a similar recipe to [5].
Examples
./notebooks/encode.ipynb and ./notebooks/decode.ipynb for an example of how to visualize the dataset using a segment of driving video from comma's drive to Taco Bell
./notebooks/gpt.ipynb for an example of how to use the world model to imagine future frames.
./compression/compress.py for an example of how to compress the tokens using lzma
Download the dataset
- Using huggingface datasets
python import numpy as np from datasets import load_dataset num_proc = 40 # CPUs go brrrr ds = load_dataset('commaai/commavq', num_proc=num_proc) tokens = np.load(ds['0'][0]['path']) # first segment from the first data shard - Manually download from huggingface datasets repository: https://huggingface.co/datasets/commaai/commavq
References
[1] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).
[2] Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[3] https://worldmodels.github.io/
[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[5] Micheli, Vincent, Eloi Alonso, and François Fleuret. "Transformers are Sample-Efficient World Models." The Eleventh International Conference on Learning Representations. 2022.
Owner
- Name: Rishikesh Jadhav
- Login: Rishikesh-Jadhav
- Kind: user
- Repositories: 2
- Profile: https://github.com/Rishikesh-Jadhav
Robotics Masters student at the University of Maryland - College Park
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use commavq, please cite it as below." authors: - family-names: "comma.ai" title: "commavq: a dataset of tokenized driving video and a GPT model" date-released: 2023-06-25 url: "https://github.com/commaai/commavq/"
GitHub Events
Total
- Watch event: 2
- Fork event: 1
Last Year
- Watch event: 2
- Fork event: 1
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- datasets ==2.15.0
- torch ==2.2.2
- tqdm *