velvet

https://github.com/dinhanhx/velvet

Last synced: 6 months ago · JSON representation ·

Repository

velvet

Basic Info

Host: GitHub
Owner: dinhanhx
License: mit
Language: Python
Default Branch: main
Size: 85 KB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Velvet

Alt text

Red velvet cake is my all-time favorite cake. It’s perfect for any holiday or celebration, and it always looks so elegant!

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC)

Introduction

This contains all the code (training, data loading) for the prompting vision language model discussed in the master thesis "Multitask Multilingual Vision Language: Image and Text Generation".

To increase the language range with Vietnamese as our main focus, we have translated 6 common English datasets for image captioning and visual question answering into Vietnamese. The 6 datasets are CC3M, COCO, VQAv2, OK-VQA, TextCaps, TextVQA. By using Google Translate, these datasets collectively contain millions of image-text pairs in each language (English and Vietnamese). These datasets are available at huggingface.co/dinhanhx.

We have proposed a prompting vision language model (this repository) which is trained on these datasets. The model can caption images and answer questions related to images. The model is modular and has 3 components: a frozen image model (ConvNeXt V2), a frozen text model (BLOOMZ), and a Cutie model. Cutie model is meant to query useful information from visual features (by image model) for the text model. Our concept is heavily inspired by BLIP-2 architecture. Docstring in velvet/model/cutie.py should provide the differences.

Citation

If you use this source code or model weights or theory, please cite it as below. @software{dinhanhx_Velvet_2023, title = {{Velvet}}, author = {dinhanhx}, year = 2023, month = sep, url = {https://github.com/dinhanhx/velvet}, license = {MIT} }

Results

It's unfortunately not good. Our largest model achieves CIDEr of Crossmodal-3600 of 0.435 in English and 0.318 in Vietnamese. It also gets 0.3404 F1 and 0.1135 BLEU on the private test of EVJVQA.

Project setup

Dependencies setup

See this pip requirements to setup on TPU v3-8 or v4-8.

Data setup

⚠ Please take a look at this file configs/data_dir.toml first

Given that the folder to download data is /mnt/storage/data/, in this folder, run the following commands, ```sh

Make sure you have git-lfs installed (https://git-lfs.com)

git lfs install

git clone https://huggingface.co/datasets/dinhanhx/evjvqa git clone https://huggingface.co/datasets/dinhanhx/TextVQA-vi git clone https://huggingface.co/datasets/dinhanhx/TextCaps-vi git clone https://huggingface.co/datasets/dinhanhx/OK-VQA-multilang git clone https://huggingface.co/datasets/dinhanhx/VQAv2-vi git clone https://huggingface.co/datasets/dinhanhx/ccsbualign_multilang git clone https://huggingface.co/datasets/dinhanhx/coco-2017-vi git clone https://huggingface.co/datasets/dinhanhx/gcc-vi git clone https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K ```

Now setup LLaVA-CC3M-Pretrain-595K, sh cd LLaVA-CC3M-Pretrain-595K/ mkdir images unzip -q images.zip -d images

Now setup coco-2017-images, sh mkdir coco-2017-images cd coco-2017-images curl -JO http://images.cocodataset.org/zips/train2017.zip unzip -q train2017.zip

Now setup OpenImages, sh mkdir OpenImages cd OpenImages curl -JO https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip unzip -q train_val_images.zip

Now run some python scripts to make data usable by velvet, don't forget do back to root of this repository, sh python scripts/<do_something>.py where <do_something>.py is a file in scripts/

⚠ Please read the files (including scripts/readme.md) before running, especially things related to paths.

Experiment train

⚠ Our training code can be used on GPU too. It's not TPU exclusive. Please read pl.Trainer in velvet_exp/run_all.py carefully before running on any hardware.

Pre run

for TPU v3-8 sh export XRT_TPU_CONFIG="localservice;0;localhost:51011"

for TPU v4-8 sh export PJRT_DEVICE=TPU

Main run

```sh

Replace by config name

python -m velvetexp.runall configs/experiments/.json ```

```sh

Kill leftover processes

pgrep -f "python -m velvetexp.runall" | xargs kill -9 ```

PyTorch Lightning CKPT file extraction

PyTorch Lightning (after and during training) produces a ckpt file (which can be opened by Torch). It contains everything including the model weight. To extract the model weight only, please refer the following code (which is similar to velvet_exp/run_all.py)

Click me

```python import json import torch from transformers.models.bert import BertConfig from transformers.models.bloom import BloomConfig from transformers.models.convnextv2.modeling_convnextv2 import ConvNextV2Config from velvet_exp.run_all import Wrapper experiment_config = json.load(open("configs/experiments/big_model.json", "r")) image_model_name = experiment_config.get("image_model_name", None) if image_model_name is None: image_model_name = "facebook/convnextv2-base-22k-224" image_config = ConvNextV2Config.from_pretrained(image_model_name) bloom_model_name = experiment_config.get("bloom_model_name", None) if bloom_model_name is None: bloom_model_name = "bigscience/bloomz-560m" bloom_config = BloomConfig.from_pretrained(bloom_model_name) bert_config = BertConfig( hidden_size=1024, num_hidden_layers=6, num_attention_heads=16, is_decoder=True, add_cross_attention=True, ) wrapper = Wrapper( experiment_config=experiment_config, image_config=image_config, # type: ignore bert_config=bert_config, bloom_config=bloom_config, # type: ignore bloom_name=bloom_model_name, learning_rate=experiment_config["learning_rate"], warmup_ratio=experiment_config["warmup_ratio"], use_lrs=experiment_config["use_learning_rate_scheduler"], warmup_steps=experiment_config.get("warmup_steps", None), ) wrapper.load_state_dict(torch.load("big_model_logs/lightning_logs/version_0/checkpoints/last.ckpt")["state_dict"]) torch.save(wrapper.visual_bloom.state_dict(), "big_model_logs/lightning_logs/version_0/checkpoints/visual_bloom.torch") ```

Demo

⚠ Please make sure that you have done the extraction or you have obtained the model weight here

The following code doesn't use GPU or TPU. To change the image, please look at variable url. To change the prompt, please read the last for loop.

Click me

```python import requests import torch from PIL import Image from transformers.models.bert import BertConfig from transformers.models.bloom import BloomConfig, BloomTokenizerFast from transformers.models.convnext import ConvNextImageProcessor from transformers.models.convnextv2.modeling_convnextv2 import ( ConvNextV2Config, ConvNextV2Model, ) from velvet.collator import ImageFeatureCollator from velvet.model import VisualBloom image_model_name = "facebook/convnextv2-large-22k-224" image_config = ConvNextV2Config.from_pretrained(image_model_name) image_processor = ConvNextImageProcessor.from_pretrained(image_model_name) image_model = ConvNextV2Model.from_pretrained(image_model_name) image_feature_collator = ImageFeatureCollator(image_processor, image_model) bloom_model_name = "bigscience/bloomz-1b7" bloom_config = BloomConfig.from_pretrained(bloom_model_name) tokenizer = BloomTokenizerFast.from_pretrained(bloom_model_name) tokenizer.padding_side = "right" bert_config = BertConfig( hidden_size=1024, num_hidden_layers=6, num_attention_heads=16, is_decoder=True, add_cross_attention=True, ) visual_bloom = VisualBloom( image_config, bert_config, bloom_config, bloom_model_name, use_frozen_bloom=False ) visual_bloom.load_state_dict( torch.load("big_model_logs/lightning_logs/version_0/checkpoints/visual_bloom.torch") ) visual_bloom = visual_bloom.eval() url = "https://i.imgur.com/Y2vIAJp.jpg" language_list = ["en", "vi"] for language in language_list: # instruction = ( # f"Generate answer in {language}: what is the color of the sky?" # if language == "en" # else f"Generate answer in {language}: màu sắc của bầu trời là gì?" # ) instruction = f"Generate caption in {language}:" print(instruction) image = Image.open(requests.get(url, stream=True).raw).convert("RGB") image_features, image_attentions = image_feature_collator([image]) instruction_inputs = tokenizer([instruction], return_tensors="pt") language_output = visual_bloom.generate( image_features, image_attentions, instruction_inputs["input_ids"], instruction_inputs["attention_mask"], ) human_output = tokenizer.decode(language_output[0], skip_special_tokens=True) print(human_output.split(".")[0]) ```

Owner

Name: dinhanhx
Login: dinhanhx
Kind: user
Location: Hanoi, Vietnam

Repositories: 10
Profile: https://github.com/dinhanhx

A Python dev :/

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Velvet
message: >-
  If you use this source code or model weights or theory,
  please cite it as below.
type: software
authors:
  - given-names: dinhanhx
repository-code: 'https://github.com/dinhanhx/velvet'
license: MIT
date-released: '2023-09-30'

GitHub Events

Total

Last Year

Committers

Last synced: over 1 year ago

All Time

Total Commits: 122
Total Committers: 1
Avg Commits per committer: 122.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
dinhanhx	d**x@g**m	122

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

velvet

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Velvet

Introduction

Citation

Results

Project setup

Dependencies setup

Data setup

Make sure you have git-lfs installed (https://git-lfs.com)

Experiment train

Pre run

Main run

Replace by config name

Kill leftover processes

PyTorch Lightning CKPT file extraction

Demo

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels