https://github.com/aim-uofa/vlmodel

Repo of HawkLlama.

https://github.com/aim-uofa/vlmodel

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Repo of HawkLlama.

Basic Info
  • Host: GitHub
  • Owner: aim-uofa
  • License: bsd-2-clause
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 70 MB
Statistics
  • Stars: 16
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

Lit-LLaMA # HawkLlama [🤗**Huggingface Model**](https://huggingface.co/AIM-ZJU/HawkLlama_8b) | [🗂️**Github**](https://github.com/aim-uofa/VLModel) | [📖**Technical Report**](assets/technical_report.pdf) | [🎮️**Demo**](http://115.236.57.99:30020/) Zhejiang University, China

This is the official implementation of HawkLlama, an open-source multimodal large language model designed for real-world vision and language understanding applications. Our model features the following highlights.

  1. HawkLlama-8B is constructed utilizing:

    • Llama3-8B, the latest open-source large language model, trained on over 15 trillion tokens.
    • SigLIP, an enhancement over CLIP employing sigmoid loss, which achieves superior performance in image recognition.
    • An efficient vision-language connector, designed to capture high-resolution details without increasing the number of visual tokens, helps reduce the training overhead associated with high-resolution images.
  2. For model training, we utilize Llava-Pretrain dataset for pretraining and a mixed dataset specifically curated for instruction tuning, which contains both multimodal and language-only data for supervised fine-tuning.

  3. HawkLlama-8B is developed on NeMo framework, which facilitates 3D parallelism and offers scalability potential for future extension.

Our model is open-source and reproducible. Please check our technical report for more details.

Contents

Setup

  1. Create envoirment and activate it. Shell conda create -n hawkllama python=3.10 -y conda activate hawkllama

  2. Clone and install this repo. git clone https://github.com/aim-uofa/VLModel.git cd VLModel pip install -e . pip install -e third_party/VLMEvalKit

Model Weights

Please refer to our HuggingFace repository to download the pretrained model weights.

Inference

We provide an example code for inference.

```Python import torch from PIL import Image from HawkLlama.model import LlavaNextProcessor, LlavaNextForConditionalGeneration from HawkLlama.utils.conversation import convllavallama3, DEFAULTIMAGE_TOKEN

processor = LlavaNextProcessor.frompretrained("AIM-ZJU/HawkLlama8b")

model = LlavaNextForConditionalGeneration.frompretrained("AIM-ZJU/HawkLlama8b", torchdtype=torch.bfloat16, lowcpumemusage=True) model.to("cuda:0")

imagefile = "assets/coin.png" image = Image.open(imagefile).convert('RGB')

prompt = "what coin is that?" prompt = DEFAULTIMAGETOKEN + "\n" + prompt

conversation = convllavallama3.copy() userroleind = 0 botroleind = 1 conversation.appendmessage(conversation.roles[userroleind], prompt) conversation.appendmessage(conversation.roles[botroleind], "") prompt = conversation.getprompt() inputs = processor(prompt, image, returntensors="pt").to("cuda:0") inputs['pixelvalues'] = inputs['pixelvalues'].to(torch.bfloat16) output = model.generate(**inputs, eostokenid=processor.tokenizer.eostokenid, maxnewtokens=2048, dosample=False, use_cache=True)

print(processor.decode(output[0], skipspecialtokens=True)) ```

Evaluation

Evaluate is modified based on the VLMEval codebase.

``` bash

single gpu

python thirdparty/VLMEvalKit/run.py --data MMBenchDEVEN MMMUDEVVAL SEEDBenchIMG --model hawkllamallama3vlm --verbose

multi-gpus

torchrun --nproc-per-node=8 thirdparty/VLMEvalKit/run.py --data MMBenchDEVEN MMMUDEVVAL SEEDBenchIMG --model hawkllamallama3vlm --verbose ```

The results are shown below:

| Benchmark | Our Method | LLaVA-Llama3-v1.1 | LLaVA-Next | |-----------------|----------------|-------------------|------------| | MMMU val | 37.8 | 36.8 | 36.9 | | SEEDBench img | 71.0 | 70.1 | 70.0 | | MMBench-EN dev | 70.6 | 70.4 | 68.0 | | MMBench-CN dev | 64.4 | 64.2 | 60.6 | | CCBench | 33.9 | 31.6 | 24.7 | | AI2D test | 65.6 | 70.0 | 67.1 | | ScienceQA test | 76.1 | 72.9 | 70.4 | | HallusionBench | 41.0 | 47.7 | 35.2 | | MMStar | 43.0 | 45.1 | 38.1 |

Training

See train with NeMo.

Demo

Welcome to try our demo!

License

For non-commercial academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact Chunhua Shen.

Acknowledgements

We express our appreciation to the following projects for their outstanding contributions in academia and code development: LLaVA, NeMo, VLMEvalKit and xtuner.

Owner

  • Name: Advanced Intelligent Machines (AIM)
  • Login: aim-uofa
  • Kind: organization
  • Location: China

A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...

GitHub Events

Total
  • Watch event: 6
  • Push event: 1
  • Fork event: 1
Last Year
  • Watch event: 6
  • Push event: 1
  • Fork event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • accelerate *
  • einops *
  • huggingface_hub *
  • matplotlib *
  • numpy ==1.23.4
  • omegaconf *
  • openai ==1.3.5
  • opencv-python >=4.4.0.46
  • openpyxl *
  • pandas >=1.5.3
  • pillow *
  • portalocker *
  • protobuf *
  • pycocoevalcap *
  • python-dotenv *
  • requests *
  • rich *
  • seaborn *
  • sentencepiece *
  • sty *
  • tabulate *
  • tiktoken *
  • timeout-decorator *
  • torch ==2.1.2
  • tqdm *
  • transformers *
  • typing_extensions ==4.7.1
  • validators *
  • visual_genome *
  • xlsxwriter *
setup.py pypi
third_party/VLMEvalKit/setup.py pypi