lrv-instruction

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

https://github.com/fuxiaoliu/lrv-instruction

Science Score: 41.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.3%) to scientific vocabulary

Keywords

chatgpt evaluation evaluation-metrics foundation-models gpt gpt-4 hallucination iclr iclr2024 llama llava multimodal object-detection prompt-engineering vicuna vision vision-and-language vqa
Last synced: 6 months ago · JSON representation ·

Repository

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Basic Info
Statistics
  • Stars: 285
  • Watchers: 13
  • Forks: 14
  • Open Issues: 4
  • Releases: 0
Topics
chatgpt evaluation evaluation-metrics foundation-models gpt gpt-4 hallucination iclr iclr2024 llama llava multimodal object-detection prompt-engineering vicuna vision vision-and-language vqa
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [ICLR 2024]

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, Lijuan Wang

[Project Page] [Paper]

You can compare between our models and original models below. If the online demos don't work, please email fl3es@umd.edu. If you find our work interesting, please cite our work. Thanks!!! bibtex @article{liu2023aligning, title={Aligning Large Multi-Modal Model with Robust Instruction Tuning}, author={Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan}, journal={arXiv preprint arXiv:2306.14565}, year={2023} } @article{liu2023hallusionbench, title={HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V (ision), LLaVA-1.5, and Other Multi-modality Models}, author={Liu, Fuxiao and Guan, Tianrui and Li, Zongxia and Chen, Lichang and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi}, journal={arXiv preprint arXiv:2310.14566}, year={2023} } @article{liu2023mmc, title={MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning}, author={Liu, Fuxiao and Wang, Xiaoyang and Yao, Wenlin and Chen, Jianshu and Song, Kaiqiang and Cho, Sangwoo and Yacoob, Yaser and Yu, Dong}, journal={arXiv preprint arXiv:2311.10774}, year={2023} }

Both LRV-V1 and LRV-V2 support training on V100 32GB.

📺 [LRV-V2(Mplug-Owl) Demo], [mplug-owl Demo]

📺 [LRV-V1(MiniGPT4) Demo], [MiniGPT4-7B Demo]

Updates


Model Checkpoints

| Model name | Backbone | Download Link | | --- | --- |---: | | LRV-Instruction V2 | Mplug-Owl | link | | LRV-Instruction V1 | MiniGPT4 | link |

Instruction Data

| Model name | Instruction | Image | | --- | --- |---: | | LRV Instruction | link | link | | LRV Instruction(More) | link | link | | Chart Instruction | link | link |

Visual Instruction Data (LRV-Instruction)

We update the dataset with 300k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. LRV-Instruction include both positive instructions and negative instructions for more robust visual instruction tuning. The images of our dataset are from Visual Genome. Our data can be accessed from here. {'image_id': '2392588', 'question': 'Can you see a blue teapot on the white electric stove in the kitchen?', 'answer': 'There is no mention of a teapot on the white electric stove in the kitchen.', 'task': 'negative'} For each instance, image_id refers to the image from Visual Genome. question and answer refer to the instruction-answer pair. task indicates the task name. You can download the images from here.

We provide our prompts for GPT-4 queries to better facilitate research in this domain. Please check out the prompts folder for positive and negative instance generation. negative1_generation_prompt.txt contains the prompt to generate negative instructions with Nonexistent Element Manipulation. negative2_generation_prompt.txt contains the prompt to generate negative instructions with Existent Element Manipulation. You can refer to the code here to generate more data. Please see our paper for more details.

LRV-Instruction can equip the LMM with the ability to say no and also provide correct answers, even though there is no chart image in LRV-Instruction dataset.


Models

🐒LRV-Instruction(V1) Setup

  • LRV-Instruction(V1) is based on MiniGPT4-7B.

1. Clone this repository bash https://github.com/FuxiaoLiu/LRV-Instruction.git

2. Install Package Shell conda env create -f environment.yml --name LRV conda activate LRV

3. Prepare the Vicuna weights

Our model is finetuned on MiniGPT-4 with Vicuna-7B. Please refer to instruction here to prepare the Vicuna weights or download from here. Then, set the path to the Vicuna weight in MiniGPT-4/minigpt4/configs/models/minigpt4.yaml at Line 15.

4. Prepare the pretrained checkpoint of our model

Download the pretrained checkpoints from here

Then, set the path to the pretrained checkpoint in MiniGPT-4/evalconfigs/minigpt4eval.yaml at Line 11. This checkpoint is based on MiniGPT-4-7B. We will release the checkpoints for MiniGPT-4-13B and LLaVA in the future.

5. Set the dataset path

After getting the dataset, then set the path to the dataset path in MiniGPT-4/minigpt4/configs/datasets/cc_sbu/align.yaml at Line 5. The structure of the dataset folder is similar to the following:

/MiniGPt-4/cc_sbu_align ├── image(Visual Genome images) ├── filter_cap.json

6. Local Demo

Try out the demo demo.py of our finetuned model on your local machine by running

cd ./MiniGPT-4 python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0 You can try the examples in here.

7. Model Inference

Set the path of the inference instruction file here, inference image folder here and output location here. We don't run inference in the training process.

cd ./MiniGPT-4 python inference.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0

🐒LRV-Instruction(V2) Setup

  • LRV-Instruction(V2) is based on plug-Owl-7B.

1. Install the environment according to mplug-owl.

We finetuned mplug-owl on 8 V100. If you meet any questions when implement on V100, feel free to let me know!

2. Download the Checkpoint

First download the checkpoint of mplug-owl from link and the trained lora model weight from here.

3. Edit the Code

As for the mplug-owl/serve/model_worker.py, edit the following code and enter the path of the lora model weight in lorapath. ``` self.imageprocessor = MplugOwlImageProcessor.frompretrained(basemodel) self.tokenizer = AutoTokenizer.frompretrained(basemodel) self.processor = MplugOwlProcessor(self.imageprocessor, self.tokenizer) self.model = MplugOwlForConditionalGeneration.frompretrained( basemodel, loadin8bit=loadin8bit, torchdtype=torch.bfloat16 if bf16 else torch.half, device_map="auto" ) self.tokenizer = self.processor.tokenizer

peftconfig = LoraConfig(targetmodules=r'.language_model..(qproj|vproj)', inferencemode=False, r=8,loraalpha=32, loradropout=0.05) self.model = getpeftmodel(self.model, peftconfig) lorapath = 'Your lora model path' prefixstatedict = torch.load(lorapath, maplocation='cpu') self.model.loadstatedict(prefixstate_dict) ```

4. Local Demo

When you launch the demo in local machine, you might find there is no space for the text input. This is because of the version conflict between python and gradio. The simplest solution is to do conda activate LRV

python -m serve.web_server --base-model 'the mplug-owl checkpoint directory' --bf16

5. Model Inference

First git clone the code from mplug-owl, replace the /mplug/serve/model_worker.py with our /utils/model_worker.py and add the file /utils/inference.py. Then edit the input data file and image folder path. Finally run:

python -m serve.inference --base-model 'your checkpoint directory' --bf16

Evaluation(GAVIE)


We introduce GPT4-Assisted Visual Instruction Evaluation (GAVIE) as a more flexible and robust approach to measure the hallucination generated by LMMs without the need for human-annotated groundtruth answers. GPT4 takes the dense captions with bounding box coordinates as the image content and compares human instructions and model response. Then we ask GPT4 to work as a smart teacher and score (0-10) students’ answers based on two criteria: (1) Accuracy: whether the response hallucinates with the image content. (2) Relevancy: whether the response directly follows the instruction. prompts/GAVIE.txt contains the prompt of GAVIE.

Our evaluation set is available at here. {'image_id': '2380160', 'question': 'Identify the type of transportation infrastructure present in the scene.'} For each instance, image_id refers to the image from Visual Genome. instruction refers to the instruction. answer_gt refers to the groundtruth answer from Text-Only GPT4 but we don't use them in our evaluation. Instead, we use Text-Only GPT4 to evaluate the model output by using the dense captions and bounding boxes from Visual Genome dataset as the visual contents.

To evaluate your model outputs, first download the vg annotations from here. Second generate the evaluation prompt according to the code here. Third, feed the prompt into GPT4.

Leaderboards

GPT4(GPT4-32k-0314) work as smart teachers and score (0-10) students’ answers based on two criteria.

(1) Accuracy: whether the response hallucinates with the image content. (2) Relevancy: whether the response directly follows the instruction.

|Method |GAVIE-Accuracy |GAVIE-Relevancy| | --- | --- |---: | |LLaVA1.0-7B |4.36 |6.11| |LLaVA 1.5-7B| 6.42 |8.20| |MiniGPT4-v1-7B| 4.14 |5.81| |MiniGPT4-v2-7B| 6.01 |8.10| |mPLUG-Owl-7B| 4.84 |6.35| |InstructBLIP-7B|5.93|7.34| |MMGPT-7B|0.91|1.79| |Ours-7B |6.58 |8.46|

Acknowledgement

Citation

If you find our work useful for your your research and applications, please cite using this BibTeX: bibtex @article{liu2023aligning, title={Aligning Large Multi-Modal Model with Robust Instruction Tuning}, author={Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan}, journal={arXiv preprint arXiv:2306.14565}, year={2023} }

License

This repository is under BSD 3-Clause License. Many codes are based on MiniGPT4 and mplug-Owl with BSD 3-Clause License here.

Owner

  • Login: FuxiaoLiu
  • Kind: user

Citation (citation.txt)

@article{liu2023aligning,
  title={Aligning Large Multi-Modal Model with Robust Instruction Tuning},
  author={Liu, Fuxiao and Lin, Kevin and Li, Linjie and Wang, Jianfeng and Yacoob, Yaser and Wang, Lijuan},
  journal={arXiv preprint arXiv:2306.14565},
  year={2023}
}
@article{liu2023hallusionbench,
  title={HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V (ision), LLaVA-1.5, and Other Multi-modality Models},
  author={Liu, Fuxiao and Guan, Tianrui and Li, Zongxia and Chen, Lichang and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi},
  journal={arXiv preprint arXiv:2310.14566},
  year={2023}
}
@article{liu2020visual,
  title={Visual news: Benchmark and challenges in news image captioning},
  author={Liu, Fuxiao and Wang, Yinghan and Wang, Tianlu and Ordonez, Vicente},
  journal={arXiv preprint arXiv:2010.03743},
  year={2020}
}
@article{liu2023covid,
  title={COVID-VTS: Fact Extraction and Verification on Short Video Platforms},
  author={Liu, Fuxiao and Yacoob, Yaser and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2302.07919},
  year={2023}
}
@article{liu2023documentclip,
  title={DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents},
  author={Liu, Fuxiao and Tan, Hao and Tensmeyer, Chris},
  journal={arXiv preprint arXiv:2306.06306},
  year={2023}
}
@article{liu2023mmc,
  title={MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning},
  author={Liu, Fuxiao and Wang, Xiaoyang and Yao, Wenlin and Chen, Jianshu and Song, Kaiqiang and Cho, Sangwoo and Yacoob, Yaser and Yu, Dong},
  journal={arXiv preprint arXiv:2311.10774},
  year={2023}
}
@article{li2023towards,
  title={Towards understanding in-context learning with contrastive demonstrations and saliency maps},
  author={Li, Zongxia and Xu, Paiheng and Liu, Fuxiao and Song, Hyemi},
  journal={arXiv preprint arXiv:2307.05052},
  year={2023}
}
@article{wang2024mementos,
  title={Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences},
  author={Wang, Xiyao and Zhou, Yuhang and Liu, Xiaoyu and Lu, Hongjin and Xu, Yuancheng and He, Feihong and Yoon, Jaehong and Lu, Taixi and Bertasius, Gedas and Bansal, Mohit and others},
  journal={arXiv preprint arXiv:2401.10529},
  year={2024}
}

GitHub Events

Total
  • Issues event: 2
  • Watch event: 37
Last Year
  • Issues event: 2
  • Watch event: 37

Committers

Last synced: 6 months ago

All Time
  • Total Commits: 366
  • Total Committers: 1
  • Avg Commits per committer: 366.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
FuxiaoLiu 6****u 366

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 24
  • Total pull requests: 0
  • Average time to close issues: 7 days
  • Average time to close pull requests: N/A
  • Total issue authors: 23
  • Total pull request authors: 0
  • Average comments per issue: 1.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • pixas (2)
  • yytzsy (1)
  • aixiaodewugege (1)
  • deepbeepmeep (1)
  • huangjy-pku (1)
  • FuxiaoLiu (1)
  • schuper (1)
  • JKBox (1)
  • KuofengGao (1)
  • huangliang-666 (1)
  • GUOGUO-lab (1)
  • Richar-Du (1)
  • guozhiyao (1)
  • aiiph4 (1)
  • Tizzzzy (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

environment.yml pypi
  • accelerate ==0.19.0
  • aiofiles ==23.1.0
  • aiohttp ==3.8.4
  • aiosignal ==1.3.1
  • altair ==4.2.2
  • antlr4-python3-runtime ==4.9.3
  • anyio ==3.6.2
  • argon2-cffi ==21.3.0
  • argon2-cffi-bindings ==21.2.0
  • arrow ==1.2.3
  • asttokens ==2.2.1
  • async-lru ==2.0.2
  • async-timeout ==4.0.2
  • attrs ==23.1.0
  • babel ==2.12.1
  • backcall ==0.2.0
  • beautifulsoup4 ==4.12.2
  • bitsandbytes ==0.39.0
  • bleach ==6.0.0
  • blinker ==1.6.2
  • blis ==0.7.9
  • braceexpand ==0.1.7
  • cachetools ==5.3.0
  • catalogue ==2.0.8
  • cfgv ==3.3.1
  • click ==8.1.3
  • cmake ==3.26.3
  • comm ==0.1.3
  • confection ==0.0.4
  • contexttimer ==0.3.3
  • contourpy ==1.0.7
  • cycler ==0.11.0
  • cymem ==2.0.7
  • debugpy ==1.6.7
  • decorator ==5.1.1
  • decord ==0.6.0
  • defusedxml ==0.7.1
  • distlib ==0.3.6
  • einops ==0.6.1
  • entrypoints ==0.4
  • et-xmlfile ==1.1.0
  • executing ==1.2.0
  • fairscale ==0.4.4
  • fastapi ==0.95.2
  • fastjsonschema ==2.16.3
  • ffmpy ==0.3.0
  • filelock ==3.12.0
  • fonttools ==4.39.4
  • fqdn ==1.5.1
  • frozenlist ==1.3.3
  • fsspec ==2023.5.0
  • ftfy ==6.1.1
  • gitdb ==4.0.10
  • gitpython ==3.1.31
  • gradio ==3.31.0
  • gradio-client ==0.2.5
  • h11 ==0.14.0
  • httpcore ==0.17.1
  • httpx ==0.24.1
  • huggingface-hub ==0.14.1
  • identify ==2.5.24
  • imageio ==2.28.1
  • importlib-metadata ==6.6.0
  • importlib-resources ==5.12.0
  • iopath ==0.1.10
  • ipykernel ==6.23.1
  • ipython ==8.13.2
  • ipython-genutils ==0.2.0
  • ipywidgets ==8.0.6
  • isoduration ==20.11.0
  • jedi ==0.18.2
  • jinja2 ==3.1.2
  • joblib ==1.2.0
  • json5 ==0.9.14
  • jsonpointer ==2.3
  • jsonschema ==4.17.3
  • jupyter ==1.0.0
  • jupyter-client ==8.2.0
  • jupyter-console ==6.6.3
  • jupyter-core ==5.3.0
  • jupyter-events ==0.6.3
  • jupyter-lsp ==2.1.0
  • jupyter-server ==2.5.0
  • jupyter-server-terminals ==0.4.4
  • jupyterlab ==4.0.0
  • jupyterlab-pygments ==0.2.2
  • jupyterlab-server ==2.22.1
  • jupyterlab-widgets ==3.0.7
  • kaggle ==1.5.13
  • kiwisolver ==1.4.4
  • langcodes ==3.3.0
  • lazy-loader ==0.2
  • linkify-it-py ==2.0.2
  • lit ==16.0.5
  • markdown-it-py ==2.2.0
  • markupsafe ==2.1.2
  • matplotlib ==3.7.1
  • matplotlib-inline ==0.1.6
  • mdit-py-plugins ==0.3.3
  • mdurl ==0.1.2
  • mistune ==2.0.5
  • mpmath ==1.3.0
  • multidict ==6.0.4
  • murmurhash ==1.0.9
  • nbclassic ==1.0.0
  • nbclient ==0.7.4
  • nbconvert ==7.4.0
  • nbformat ==5.8.0
  • nest-asyncio ==1.5.6
  • networkx ==3.1
  • nodeenv ==1.8.0
  • notebook ==6.5.4
  • notebook-shim ==0.2.3
  • nvidia-cublas-cu11 ==11.10.3.66
  • nvidia-cuda-cupti-cu11 ==11.7.101
  • nvidia-cuda-nvrtc-cu11 ==11.7.99
  • nvidia-cuda-runtime-cu11 ==11.7.99
  • nvidia-cudnn-cu11 ==8.5.0.96
  • nvidia-cufft-cu11 ==10.9.0.58
  • nvidia-curand-cu11 ==10.2.10.91
  • nvidia-cusolver-cu11 ==11.4.0.1
  • nvidia-cusparse-cu11 ==11.7.4.91
  • nvidia-nccl-cu11 ==2.14.3
  • nvidia-nvtx-cu11 ==11.7.91
  • omegaconf ==2.3.0
  • opencv-python ==4.7.0.72
  • opencv-python-headless ==4.5.5.64
  • opendatasets ==0.1.22
  • openpyxl ==3.1.2
  • orjson ==3.8.12
  • packaging ==23.1
  • pandas ==2.0.1
  • pandocfilters ==1.5.0
  • parso ==0.8.3
  • pathy ==0.10.1
  • pexpect ==4.8.0
  • pickleshare ==0.7.5
  • platformdirs ==3.5.1
  • plotly ==5.14.1
  • portalocker ==2.7.0
  • pre-commit ==3.3.2
  • preshed ==3.0.8
  • prometheus-client ==0.16.0
  • prompt-toolkit ==3.0.38
  • protobuf ==3.20.3
  • psutil ==5.9.5
  • ptyprocess ==0.7.0
  • pure-eval ==0.2.2
  • pyarrow ==12.0.0
  • pycocoevalcap ==1.2
  • pycocotools ==2.0.6
  • pydantic ==1.10.7
  • pydeck ==0.8.1b0
  • pydub ==0.25.1
  • pygments ==2.15.1
  • pympler ==1.0.1
  • pyparsing ==3.0.9
  • pyrsistent ==0.19.3
  • python-dateutil ==2.8.2
  • python-json-logger ==2.0.7
  • python-magic ==0.4.27
  • python-multipart ==0.0.6
  • python-slugify ==8.0.1
  • pytz ==2023.3
  • pywavelets ==1.4.1
  • pyyaml ==6.0
  • pyzmq ==25.0.2
  • qtconsole ==5.4.3
  • qtpy ==2.3.1
  • regex ==2023.5.5
  • rfc3339-validator ==0.1.4
  • rfc3986-validator ==0.1.1
  • rich ==13.3.5
  • safetensors ==0.3.1
  • scikit-image ==0.20.0
  • scikit-learn ==1.2.2
  • scipy ==1.9.1
  • semantic-version ==2.10.0
  • send2trash ==1.8.2
  • sentencepiece ==0.1.99
  • six ==1.16.0
  • sklearn ==0.0.post5
  • smart-open ==6.3.0
  • smmap ==5.0.0
  • sniffio ==1.3.0
  • soupsieve ==2.4.1
  • spacy ==3.5.3
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.4
  • srsly ==2.4.6
  • stack-data ==0.6.2
  • starlette ==0.27.0
  • streamlit ==1.22.0
  • sympy ==1.12
  • tenacity ==8.2.2
  • terminado ==0.17.1
  • text-unidecode ==1.3
  • thinc ==8.1.10
  • threadpoolctl ==3.1.0
  • tifffile ==2023.4.12
  • timm ==0.4.12
  • tinycss2 ==1.2.1
  • tokenizers ==0.13.3
  • toml ==0.10.2
  • tomli ==2.0.1
  • toolz ==0.12.0
  • torch ==2.0.1
  • torchvision ==0.15.2
  • tornado ==6.3.2
  • tqdm ==4.65.0
  • traitlets ==5.9.0
  • transformers ==4.29.2
  • triton ==2.0.0
  • typer ==0.7.0
  • tzdata ==2023.3
  • tzlocal ==5.0.1
  • uc-micro-py ==1.0.2
  • uri-template ==1.2.0
  • uvicorn ==0.22.0
  • validators ==0.20.0
  • virtualenv ==20.23.0
  • wasabi ==1.1.1
  • watchdog ==3.0.0
  • wcwidth ==0.2.6
  • webcolors ==1.13
  • webdataset ==0.2.48
  • webencodings ==0.5.1
  • websocket-client ==1.5.1
  • websockets ==11.0.3
  • widgetsnbextension ==4.0.7
  • yarl ==1.9.2
  • zipp ==3.15.0