aki

Official implementation of "Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs"

https://github.com/sony/aki

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Official implementation of "Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs"

Basic Info

Host: GitHub
Owner: sony
License: other
Language: Python
Default Branch: main
Homepage:
Size: 4.2 MB

Statistics

Stars: 19
Watchers: 1
Forks: 4
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

This repo contains an official PyTorch implementation of Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs by Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi.

Overview

Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of earlier modalities (e.g., images) to incorporate information from later modalities (e.g., text). To address this problem, we propose AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2\% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios.

Framework

Usage

Prerequisites

Our environment is Python 3.12 with PyTorch >= 2.0.1. For more details, please check create_env.sh. 1. Clone the repo git clone https://github.com/sony/aki.git && cd aki 2. Install the corresponding packages bash codes/open_flamingo/scripts/create_env.sh

Pre-Training

Need to run cd codes/open_flamingo first. 1. Prepare datasets in the webdataset format. In this paper, we adopt the pre-training datasets from BLIP-3, including BLIP3-Kale and BLIP3-OCR-200m. 2. Start pre-training bash scripts/run_train.sh

Instruction Finetuning

Need to run cd codes/open_flamingo first. 1. Prepare SFT datasets with the original formats 2. Start instruction finetuning bash scrips/run_sft.sh

Evaluations

CV-Bench

The benchmark dataset is fetched from the official release. python3.12 eval_cv_bench/eval.py {model_path}
Other VLM Benchmarks

Under construction to create a PR to VLMEvalKit.

Local Demonstration

Need to run cd codes/open_flamingo first.

Start the local demo python3.12 local_demo.py

Results

Main Comparisons with the Same Configurations (Table 1)

| | MME^P | MME^C | MMB | SEED^I | LLaVA^W | MMMU | MathV^mini | POPE | MM-Vet | RealWorldQA | CV-Bench^2D | CV-Bench^3D | |--------------------------|------------|------------|------|-------------|---------------|------|----------------|------|-------|------------|-----------------|-----------------| | (I&T)_PT + (I&T)_SFT | 1226.3 | 258.2 | 64.9 | 64.1 | 47.0 | 31.1 | 24.2 | 79.8 | 24.3 | 50.6 | 45.2 | 54.3 | | CCA [Xing et al., 2024] | 1212.7 | 243.6 | 67.4 | 65.3 | 54.0 | 34.6 | 25.6 | 81.9 | 29.0 | 52.7 | 56.0 | 62.8 | | (w/o T&I)_PT | 1046.3 | 226.4 | 31.7 | 45.1 | 38.1 | 27.2 | 23.8 | 65.0 | 17.2 | 40.1 | 53.2 | 54.8 | | (w/o I&T)_PT | 1013.2 | 208.6 | 32.0 | 43.3 | 37.9 | 27.7 | 22.4 | 70.4 | 20.6 | 39.5 | 55.4 | 53.0 | | (w/o T&I)_SFT | 1194.8 | 289.3 | 58.5 | 61.1 | 40.2 | 28.0 | 21.9 | 79.0 | 22.8 | 47.8 | 41.4 | 63.0 | | (w/o I&T)_SFT | 1166.2 | 264.3 | 58.4 | 60.8 | 36.9 | 26.7 | 23.1 | 76.8 | 20.4 | 46.9 | 43.3 | 61.2 | | DOT (Ours) | 1267.8 | 251.4 | 43.8 | 54.7 | 47.5 | 30.7 | 25.6 | 82.7 | 25.0 | 50.5 | 52.2 | 58.1 | | MMA (Ours) | 1363.7 | 315.4 | 71.8 | 67.1 | 59.6 | 37.3 | 26.4 | 82.7 | 30.2 | 52.3 | 57.8 | 64.1 | | Improvements | 10.9% | 29.5% | 4.3% | 2.8% | 10.4% | 7.8% | 3.1% | 1% | 4.1% | - | 3.2% | 2.1% |

AKI-4B (Table 2)

| | MME^P | MME^C | MMB | SEED^I | LLaVA^W | MMMU | MathV^mini | POPE | MM-Vet | RealWorldQA | CV-Bench^2D | CV-Bench^3D | |---|---|---|---|---|---|---|---|---|---|---|---|---| | AKI-4B | 1491.9 | 362.9 | 73.1 | 69.4 | 74.6 | 38.7 | 32.1 | 86.9 | 40.8 | 58.9 | 62.1 | 71.8 |

Contact

For any questions or issues, pleasefeel free to open an issue/PR or reach out: wei-yao.wang@sony.com.

Citation

If you found this repository is relevant or useful to your research, please consider citing our paper: @misc{wywang2025AKI, title={Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs}, author={Wei-Yao Wang and Zhao Wang and Helen Suzuki and Yoshiyuki Kobayashi}, year={2025}, eprint={2503.02597}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02597}, }

Acknowledgements

The training code is based on the open_flamingo repo, and the evaluation code is based on the VLMEvalKit repo. The SFT config file is built on top of the HoneyBee repo. Thank you for making your codes public! We also thank the XGen-MM repo as we use their released data for pre-training and to take inspiration from their model implementation.

Owner

Name: Sony
Login: sony
Kind: organization
Location: Minato-ku, Tokyo, Japan

Website: https://www.sony.com/en/
Repositories: 36
Profile: https://github.com/sony

Sony Group Corporation

Citation (CITATIONS.bib)

@misc{wywang2025AKI,
      title={Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs}, 
      author={Wei-Yao Wang and Zhao Wang and Helen Suzuki and Yoshiyuki Kobayashi},
      year={2025},
      eprint={2503.02597},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.02597}, 
}

GitHub Events

Total

Issues event: 3
Watch event: 18
Issue comment event: 4
Push event: 5
Public event: 1
Fork event: 5

Last Year

Issues event: 3
Watch event: 18
Issue comment event: 4
Push event: 5
Public event: 1
Fork event: 5

Committers

Last synced: 12 months ago

All Time

Total Commits: 10
Total Committers: 2
Avg Commits per committer: 5.0
Development Distribution Score (DDS): 0.1

Past Year

Commits: 10
Committers: 2
Avg Commits per committer: 5.0
Development Distribution Score (DDS): 0.1

Top Committers

Name	Email	Commits
Wei-Yao Wang	W**g@s**m	9
Joe Wang	1****7	1

Committer Domains (Top 20 + Academic)

sony.com: 1

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: 5 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 2.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 5 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 2.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

aki

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Overview

Usage

Prerequisites

Pre-Training

Instruction Finetuning

Evaluations

Local Demonstration

Results

Main Comparisons with the Same Configurations (Table 1)

AKI-4B (Table 2)

Contact

Citation

Acknowledgements

Owner

Citation (CITATIONS.bib)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies