https://github.com/bytedance/attention2probability

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: bytedance
License: apache-2.0
Language: Python
Default Branch: main
Size: 46.9 KB

Statistics

Stars: 1
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Created 10 months ago · Last pushed 10 months ago

Metadata Files

Readme License

Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

Attention2Probability (A2P) is a lightweight intervention scheme for speech terminology. The core approach is to use the cross-attention mechanism to retrieve the terms that may appear in the audio and add these terms to the prompt of the llm to complete the term intervention.

News

[2025-08-27] We have released the train and infer code for A2P.

Structure

The overall architecture of Attention2Probability. Audio features are extracted and then fed into a cross-attention retriever, which retrieves the Top-k terms with the highest probability of occurrence within the audio. These retrieved terms are concatenated with the prompt. Finally, the prompt and the audio features are jointly input into the speech large language model.

Installation

A2P is implemented based on the open-source toolkit accelerate

bash pip3 install -r requirements.txt

Training

Download the data to /pathtodata. It's important to change your audio path in json.
Download the model to /path/pretrained-modelh. Your can also download Qwen2-Audio-Instruction and split it to the audio_tower, projector and embedding.
Running with `bash ./retriever/train.sh in A100-SXM-80GB.
For the dataset configuration, the phrase_type parameter can be adjusted to specify either word-level or phrase-level granularity. It should be noted that models for Chinese are generally trained only at the phrase-level, as word-level granularity is nonsensical for the Chinese language.

Inference

Same as Training: 1-2.
Download the ckpt to ckpt.
Running with python3 ./infer/infer.py --config ./infer/infer_config in A100-SXM-80GB. Now you can change the setting in the infer_config.json. Enjoy yourself !

Citation

If you find A2P useful, please cite the paper: @inproceedings{ dy2025attention, title={{Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System}, author={Yangfan Du, Jun Zhang, Bin Wang, Jin Qiu, Lu Huang, Yuan Ge, Xiaoqian Liu, Tong Xiao, Jingbo Zhu}, }

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

GitHub Events

Total

Watch event: 1
Public event: 1
Push event: 1
Fork event: 1

Last Year

Watch event: 1
Public event: 1
Push event: 1
Fork event: 1

Dependencies

requirements.txt pypi

accelerate ==1.9.0
audioread ==3.0.1
beartype ==0.21.0
huggingface-hub ==0.34.1
jieba ==0.42.1
jiwer ==4.0.0
librosa ==0.11.0
numba ==0.60.0
numpy ==1.26.4
packaging ==24.1
pillow ==10.2.0
prompt_toolkit ==3.0.48
protobuf ==3.20.3
requests ==2.32.3
sacrebleu ==2.5.1
safetensors ==0.5.3
six ==1.16.0
snowballstemmer ==2.2.0
soundfile ==0.13.1
tokenizers ==0.21.1
torch ==2.1.0
torchaudio ==2.1.0
torchvision ==0.16.0
tox ==3.28.0
tqdm ==4.67.1
transformers ==4.51.3
typing_extensions ==4.12.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science