https://github.com/cyberagentailab/diverse-mbr
Code of "Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding" 2024
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.3%) to scientific vocabulary
Repository
Code of "Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding" 2024
Basic Info
- Host: GitHub
- Owner: CyberAgentAILab
- License: mit
- Language: Python
- Default Branch: master
- Homepage: https://aclanthology.org/2024.findings-acl.503/
- Size: 141 KB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Diverse Minimum Bayes Risk Decoding
This repository contains the code for the experiments in Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding.
The code is tested on Ubuntu 20.04 using Python 3.8 and CUDA 11.0 (Docker image nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04). The code is provided mostly as is with little effort on refactoring.
Installation
git clone git@github.com:CyberAgentAILab/diverse-mbr
cd diverse-mbr
pip install -r requirements.txt
Usage
The code runs in two steps.
1. sample.sh samples candidates.
2. run_mbr.sh computes the MBR candidate from the candidates sampled.
Sampling candidates
./experiments/sample.sh -d [DATASET] -s [NUMBER OF SAMPLES]
Computing Diverse MBR and KMBR
./experiments/run_mbr.sh -d [DATASET] -s [NUMBER OF SAMPLES] -a [ALGORITHM]
Example on WMT'19 En-De
Use sacrebleu to prepare the benchmark dataset.
mkdir -p ./dataset/wmt19-text sacrebleu -t wmt19 -l en-de --echo src > ./dataset/wmt19-text/wmt19.en-de.en sacrebleu -t wmt19 -l en-de --echo ref > ./dataset/wmt19-text/wmt19.en-de.deSample candidates on WMT'19 En-De
./experiments/sample.sh -d wmt19.en-de
- Computing Diverse MBR and K-Medoid MBR on WMT'19 En-De
./experiments/run_mbr.sh -d wmt19.en-de -m wmt19-en-de -a diverse
Reference
Bibtex:
@inproceedings{jinnai-etal-2024-generating,
title = "Generating Diverse and High-Quality Texts by Minimum {B}ayes Risk Decoding",
author = "Jinnai, Yuu and
Honda, Ukyo and
Morimura, Tetsuro and
Zhang, Peinan",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.503",
pages = "8494--8525",
}
Contact
For any questions, feel free to raise an issue or contact me at jinnai_yu@cyberagent.co.jp.
Acknowledgements
MS COCO dataset is licensed under a Creative Commons BY 4.0.
Owner
- Name: CyberAgent AI Lab
- Login: CyberAgentAILab
- Kind: organization
- Location: Japan
- Website: https://cyberagent.ai/ailab/
- Twitter: cyberagent_ai
- Repositories: 7
- Profile: https://github.com/CyberAgentAILab
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2
Dependencies
- absl-py *
- accelerate *
- bert_score ==0.3.13
- bitsandbytes ==0.40.2
- datasets *
- einops *
- evaluate *
- google-cloud-storage *
- nltk ==3.8.1
- peft ==0.7.1
- py7zr *
- rouge-score ==0.1.2
- sacremoses ==0.0.53
- scikit-learn-extra ==0.3.0
- sortedcontainers *
- subword-nmt ==0.3.8
- torchmetrics ==0.10.3
- transformers *
- unbabel-comet *