https://github.com/cyberagentailab/diverse-mbr

Code of "Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding" 2024

https://github.com/cyberagentailab/diverse-mbr

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Code of "Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding" 2024

Basic Info
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

Diverse Minimum Bayes Risk Decoding

This repository contains the code for the experiments in Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding.

The code is tested on Ubuntu 20.04 using Python 3.8 and CUDA 11.0 (Docker image nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04). The code is provided mostly as is with little effort on refactoring.

Installation

git clone git@github.com:CyberAgentAILab/diverse-mbr cd diverse-mbr pip install -r requirements.txt

Usage

The code runs in two steps. 1. sample.sh samples candidates. 2. run_mbr.sh computes the MBR candidate from the candidates sampled.

Sampling candidates

./experiments/sample.sh -d [DATASET] -s [NUMBER OF SAMPLES]

Computing Diverse MBR and KMBR

./experiments/run_mbr.sh -d [DATASET] -s [NUMBER OF SAMPLES] -a [ALGORITHM]

Example on WMT'19 En-De

  1. Use sacrebleu to prepare the benchmark dataset. mkdir -p ./dataset/wmt19-text sacrebleu -t wmt19 -l en-de --echo src > ./dataset/wmt19-text/wmt19.en-de.en sacrebleu -t wmt19 -l en-de --echo ref > ./dataset/wmt19-text/wmt19.en-de.de

  2. Sample candidates on WMT'19 En-De

./experiments/sample.sh -d wmt19.en-de

  1. Computing Diverse MBR and K-Medoid MBR on WMT'19 En-De

./experiments/run_mbr.sh -d wmt19.en-de -m wmt19-en-de -a diverse

Reference

Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang. 2024. Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding. In Findings of the Association for Computational Linguistics ACL 2024, pages 8494–8525, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.

Bibtex: @inproceedings{jinnai-etal-2024-generating, title = "Generating Diverse and High-Quality Texts by Minimum {B}ayes Risk Decoding", author = "Jinnai, Yuu and Honda, Ukyo and Morimura, Tetsuro and Zhang, Peinan", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Findings of the Association for Computational Linguistics ACL 2024", month = aug, year = "2024", address = "Bangkok, Thailand and virtual meeting", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-acl.503", pages = "8494--8525", }

Contact

For any questions, feel free to raise an issue or contact me at jinnai_yu@cyberagent.co.jp.

Acknowledgements

MS COCO dataset is licensed under a Creative Commons BY 4.0.

Owner

  • Name: CyberAgent AI Lab
  • Login: CyberAgentAILab
  • Kind: organization
  • Location: Japan

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Dependencies

requirements.txt pypi
  • absl-py *
  • accelerate *
  • bert_score ==0.3.13
  • bitsandbytes ==0.40.2
  • datasets *
  • einops *
  • evaluate *
  • google-cloud-storage *
  • nltk ==3.8.1
  • peft ==0.7.1
  • py7zr *
  • rouge-score ==0.1.2
  • sacremoses ==0.0.53
  • scikit-learn-extra ==0.3.0
  • sortedcontainers *
  • subword-nmt ==0.3.8
  • torchmetrics ==0.10.3
  • transformers *
  • unbabel-comet *