mbrs

A library for minimum Bayes risk (MBR) decoding

https://github.com/naist-nlp/mbrs

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary

Keywords

mbr-decoding natural-language-processing python pytorch

Last synced: 11 months ago · JSON representation ·

Repository

A library for minimum Bayes risk (MBR) decoding

Basic Info

Host: GitHub
Owner: naist-nlp
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 283 KB

Statistics

Stars: 45
Watchers: 4
Forks: 7
Open Issues: 0
Releases: 7

Topics

mbr-decoding natural-language-processing python pytorch

Created about 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

README.md

mbrs is a library for minimum Bayes risk (MBR) decoding.

Paper | Reference docs | Citation | Release notes

Installation

You can install from PyPi:

bash pip install mbrs

For developers, it can be installed from the source.

bash git clone https://github.com/naist-nlp/mbrs.git cd mbrs/ pip install ./

For uv users: bash git clone https://github.com/naist-nlp/mbrs.git cd mbrs/ uv sync

Quick start

mbrs provides two interfaces: command-line interface (CLI) and Python API.

Command-line interface

Command-line interface can run MBR decoding from command-line. Before running MBR decoding, you can generate hypothesis sentences with mbrs-generate:

bash mbrs-generate \ sources.txt \ --output hypotheses.txt \ --lang_pair en-de \ --model facebook/m2m100_418M \ --num_candidates 1024 \ --sampling eps --epsilon 0.02 \ --batch_size 8 --sampling_size 8 --fp16 \ --report_format rounded_outline

Beam search can also be used by replacing --sampling eps --epsilon 0.02 with --beam_size 10.

Next, MBR decoding and other decoding methods can be executed with mbrs-decode. This example regards the hypothesis set as the pseudo-reference set.

bash mbrs-decode \ hypotheses.txt \ --num_candidates 1024 \ --nbest 1 \ --source sources.txt \ --references hypotheses.txt \ --output translations.txt \ --report report.txt --report_format rounded_outline \ --decoder mbr \ --metric comet \ --metric.model Unbabel/wmt22-comet-da \ --metric.batch_size 64 --metric.fp16 true

You can pass the arguments using a configuration yaml file via --config_path option. See docs for the details.

Finally, you can evaluate the score with mbrs-score:

bash mbrs-score \ hypotheses.txt \ --sources sources.txt \ --references hypotheses.txt \ --format json \ --metric bleurt \ --metric.batch_size 64 --metric.fp16 true

Python API

This is the example of COMET-MBR via Python API.

``` python from mbrs.metrics import MetricCOMET from mbrs.decoders import DecoderMBR

SOURCE = "ありがとう" HYPOTHESES = ["Thanks", "Thank you", "Thank you so much", "Thank you.", "thank you"]

Setup COMET.

metriccfg = MetricCOMET.Config( model="Unbabel/wmt22-comet-da", batchsize=64, fp16=True, ) metric = MetricCOMET(metric_cfg)

Setup MBR decoding.

decodercfg = DecoderMBR.Config() decoder = DecoderMBR(decodercfg, metric)

Decode by COMET-MBR.

This example regards the hypotheses themselves as the pseudo-references.

Args: (hypotheses, pseudo-references, source)

output = decoder.decode(HYPOTHESES, HYPOTHESES, source=SOURCE, nbest=1)

print(f"Selected index: {output.idx}") print(f"Output sentence: {output.sentence}") print(f"Expected score: {output.score}") ```

List of implemented methods

Metrics

Currently, the following metrics are supported:

BLEU (Papineni et al., 2002): bleu
TER (Snover et al., 2006): ter
chrF (Popović et al., 2015): chrf
COMET (Rei et al., 2020): comet
COMETkiwi (Rei et al., 2022): cometkiwi
XCOMET (Guerreiro et al., 2023): xcomet
XCOMET-lite (Larionov et al., 2024): xcomet with --metric.model="myyycroft/XCOMET-lite"
BLEURT (Sellam et al., 2020): bleurt (thanks to \@lucadiliello)
MetricX (Juraska et al., 2023; Juraska et al., 2024): metricx
BERTScore (Zhang et al., 2020): bertscore

Decoders

The following decoding methods are implemented:

N-best reranking: rerank
MBR decoding: mbr

Specifically, the following methods of MBR decoding are included:

Expectation estimation:
- Monte Carlo estimation (Eikema and Aziz, 2020; Eikema and Aziz, 2022)
- Model-based estimation (Jinnai et al., 2024): --reference_lprobs option
Efficient methods:
- Confidence-based pruning (Cheng and Vlachos, 2023) : pruning_mbr
- Reference aggregation (DeNero et al., 2009; Vamvas and Sennrich, 2024): aggregate_mbr
  - N-gram aggregation on BLEU (DeNero et al., 2009)
  - N-gram aggregation on chrF (Vamvas and Sennrich, 2024)
  - Embedding aggregation on COMET (Vamvas and Sennrich, 2024; Deguchi et al., 2024)
- Centroid-based MBR (Deguchi et al., 2024): centroid_mbr
- Probabilistic MBR (Trabelsi et al., 2024): probabilistic_mbr

Selectors

The final output list is selected according to these selectors:

N-best selection: nbest
Diverse selection (Jinnai et al., 2024): diverse

Related projects

mbr
- Highly integrated with huggingface transformers by customizing generate() method of model implementation.
- If you are looking for an MBR decoding library that is fully integrated into transformers, this might be a good choice.
- Our mbrs works standalone; thus, not only transformers but also fairseq or LLM outputs via API can be used.

Citation

If you use this software, please cite:

bibtex @inproceedings{deguchi-etal-2024-mbrs, title = "mbrs: A Library for Minimum {B}ayes Risk Decoding", author = "Deguchi, Hiroyuki and Sakai, Yusuke and Kamigaito, Hidetaka and Watanabe, Taro", editor = "Hernandez Farias, Delia Irazu and Hope, Tom and Li, Manling", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-demo.37", pages = "351--362", }

License

This library is mainly developed by Hiroyuki Deguchi and published under the MIT-license.

Owner

Name: naist-nlp
Login: naist-nlp
Kind: organization

Repositories: 1
Profile: https://github.com/naist-nlp

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Deguchi
    given-names: Hiroyuki
    orcid: https://orcid.org/0000-0003-2127-6607
  - family-names: Yusuke
    given-names: Sakai
  - family-names: Hidetaka
    given-names: Kamigaito
  - family-names: Taro
    given-names: Watanabe
title: "mbrs: A Library for Minimum Bayes Risk Decoding"
date-released: 2024-06-16
preferred-citation:
  type: misc
  authors:
  - family-names: Deguchi
    given-names: Hiroyuki
    orcid: https://orcid.org/0000-0003-2127-6607
  - family-names: Yusuke
    given-names: Sakai
  - family-names: Hidetaka
    given-names: Kamigaito
  - family-names: Taro
    given-names: Watanabe
  title: "mbrs: A Library for Minimum Bayes Risk Decoding"
  eprint: 2408.04167
  archivePrefix: arXiv
  primaryClass: cs.CL
  url: https://arxiv.org/abs/2408.04167
  month: 8
  year: 2024

GitHub Events

Total

Release event: 4
Watch event: 20
Delete event: 10
Push event: 29
Pull request event: 25
Fork event: 7
Create event: 17

Last Year

Release event: 4
Watch event: 20
Delete event: 10
Push event: 29
Pull request event: 25
Fork event: 7
Create event: 17

Packages

Total packages: 1
Total downloads:
- pypi 154 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 8
Total maintainers: 1

pypi.org: mbrs

A library for minimum Bayes risk (MBR) decoding.

Documentation: https://mbrs.readthedocs.io/
License: mit
Latest release: 0.1.7
published about 1 year ago

Versions: 8
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 154 Last month

Rankings

Dependent packages count: 11.0%

Average: 36.4%

Dependent repos count: 61.7%

Maintainers (1)

deguchi

Last synced: 11 months ago

Dependencies

pyproject.toml pypi

mypy ^1.8.0 develop
ptpython ^3.0.25 develop
pytest ^7.4.4 develop
pytest-cov ^4.1.0 develop
ruff ^0.4.4 develop
numpy ^1.26.3
python ^3.10
sacrebleu ^2.4.0
simple-parsing ^0.1.5
tabulate ^0.9.0
torch ^2.1.2
tqdm ^4.66.1
unbabel-comet ^2.2.1

.github/workflows/ci.yaml actions

actions/checkout v4 composite
actions/setup-python v3 composite

.github/workflows/release_pypi.yaml actions

JRubics/poetry-publish v2.0 composite
actions/checkout v4 composite