mbrs

A library for minimum Bayes risk (MBR) decoding

https://github.com/naist-nlp/mbrs

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.2%) to scientific vocabulary

Keywords

mbr-decoding natural-language-processing python pytorch
Last synced: 6 months ago · JSON representation ·

Repository

A library for minimum Bayes risk (MBR) decoding

Basic Info
  • Host: GitHub
  • Owner: naist-nlp
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 283 KB
Statistics
  • Stars: 45
  • Watchers: 4
  • Forks: 7
  • Open Issues: 0
  • Releases: 7
Topics
mbr-decoding natural-language-processing python pytorch
Created almost 2 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

mbrs is a library for minimum Bayes risk (MBR) decoding.

PyPi GitHub

Paper | Reference docs | Citation | Release notes

Installation

You can install from PyPi:

bash pip install mbrs

For developers, it can be installed from the source.

bash git clone https://github.com/naist-nlp/mbrs.git cd mbrs/ pip install ./

For uv users: bash git clone https://github.com/naist-nlp/mbrs.git cd mbrs/ uv sync

Quick start

mbrs provides two interfaces: command-line interface (CLI) and Python API.

Command-line interface

Command-line interface can run MBR decoding from command-line. Before running MBR decoding, you can generate hypothesis sentences with mbrs-generate:

bash mbrs-generate \ sources.txt \ --output hypotheses.txt \ --lang_pair en-de \ --model facebook/m2m100_418M \ --num_candidates 1024 \ --sampling eps --epsilon 0.02 \ --batch_size 8 --sampling_size 8 --fp16 \ --report_format rounded_outline

Beam search can also be used by replacing --sampling eps --epsilon 0.02 with --beam_size 10.

Next, MBR decoding and other decoding methods can be executed with mbrs-decode. This example regards the hypothesis set as the pseudo-reference set.

bash mbrs-decode \ hypotheses.txt \ --num_candidates 1024 \ --nbest 1 \ --source sources.txt \ --references hypotheses.txt \ --output translations.txt \ --report report.txt --report_format rounded_outline \ --decoder mbr \ --metric comet \ --metric.model Unbabel/wmt22-comet-da \ --metric.batch_size 64 --metric.fp16 true

You can pass the arguments using a configuration yaml file via --config_path option. See docs for the details.

Finally, you can evaluate the score with mbrs-score:

bash mbrs-score \ hypotheses.txt \ --sources sources.txt \ --references hypotheses.txt \ --format json \ --metric bleurt \ --metric.batch_size 64 --metric.fp16 true

Python API

This is the example of COMET-MBR via Python API.

``` python from mbrs.metrics import MetricCOMET from mbrs.decoders import DecoderMBR

SOURCE = "ありがとう" HYPOTHESES = ["Thanks", "Thank you", "Thank you so much", "Thank you.", "thank you"]

Setup COMET.

metriccfg = MetricCOMET.Config( model="Unbabel/wmt22-comet-da", batchsize=64, fp16=True, ) metric = MetricCOMET(metric_cfg)

Setup MBR decoding.

decodercfg = DecoderMBR.Config() decoder = DecoderMBR(decodercfg, metric)

Decode by COMET-MBR.

This example regards the hypotheses themselves as the pseudo-references.

Args: (hypotheses, pseudo-references, source)

output = decoder.decode(HYPOTHESES, HYPOTHESES, source=SOURCE, nbest=1)

print(f"Selected index: {output.idx}") print(f"Output sentence: {output.sentence}") print(f"Expected score: {output.score}") ```

List of implemented methods

Metrics

Currently, the following metrics are supported:

Decoders

The following decoding methods are implemented:

  • N-best reranking: rerank
  • MBR decoding: mbr

Specifically, the following methods of MBR decoding are included:

Selectors

The final output list is selected according to these selectors:

Related projects

  • mbr
    • Highly integrated with huggingface transformers by customizing generate() method of model implementation.
    • If you are looking for an MBR decoding library that is fully integrated into transformers, this might be a good choice.
    • Our mbrs works standalone; thus, not only transformers but also fairseq or LLM outputs via API can be used.

Citation

If you use this software, please cite:

bibtex @inproceedings{deguchi-etal-2024-mbrs, title = "mbrs: A Library for Minimum {B}ayes Risk Decoding", author = "Deguchi, Hiroyuki and Sakai, Yusuke and Kamigaito, Hidetaka and Watanabe, Taro", editor = "Hernandez Farias, Delia Irazu and Hope, Tom and Li, Manling", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-demo.37", pages = "351--362", }

License

This library is mainly developed by Hiroyuki Deguchi and published under the MIT-license.

Owner

  • Name: naist-nlp
  • Login: naist-nlp
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Deguchi
    given-names: Hiroyuki
    orcid: https://orcid.org/0000-0003-2127-6607
  - family-names: Yusuke
    given-names: Sakai
  - family-names: Hidetaka
    given-names: Kamigaito
  - family-names: Taro
    given-names: Watanabe
title: "mbrs: A Library for Minimum Bayes Risk Decoding"
date-released: 2024-06-16
preferred-citation:
  type: misc
  authors:
  - family-names: Deguchi
    given-names: Hiroyuki
    orcid: https://orcid.org/0000-0003-2127-6607
  - family-names: Yusuke
    given-names: Sakai
  - family-names: Hidetaka
    given-names: Kamigaito
  - family-names: Taro
    given-names: Watanabe
  title: "mbrs: A Library for Minimum Bayes Risk Decoding"
  eprint: 2408.04167
  archivePrefix: arXiv
  primaryClass: cs.CL
  url: https://arxiv.org/abs/2408.04167
  month: 8
  year: 2024

GitHub Events

Total
  • Release event: 4
  • Watch event: 20
  • Delete event: 10
  • Push event: 29
  • Pull request event: 25
  • Fork event: 7
  • Create event: 17
Last Year
  • Release event: 4
  • Watch event: 20
  • Delete event: 10
  • Push event: 29
  • Pull request event: 25
  • Fork event: 7
  • Create event: 17

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 154 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 8
  • Total maintainers: 1
pypi.org: mbrs

A library for minimum Bayes risk (MBR) decoding.

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 154 Last month
Rankings
Dependent packages count: 11.0%
Average: 36.4%
Dependent repos count: 61.7%
Maintainers (1)
Last synced: 7 months ago

Dependencies

pyproject.toml pypi
  • mypy ^1.8.0 develop
  • ptpython ^3.0.25 develop
  • pytest ^7.4.4 develop
  • pytest-cov ^4.1.0 develop
  • ruff ^0.4.4 develop
  • numpy ^1.26.3
  • python ^3.10
  • sacrebleu ^2.4.0
  • simple-parsing ^0.1.5
  • tabulate ^0.9.0
  • torch ^2.1.2
  • tqdm ^4.66.1
  • unbabel-comet ^2.2.1
.github/workflows/ci.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v3 composite
.github/workflows/release_pypi.yaml actions
  • JRubics/poetry-publish v2.0 composite
  • actions/checkout v4 composite