Jury

Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

datasets evaluate evaluation huggingface machine-learning metrics natural-language-processing nlp nlp-evaluation python pytorch transformers

Last synced: 6 months ago · JSON representation

Repository

Comprehensive NLP Evaluation System

Basic Info

Host: GitHub
Owner: obss
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 291 KB

Statistics

Stars: 188
Watchers: 3
Forks: 19
Open Issues: 5
Releases: 23

Topics

datasets evaluate evaluation huggingface machine-learning metrics natural-language-processing nlp nlp-evaluation python pytorch transformers

Created over 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

Jury

A comprehensive toolkit for evaluating NLP experiments offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses a more advanced version of evaluate design for underlying metric computation, so that adding custom metric is easy as extending proper class.

Main advantages that Jury offers are:

Easy to use for any NLP project.
Unified structure for computation input across all metrics.
Calculate many metrics at once.
Metrics calculations can be handled concurrently to save processing time.
It seamlessly supports evaluation for multiple predictions/multiple references.

To see more, check the official Jury blog post.

🔥 News

(2024.05.29) Retraction Watch Post regarding retraction of a paper has been posted. The plagiarised paper has been retracted.
(2023.10.03) Jury paper is out currently is on arxiv. Please cite this paper if your work use Jury, and if your publication material will be submitted to the venues after this date.
(2023.07.30) Public notice: You can reach our official Public Notice document that poses a claim about plagiarism of the work, jury, presented in this codebase.

Available Metrics

The table below shows the current support status for available metrics.

| Metric | Jury Support | HF/evaluate Support | |-------------------------------------------------------------------------------|--------------------|---------------------| | Accuracy-Numeric | :heavycheckmark: | :whitecheckmark: | | Accuracy-Text | :heavycheckmark: | :x: | | Bartscore | :heavycheckmark: | :x: | | Bertscore | :heavycheckmark: | :whitecheckmark: | | Bleu | :heavycheckmark: | :whitecheckmark: | | Bleurt | :heavycheckmark: | :whitecheckmark: | | CER | :heavycheckmark: | :whitecheckmark: | | CHRF | :heavycheckmark: | :whitecheckmark: | | COMET | :heavycheckmark: | :whitecheckmark: | | F1-Numeric | :heavycheckmark: | :whitecheckmark: | | F1-Text | :heavycheckmark: | :x: | | METEOR | :heavycheckmark: | :whitecheckmark: | | Precision-Numeric | :heavycheckmark: | :whitecheckmark: | | Precision-Text | :heavycheckmark: | :x: | | Prism | :heavycheckmark: | :x: | | Recall-Numeric | :heavycheckmark: | :whitecheckmark: | | Recall-Text | :heavycheckmark: | :x: | | ROUGE | :heavycheckmark: | :whitecheckmark: | | SacreBleu | :heavycheckmark: | :whitecheckmark: | | Seqeval | :heavycheckmark: | :whitecheckmark: | | Squad | :heavycheckmark: | :whitecheckmark: | | TER | :heavycheckmark: | :whitecheckmark: | | WER | :heavycheckmark: | :whitecheckmark: | | Other metrics* | :whitecheckmark: | :whitecheckmark: |

* Placeholder for the rest of the metrics available in evaluate package apart from those which are present in the table.

Notes

The entry :heavycheckmark: represents that full Jury support is available meaning that all combinations of input types (single prediction & single reference, single prediction & multiple references, multiple predictions & multiple references) are supported
The entry :whitecheckmark: means that this metric is supported (for Jury through the evaluate), so that it can (and should) be used just like the evaluate metric as instructed in evaluate implementation although unfortunately full Jury support for those metrics are not yet available.

Request for a New Metric

For the request of a new metric please open an issue providing the minimum information. Also, PRs addressing new metric supports are welcomed :).

Installation

Through pip,

pip install jury

or build from source,

git clone https://github.com/obss/jury.git
cd jury
python setup.py install

NOTE: There may be malfunctions of some metrics depending on sacrebleu package on Windows machines which is mainly due to the package pywin32. For this, we fixed pywin32 version on our setup config for Windows platforms. However, if pywin32 causes trouble in your environment we strongly recommend using conda manager install the package as conda install pywin32.

Usage

API Usage

It is only two lines of code to evaluate generated outputs.

```python from jury import Jury

scorer = Jury() predictions = [ ["the cat is on the mat", "There is cat playing on the mat"], ["Look! a wonderful day."] ] references = [ ["the cat is playing on the mat.", "The cat plays on the mat."], ["Today is a wonderful day", "The weather outside is wonderful."] ] scores = scorer(predictions=predictions, references=references) ```

Specify metrics you want to use on instantiation.

python scorer = Jury(metrics=["bleu", "meteor"]) scores = scorer(predictions, references)

Use of Metrics standalone

You can directly import metrics from jury.metrics as classes, and then instantiate and use as desired.

```python from jury.metrics import Bleu

bleu = Bleu.construct() score = bleu.compute(predictions=predictions, references=references) ```

The additional parameters can either be specified on compute()

```python from jury.metrics import Bleu

bleu = Bleu.construct() score = bleu.compute(predictions=predictions, references=references, max_order=4) ```

, or alternatively on instantiation

python from jury.metrics import Bleu bleu = Bleu.construct(compute_kwargs={"max_order": 1}) score = bleu.compute(predictions=predictions, references=references)

Note that you can seemlessly access both jury and evaluate metrics through jury.load_metric.

```python import jury

bleu = jury.loadmetric("bleu") bleu1 = jury.loadmetric("bleu", resultingname="bleu1", computekwargs={"max_order": 1})

metrics not available in `jury` but in `evaluate`

wer = jury.loadmetric("competitionmath") # It falls back to evaluate package with a warning ```

CLI Usage

You can specify predictions file and references file paths and get the resulting scores. Each line should be paired in both files. You can optionally provide reduce function and an export path for results to be written.

jury eval --predictions /path/to/predictions.txt --references /path/to/references.txt --reduce_fn max --export /path/to/export.txt

You can also provide prediction folders and reference folders to evaluate multiple experiments. In this set up, however, it is required that the prediction and references files you need to evaluate as a pair have the same file name. These common names are paired together for prediction and reference.

jury eval --predictions /path/to/predictions_folder --references /path/to/references_folder --reduce_fn max --export /path/to/export.txt

If you want to specify metrics, and do not want to use default, specify it in config file (json) in metrics key.

json { "predictions": "/path/to/predictions.txt", "references": "/path/to/references.txt", "reduce_fn": "max", "metrics": [ "bleu", "meteor" ] }

Then, you can call jury eval with config argument.

jury eval --config path/to/config.json

Custom Metrics

You can use custom metrics with inheriting jury.metrics.Metric, you can see current metrics implemented on Jury from jury/metrics. Jury falls back to evaluate implementation of metrics for the ones that are currently not supported by Jury, you can see the metrics available for evaluate on evaluate/metrics.

Jury itself uses evaluate.Metric as a base class to drive its own base class as jury.metrics.Metric. The interface is similar; however, Jury makes the metrics to take a unified input type by handling the inputs for each metrics, and allows supporting several input types as;

single prediction & single reference
single prediction & multiple reference
multiple prediction & multiple reference

As a custom metric both base classes can be used; however, we strongly recommend using jury.metrics.Metric as it has several advantages such as supporting computations for the input types above or unifying the type of the input.

```python from jury.metrics import MetricForTask

class CustomMetric(MetricForTask): def computesinglepredsingleref( self, predictions, references, reducefn = None, **kwargs ): raise NotImplementedError

def _compute_single_pred_multi_ref(
    self, predictions, references, reduce_fn = None, **kwargs
):
    raise NotImplementedError

def _compute_multi_pred_multi_ref(
        self, predictions, references, reduce_fn = None, **kwargs
):
    raise NotImplementedError

```

For more details, have a look at base metric implementation jury.metrics.Metric

Contributing

PRs are welcomed as always :)

Installation

git clone https://github.com/obss/jury.git
cd jury
pip install -e ".[dev]"

Also, you need to install the packages which are available through a git source separately with the following command. For the folks who are curious about "why?"; a short explaination is that PYPI does not allow indexing a package which are directly dependent on non-pypi packages due to security reasons. The file requirements-dev.txt includes packages which are currently only available through a git source, or they are PYPI packages with no recent release or incompatible with Jury, so that they are added as git sources or pointing to specific commits.

pip install -r requirements-dev.txt

Tests

To tests simply run.

python tests/run_tests.py

Code Style

To check code style,

python tests/run_code_style.py check

To format codebase,

python tests/run_code_style.py format

Citation

If you use this package in your work, please cite it as:

@misc{cavusoglu2023jury,
  title={Jury: A Comprehensive Evaluation Toolkit}, 
  author={Devrim Cavusoglu and Ulas Sert and Secil Sen and Sinan Altinuc},
  year={2023},
  eprint={2310.02040},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  doi={10.48550/arXiv.2310.02040}
}

Community Interaction

We use the GitHub Issue Tracker to track issues in general. Issues can be bug reports, feature requests or implementation of a new metric type. Please refer to the related issue template for opening new issues.

| | Location | |--------------------------------|----------------------------------------------------------------------------------------------------| | Bug Report | Bug Report Template | | New Metric Request | Request Metric Implementation | | All other issues and questions | General Issues |

License

Licensed under the MIT License.

Owner

Name: Open Business Software Solutions
Login: obss
Kind: organization
Email: rcm@obss.tech
Location: Istanbul

Website: https://obss.tech
Twitter: obsstech
Repositories: 13
Profile: https://github.com/obss

Open Source for Open Business

JOSS Publication

Jury: A Comprehensive Evaluation Toolkit

Published

May 20, 2024

DOI

10.21105/joss.06452

Volume 9, Issue 97, Page 6452

Authors

Devrim Cavusoglu
OBSS AI, Middle East Technical University

Secil Sen
OBSS AI, Bogazici University

Ulas Sert
OBSS AI

Sinan Altinuc
OBSS AI, Middle East Technical University

Editor

Chris Vernon

GitHub Events

Total

Watch event: 4

Last Year

Watch event: 4

Committers

Last synced: 7 months ago

All Time

Total Commits: 100
Total Committers: 9
Avg Commits per committer: 11.111
Development Distribution Score (DDS): 0.18

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Devrim	4****u	82
fcakyon	3****n	7
Ulaş "Sophylax" Sert	S****x	5
cemilcengiz	3****z	1
Zafer Cavdar	z**r@y**m	1
Nish	n**a@g**m	1
Kenneth Enevoldsen	k**n@g**m	1
Ikko Eltociear Ashimine	e**r@g**m	1
devrim.cavusoglu	d**u@o**r	1

Committer Domains (Top 20 + Academic)

obss.com.tr: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 45
Total pull requests: 66
Average time to close issues: 13 days
Average time to close pull requests: 2 days
Total issue authors: 9
Total pull request authors: 7
Average comments per issue: 1.11
Average comments per pull request: 0.44
Merged pull requests: 62
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: 3 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 5.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

devrimcavusoglu (35)
Axe-- (3)
Santhanreddy71 (1)
salvadora (1)
amit0623 (1)
AI-14 (1)
fcakyon (1)
Sophylax (1)
NISH1001 (1)

Pull Request Authors

devrimcavusoglu (56)
Sophylax (5)
KennethEnevoldsen (2)
eltociear (1)
zafercavdar (1)
fcakyon (1)
NISH1001 (1)

Top Labels

Issue Labels

enhancement (16) bug (9) prioritized (8) new metric (8) patch (2) documentation (2) help wanted (2) discussion (1)

Pull Request Labels

enhancement (6) bug (5) patch (4) new metric (2) documentation (1) do not merge (1)

Packages

Total packages: 1
Total downloads:
- pypi 1,591 last-month

Total dependent packages: 1
Total dependent repositories: 2
Total versions: 23
Total maintainers: 1

pypi.org: jury

Evaluation toolkit for neural language generation.

Homepage: https://github.com/obss/jury
Documentation: https://jury.readthedocs.io/
License: MIT
Latest release: 2.3.1
published almost 2 years ago

Versions: 23
Dependent Packages: 1
Dependent Repositories: 2
Downloads: 1,591 Last month

Rankings

Dependent packages count: 3.2%

Stargazers count: 5.4%

Average: 7.9%

Forks count: 8.8%

Downloads: 10.2%

Dependent repos count: 11.8%

Maintainers (1)

obss

Last synced: 6 months ago

Dependencies

requirements.txt pypi

click ==8.0.4
datasets >=2.0.0
fire >=0.4.0
nltk >=3.6.6,<3.7.1
numpy >=1.21.0
pandas >=1.1.5
rouge-score ==0.0.4
sklearn *
tqdm *

.github/workflows/ci.yml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/publish_pypi.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

pyproject.toml pypi

requirements-dev.txt pypi

setup.py pypi

Jury

Science Score: 93.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Jury

🔥 News

Available Metrics

Request for a New Metric

Installation

Usage

API Usage

Use of Metrics standalone

metrics not available in jury but in evaluate

CLI Usage

Custom Metrics

Contributing

Installation

Tests

Code Style

Citation

Community Interaction

License

Owner

JOSS Publication

Jury: A Comprehensive Evaluation Toolkit

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: jury

Rankings

Maintainers (1)

Dependencies

metrics not available in `jury` but in `evaluate`