nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

https://github.com/mantisai/nervaluate

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary

Keywords

evaluation-metrics machine-learning named-entity-recognition natural-language-processing sequence-models

Last synced: 9 months ago · JSON representation ·

Repository

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

Basic Info

Host: GitHub
Owner: MantisAI
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 397 KB

Statistics

Stars: 183
Watchers: 5
Forks: 23
Open Issues: 18
Releases: 0

Topics

evaluation-metrics machine-learning named-entity-recognition natural-language-processing sequence-models

Created over 7 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog Contributing License Citation

nervaluate

nervaluate is a module for evaluating Named Entity Recognition (NER) models as defined in the SemEval 2013 - 9.1 task.

The evaluation metrics output by nervaluate go beyond a simple token/tag based schema, and consider different scenarios based on whether all the tokens that belong to a named entity were classified or not, and also whether the correct entity type was assigned.

This full problem is described in detail in the original blog post by David Batista, and this package extends the code in the original repository which accompanied the blog post.

The code draws heavily on the papers:

Usage example

pip install nervaluate

A possible input format are lists of NER labels, where each list corresponds to a sentence and each label is a token label. Initialize the Evaluator class with the true labels and predicted labels, and specify the entity types we want to evaluate.

```python from nervaluate.evaluator import Evaluator

true = [ ['O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'B-ORG', 'I-ORG'], # "The John Smith who works at Google Inc" ['O', 'B-LOC', 'B-PER', 'I-PER', 'O', 'O', 'B-DATE'], # "In Paris Marie Curie lived in 1895" ]

pred = [ ['O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG'], ['O', 'B-LOC', 'I-LOC', 'B-PER', 'O', 'O', 'B-DATE'], ]

evaluator = Evaluator(true, pred, tags=['PER', 'ORG', 'LOC', 'DATE'], loader="list") ```

Print the summary report for the evaluation, which will show the metrics for each entity type and evaluation scenario:

```python

print(evaluator.summary_report())

Scenario: all

          correct   incorrect     partial      missed    spurious   precision      recall    f1-score

ent_type 5 0 0 0 0 1.00 1.00 1.00 exact 2 3 0 0 0 0.40 0.40 0.40 partial 2 0 3 0 0 0.40 0.40 0.40 strict 2 3 0 0 0 0.40 0.40 0.40 ```

or aggregated by entity type under a specific evaluation scenario:

```python print(evaluator.summary_report(mode='entities'))

Scenario: strict

         correct   incorrect     partial      missed    spurious   precision      recall    f1-score

DATE 1 0 0 0 0 1.00 1.00 1.00 LOC 0 1 0 0 0 0.00 0.00 0.00 ORG 1 0 0 0 0 1.00 1.00 1.00 PER 0 2 0 0 0 0.00 0.00 0.00 ```

Evaluation Scenarios

Token level evaluation for NER is too simplistic

When running machine learning models for NER, it is common to report metrics at the individual token level. This may not be the best approach, as a named entity can be made up of multiple tokens, so a full-entity accuracy would be desirable.

When comparing the golden standard annotations with the output of a NER system different scenarios might occur:

I. Surface string and entity type match

| Token | Gold | Prediction | |-------|-------|------------| | in | O | O | | New | B-LOC | B-LOC | | York | I-LOC | I-LOC | | . | O | O |

II. System hypothesized an incorrect entity

| Token | Gold | Prediction | |----------|------|------------| | an | O | O | | Awful | O | B-ORG | | Headache | O | I-ORG | | in | O | O |

III. System misses an entity

| Token | Gold | Prediction | |-------|-------|------------| | in | O | O | | Palo | B-LOC | O | | Alto | I-LOC | O | | , | O | O |

Based on these three scenarios we have a simple classification evaluation that can be measured in terms of false positives, true positives, false negatives and false positives, and subsequently compute precision, recall and F1-score for each named-entity type.

However, this simple schema ignores the possibility of partial matches or other scenarios when the NER system gets the named-entity surface string correct but the type wrong. We might also want to evaluate these scenarios again at a full-entity level.

For example:

IV. System identifies the surface string but assigns the wrong entity type

| Token | Gold | Prediction | |-------|-------|------------| | I | O | O | | live | O | O | | in | O | O | | Palo | B-LOC | B-ORG | | Alto | I-LOC | I-ORG | | , | O | O |

V. System gets the boundaries of the surface string wrong

| Token | Gold | Prediction | |---------|-------|------------| | Unless | O | B-PER | | Karl | B-PER | I-PER | | Smith | I-PER | I-PER | | resigns | O | O |

VI. System gets the boundaries and entity type wrong

| Token | Gold | Prediction | |---------|-------|------------| | Unless | O | B-ORG | | Karl | B-PER | I-ORG | | Smith | I-PER | I-ORG | | resigns | O | O |

Defining evaluation metrics

How can we incorporate these described scenarios into evaluation metrics? See the original blog for a great explanation, a summary is included here.

We can define the following five metrics to consider different categories of errors:

| Error type | Explanation | |-----------------|--------------------------------------------------------------------------| | Correct (COR) | both are the same | | Incorrect (INC) | the output of a system and the golden annotation don’t match | | Partial (PAR) | system and the golden annotation are somewhat “similar” but not the same | | Missing (MIS) | a golden annotation is not captured by a system | | Spurious (SPU) | system produces a response which doesn’t exist in the golden annotation |

These five metrics can be measured in four different ways:

| Evaluation schema | Explanation | |-------------------|-----------------------------------------------------------------------------------| | Strict | exact boundary surface string match and entity type | | Exact | exact boundary match over the surface string, regardless of the type | | Partial | partial boundary match over the surface string, regardless of the type | | Type | some overlap between the system tagged entity and the gold annotation is required |

These five errors and four evaluation schema interact in the following ways:

| Scenario | Gold entity | Gold string | Pred entity | Pred string | Type | Partial | Exact | Strict | |----------|-------------|----------------|-------------|---------------------|------|---------|-------|--------| | III | BRAND | tikosyn | | | MIS | MIS | MIS | MIS | | II | | | BRAND | healthy | SPU | SPU | SPU | SPU | | V | DRUG | warfarin | DRUG | of warfarin | COR | PAR | INC | INC | | IV | DRUG | propranolol | BRAND | propranolol | INC | COR | COR | INC | | I | DRUG | phenytoin | DRUG | phenytoin | COR | COR | COR | COR | | VI | GROUP | contraceptives | DRUG | oral contraceptives | INC | PAR | INC | INC |

Then precision, recall and f1-score are calculated for each different evaluation schema. In order to achieve data, two more quantities need to be calculated:

POSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP

Then we can compute precision, recall, f1-score, where roughly describing precision is the percentage of correct named-entities found by the NER system. Recall as the percentage of the named-entities in the golden annotations that are retrieved by the NER system.

This is computed in two different ways depending on whether we want an exact match (i.e., strict and exact ) or a partial match (i.e., partial and type) scenario:

Exact Match (i.e., strict and exact ) Precision = (COR / ACT) = TP / (TP + FP) Recall = (COR / POS) = TP / (TP+FN)

Partial Match (i.e., partial and type) Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP) Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FN)

Putting all together:

| Measure | Type | Partial | Exact | Strict | |-----------|------|---------|-------|--------| | Correct | 3 | 3 | 3 | 2 | | Incorrect | 2 | 0 | 2 | 3 | | Partial | 0 | 2 | 0 | 0 | | Missed | 1 | 1 | 1 | 1 | | Spurious | 1 | 1 | 1 | 1 | | Precision | 0.5 | 0.66 | 0.5 | 0.33 | | Recall | 0.5 | 0.66 | 0.5 | 0.33 | | F1 | 0.5 | 0.66 | 0.5 | 0.33 |

Notes:

In scenarios IV and VI the entity type of the true and pred does not match, in both cases we only scored against the true entity, not the predicted one. You can argue that the predicted entity could also be scored as spurious, but according to the definition of spurious:

Spurious (SPU) : system produces a response which does not exist in the golden annotation;

In this case there exists an annotation, but with a different entity type, so we assume it's only incorrect.

Contributing to the `nervaluate` package

Extending the package to accept more formats

The Evaluator accepts the following formats:

Nested lists containing NER labels
CoNLL style tab delimited strings
prodi.gy style lists of spans

Additional formats can easily be added by creating a new loader class in nervaluate/loaders.py. The loader class should inherit from the DataLoader base class and implement the load method.

The load method should return a list of entity lists, where each entity is represented as a dictionary with label, start, and end keys.

The new loader can then be added to the _setup_loaders method in the Evaluator class, and can be selected with the loader argument when instantiating the Evaluator class.

Here is list of formats we intend to include.

General Contributing

Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of nervaluate please read the guidelines in the CONTRIBUTING.md file.

Give a ⭐️ if this project helped you!

Owner

Name: Mantis
Login: MantisAI
Kind: organization
Email: hi@mantisnlp.com
Location: Cyprus

Website: mantisnlp.com
Twitter: mantisnlp
Repositories: 28
Profile: https://github.com/MantisAI

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "nervaluate"
date-released: 2025-06-08
url: "https://github.com/mantisnlp/nervaluate"
version: 1.0.0
authors:
- family-names: "Batista"
  given-names: "David"
  orcid: "https://orcid.org/0000-0002-9324-5773"
- family-names: "Upson"
  given-names: "Matthew Antony"
  orcid: "https://orcid.org/0000-0002-1040-8048"

GitHub Events

Total

Issues event: 4
Watch event: 26
Delete event: 14
Issue comment event: 17
Push event: 77
Pull request review comment event: 2
Pull request review event: 5
Pull request event: 31
Fork event: 5
Create event: 12

Last Year

Issues event: 4
Watch event: 26
Delete event: 14
Issue comment event: 17
Push event: 77
Pull request review comment event: 2
Pull request review event: 5
Pull request event: 31
Fork event: 5
Create event: 12

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 3
Total pull requests: 16
Average time to close issues: almost 4 years
Average time to close pull requests: 6 months
Total issue authors: 3
Total pull request authors: 6
Average comments per issue: 0.0
Average comments per pull request: 0.31
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 15
Average time to close issues: N/A
Average time to close pull requests: 21 days
Issue authors: 2
Pull request authors: 5
Average comments per issue: 0.0
Average comments per pull request: 0.27
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

davidsbatista (1)
ShaiTAUB (1)
TessDejaeghere (1)
ain-soph (1)
mrshu (1)
jamblejoe (1)
ivyleavedtoadflax (1)

Pull Request Authors

davidsbatista (12)
jackboyla (3)
adgianv (2)
tmills (1)
rodrigues-pedro (1)
AlrikF (1)
dependabot[bot] (1)
ivyleavedtoadflax (1)
infopz (1)
n-drury (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (1)

Packages

Total packages: 1
Total downloads:
- pypi 29,060 last-month
Total docker downloads: 31

Total dependent packages: 2
Total dependent repositories: 20
Total versions: 9
Total maintainers: 1

pypi.org: nervaluate

NER evaluation considering partial match scoring

Homepage: https://github.com/MantisAI/nervaluate
Documentation: https://nervaluate.readthedocs.io/
License: MIT License
Latest release: 1.0.0
published 10 months ago

Versions: 9
Dependent Packages: 2
Dependent Repositories: 20
Downloads: 29,060 Last month
Docker Downloads: 31

Rankings

Dependent packages count: 3.2%

Dependent repos count: 3.2%

Average: 3.8%

Docker downloads count: 4.1%

Downloads: 4.6%

Maintainers (1)

ivyleavedtoadflax

Last synced: 10 months ago

Dependencies

requirements_dev.txt pypi

codecov *
gitchangelog *
pytest *
pytest-cov *
tox *
tox-gh-actions *
twine *
wheel *

.github/workflows/tests.yaml actions

actions/checkout v1 composite
actions/setup-python v2 composite
codecov/codecov-action v1 composite

nervaluate

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

nervaluate

Usage example

Evaluation Scenarios

Token level evaluation for NER is too simplistic

Defining evaluation metrics

Notes:

Contributing to the nervaluate package

Extending the package to accept more formats

General Contributing

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: nervaluate

Rankings

Maintainers (1)

Dependencies

Contributing to the `nervaluate` package