https://github.com/ausgerechnet/bratutils
A collection of utilities for manipulating data and calculating inter-annotator agreement in brat annotation files.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
A collection of utilities for manipulating data and calculating inter-annotator agreement in brat annotation files.
Basic Info
- Host: GitHub
- Owner: ausgerechnet
- License: mit
- Default Branch: master
- Size: 76.2 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of jeanphilippegoldman/bratutils
Created about 5 years ago
· Last pushed over 6 years ago
https://github.com/ausgerechnet/bratutils/blob/master/
bratutils
=========
[](https://circleci.com/gh/savkov/bratutils)
[](https://codeclimate.com/github/savkov/bratutils/maintainability)
[](https://codeclimate.com/github/savkov/bratutils/test_coverage)
[](https://opensource.org/licenses/MIT)
A collection of utilities for manipulating data and calculating inter-annotator
agreement in brat annotation files.
### Installation
Install as a normal package from the source directory.
```bash
$ pip install bratutils
```
### Agreement Definition
Agreement in multi-token annotations is commonly evaluated using [f-score][fsc].
due to various problems with computing the traditional [Krippendorf's alpha][al]
and [Cohen's kappa][ka]. [Hripcsak][hripcsak] prove the validity of the metric
for very large populations, i.e. for unrestricted text annotations.
This library roughly follows the definitions of precision and recall calculation
from the [MUC-7 test scoring][muc]. The basic definitions along with some
additional restrictions are laid out below:
* `CORRECT` - when annotation tags and indices match completely
* `INCORRECT` - when annotation tags do not match, but the indices coincide
* `PARTIAL` - when the annotation tags are the same but one of the annotations
has the same end index and a different start index
* `MISSING` - annotations exising only in the gold standard annotation set
* `SPURIOUS` - annotations existing only in the candidate annotation set
_Note_: the gold standard is considered the collections/document from which the
comparison is invoked, while the supplied parallel annotation is considered
the candidate set.
_*Disclaimer:*_ the current definition of the `PARTIAL` category accomodates
working with syntactic chunks. A different arrangement (e.g. pick largest
contained tag as partial match instead of rightmost) might be more suitable for
other tasks, for example some types of semantic annotation.
### Examples
Simple example:
```python
from bratutils import agreement as a
doc = a.Document('res/samples/A/data-sample-1.ann')
doc2 = a.Document('res/samples/B/data-sample-1.ann')
doc.make_gold()
statistics = doc2.compare_to_gold(doc)
print(statistics)
```
Output:
```shell
-------------------MUC-Table--------------------
------------------------------------------------
pos:135
act:134
cor:115
par:5
inc:4
mis:11
spu:10
------------------------------------------------
pre:0.858208955224
rec:0.851851851852
fsc:0.855018587361
------------------------------------------------
und:0.0814814814815
ovg:0.0746268656716
sub:0.0725806451613
------------------------------------------------
bor:119
ibo:15
------------------------------------------------
------------------------------------------------
```
[fsc]:
[al]:
[ka]:
[hripcsak]:
[muc]:
Owner
- Name: Philipp Heinrich
- Login: ausgerechnet
- Kind: user
- Location: Erlangen
- Company: @fau-klue
- Website: https://philipp-heinrich.eu
- Repositories: 2
- Profile: https://github.com/ausgerechnet