https://github.com/compnet/tibert

End-to-End BERT-Based Coreference System

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary

Keywords

bert coreference-resolution nlp

Last synced: 5 months ago · JSON representation

Repository

End-to-End BERT-Based Coreference System

Basic Info

Host: GitHub
Owner: CompNet
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 188 KB

Statistics

Stars: 1
Watchers: 3
Forks: 1
Open Issues: 0
Releases: 13

Topics

bert coreference-resolution nlp

Created over 2 years ago · Last pushed 6 months ago

Metadata Files

Readme License

Tibert

Tibert is a transformers-compatible reproduction from the paper End-to-end Neural Coreference Resolution with several modifications. Among these:

Usage of BERT (or any BERT variant) as an encoder as in BERT for Coreference Resolution: Baselines and Analysis
batch size can be greater than 1
Support of singletons as in Adapted End-to-End Coreference Resolution System for Anaphoric Identities in Dialogues
Hierarchical merging as in Coreference in Long Documents using Hierarchical Entity Merging

It can be installed with pip install tibert.

Documentation

Simple Prediction Example

Here is an example of using the simple prediction interface:

```python from tibert import BertForCoreferenceResolution, predictcorefsimple from tibert.utils import pprintcoreferencedocument from transformers import BertTokenizerFast

model = BertForCoreferenceResolution.frompretrained( "compnet-renard/bert-base-cased-literary-coref" ) tokenizer = BertTokenizerFast.frompretrained("bert-base-cased")

annotateddoc = predictcoref_simple( "Sli did not want the earpods. He didn't like them.", model, tokenizer )

pprintcoreferencedocument(annotated_doc) ```

results in:

>>> (0 Sli ) did not want the earpods. (0 He ) didn't like them.

Batched Predictions for Performance

A more advanced prediction interface is available:

```python from transformers import BertTokenizerFast from tibert import predict_coref, BertForCoreferenceResolution

model = BertForCoreferenceResolution.frompretrained( "compnet-renard/bert-base-cased-literary-coref" ) tokenizer = BertTokenizerFast.frompretrained("bert-base-cased")

documents = [ "Sli did not want the earpods. He didn't like them.", "Princess Liana felt sad, because Zarth Arn was gone. The princess went to sleep.", ]

annotateddocs = predictcoref(documents, model, tokenizer, batch_size=2)

for doc in annotateddocs: pprintcoreference_document(doc) ```

results in:

>>> (0 Sli ) did not want the earpods . (0 He ) didn't like them .

>>> (0 Princess Liana ) felt sad , because (1 Zarth Arn ) was gone . (0 The princess) went to sleep .

Using Coreference Chains

The coreference chains predicted can be accessed using the .coref_chains attribute:

python annotated_doc = predict_coref_simple( "Princess Liana felt sad, because Zarth Arn was gone. The princess went to sleep.", model, tokenizer ) print(annotated_doc.coref_chains)

>>>[[Mention(tokens=['The', 'princess'], start_idx=11, end_idx=13), Mention(tokens=['Princess', 'Liana'], start_idx=0, end_idx=2)], [Mention(tokens=['Zarth', 'Arn'], start_idx=6, end_idx=8)]]

Hierarchical Merging

Hierarchical merging allows to reduce RAM usage and computations when performing inference on long documents. To do so, the user provides the text cut in chunks. The model will perform prediction for chunks, which means the long document wont be taken at once into memory. Then, hierarchical merging will try to merge chunk predictions. This allow scaling to arbitrarily large documents. See Coreference in Long Documents using Hierarchical Entity Merging for more details.

Hierarchical merging can be used as follows:

```python from tibert import BertForCoreferenceResolution, predictcoref from tibert.utils import pprintcoreference_document from transformers import BertTokenizerFast

model = BertForCoreferenceResolution.frompretrained( "compnet-renard/bert-base-cased-literary-coref" ) tokenizer = BertTokenizerFast.frompretrained("bert-base-cased")

chunk1 = "Princess Liana felt sad, because Zarth Arn was gone." chunk2 = "She went to sleep."

annotateddoc = predictcoref( [chunk1, chunk2], model, tokenizer, hierarchical_merging=True )

pprintcoreferencedocument(annotated_doc) ```

This results in:

>>>(1 Princess Liana ) felt sad , because (0 Zarth Arn ) was gone . (1 She ) went to sleep .

Even if the mentions Princess Liana and She are not in the same chunk, hierarchical merging still resolves this case correctly.

Training a model

Aside from the tibert.train.train_coref_model function, it is possible to train a model from the command line. Training a model requires installing the sacred library. Here is the most basic example:

sh python -m tibert.run_train with\ dataset_path=/path/to/litbank/repository\ out_model_dir=/path/to/output/model/directory

The following parameters can be set (taken from ./tibert/run_train.py config function):

| Parameter | Default Value | |------------------------------|---------------------| | batch_size | 1 | | epochs_nb | 30 | | dataset_name | "litbank" | | dataset_path | "~/litbank" | | mentions_per_tokens | 0.4 | | antecedents_nb | 350 | | max_span_size | 10 | | mention_scorer_hidden_size | 3000 | | sents_per_documents_train | 11 | | mention_loss_coeff | 0.1 | | bert_lr | 1e-5 | | task_lr | 2e-4 | | dropout | 0.3 | | segment_size | 128 | | encoder | "bert-base-cased" | | out_model_dir | "~/tibert/model" | | checkpoint | None |

One can monitor training metrics by adding run observers using command line flags - see sacred documentation for more details.

Method

We reimplemented the model from Lee et al., 2017 from scratch, but used BERT as the encoder as in Joshi et al., 2019. We do not use higher order inference as in Lee et al., 2018 since it was found to be not necessarily useful by Xu and Choi, 2020.

Singletons

Unfortunately, the framework from Lee et al., 2017 cannot represent singletons. This is because the authors were working on the OntoNotes dataset, where singletons are not annotated. We wanted to work on Litbank, so we had to find a way to represent singletons.

We opted to do as in Xu and Choi, 2021: we consider mention with a high enough mention scores as singletons, even when they are in no clusters. To force the model to learn proper mention scores, we add an auxiliary loss on mention score (as in Xu and Choi, 2021). To counter dataset imbalance between positive and negative mentions, we opt to compute a weighted loss instead of performing sampling.

Additional Features

Several work make use of additional features. For now, only the distance between spans is implemented.

Results

The following table presents the results we obtained on Litbank by training this model. We evaluate on 10% of Litbank documents, each of which consists of ~2000 tokens. The split column indicate whether documents were split in blocks of 512 tokens. The HM coumns indicates whether we use hierarchical merging.

| Dataset | Base model | split | HM | MUC | B3 | CEAF | BLANC | LEA | time (m:s) | |---------|-------------------|-------|-----|-------|-------|-------|-------|-------|------------| | Litbank | bert-base-cased | no | no | 75.03 | 60.66 | 48.71 | 62.96 | 32.84 | 22:07 | | Litbank | bert-base-cased | yes | no | 73.84 | 49.14 | 47.88 | 48.41 | 27.63 | 16:18 | | Litbank | bert-base-cased | yes | yes | 74.54 | 59.30 | 46.98 | 62.69 | 42.46 | 21:13 |

Citation

If you use this software in a research project, you can cite Tibert as follows:

bibtex @Misc{tibert, author = {Amalvy, A. and Labatut, V. and Dufour, R.}, title = {Tibert}, year = {2023}, url = {https://github.com/CompNet/Tibert}, }

Owner

Name: Complex Networks
Login: CompNet
Kind: organization
Location: Avignon, France

Website: http://lia.univ-avignon.fr
Repositories: 44
Profile: https://github.com/CompNet

GitHub Events

Total

Release event: 1
Watch event: 1
Push event: 6
Create event: 1

Last Year

Release event: 1
Watch event: 1
Push event: 6
Create event: 1

Dependencies

poetry.lock pypi

attrs 23.1.0
certifi 2023.5.7
charset-normalizer 3.1.0
click 8.1.3
cmake 3.26.4
colorama 0.4.6
exceptiongroup 1.1.2
filelock 3.12.2
fsspec 2023.6.0
huggingface-hub 0.17.3
hypothesis 6.82.3
idna 3.4
iniconfig 2.0.0
jinja2 3.1.2
joblib 1.3.1
lit 16.0.6
markdown-it-py 3.0.0
markupsafe 2.1.3
mdurl 0.1.2
more-itertools 10.1.0
mpmath 1.3.0
neleval 3.1.1
networkx 2.8.8
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
packaging 23.1
pluggy 1.2.0
pygments 2.15.1
pytest 7.4.0
pyyaml 6.0
regex 2023.6.3
requests 2.31.0
rich 13.6.0
sacremoses 0.0.53
safetensors 0.3.1
setuptools 68.0.0
six 1.16.0
sortedcontainers 2.4.0
sympy 1.12
tokenizers 0.14.1
tomli 2.0.1
torch 2.0.0
tqdm 4.65.0
transformers 4.34.0
triton 2.0.0
typing-extensions 4.7.1
urllib3 2.0.3
wheel 0.40.0

pyproject.toml pypi

more-itertools ^10.1.0
neleval ^3.1.1
networkx ^2.6.3
python ^3.8,<3.11
rich ^13.5.3
sacremoses ^0.0.53
torch >=2.0.0, !=2.0.1
tqdm ^4.62.3
transformers ^4.32.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/compnet/tibert

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Tibert

Documentation

Simple Prediction Example

Batched Predictions for Performance

Using Coreference Chains

Hierarchical Merging

Training a model

Method

Singletons

Additional Features

Results

Citation

Owner

GitHub Events

Total

Last Year

Dependencies