crosslingual-coreference

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

https://github.com/davidberenstein1957/crosslingual-coreference

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary

Keywords

coreference coreference-resolution hacktoberfest natural-language-processing nlp python spacy

Last synced: 6 months ago · JSON representation ·

Repository

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Basic Info

Host: GitHub
Owner: davidberenstein1957
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 548 KB

Statistics

Stars: 108
Watchers: 4
Forks: 19
Open Issues: 10
Releases: 12

Topics

coreference coreference-resolution hacktoberfest natural-language-processing nlp python spacy

Created almost 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Install

pip install crosslingual-coreference

Quickstart

```python from crosslingual_coreference import Predictor

text = ( "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At" " that location, Nissin was founded. Many students survived by eating these" " noodles, but they don't even know him." )

choose minilm for speed/memory and info_xlm for accuracy

predictor = Predictor( language="encorewebsm", device=-1, modelname="minilm" )

print(predictor.predict(text)["resolved_text"]) print(predictor.pipe([text])[0]["resolved_text"])

Note you can also get 'cluster_heads' and 'clusters'

Output

Do not forget about Momofuku Ando!

Momofuku Ando created instant noodles in Osaka.

At Osaka, Nissin was founded.

Many students survived by eating instant noodles,

but Many students don't even know Momofuku Ando.

```

Models

As of now, there are two models available "spanbert", "infoxlm", "xlmroberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively. - The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts. - The "info_xlm" model produces the best quality for multi-lingual texts. - The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

```python from crosslingual_coreference import Predictor

predictor = Predictor( language="encorewebsm", device=0, modelname="minilm", chunksize=2500, chunkoverlap=2, ) ```

Use spaCy pipeline

```python import spacy

nlp = spacy.load("encorewebsm") nlp.addpipe( "xxcoref", config={"chunksize": 2500, "chunk_overlap": 2, "device": 0} )

doc = nlp(text) print(doc..corefclusters)

Output

[[[4, 5], [7, 7], [27, 27], [36, 36]],

[[12, 12], [15, 16]],

[[9, 10], [27, 28]],

[[22, 23], [31, 31]]]

print(doc..resolvedtext)

Output

Do not forget about Momofuku Ando!

Momofuku Ando created instant noodles in Osaka.

At Osaka, Nissin was founded.

Many students survived by eating instant noodles,

but Many students don't even know Momofuku Ando.

print(doc..clusterheads)

Output

{Momofuku Ando: [5, 6],

instant noodles: [11, 12],

Osaka: [14, 14],

Nissin: [21, 21],

Many students: [26, 27]}

```

Visualize spacy pipeline

This only works with spacy >= 3.3. ```python import spacy from spacy.tokens import Span from spacy import displacy

nlp = spacy.load("nlcorenewssm") nlp.addpipe("xxcoref", config={"modelname": "minilm"}) doc = nlp(text) spans = [] for idx, cluster in enumerate(doc..corefclusters): for span in cluster: spans.append( Span(doc, span[0], span[1]+1, str(idx).upper()) )

doc.spans["custom"] = spans

displacy.render(doc, style="span", options={"spans_key": "custom"}) ```

More Examples

Owner

Name: David Berenstein
Login: davidberenstein1957
Kind: user
Location: Madrid
Company: @argilla-io

Website: https://www.linkedin.com/in/david-berenstein-1bab11105/
Repositories: 2
Profile: https://github.com/davidberenstein1957

👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing Developer Advocate @argilla-io

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: David
    given-names: Berenstein
title: "Crosslingual Coreference - a multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy."
version: 0.2.9
date-released: 2022-09-24

GitHub Events

Total

Watch event: 5
Pull request review event: 1
Fork event: 2

Last Year

Watch event: 5
Pull request review event: 1
Fork event: 2

Committers

Last synced: 9 months ago

All Time

Total Commits: 48
Total Committers: 5
Avg Commits per committer: 9.6
Development Distribution Score (DDS): 0.5

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
David Berenstein	d**n@p**m	24
David Berenstein	d**n@g**m	14
Mathijs Boezer	m**r@p**m	7
Daniel Vila Suero	d**l@r**i	2
Martin Kirilov	m**v@g**m	1

Committer Domains (Top 20 + Academic)

pandoraintelligence.com: 2 recogn.ai: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 6
Total pull request authors: 1
Average comments per issue: 1.17
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dimitristaufer (1)
joa-spec (1)
osehmathias (1)
rainergo (1)
arslanahmad90 (1)
vaibhava-vylabs (1)

Pull Request Authors

matesaki (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

poetry.lock pypi

216 dependencies

pyproject.toml pypi

black ~22.3 develop
flake8 ~4.0 develop
flake8-bugbear ~22.3 develop
flake8-docstrings ~1.6 develop
isort ^5.10 develop
pep8-naming ^~0.12 develop
pre-commit ~2.17 develop
pytest ~7.0 develop
Pillow >9.1
allennlp ~2.8
allennlp-models ~2.8
checklist ^0.0.11
python ^3.7.1
spacy ~3.1

.github/workflows/python-package.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/python-publish.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

crosslingual-coreference

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Crosslingual Coreference

Install

Quickstart

choose minilm for speed/memory and info_xlm for accuracy

Note you can also get 'cluster_heads' and 'clusters'

Output

Do not forget about Momofuku Ando!

Momofuku Ando created instant noodles in Osaka.

At Osaka, Nissin was founded.

Many students survived by eating instant noodles,

but Many students don't even know Momofuku Ando.

Models

Chunking/batching to resolve memory OOM errors

Use spaCy pipeline

Output

[[[4, 5], [7, 7], [27, 27], [36, 36]],

[[12, 12], [15, 16]],

[[9, 10], [27, 28]],

[[22, 23], [31, 31]]]

Output

Do not forget about Momofuku Ando!

Momofuku Ando created instant noodles in Osaka.

At Osaka, Nissin was founded.

Many students survived by eating instant noodles,

but Many students don't even know Momofuku Ando.

Output

{Momofuku Ando: [5, 6],

instant noodles: [11, 12],

Osaka: [14, 14],

Nissin: [21, 21],

Many students: [26, 27]}

Visualize spacy pipeline

More Examples

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies