https://github.com/explosion/spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary

Keywords

corenlp data-science machine-learning natural-language-processing nlp spacy spacy-pipeline stanford-corenlp stanford-machine-learning stanford-nlp stanza

Keywords from Contributors

tokenization named-entity-recognition text-classification entity-linking cython spacy-extension jax language-model huggingface transfer-learning

Last synced: 11 months ago · JSON representation

Repository

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy

Basic Info

Host: GitHub
Owner: explosion
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 67.4 KB

Statistics

Stars: 738
Watchers: 23
Forks: 62
Open Issues: 14
Releases: 17

Topics

corenlp data-science machine-learning natural-language-processing nlp spacy spacy-pipeline stanford-corenlp stanford-machine-learning stanford-nlp stanza

Created over 7 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License

spaCy + Stanza (formerly StanfordNLP)

This package wraps the Stanza (formerly StanfordNLP) library, so you can use Stanford's models in a spaCy pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labeled dependency parsing in 68 languages. As of v1.0, Stanza also supports named entity recognition for selected languages.

⚠️ Previous version of this package were available as spacy-stanfordnlp.

Using this wrapper, you'll be able to use the following annotations, computed by your pretrained stanza model:

Statistical tokenization (reflected in the Doc and its tokens)
Lemmatization (token.lemma and token.lemma_)
Part-of-speech tagging (token.tag, token.tag_, token.pos, token.pos_)
Morphological analysis (token.morph)
Dependency parsing (token.dep, token.dep_, token.head)
Named entity recognition (doc.ents, token.ent_type, token.ent_type_, token.ent_iob, token.ent_iob_)
Sentence segmentation (doc.sents)

️️️⌛️ Installation

As of v1.0.0 spacy-stanza is only compatible with spaCy v3.x. To install the most recent version:

bash pip install spacy-stanza

For spaCy v2, install v0.2.x and refer to the v0.2.x usage documentation:

bash pip install "spacy-stanza<0.3.0"

Make sure to also download one of the pre-trained Stanza models.

📖 Usage & Examples

⚠️ Important note: This package has been refactored to take advantage of spaCy v3.0. Previous versions that were built for spaCy v2.x worked considerably differently. Please see previous tagged versions of this README for documentation on prior versions.

Use spacy_stanza.load_pipeline() to create an nlp object that you can use to process a text with a Stanza pipeline and create a spaCy Doc object. By default, both the spaCy pipeline and the Stanza pipeline will be initialized with the same lang, e.g. "en":

```python import stanza import spacy_stanza

Download the stanza model if necessary

stanza.download("en")

Initialize the pipeline

nlp = spacystanza.loadpipeline("en")

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.") for token in doc: print(token.text, token.lemma, token.pos, token.dep, token.enttype_) print(doc.ents) ```

If language data for the given language is available in spaCy, the respective language class can be used as the base for the nlp object – for example, English(). This lets you use spaCy's lexical attributes like is_stop or like_num. The nlp object follows the same API as any other spaCy Language class – so you can visualize the Doc objects with displaCy, add custom components to the pipeline, use the rule-based matcher and do pretty much anything else you'd normally do in spaCy.

```python

Access spaCy's lexical attributes

print([token.isstop for token in doc]) print([token.likenum for token in doc])

Visualize dependencies

from spacy import displacy displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook

Process texts with nlp.pipe

for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]): print(doc.text)

Combine with your own custom pipeline components

from spacy import Language @Language.component("customcomponent") def customcomponent(doc): # Do something to the doc here print(f"Custom component called: {doc.text}") return doc

nlp.addpipe("customcomponent") doc = nlp("Some text")

Serialize attributes to a numpy array

nparray = doc.toarray(['ORTH', 'LEMMA', 'POS']) ```

Stanza Pipeline options

Additional options for the Stanza Pipeline can be provided as keyword arguments following the Pipeline API:

Provide the Stanza language as lang. For Stanza languages without spaCy support, use "xx" for the spaCy language setting:

python # Initialize a pipeline for Coptic nlp = spacy_stanza.load_pipeline("xx", lang="cop")

Provide Stanza pipeline settings following the Pipeline API:

python # Initialize a German pipeline with the `hdt` package nlp = spacy_stanza.load_pipeline("de", package="hdt")

Tokenize with spaCy rather than the statistical tokenizer (only for English):

python nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})

Provide any additional processor settings as additional keyword arguments:

python # Provide pretokenized texts (whitespace tokenization) nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)

The spaCy config specifies all Pipeline options in the [nlp.tokenizer] block. For example, the config for the last example above, a German pipeline with pretokenized texts:

```ini [nlp.tokenizer] @tokenizers = "spacystanza.PipelineAsTokenizer.v1" lang = "de" dir = null package = "default" logginglevel = null verbose = null use_gpu = true

[nlp.tokenizer.kwargs] tokenize_pretokenized = true

[nlp.tokenizer.processors] ```

Serialization

The full Stanza pipeline configuration is stored in the spaCy pipeline config, so you can save and load the pipeline just like any other nlp pipeline:

```python

Save to a local directory

nlp.to_disk("./stanza-spacy-model")

Reload the pipeline

nlp = spacy.load("./stanza-spacy-model") ```

Note that this does not save any Stanza model data by default. The Stanza models are very large, so for now, this package expects you to download the models separately with stanza.download() and have them available either in the default model directory or in the path specified under [nlp.tokenizer.dir] in the config.

Adding additional spaCy pipeline components

By default, the spaCy pipeline in the nlp object returned by spacy_stanza.load_pipeline() will be empty, because all stanza attributes are computed and set within the custom tokenizer, StanzaTokenizer. But since it's a regular nlp object, you can add your own components to the pipeline. For example, you could add your own custom text classification component with nlp.add_pipe("textcat", source=source_nlp), or augment the named entities with your own rule-based patterns using the EntityRuler component.

Owner

Name: Explosion
Login: explosion
Kind: organization
Email: contact@explosion.ai
Location: Berlin, Germany

Website: https://explosion.ai
Twitter: explosion_ai
Repositories: 61
Profile: https://github.com/explosion

A software company specializing in developer tools for Artificial Intelligence and Natural Language Processing

GitHub Events

Total

Watch event: 14
Member event: 1
Issue comment event: 3
Fork event: 3

Last Year

Watch event: 14
Member event: 1
Issue comment event: 3
Fork event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 97
Total Committers: 8
Avg Commits per committer: 12.125
Development Distribution Score (DDS): 0.351

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ines Montani	i**s@i**o	63
Adriane Boyd	a**d@g**m	21
Thomas Buhrmann	t**s@g**m	7
Michael K	m****k	2
Yuhao Zhang	y****g	1
Shkarin Sergey	k**y@g**m	1
Matthew Honnibal	h**h@g**m	1
Bram Vanroy	B**y@U**e	1

Committer Domains (Top 20 + Academic)

ugent.be: 1 graphext.com: 1 ines.io: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 65
Total pull requests: 38
Average time to close issues: 6 months
Average time to close pull requests: 14 days
Total issue authors: 54
Total pull request authors: 10
Average comments per issue: 2.45
Average comments per pull request: 0.63
Merged pull requests: 30
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 3
Average comments per issue: 0
Average comments per pull request: 0.67
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mehmetilker (3)
BramVanroy (3)
ZohaibRamzan (2)
hammad26 (2)
askhogan (2)
TahaMunir1 (2)
buhrmann (2)
AlexeySlvv (2)
chaouiy (2)
Code4SAFrankie (1)
koren-v (1)
vitojph (1)
srsridharan (1)
Joselinejamy (1)
capucincapucine (1)

Pull Request Authors

adrianeboyd (21)
michael-k (8)
buhrmann (4)
nxgeo (2)
phucdev (2)
BramVanroy (2)
omri374 (1)
yuhaozhang (1)
SergeyShk (1)
ines (1)

Top Labels

Issue Labels

usage (5) bug (2) duplicate (1)

Pull Request Labels

bug (1)

Packages

Total packages: 1
Total downloads:
- pypi 2,092 last-month
Total docker downloads: 57

Total dependent packages: 3
Total dependent repositories: 30
Total versions: 11
Total maintainers: 2

pypi.org: spacy-stanza

Use the latest Stanza (StanfordNLP) research models directly in spaCy

Homepage: https://explosion.ai
Documentation: https://spacy-stanza.readthedocs.io/
License: MIT
Latest release: 1.0.4
published almost 3 years ago

Versions: 11
Dependent Packages: 3
Dependent Repositories: 30
Downloads: 2,092 Last month
Docker Downloads: 57

Rankings

Dependent packages count: 2.4%

Stargazers count: 2.4%

Dependent repos count: 2.7%

Average: 3.7%

Downloads: 4.3%

Docker downloads count: 4.6%

Forks count: 5.7%

Maintainers (2)

explosion inesmontani

Last synced: 11 months ago

Dependencies

requirements.txt pypi

pytest >=5.2.0
spacy >=3.0.0,<4.0.0
stanza >=1.2.0,<1.5.0

setup.py pypi

spacy >=3.0.0,<4.0.0

.github/workflows/tests.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

https://github.com/explosion/spacy-stanza

Science Score: 13.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

spaCy + Stanza (formerly StanfordNLP)

️️️⌛️ Installation

📖 Usage & Examples

Download the stanza model if necessary

Initialize the pipeline

Access spaCy's lexical attributes

Visualize dependencies

Process texts with nlp.pipe

Combine with your own custom pipeline components

Serialize attributes to a numpy array

Stanza Pipeline options

Serialization

Save to a local directory

Reload the pipeline

Adding additional spaCy pipeline components

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: spacy-stanza

Rankings

Maintainers (2)

Dependencies