https://github.com/explosion/spacy-stanza
💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
Basic Info
Statistics
- Stars: 738
- Watchers: 23
- Forks: 62
- Open Issues: 14
- Releases: 17
Topics
Metadata Files
README.md
spaCy + Stanza (formerly StanfordNLP)
This package wraps the Stanza (formerly StanfordNLP) library, so you can use Stanford's models in a spaCy pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labeled dependency parsing in 68 languages. As of v1.0, Stanza also supports named entity recognition for selected languages.
⚠️ Previous version of this package were available as
spacy-stanfordnlp.
Using this wrapper, you'll be able to use the following annotations, computed by
your pretrained stanza model:
- Statistical tokenization (reflected in the
Docand its tokens) - Lemmatization (
token.lemmaandtoken.lemma_) - Part-of-speech tagging (
token.tag,token.tag_,token.pos,token.pos_) - Morphological analysis (
token.morph) - Dependency parsing (
token.dep,token.dep_,token.head) - Named entity recognition (
doc.ents,token.ent_type,token.ent_type_,token.ent_iob,token.ent_iob_) - Sentence segmentation (
doc.sents)
️️️⌛️ Installation
As of v1.0.0 spacy-stanza is only compatible with spaCy v3.x. To install
the most recent version:
bash
pip install spacy-stanza
For spaCy v2, install v0.2.x and refer to the v0.2.x usage documentation:
bash
pip install "spacy-stanza<0.3.0"
Make sure to also download one of the pre-trained Stanza models.
📖 Usage & Examples
⚠️ Important note: This package has been refactored to take advantage of spaCy v3.0. Previous versions that were built for spaCy v2.x worked considerably differently. Please see previous tagged versions of this README for documentation on prior versions.
Use spacy_stanza.load_pipeline() to create an nlp object that you can use to
process a text with a Stanza pipeline and create a spaCy
Doc object. By default, both the spaCy pipeline
and the Stanza pipeline will be initialized with the same lang, e.g. "en":
```python import stanza import spacy_stanza
Download the stanza model if necessary
stanza.download("en")
Initialize the pipeline
nlp = spacystanza.loadpipeline("en")
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.") for token in doc: print(token.text, token.lemma, token.pos, token.dep, token.enttype_) print(doc.ents) ```
If language data for the given language is available in spaCy, the respective
language class can be used as the base for the nlp object – for example,
English(). This lets you use spaCy's lexical attributes like is_stop or
like_num. The nlp object follows the same API as any other spaCy Language
class – so you can visualize the Doc objects with displaCy, add custom
components to the pipeline, use the rule-based matcher and do pretty much
anything else you'd normally do in spaCy.
```python
Access spaCy's lexical attributes
print([token.isstop for token in doc]) print([token.likenum for token in doc])
Visualize dependencies
from spacy import displacy displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook
Process texts with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]): print(doc.text)
Combine with your own custom pipeline components
from spacy import Language @Language.component("customcomponent") def customcomponent(doc): # Do something to the doc here print(f"Custom component called: {doc.text}") return doc
nlp.addpipe("customcomponent") doc = nlp("Some text")
Serialize attributes to a numpy array
nparray = doc.toarray(['ORTH', 'LEMMA', 'POS']) ```
Stanza Pipeline options
Additional options for the Stanza
Pipeline can be
provided as keyword arguments following the Pipeline API:
- Provide the Stanza language as
lang. For Stanza languages without spaCy support, use "xx" for the spaCy language setting:
python
# Initialize a pipeline for Coptic
nlp = spacy_stanza.load_pipeline("xx", lang="cop")
- Provide Stanza pipeline settings following the
PipelineAPI:
python
# Initialize a German pipeline with the `hdt` package
nlp = spacy_stanza.load_pipeline("de", package="hdt")
- Tokenize with spaCy rather than the statistical tokenizer (only for English):
python
nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})
- Provide any additional processor settings as additional keyword arguments:
python
# Provide pretokenized texts (whitespace tokenization)
nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)
The spaCy config specifies all Pipeline options in the [nlp.tokenizer]
block. For example, the config for the last example above, a German pipeline
with pretokenized texts:
```ini [nlp.tokenizer] @tokenizers = "spacystanza.PipelineAsTokenizer.v1" lang = "de" dir = null package = "default" logginglevel = null verbose = null use_gpu = true
[nlp.tokenizer.kwargs] tokenize_pretokenized = true
[nlp.tokenizer.processors] ```
Serialization
The full Stanza pipeline configuration is stored in the spaCy pipeline
config, so you can save and load the
pipeline just like any other nlp pipeline:
```python
Save to a local directory
nlp.to_disk("./stanza-spacy-model")
Reload the pipeline
nlp = spacy.load("./stanza-spacy-model") ```
Note that this does not save any Stanza model data by default. The Stanza
models are very large, so for now, this package expects you to download the
models separately with stanza.download() and have them available either in the
default model directory or in the path specified under [nlp.tokenizer.dir] in
the config.
Adding additional spaCy pipeline components
By default, the spaCy pipeline in the nlp object returned by
spacy_stanza.load_pipeline() will be empty, because all stanza attributes
are computed and set within the custom tokenizer,
StanzaTokenizer. But since it's a regular nlp
object, you can add your own components to the pipeline. For example, you could
add
your own custom text classification component
with nlp.add_pipe("textcat", source=source_nlp), or augment the named entities
with your own rule-based patterns using the
EntityRuler component.
Owner
- Name: Explosion
- Login: explosion
- Kind: organization
- Email: contact@explosion.ai
- Location: Berlin, Germany
- Website: https://explosion.ai
- Twitter: explosion_ai
- Repositories: 61
- Profile: https://github.com/explosion
A software company specializing in developer tools for Artificial Intelligence and Natural Language Processing
GitHub Events
Total
- Watch event: 14
- Member event: 1
- Issue comment event: 3
- Fork event: 3
Last Year
- Watch event: 14
- Member event: 1
- Issue comment event: 3
- Fork event: 3
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Ines Montani | i****s@i****o | 63 |
| Adriane Boyd | a****d@g****m | 21 |
| Thomas Buhrmann | t****s@g****m | 7 |
| Michael K | m****k | 2 |
| Yuhao Zhang | y****g | 1 |
| Shkarin Sergey | k****y@g****m | 1 |
| Matthew Honnibal | h****h@g****m | 1 |
| Bram Vanroy | B****y@U****e | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 65
- Total pull requests: 38
- Average time to close issues: 6 months
- Average time to close pull requests: 14 days
- Total issue authors: 54
- Total pull request authors: 10
- Average comments per issue: 2.45
- Average comments per pull request: 0.63
- Merged pull requests: 30
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 3
- Average comments per issue: 0
- Average comments per pull request: 0.67
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mehmetilker (3)
- BramVanroy (3)
- ZohaibRamzan (2)
- hammad26 (2)
- askhogan (2)
- TahaMunir1 (2)
- buhrmann (2)
- AlexeySlvv (2)
- chaouiy (2)
- Code4SAFrankie (1)
- koren-v (1)
- vitojph (1)
- srsridharan (1)
- Joselinejamy (1)
- capucincapucine (1)
Pull Request Authors
- adrianeboyd (21)
- michael-k (8)
- buhrmann (4)
- nxgeo (2)
- phucdev (2)
- BramVanroy (2)
- omri374 (1)
- yuhaozhang (1)
- SergeyShk (1)
- ines (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 2,092 last-month
- Total docker downloads: 57
- Total dependent packages: 3
- Total dependent repositories: 30
- Total versions: 11
- Total maintainers: 2
pypi.org: spacy-stanza
Use the latest Stanza (StanfordNLP) research models directly in spaCy
- Homepage: https://explosion.ai
- Documentation: https://spacy-stanza.readthedocs.io/
- License: MIT
-
Latest release: 1.0.4
published over 2 years ago
Rankings
Maintainers (2)
Dependencies
- pytest >=5.2.0
- spacy >=3.0.0,<4.0.0
- stanza >=1.2.0,<1.5.0
- spacy >=3.0.0,<4.0.0
- actions/checkout v3 composite
- actions/setup-python v4 composite