Recent Releases of spacy
spacy - v3.8.7: Python 3.13 support, Cython 3, centralize registry entries
In order to support Python 3.13, spaCy is now compiled with Cython 3. This brings a change to the way types are handled at runtime (Cython 3 uses the from __future__ import annotations semantics, which stores types as strings at runtime. This difference caused problems for components registered within Cython files, as we rely on building Pydantic models from factory function signatures to do validation.
To support Python 3.13 we therefore create a new module, spacy.pipeline.factories, which contains the factory function implementations. __getattr__ import shims have been added to the previous locations of these functions to prevent backwards incompatibilities.
As well as moving the factories, the new implementation avoids import-time side-effects, by moving the actual calls to the decorator inside a function, which is executed once when the Language class is initialised.
A matching change has been made to the catalogue registry decorators. A new module spacy.registrations has been created that performs all the catalogue registrations. Moving these registrations away from the functions prevents these decorators from running at import time. This change was not necessary for the Python 3.13 support, but it means we no longer rely on any import-time side-effects, which will allow us to improve spaCy's import time and therefore CLI execution time. The change also makes maintenance easier as it's easier to find the implementations of different registry functions (this may help library users as well).
- Python
Published by github-actions[bot] about 1 year ago
spacy - v3.8.6: Restore wheels, remove Python 3.13 compatibility
Restores support for wheels for ARM platforms, while correctly noting compatibility range.
- Python
Published by github-actions[bot] about 1 year ago
spacy - v3.8.3: Improve memory zone stability
Fix bug in memory zones when non-transient strings were added to the StringStore inside a memory zone. This caused a bug in the morphological analyser that caused string not found errors when applied during a memory zone.
- Python
Published by github-actions[bot] over 1 year ago
spacy - v3.8: Memory management for persistent services, numpy 2.0 support
Optional memory management for persistent services
Support a new context manager method Language.memory_zone(), to allow long-running services to avoid growing memory usage from cached entries in the Vocab or StringStore. Once the memory zone block ends, spaCy will evict Vocab and StringStore entries that were added during the block, freeing up memory. Doc objects created inside a memory zone block should not be accessed outside the block.
The current implementation disables population of the tokenizer cache inside the memory zone, resulting in some performance impact. The performance difference will likely be negligible if you're running a full pipeline, but if you're only running the tokenizer, it'll be much slower. If this is a problem, you can mitigate it by warming the cache first, by processing the first few batches of text without creating a memory zone. Support for memory zones in the tokenizer will be added in a future update.
The Language.memory_zone() context manager also checks for a memory_zone() method on pipeline components, so that components can perform similar memory management if necessary. None of the built-in components currently require this.
If you component needs to add non-transient entries to the StringStore or Vocab, you can pass the allow_transient=False flag to the Vocab.add() or StringStore.add() components.
Example usage:
```python
import spacy import json from pathlib import Path from typing import Iterator from collections import Counter import typer from spacy.util import minibatch
def texts(path: Path) -> Iterator[str]: with path.open("r", encoding="utf8") as file: for line in file: yield json.loads(line)["text"]
def main(jsonlpath: Path) -> None: nlp = spacy.load("encorewebsm") counts = Counter() batches = minibatch(texts(jsonlpath), 1000) for i, batch in enumerate(batches): print("Batch", i) with nlp.memoryzone(): for doc in nlp.pipe(batch): for token in doc: counts[token.text] += 1 for word, count in counts.most_common(100): print(count, word)
if name == "main": typer.run(main) ```
Numpy v2 compatibility
Numpy 2.0 isn't binary-compatible with numpy v1, so we need to build against one or the other. This release isolates the dependency change and has no other changes, to make things easier if the dependency change causes problems.
This dependency change was previously attempted in version 3.7.6, but dependencies within the v3.7 family of models resulted in some conflicts, and some packages depending on numpy v1 were incompatible with v3.7.6. I've therefore removed the 3.7.6 release and replaced it with this one, which increments the minor version.
Model packages no longer list spacy as a requirement
I've also made a change to the way models are packaged to make it easier to release more quickly. Previously spaCy models specified a versioned requirement on spacy itself. This meant that there was no way to increment the spaCy version and have it work with the existing models, because the models would specify they were only compatible with spacy>=3.7.0,<3.8.0. We have a compatibility table that allows spacy to see which models are compatible, but the models themselves can't know which future versions of spaCy they work with.
I've therefore added a flag --require-parent/--no-require-parent to the spacy package CLI, which controls where the parent package (e.g. spaCy) should be listed as a requirement of the model. --require-parent is the default for v3.8, but this will change to --no-require-parent by default in v4. I've set --no-require-parent for the v3.8 models, so that further changes can be published that don't impact the models, without retraining the models or forcing users to redownload them.
- Python
Published by github-actions[bot] over 1 year ago
spacy - Optional memory management for persistent services
Support a new context manager method Language.memory_zone(), to allow long-running services to avoid growing memory usage from cached entries in the Vocab or StringStore. Once the memory zone block ends, spaCy will evict Vocab and StringStore entries that were added during the block, freeing up memory. Doc objects created inside a memory zone block should not be accessed outside the block.
The current implementation disables population of the tokenizer cache inside the memory zone, resulting in some performance impact. The performance difference will likely be negligible if you're running a full pipeline, but if you're only running the tokenizer, it'll be much slower. If this is a problem, you can mitigate it by warming the cache first, by processing the first few batches of text without creating a memory zone. Support for memory zones in the tokenizer will be added in a future update.
The Language.memory_zone() context manager also checks for a memory_zone() method on pipeline components, so that components can perform similar memory management if necessary. None of the built-in components currently require this.
If you component needs to add non-transient entries to the StringStore or Vocab, you can pass the allow_transient=False flag to the Vocab.add() or StringStore.add() components.
Example usage:
```python
import spacy import json from pathlib import Path from typing import Iterator from collections import Counter import typer from spacy.util import minibatch
def texts(path: Path) -> Iterator[str]: with path.open("r", encoding="utf8") as file: for line in file: yield json.loads(line)["text"]
def main(jsonlpath: Path) -> None: nlp = spacy.load("encorewebsm") counts = Counter() batches = minibatch(texts(jsonlpath), 1000) for i, batch in enumerate(batches): print("Batch", i) with nlp.vocab.memoryzone(): for doc in nlp.pipe(batch): for token in doc: counts[token.text] += 1 for word, count in counts.most_common(100): print(count, word)
if name == "main": typer.run(main)```
- Python
Published by github-actions[bot] over 1 year ago
spacy - v3.7.6: Depend on numpy 2.0
Numpy 2.0 isn't binary-compatible with numpy v1, so we need to build against one or the other. This release isolates the dependency change and has no other changes, to make things easier if the dependency change causes problems.
- Python
Published by github-actions[bot] almost 2 years ago
spacy - v3.7.6a: Test pypi release process
- Python
Published by github-actions[bot] almost 2 years ago
spacy - v3.7.5: Download sanitization, Typer compatibility, and a bugfix for linking gold entities
✨ New features and improvements
- Sanitize direct download for
spacy download(#13313). - Convert Cython properties to decorator syntax (#13390).
- Bump Weasel pin to allow v0.4.x (#13409).
- Improvements to the test suite (#13469, #13470).
- Bump Typer pin to allow v0.10.0 and above (#13471).
- Allow
typing-extensions<5.0.0for Python < 3.8 (#13516).
🔴 Bug fixes
- #13400: Fix
use_gold_entsbehaviour for EntityLinker.
📖 Documentation and examples
- Make the file name for code listings stick to the top (#13379).
- Update the documentation of
MorphAnalysis(#13433). - Typo fixes in the documentation (#13466).
👥 Contributors
@danieldk, @honnibal, @ines, @JoeSchiff, @nokados, @Paillat-dev, @rmitsch, @schorfma, @strickvl, @svlandeg, @ynx0
- Python
Published by svlandeg almost 2 years ago
spacy - v3.7.4: New textcat layers and fo/nn language extensions
✨ New features and improvements
- Improve NumPy 2.0 compatibility (#13103).
- Added language extensions for Faroese and Norwegian Nynorsk (#13116).
- Add new
TextCatReduce.v1layer for text classification (#13181). - Add new
TextCatParametricAttention.v1layer for text classification (#13201). - Use
buildmodule for creating model packages by default (#13109). - Add support for code loading to the
benchmark speedcommand (#13247). - Extend lexical attributes for English with more numericals (#13106).
- Warn about reloading dependencies after downloading models (#13081).
🔴 Bug fixes
- #13259, #13304, #13321: Correctness fixes for multiprocessing support in
Language.pipe. - #13187: Typing and documentation fixes for
Doc. - #13086: Update
Tokenizer.explainfor special cases with whitespace. - #13068: Fix displaCy span stacking.
- #13149: Add spacy.TextCatBOW.v3 to use the fixed
SparseLinearlayer.
📖 Documentation and examples
- Many improvements and updates to the LLM documentation.
- Update
trf_dataexamples and the transformer pipeline design section.
👥 Contributors
@adrianeboyd, @danieldk, @evornov, @honnibal, @ines, @lise-brinck, @ridge-kimani, @rmitsch, @shadeMe, @svlandeg
- Python
Published by danieldk over 2 years ago
spacy - v3.7.2: Fixes for APIs and requirements
✨ New features and improvements
- Update
__all__fields (#13063).
🔴 Bug fixes
- #13035: Remove Pathy requirement.
- #13053: Restore
spacy.cli.projectAPI. - #13057: Support
Anycomparisons forTokenandSpan.
📖 Documentation and examples
- Many updates for
spacy-llmincluding Azure OpenAI, PaLM, and Mistral support. - Various documentation corrections.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @rmitsch, @svlandeg
- Python
Published by adrianeboyd over 2 years ago
spacy - v3.7.1: Bug fix for spacy.cli module loading
🔴 Bug fixes
- Revert lazy loading of CLI module for
spacy.infoto fix availability ofspacy.clifollowingimport spacy(#13040).
👥 Contributors
@adrianeboyd, @honnibal, @ines, @svlandeg
- Python
Published by adrianeboyd over 2 years ago
spacy - v3.7.0: Trained pipelines using Curated Transformers and support for Python 3.12
This release drops support for Python 3.6 and adds support for Python 3.12.
✨ New features and improvements
- Add support for Python 3.12 (#12979).
- Use the new library Weasel for spaCy projects functionality (#12769).
- All
spacy projectcommands should run as before, just now they're using Weasel under the hood. - ⚠️ Remote storage is not yet supported for Python 3.12. Use Python 3.11 or earlier for remote storage.
- All
- Extend to Thinc v8.2 (#12897).
- Extend
transformersextra tospacy-transformersv1.3 (#13025). - Support registered vectors (#12492).
- Add
--spans-keyoption for CLI evaluation withspacy benchmark accuracy(#12981). - Load the CLI module lazily for
spacy.info(#12962). - Add type stubs for
spacy.training.example(#12801). - Warn for unsupported pattern keys in dependency matcher (#12928).
Language.replace_listeners: Pass the replaced listener and thetok2vecpipe to the callback in order to supportspacy-curated-transformers(#12785).- Always use
tqdmwithdisable=Noneto disable output in non-interactive environments (#12979). - Language updates:
- Add left and right pointing angle brackets as punctuation to ancient Greek (#12829).
- Update example sentences for Turkish (#12895).
- Package setup updates:
- Update NumPy build constraints for NumPy 1.25+ (#12839). For Python 3.9+, it is no longer necessary to set build constraints while building binary wheels.
- Refactor Cython profiling in order to disable profiling for Python 3.12 in the package setup, since Cython does not currently support profiling for Python 3.12 (#12979).
📦 Trained pipelines updates
The transformer-based trf pipelines have been updated to use our new Curated Transformers library through the Thinc model wrappers and pipeline component from spaCy Curated Transformers.
⚠️ Backwards incompatibilities
- Drop support for Python 3.6.
- Drop mypy checks for Python 3.7.
- Remove
rayextra. spacy projecthas a few backwards incompatibilities due to the transition to the standalone library Weasel, which is not as tightly coupled to spaCy. Weasel produces warnings when it detects older spaCy-specific settings in your environment or project config.- Support for the
spacy_versionconfiguration key has been dropped. - Support for the
check_requirementsconfiguration key has been dropped due to the deprecation ofpkg_resources. - The
SPACY_CONFIG_OVERRIDESenvironment variable is no longer checked. You can set configuration overrides usingWEASEL_CONFIG_OVERRIDES. - Support for
SPACY_PROJECT_USE_GIT_VERSIONenvironment variable has been dropped. - Error codes are now Weasel-specific and do not follow spaCy error codes.
- Support for the
📖 Documentation and examples
- New and updated documentation for large language models and spaCy Curated Transformers.
- Various documentation corrections and updates.
- New additions to the spaCy Universe:
- Hobbit spaCy: NLP for Middle Earth
- rolegal: a spaCy Package for Noisy Romanian Legal Document Processing
👥 Contributors
@adrianeboyd, @bdura, @connorbrinton, @danieldk, @davidberenstein1957, @denizcodeyaa, @eltociear, @evornov, @honnibal, @ines, @jmyerston, @koaning, @magdaaniol, @pdhall99, @ringohoffman, @rmitsch, @senisioi, @shadeMe, @svlandeg, @vinbo8, @wjbmattingly
- Python
Published by adrianeboyd over 2 years ago
spacy - v3.6.1: Support for Pydantic v2, find-function CLI and more
✨ New features and improvements
- Allow Pydantic v2 using transitional v1 support (#12888).
- Add
find-functionCLI for finding locations of registered functions (#12757). - Add extra
spacy[cuda12x]forcupy-cuda12x(#12890). - Extend tests for
init configandtrainCLI (#12173). - Switch from
distutilstosetuptools/sysconfig(#12853).
🔴 Bug fixes
- #12817: Escape annotated HTML tags in displaCy span renderer.
- #12857: Display model's full base version string in incompatibility warning.
- #12882: Update
<br>tags in displaCy.
📖 Documentation and examples
👥 Contributors
@adrianeboyd, @afriedman412, @arplusman, @bdura, @connorbrinton, @honnibal, @ines, @it176131, @pmbaumgartner, @rmitsch, @shadeMe, @svlandeg, @thomashacker, @victorialslocum, @x-tabdeveloping
- Python
Published by adrianeboyd almost 3 years ago
spacy - v3.6.0: New span finder component and pipelines for Slovenian
✨ New features and improvements
- NEW:
span_finderpipeline component to identify overlapping, unlabeled spans (#12507). - Language updates:
- Add initial support for Malay (#12602).
- Update Latin defaults to support noun chunks, update lexical/tokenizer defaults and add example sentences (#12538).
- Add option to return scores separately keyed by component name with
spacy evaluate --per-component,Language.evaluate(per_component=True)andScorer.score(per_component=True)(#12540). - Support custom token/lexeme attribute for vectors (#12625).
- Support
spancat_singlelabelinspacy debug dataCLI (#12749). - Typing updates for
PhraseMatcherandSpanGroup(#12642, #12714).
🔴 Bug fixes
- #12569: Require that all
SpanGroupspans come from the current doc.
📦 Trained pipelines updates
We have added new pipelines for Slovenian that use the trainable lemmatizer and floret vectors.
| Package | UPOS | Parser LAS | NER F |
| --- | --- | --- | --- |
| sl_core_news_sm | 96.9 | 82.1 | 62.9 |
| sl_core_news_md | 97.6 | 84.3 | 73.5 |
| sl_core_news_lg | 97.7 | 84.3 | 79.0 |
| sl_core_news_trf | 99.0 | 91.7 | 90.0 |
- 🙏 Special thanks to @orglce for help with the new pipelines!
The English pipelines have been updated to improve handling of contractions with various apostrophes and to lemmatize "get" as a passive auxiliary.
The Danish pipeline da_core_news_trf has been updated to use vesteinn/DanskBERT with performance improvements across the board.
⚠️ Backwards incompatibilities
SpanGroupspans are now required to be from the same doc. When initializing aSpanGroup, there is a new check to verify that all added spans refer to the current doc. Without this check, it was possible to run into string store or other errors.
📖 Documentation and examples
- Various documentation corrections and updates.
- New additions to spaCy Universe:
👥 Contributors
@adrianeboyd, @bdura, @danieldk, @davidberenstein1957, @diyclassics, @essenmitsosse, @honnibal, @ines, @isabelizimm, @jmyerston, @kadarakos, @KennethEnevoldsen, @khursani8, @ljvmiranda921, @rmitsch, @shadeMe, @svlandeg, @tomaarsen, @victorialslocum, @vin-ivar, @ZiadAmerr
- Python
Published by adrianeboyd almost 3 years ago
spacy - v3.5.4: Bug fixes for overrides with registered functions and sourced components with listeners
✨ New features and improvements
- Extend Typer support to v0.9 (#12631).
🔴 Bug fixes
- #12701: Fix issues with component names and listeners for sourced components.
- #12623: Support overrides for registered functions in configs.
👥 Contributors
@adrianeboyd, @bdura, @honnibal, @ines, @svlandeg
- Python
Published by adrianeboyd almost 3 years ago
spacy - v3.2.6: Bug fixes for Pydantic and pip
This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0.
✨ New features and improvements
- Huge speed improvements for
spancat, in particular on GPU (~10x-30x faster) (#12577).
🔴 Bug fixes
- Add
typing_extensionsrequirement due to Pydantic incompatibility withtyping_extensions>=4.6.0. - Remove
#eggfrom download URLs due to future deprecation inpip.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @kadarakos, @svlandeg
- Python
Published by adrianeboyd about 3 years ago
spacy - v3.3.3: Bug fixes for Pydantic and pip
This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0.
✨ New features and improvements
- Huge speed improvements for
spancat, in particular on GPU (~10x-30x faster) (#12577).
🔴 Bug fixes
- Add
typing_extensionsrequirement due to Pydantic incompatibility withtyping_extensions>=4.6.0. - Remove
#eggfrom download URLs due to future deprecation inpip.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @kadarakos, @svlandeg
- Python
Published by adrianeboyd about 3 years ago
spacy - v3.5.3: Speed improvements, bug fixes and more
✨ New features and improvements
- Huge speed improvements for
spancat, in particular on GPU (~10x-30x faster) (#12577). - Improve speed for child operators (
>+,>-,>++,>--) for the dependency matcher (#12528). - Improve loading speed for tokenizers with a large number of exceptions (#12553).
- Support
doc.spansfor displaCy output inspacy benchmark accuracy/spacy evaluate(#12575). - Add
MorphAnalysis.get(default=)argument for user-provided default values similar todict(#12545). - Only perform vectors checks during initialization if there are sourced components (#12607).
🔴 Bug fixes
- #12567: Remove
#eggfrom download URLs due to future deprecation inpip.
📖 Documentation and examples
- Various documentation corrections and updates.
- New additions to spaCy Universe:
👥 Contributors
@adrianeboyd, @andyjessen, @bdura, @davidberenstein1957, @diyclassics, @honnibal, @ines, @kadarakos, @KennethEnevoldsen, @ljvmiranda921, @moxley01, @royashcenazi, @svlandeg, @tanloong, @victorialslocum
- Python
Published by adrianeboyd about 3 years ago
spacy - v3.5.2: Pretraining improvements, bug fixes for spans and spancat and more
✨ New features and improvements
- Add support for floret vectors in
spacy pretrain(#12435). - Save final model as
model-last.binforspacy pretrain(#12459). - Support
Spaninput fordisplacy.parse_deps(#12477). - Extend support to CuPy 12.0 for
cupyinstall extras.
🔴 Bug fixes
- #12398: Fix entity linker failure on sentence-crossing entities.
- #12405: Fix sentence indexing bug in
Span.sents. - #12469: Fix scores attribute for
spancat_singlelabel. - #12484: Fix
Span.sentswhen the final sentence is the last token in aDoc. - #12486: Fix pickle for the ngram suggester.
- #12493: Include
Span.kb_idandSpan.idstrings inDocandDocBinserialization.
📖 Documentation and examples
- Various documentation corrections and updates.
- New addition to spaCy Universe:
👥 Contributors
@adrianeboyd, @BLKSerene, @honnibal, @ines, @kadarakos, @prajakta-1527, @rmitsch, @shadeMe, @sloev, @svlandeg, @thomashacker, @willfrey
- Python
Published by adrianeboyd about 3 years ago
spacy - v3.5.1: spancat for multi-class labeling, fixes for textcat+transformers and more
💥 We'd love to hear more about your experience with spaCy! Take our survey here.
✨ New features and improvements
- NEW:
spancat_singlelabelpipeline component for multi-class and non-overlapping span classification. Thespancat_singlelabelcomponent predicts at most one label for each suggested span and adds a new settingallow_overlapto restrict the output to non-overlapping spans (#11365). - Extend to mypy v1.0 (#12245).
- Use
transformer+ CNN for efficient GPUtextcatwithspacy init config(#11900). - Support trainable lemmatizer in
spacy debug data(#11419). - Add new operators to dependency matcher for left/right immediate child/parent nodes (
>+,>-,<+,<-) (#12334). - Add
spacy.PlainTextCorpusReader.v1for plain text input (#12122). - Add
alignment_modeandspan_idtoSpan.char_span()(#12145, #12196). - Use string formatting types in logging calls (#12215).
🔴 Bug fixes
- #12017: Improve speed for
top_k>1in trainable lemmatizer. - #12048: Make
test_cli_find_threshold()test more robust. - #12227: Fix return type of
registry.find(). - #12272: Fix speed regression for
Matcherpatterns with extension attributes. - #12287: Add
grcto languages with lexeme norms inspacy-lookups-data. - #12320: Make generation of empty
KnowledgeBaseinstances configurable. - #12343: Fix error message for displacy
auto_select_port. - #12347: Fix length check for knowledge base in entity linker, add
InMemoryLookupKB.is_empty. - #12365: Fix types for
Lexeme.orthandLexeme.lower. - #12366: Raise error for non-default vectors with
PretrainVectors. - #12368: Partially address pending deprecation of
pkg_resources. - Various improvements and fixes for the test suite (#12148, #12157, #12210, #12303, #12372).
📖 Documentation and examples
- Many website updates to improve accessibility.
- Various documentation corrections and updates.
- New projects:
- Span labeling datasets
- Comparing embedding layers in spaCy from the technical report Multi hash embeddings in spaCy
👥 Contributors
@adrianeboyd, @andyjessen, @danieldk, @essenmitsosse, @honnibal, @ines, @itssimon, @kadarakos, @kwhumphreys, @ljvmiranda921, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @shadeMe, @svlandeg, @tanloong, @thomashacker, @victorialslocum
- Python
Published by adrianeboyd about 3 years ago
spacy - v3.5.0: New CLI commands, language updates, bug fixes and much more
✨ New features and improvements
- NEW: New
applyCLI command to annotate new documents with a trained pipeline (#11376). - NEW: New
benchmarkCLI command to benchmark pipelines. The newbenchmark speedsubcommand measures the speed of a pipeline, thebenchmark accuracysubcommand is a new alias forevaluate(#11902). - NEW: New
find-thresholdCLI command to identify an optimal threshold for classification models (#11280). - NEW: New
FUZZYMatcheroperator for fuzzy matches based on Levenshtein edit distance. In addition, theFUZZYandREGEXoperators are now supported in combination withIN/NOT_IN. (#11359). - Language updates for Ancient Greek, Dutch, Russian, Slovenian and Ukrainian (#11345, #11162, #11426, #11753, #11811, #11997, more details below).
- Allow up to
typerv0.7.x (#11720),mypy0.990 (#11801) andtyping_extensionsv4.4.x (#12036). - New
spacy.ConsoleLogger.v3with expanded progress tracking (#11972). - Improved scoring behavior for
textcatwithspacy.textcat_scorer.v2(#11696 and #11971) andspacy.textcat_multilabel_scorer.v2(#11820). - Improved customizability of the knowledge base used for entity linking, with the default implementation being the new
InMemoryLookupKB(#11268). - Optional
before_updatecallback that is invoked at the start of each training step (#11739). - Improve performance of
SpanGroup(#11380). - Improve UX around
displacy.servewhen the default port is in use (#11948). - Patch a security vulnerability in extracting tar files (#11746).
- Add equality definition for vectors (#11806).
- Allow interpolation of variables in directory names in projects (#11235).
- Update default component configs to use the latest
tok2vecversion (#11618).
🔴 Bug fixes
- #11382: Fix lookup behavior for the French and Catalan lemmatizers.
- #11385: Ensure that downstream components can train properly on a frozen
tok2vecortransformerlayer. - #11762: Support local file system remotes for projects.
- #11763: Raise an error when unsupported values are used for
textcat. - #11834: Ensure
Vocab.to_diskrespects the exclude setting forlookupsandvectors. - #12009: Fix a few typing issues for
SpanGroupandSpanobjects. - #12098: Correctly handle missing annotations in the edit tree lemmatizer.
⚠️ Backwards incompatibilities and model updates
The following changes may require you to update code that is using the relevant functionality:
- An error is now raised when unsupported values are given as input to train a
textcatortextcat_multilabelmodel - ensure that values are 0.0 or 1.0 as explained in the docs. - As
KnowledgeBaseis now an abstract class, you should call the constructor of the newInMemoryLookupKBinstead when you want to use spaCy's default KB implementation. If you've written a custom KB that inherits fromKnowledgeBase, you'll need to implement its abstract methods, or alternatively inherit fromInMemoryLookupKBinstead.
The following changes may influence the output of your language pipeline or trained models:
- Updates to language defaults:
- Extended support for Slovenian (#11162).
- Switch Russian and Ukrainian lemmatizers to
pymorphy3(#11345, #11811). - Support for editorial punctuation in Ancient Greek (#11426).
- Update to Russian tokenizer exceptions (#11753).
- Small fix in the list of Dutch stop words (#11997).
- Updates to model defaults:
- Use the latest
tok2vecdefaults in all components (#11618). - Improve the default attributes used for the
textcatandtextcat_multilabelcomponents (#11698). - Update the default scorer for
textcatandtextcat_multilabelto fix a bug related tothresholdfortextcatand to make it possible to score multipletextcat/textcat_multilabelcomponents in a single pipeline with custom scorers. If no custom scorers are used, thecat_p/r/fscores will now only reflect the final component's labels and performance (#11696, #11820). - Correct the
token_accscore to report the intended measure (# correct tokens / # predicted tokens, the same as in spaCy v2). Thetoken_accscores for v3.5 will be lower for the same performance because they were incorrectly inflated in v3.0-v3.4. Thetoken_p/r/fscores should remain unchanged (#12073).
- Use the latest
The following functionality will be changed in the near future - so it's best to start updating your scripts now to make them more generic:
- From v4 onwards, we'll rename the
masterbranch tomain.
📦 Trained pipelines updates
- The CNN pipelines add
IS_SPACEas atok2vecfeature fortaggerandmorphologizercomponents to improve tagging of non-whitespace vs. whitespace tokens. - The transformer pipelines require
spacy-transformersv1.2, which uses the exact alignment fromtokenizersfor fast tokenizers instead of the heuristic alignment fromspacy-alignments. For all trained pipelines exceptja_core_news_trf, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about thespacy-transformerschanges in the v1.2.0 release notes.
📖 Documentation and examples
- We've ported our website from Gatsby to Next 🥳
- Updated the documentation on supported languages.
- Added a note about experimental M1 GPU support to the installation quickstart.
- Included documentation for the
biluo_to_iobandiob_to_biluofunctions. - Fixed model links in the v3.4 usage documentation.
- Removed "new" tags of functionality from spaCy v2.x.
- Various small additions, spelling and typo fixes.
- spaCy Universe additions:
- greCy: Providing Ancient Greek models
- spacy-pythainlp: Add Thai support for spaCy
- New projects:
- Accelerate NER with Speedster (experimental)
👥 Contributors
@aaronzipp, @adrianeboyd, @albertvillanova, @ArchiDevil, @cfuerbachersparks, @damian-romero, @danieldk, @darigovresearch, @DSLituiev, @essenmitsosse, @gremur, @honnibal, @ines, @jmyerston, @JosPolfliet, @kadarakos, @koaning, @kwhumphreys, @ljvmiranda921, @MarcoGorelli, @orglce, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @ryndaniels, @shadeMe, @svlandeg, @thomashacker, @TrellixVulnTeam, @wannaphong, @zhiiw, @zrpxx
- Python
Published by adrianeboyd over 3 years ago
spacy - v2.3.9: Compatibility with NumPy v1.24+
This release addresses future compatibility with NumPy v1.24+.
🔴 Bug fixes
- #11940: Update for compatibility with NumPy v1.24+ integer conversions.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @svlandeg
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.0.9: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11864: Add
smart_openrequirement and update deprecated options. - #11899: Fix
spacy init config --gpufor environments withoutspacy-transformers. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11935: Restore missing error messages for beam search.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.1.7: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #10573: Remove Click pin following Typer updates.
- #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11860: Fix
spancatfor docs with zero suggestions. - #11864: Add
smart_openrequirement and update deprecated options. - #11899: Fix
spacy init config --gpufor environments withoutspacy-transformers. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11935: Restore missing error messages for beam search.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.2.5: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #10573: Remove Click pin following Typer updates.
- #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11860: Fix
spancatfor docs with zero suggestions. - #11864: Add
smart_openrequirement and update deprecated options. - #11899: Fix
spacy init config --gpufor environments withoutspacy-transformers. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11935: Restore missing error messages for beam search.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.3.2: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #10911, #11194: Improve speed in
precomputable_biaffineby avoiding concatenation. - #11276, #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11860: Fix
spancatfor docs with zero suggestions. - #11864: Add
smart_openrequirement and update deprecated options. - #11899: Fix
spacy init config --gpufor environments withoutspacy-transformers. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11934: Add strings when initializing from labels in
EditTreeLemmatizer. - #11935: Restore missing error messages for beam search.
👥 Contributors
@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.4.4: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11860: Fix
spancatfor docs with zero suggestions. - #11864: Add
smart_openrequirement and update deprecated options. - #11899: Fix
spacy init config --gpufor environments withoutspacy-transformers. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11934: Add strings when initializing from labels in
EditTreeLemmatizer. - #11935: Restore missing error messages for beam search.
👥 Contributors
@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.4.3: Extended Typer support and bug fixes
✨ New features and improvements
- Extend Typer support to v0.7.x (#11720).
🔴 Bug fixes
- #11640: Handle docs with no entities in
EntityLinker. - #11688: Restore custom doc extension values in
Doc.to_json()for attributes set by getters. - #11706: Remove incorrect warning for
pipeline_package.load(). - #11735: Improve
spacy projectrequirements checks for unsupported specifiers and requirements lines. - #11745: Revert modifications to
spacy.load(disable=)that could enable currently disabled components.
👥 Contributors
@aaronzipp, @adrianeboyd, @honnibal, @ines, @polm, @rmitsch, @ryndaniels, @svlandeg, @thomashacker
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.4.2: Latin and Luganda support, Python 3.11 wheels and more
✨ New features and improvements
- NEW: Luganda language support (#10847).
- NEW: Latin language support (#11349).
- NEW:
spacy.ConsoleLogger.v2optionally saves training logs to JSONL (#11214). - NEW: New operators for the
DependencyMatcherto include matching parents or children to the left or the right of the node (#10371). - Prebuilt Python 3.11 wheels are now available for all spaCy dependencies distributed by @explosion.
- Support pydantic v1.10 and mypy 0.980+, drop mypy support for Python 3.6 (#11546, #11635).
- Support CuPy v11 and add extras for
cuda11xandcuda-autodetect(usingcupy-wheel) (#11279). - Support custom attributes for tokens and spans in
Doc.to_json()andDoc.from_json()(#11125). - Make the
enableanddisableoptions forspacy.load()more consistent (#11459). - Allow a single string argument for
disable/enclude/excludeforspacy.load()(#11406). - New
--urlflag forspacy infoto print the direct download URL for a pipeline (#11175). - Add a check for missing requirements in the
spacy projectCLI (#11226). - Add a Levenshtein distance function (#11418).
- Improvements to the
spacy debug dataCLI for spancat data (#11504). - Allow overriding
spacy_versioninspacy packagemetadata (#11552). - Improve the error message when using the wrong command for
spacy project assets(#11458). - Ensure parent directories are created when storing the results of the
spacy pretraincommand (#11210). - Extend support to newer versions of
natto-pyfor thekoextra (#11222).
📦 Trained pipelines updates
This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_* v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).
Use spacy download to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0. You can check that you are using the new version (v3.4.1) with spacy validate:
NAME SPACY VERSION
en_core_web_md >=3.4.0,<3.5.0 3.4.1 ✔
🔴 Bug fixes
- #11275: Fix Dutch noun chunks to skip overlapping spans.
- #11276: Fix regex invalid escape sequences.
- #11312: Better handling of unexpected types in
SetPredicate. - #11460: Fix config validation failures caused by NVTX pipeline wrappers.
- #11506: Avoid unwanted side effects in
Doc.__init__. - #11540: Preserve missing entity annotation in augmenters.
- #11592: Fix issues with DVC commands.
- #11631: Fix initialization for
pymorphy2_lookuplemmatizer mode for Russian and Ukrainian.
⚠️ Backwards incompatibilities
- If you're using a custom component that does not return a
Doctype, an error will now be raised (#11424). - If you're using a dot in a factory name, an error is raised as this is not supported (#11336).
📖 Documentation and examples
- Added documentation for the new experimental coref component.
- Added Ukrainian trained pipelines to the website.
- Added documentation for the
spacy.models_and_pipes_with_nvtx_range.v1callback. - Fix English pipeline names in v3.4 release notes.
- Various fixes to the
ExampleAPI documentation. - Extensions and improvements to the
displacydocs. - Fix the example command for
spacy project dvc. - Update example code for
spacy-wordnet. - Improve API documentation around the
initialize()function for pipeline components. - Fix various typos and inconsistencies.
- spaCy universe additions:
- concepCy: A spaCy wrapper for ConceptNet.
- spaCy partial tagger: build a CRF tagger with a partially annotated dataset.
- Zshot: Zero and Few shot named entity & relationships recognition.
👥 Contributors
@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy
- Python
Published by adrianeboyd over 3 years ago
spacy - v2.3.8: Updates for Python 3.10 and 3.11
✨ New features and improvements
- Updates and binary wheels for Python 3.10 and 3.11.
👥 Contributors
@adrianeboyd, @honnibal, @ines
- Python
Published by adrianeboyd over 3 years ago
spacy - v3.4.1: Fix compatibility with CuPy v9.x
🔴 Bug fixes
- Fix issue #11137: Fix compatibility with CuPy v9.x.
📖 Documentation and examples
- spaCy universe additions:
- BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.
- English Interpretation Sentence Pattern: English interpretation for accurate translation from English to Japanese.
👥 Contributors
@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic
- Python
Published by adrianeboyd almost 4 years ago
spacy - v3.4.0: Updated types, speed improvements and pipelines for Croatian
✨ New features and improvements
- Support for mypy 0.950+ and pydantic v1.9 (#10786).
- Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
- Min/max
{n,m}operator forMatcherpatterns (#10981). - Language updates:
- Improve tokenization for Cyrillic combining diacritics (#10837).
- Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).
- Improved speed of vector lookups (#10992).
- For the parser, use C
saxpy/sgemmprovided by theOpsimplementation in order to use Accelerate throughthinc-apple-ops(#10773). - Improved speed of
Example.get_aligned_parseandExample.get_aligned(#10952). - Improved speed of
StringStorelookups (#10938). - Updated
spacy project cloneto try bothmainandmasterbranches by default (#10843). - Added confidence threshold for named entity linker (#11016).
- Improved handling of Typer optional default values for
init_config_cli(#10788). - Added cycle detection in parser projectivization methods (#10877).
- Added counts for NER labels in
debug data(#10960). - Support for adding NVTX ranges to
TrainablePipecomponents (#10965). - Support env variable
SPACY_NUM_BUILD_JOBSto specify the number of build jobs to run in parallel withpip(#11073).
📦 Trained pipelines updates
We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.
| Package | UPOS | Parser LAS | NER F |
| ----------------------------------------------- | ---: | ---------: | ----: |
| hr_core_news_sm | 96.6 | 77.5 | 76.1 |
| hr_core_news_md | 97.3 | 80.1 | 81.8 |
| hr_core_news_lg | 97.5 | 80.4 | 83.0 |
🙏 Special thanks to @gtoffoli for help with the new pipelines!
The English pipelines have new word vectors:
| Package | Model Version | TAG | Parser LAS | NER F |
| ----------------------------------------------- | ------------- | ---: | ---------: | ----: |
| en_core_news_md | v3.3.0 | 97.3 | 90.1 | 84.6 |
| en_core_news_md | v3.4.0 | 97.2 | 90.3 | 85.5 |
| en_core_news_lg | v3.3.0 | 97.4 | 90.1 | 85.3 |
| en_core_news_lg | v3.4.0 | 97.3 | 90.2 | 85.6 |
All CNN pipelines have been extended to add whitespace augmentation.
🔴 Bug fixes
- Fix issue #10960: Support hyphens in NER labels.
- Fix issue #10994: Fix horizontal spacing for spans in displaCy.
- Fix issue #11013: Check for any token with a vector in
Doc.has_vector, distinguish 0-vectors and missing vectors insimilaritywarnings. - Fix issue #11056: Don't use
get_array_moduleintextcat. - Fix issue #11092: Fix vertical alignment for spans in displaCy.
🚀 Notes about upgrading from v3.3
Doc.has_vectornow matchesToken.has_vectorandSpan.has_vector: it returnsTrueif at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.
📖 Documentation and examples
- spaCy universe additions:
- Aim-spacy: An Aim-based spaCy experiment tracker.
- Asent: Fast, flexible and transparent sentiment analysis.
- spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
- spacy-report: Generates interactive reports for spaCy models.
👥 Contributors
@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere
- Python
Published by adrianeboyd almost 4 years ago
spacy - v3.3.1: New Span Ruler component, JSON (de)serialization of Doc, span analyzer and more
✨ New features and improvements
- Add the SpanRuler component. This component saves a list of matched spans to
Doc.spans[spans_key]. - Support for JSON serialization and deserialization of
Docobjects. - Add span analysis to
debug data. - Allow data assets to be made optional in a spaCy project.
- Prebuilt macOS ARM64 wheels are now available for all spaCy dependencies distributed by @Explosion.
🔴 Bug fixes
- Fix issue #9575: Fix Entity Linker with tokenization mismatches between gold and predicted
Docobjects. - Fix issue #10685: Fix serialization of
SpanGroupobjects that share the same name within oneSpanGroupscontainer. - Fix issue #10718: Remove debug print statements in
walk_head_nodesto avoid acquiring the GIL. - Fix issue #10741: Make the
StringStore.__getitem__return type dependent on its parameter type. - Fix issue #10734: Support removal of overlapping terms in
PhraseMatcher. - Fix issue #10772: Override
SpanGroups.setdefaultto also supportIterable[SpanGroup]as the default. - Fix issue #10817: Ensure that the term
ROOTis in the glossary. - Fix issue #10830: Better errors for
Doc.has_annotationandMatcher. - Fix issue #10864: Avoid pickling
Docinputs passed toLanguage.pipe(). - Fix issue #10898: Fix schemas import in
Doc.
⚠️ Backward incompatibilities
- Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the
nameattribute. For example, the following pipeline component:
ini
[components.transformer]
factory = "transformer"
name = "custom_transformer_name"
would be registered erroneously as custom_transformer_name. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer.
👥 Contributors
@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg
- Python
Published by danieldk almost 4 years ago
spacy - v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish
✨ New features and improvements
- Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in
spacy.Tagger.v2to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197). - Speed up parser projectivization functions (#10241).
- Replace
Raggedwith fasterAlignmentArrayinExamplefor training (#10319). - Improve
Matcherspeed (#10659). - Improve serialization speed for empty
Doc.spans(#10250).
- NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with
spacy init config -p trainable_lemmatizeror using the quickstart. - Language updates:
- Big endian support with
thincv8.0.14+ andthinc-bigendian-ops. - Config comparisons with
spacy debug diff-config. - displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidatesfor debugging span suggesters.- The quickstart now supports adding
spancatandtrainable_lemmatizercomponents.
📦 Trained pipelines
v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
| Package | Language | UPOS | Parser LAS | NER F |
| --------------------------------------------------------------- | -------- | ---: | ---------: | ----: |
| fi_core_news_sm | Finnish | 92.5 | 71.9 | 75.9 |
| fi_core_news_md | Finnish | 95.9 | 78.6 | 80.6 |
| fi_core_news_lg | Finnish | 96.2 | 79.4 | 82.4 |
| ko_core_news_sm | Korean | 86.1 | 65.6 | 71.3 |
| ko_core_news_md | Korean | 94.7 | 80.9 | 83.1 |
| ko_core_news_lg | Korean | 94.7 | 81.3 | 85.3 |
| sv_core_news_sm | Swedish | 95.0 | 75.9 | 74.7 |
| sv_core_news_md | Swedish | 96.3 | 78.5 | 79.3 |
| sv_core_news_lg | Swedish | 96.3 | 79.1 | 81.1 |
🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
| Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
| ----------------------------------------------- | -------------: | -------------: |
| da_core_news_md | 84.9 | 94.8 |
| de_core_news_md | 73.4 | 97.7 |
| el_core_news_md | 56.5 | 88.9 |
| fi_core_news_md | - | 86.2 |
| it_core_news_md | 86.6 | 97.2 |
| ko_core_news_md | - | 90.0 |
| lt_core_news_md | 71.1 | 84.8 |
| nb_core_news_md | 76.7 | 97.1 |
| nl_core_news_md | 81.5 | 94.0 |
| pl_core_news_md | 87.1 | 93.7 |
| pt_core_news_md | 76.7 | 96.9 |
| ro_core_news_md | 81.8 | 95.5 |
| sv_core_news_md | - | 95.5 |
🔴 Bug fixes
- Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
- Fix issue #9443: Fix
Scorer.score_catsfor missing labels. - Fix issue #9669: Fix entity linker batching.
- Fix issue #9903: Handle
_value for UPOS in CoNLL-U converter. - Fix issue #9904: Fix textcat loss scaling.
- Fix issue #9956: Compare all
Spanattributes consistently. - Fix issue #10073: Add
"spans"to the output ofdoc.to_json. - Fix issue #10086: Add tokenizer option to allow
Matcherhandling for all special cases. - Fix issue #10189: Allow
Exampleto align whitespace annotation. - Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
- Fix issue #10324: Fix
Tok2Vecfor empty batches. - Fix issue #10347: Update basic functionality for
rehearse. - Fix issue #10394: Fix
Vectors.n_keysfor floret vectors. - Fix issue #10400: Use
metainutil.load_model_from_config. - Fix issue #10451: Fix
Example.get_matching_ents. - Fix issue #10460: Fix initial special cases for
Tokenizer.explain. - Fix issue #10521: Stream large assets on download in spaCy projects.
- Fix issue #10536: Handle unknown tags in
KoreanTokenizertag map. - Fix issue #10551: Add automatic vector deduplication for
init vectors.
🚀 Notes about upgrading from v3.2
- To see the speed improvements for the
Taggerarchitecture, edit your configs to switch fromspacy.Tagger.v1tospacy.Tagger.v2and then runinit fill-config. - Span comparisons involving ordering (
<,<=,>,>=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956). - Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docsnow includesDoc.tensorby default and supports excludes with anexcludeargument in the same format asDoc.to_bytes. The supported exclude fields arespans,tensoranduser_data.
📖 Documentation and examples
- spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.
👥 Contributors
@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996
- Python
Published by adrianeboyd about 4 years ago
spacy - v3.1.6: Workaround for Click/Typer issues
🔴 Bug fixes
- Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.
👥 Contributors
@adrianeboyd, @honnibal, @ines
- Python
Published by adrianeboyd about 4 years ago
spacy - v3.2.4: Workaround for Click/Typer issues
🔴 Bug fixes
- Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.
👥 Contributors
@adrianeboyd, @honnibal, @ines
- Python
Published by adrianeboyd about 4 years ago
spacy - v3.2.3: Fix Tok2Vec for empty batches
🔴 Bug fixes
- Fix issue #10324: Fix
Tok2Vecfor empty batches.
👥 Contributors
@adrianeboyd, @honnibal, @ines
- Python
Published by adrianeboyd over 4 years ago
spacy - v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more
🔴 Bug fixes
- Fix issue #9593: Use metaclass to subclass errors for easier pickling.
- Fix issue #9654: Fix
spancatfor empty docs and zero suggestions. - Fix issue #9979: Fix type of
Lexeme.rank. - Fix issue #10324: Fix
Tok2Vecfor empty batches.
👥 Contributors
@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz
- Python
Published by adrianeboyd over 4 years ago
spacy - v3.0.8: Fix Tok2Vec for empty batches
🔴 Bug fixes
- Fix issue #10324: Fix
Tok2Vecfor empty batches.
👥 Contributors
@adrianeboyd, @danieldk, @honnibal, @ines
- Python
Published by adrianeboyd over 4 years ago
spacy - v3.2.2: Improved NER and parser speeds, bug fixes and more
✨ New features and improvements
- Improved
parserandnerspeeds on long documents (see technical details in #10019). - Support for
spancatcomponents indebug data. - Support for
ENT_IOBas aMatchertoken pattern key. - Extended and improved types for many classes.
🔴 Bug fixes
- Fix issue #9735: Make floret murmurhash endian-neutral.
- Fix issue #9738: Support string IOB values for
ENT_IOB. - Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
- Fix issue #9960: Warn about entities that cross sentence boundaries in
debug data. - Fix issue #9979: Fix type for
Lexeme.rank. - Fix issue #10026: Check for 0-size assets in
spacy project. - Fix issue #10051: Consistently return scalars from similarity methods.
- Fix issue #10052: Fix spaces in
Doc.from_docs()for empty docs. - Fix issue #10079: Fix label detection in
debug datafor components with custom names. - Fix issue #10109: Add types to
UnderscoreandDependencyMatcherand improve types inLanguage,MatcherandPhraseMatcher. - Fix issue #10130: Fix
Tokenizer.explainwhen infixes appear as prefixes. - Fix issue #10143: Use simple suggester in
spancatinitialization. - Fix issue #10164: Support
IS_SENT_ENDinDoc.has_annotation. - Fix issue #10192: Detect invalid package names in
spacy package. - Fix issue #10223: Support mixed case in package names.
- Fix issue #10234: Fix type in
PhraseMatcher.
📖 Documentation and examples
- Various documentation updates.
- New spaCy version tags in spaCy universe.
- New
Dockerfilefor repeatable website builds and easier local development. - New additions to spaCy universe:
- Augmenty: a text augmentation library
- Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
- spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
- spacypdfreader: easy PDF to text to spaCy text extraction
- textnets: text analysis with networks
👥 Contributors
@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav
- Python
Published by adrianeboyd over 4 years ago
spacy - v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more
✨ New features and improvements
- NEW:
doc_cleanercomponent for removingdoc.tensor,doc._._trf_dataor otherDocattributes at the end of the pipeline to reduce size of output docs. - NEW:
ENT_IDandENT_KB_IDtoMatcherpattern attributes. - Support
kb_idfor entities in displaCy fromDocinput. - Add
Span.sentsproperty for spans spanning over more than one sentence. - Add
EntityRuler.removeto remove patterns byid. - Make the
Taggerneg_prefixconfigurable. - Use
Language.pipeinLanguage.evaluatefor more efficient processing. - Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.
🔴 Bug fixes
- Fix issue #9638: Make
JsonlCorpuspath optional again. - Fix issue #9654: Fix
spancatfor empty docs and zero suggestions. - Fix issue #9658: Improve error message for incorrect
.jsonlpaths inEntityRuler. - Fix issue #9674: Fix language-specific factory handling in package CLI.
- Fix issue #9694: Convert labels to strings for README in package CLI.
- Fix issue #9697: Exclude strings from source vector checks.
- Fix issue #9701: Allow
Scorer.score_spansto handle predicted docs with missing annotation. - Fix issue #9722: Initialize
parserfrom reference parse rather than aligned example. - Fix issue #9764: Set annotations more efficiently in
taggerandmorphologizer.
📖 Documentation and examples
- Various documentation updates:
init_tok2vecafter pretraining, batch contract for listeners. - New additions to the spaCy universe:
eng-spacysentiment: Sentiment analysis for English.- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.
👥 Contributors
@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar
- Python
Published by adrianeboyd over 4 years ago
spacy - v3.2.0: Registered scoring functions, Doc input, floret vectors and more
✨ New features and improvements
- NEW: Registered scoring functions for each component in the config.
- NEW:
nlp()andnlp.pipe()acceptDocinput, which simplifies setting custom tokenization or extensions before processing. - NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwriteconfig settings forentity_linker,morphologizer,tagger,sentencizerandsenter.extendconfig setting formorphologizerfor whether existing feature types are preserved.- Support for a wider range of language codes in
spacy.blank()including IETF language tags, for examplefraforFrenchandzh-HansforChinese. - New package
spacy-loggersfor additional loggers. - New Irish lemmatizer.
- New Portuguese noun chunks and updated Spanish noun chunks.
- Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
- Japanese reading and inflection from
sudachipyare annotated asToken.morphfeatures. - Additional
morph_micro_p/r/fscores for morphological features fromScorer.score_morph_per_feat(). LIKE_URLattribute includes the tokenizer URL pattern.--n-save-epochoption forspacy pretrain.- Trained pipelines:
- New transformer pipeline for Japanese
ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community! - Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a
tok2vecfeature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - English attribute ruler patterns updated to improve
Token.posandToken.morph.
- New transformer pipeline for Japanese
For more details, see the New in v3.2 usage guide.
🔴 Bug fixes
- Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
- Fix issue #9032: Retain alignment between doc and context for
Language.pipe(as_tuples=True)for multiprocessing with custom error handlers. - Fix issue #9136: Ignore prefixes when applying suffix patterns in
Tokenizer. - Fix issue #9584: Use metaclass to subclass errors to allow better pickling.
⚠️ Backwards incompatibilities
- In the
Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of°[cfk].is now° c .instead of° c.for most languages. - The tokenizer classes
ChineseTokenizer,JapaneseTokenizer,KoreanTokenizer,ThaiTokenizerandVietnameseTokenizerrequireVocabrather thanLanguagein__init__. - In
DocBin, user data is now always serialized according to thestore_user_dataoption, see #9190.
📖 Documentation and examples
- Demo projects for floret vectors:
pipelines/floret_vectors_demo: basic floret vector training and importing.pipelines/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipelines/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.
👥 Contributors
@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
- Python
Published by adrianeboyd over 4 years ago
spacy - v3.1.4: Python 3.10 wheels and support for AppleOps
✨ New features and improvements
- NEW: Binary wheels for Python 3.10.
- NEW: Improve performance on Apple M1 with
AppleOps:pip install spacy[apple]. - GPU profiling with
spacy.models_with_nvtx_range.v1. - Full
mypyintegration in the CI and many type fixes across the code base. - Added custom
Protocolclasses inty.pyto define behavior of pipeline components. - Support for entity linking visualization in
displacy. - Allow overriding vars in
spacy project assets. - Standalone
trainfunction to run the training from Python scripts just like thespacy trainCLI. - Support for
spacy-transformers>=1.1.0with improved IO. - Support for
thinc>=8.0.11with improved gradient clipping.
🔴 Bug fixes
- Fix issue #5507: Improve UX for multiprocessing on GPU.
- Fix issue #9137: Fix serialization for
KnowledgeBase.set_entities. - Fix issue #9244: Fix vectors for 0-length spans.
- Fix issue #9247: Improve UX for the
DocBinconstructor. - Fix Issue #9254: Allow unicode in a
spacy projecttitle. - Fix issue #9263: Make added patterns consistent in the
DependencyMatcher. - Fix issue #9305: Restore tokenization timing during evaluation.
- Fix issue #9335: Sync vocab in vectors and sourced components.
- Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
- Fix issue #9404: Create consistent default
textcatandtextcat_multilabelconfigurations. - Fix issue #9437: Improve UX around
Docobject creation. - Fix issue #9465: Fix minor issues with
convertCLI. - Fix issue #9500: Include
.pyifiles in the distributed package.
📖 Documentation and examples
- Various updates to the documentation.
- New additions to the spaCy universe:
deplacy: CUI-based dependency visualizeripymarkup: Visualizations for NER and syntax treesPhruzzMatcher: Find fuzzy matchesspacy-huggingface-hub: Push spaCy pipelines to the Hugging Face HubspaCyOpenTapioca: Entity Linking on Wikidataspacy-clausie: Clause-based information extraction system- "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
- "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly
👥 Contributors
@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker
- Python
Published by svlandeg over 4 years ago
spacy - v3.1.3: Bug fixes and UX updates
✨ New features and improvements
- The
v3ofWandbLoggernow supports optionalrun_nameandentityparameters. - Improved UX when providing invalid
posvalues for aDocorToken.
🔴 Bug fixes
- Fix issue #9001: Pass alignments to
Matchercallbacks. - Fix issue #9009: Include component factories in third-party dependencies resolver.
- Fix issue #9012: Correct type of
configincreate_pipe. - Fix issue #9014: Allow
typer0.4 to provide support for both Click 7 and Click 8. - Fix issue #9033: Fix verbs list for French tokenizer exceptions.
- Fix issue #9059: Pass overrides to subcommands in
spacy projectworkflows. - Fix issue #9074: Improve UX around
repoandpatharguments inspacy project. - Fix issue #9084: Fix inference of
epoch_resumeinspacy pretrain. - Fix issue #9163: Handle
spacy-legacyinspacy packagedependency detection. - Fix issue #9211: Include only runtime-relevant dependencies in
spacy package.
📖 Documentation and examples
- Various updates to the documentation.
- Few additions and updates to the spaCy universe.
- Extended the developer documentation with information about the listener pattern, the
StringStoreand theVocab.
👥 Contributors
@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker
- Python
Published by svlandeg over 4 years ago
spacy - v3.1.2: Improved spancat component and various bugfixes
✨ New features and improvements
- NEW: Provide scores for the
SpanCategorizerpredictions. - NEW: Broader compatibility with type checkers thanks to
.pyistub files. - NEW: Auto-detect package dependencies in
spacy package. - New
INTERSECTSoperator for the Matcher. - More debugging info for
spacy projectpushandpullcommands. - Allow passing in a precomputed array for speeding up multiple
Span.as_doccalls. - The default
datransformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).
🔴 Bug fixes
- Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
- Fix issue #8774: Ensure
debug dataruns correctly with a custom tokenizer. - Fix issue #8784: Fix incorrect
ISSUBSETandISSUPERSETin schema and docs. - Fix issue #8796: Respect the
no_skipvalue forspacy project run. - Fix issue #8810: Make
ConsoleLoggerflush after each logging line. - Fix issue #8819: Pass
excludewhen serializing the vocab. - Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
- Fix issue #8970: Fix
allow_overlapdefault for span categorizer scoring. - Fix issue #8982: Add glossary entry for
_SP. - Fix issue #9007: Fix span categorizer training on nested entities.
📖 Documentation and examples
- New developer documentation covering spaCy's internals and code conventions.
- Added a documentation section on preparing training data in spaCy's binary format.
- Updated some error/log messages to be more informative.
- Various updates to the documentation.
- A few new additions to the spaCy universe.
👥 Contributors
@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker
- Python
Published by svlandeg almost 5 years ago
spacy - v3.0.7: Bug fixes and base support for Azerbaijani
✨ New features and improvements
- Alpha tokenization support for Azerbaijani.
- Updates for French stop words.
🔴 Bug fixes
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7886: Fix unknown tokens percentage in
debug data. - Fix issue #7907: Update
load_lookupsreturn type and docstring. - Fix issue #7930: Make
EntityLinkerrobust fornO=None. - Fix issue #7925: Skip vector ngram backoff if
minnis not set. - Fix issue #7973: Fix
debug modelfor transformers. - Fix issue #7988: Preserve existing
ENT_KB_IDinnerannotation. - Fix issue #7992: Fix span offsets for
Matcher(as_spans)on spans. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()for all empty docs. - Fix issue #8012: Fix ensemble
textcatwith listener. - Fix issue #8054: Add
ENT_IDandNORMtoDocBinstrings. - Fix issue #8055: Handle partial entities in
Span.as_doc. - Fix issue #8062: Make all
Spanattrs writable. - Fix issue #8066: Update
debug datafortextcat. - Fix issue #8069: Custom warning if
DocBinis too large. - Fix issue #8113: Support
to/from_bytesforKnowledgeBaseandEntityLinker. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1. - Fix issue #8169: Fix bug from
EntityRuler:ent_idsreturnsNonefor phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler. - Fix issue #8244: Use context manager when reading model file.
- Fix issue #8245: Fix other open calls without context managers.
- Fix issue #8265: Address mypy errors.
- Fix issue #8299: Restrict
pymorphy2requirement topymorphy2mode in Russian and Ukrainian lemmatizers. - Fix issue #8335: Raise error if deps not provided with heads in
Doc. - Fix issue #8368: Preserve whitespace in
Span.lemma_. - Fix issue #8396: Make
JsonlReaderpath optional. - Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs. - Fix issue #8584: Raise an error for
textcatwith <2 labels. - Fix issue #8551: Fix duplicate spacy package CLI opts.
👥 Contributors
@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD
- Python
Published by adrianeboyd almost 5 years ago
spacy - v3.1.1: Support for Ancient Greek and various bug fixes
✨ New features and improvements
- Alpha tokenization support for Ancient Greek.
- Implementation of a
noun_chunkiterator for Dutch. - Support for
black&flake8as pre-commit hooks. - New
spacy.ngram_range_suggester.v1for suggesting a range of n-gram sizes for thespancatcomponent.
🔴 Bug fixes
- Fix issue #8638: Fix Azerbaijani initialization.
- Fix issue #8639: Use 0-vector for OOV lexemes.
- Fix issue #8640: Update lexeme ranks for loaded vectors.
- Fix issue #8651: Fix
ruandukmultiprocessing (withspawn). - Fix issue #8663: Preserve existing
metainformation withspacy package. - Fix issue #8718: Ensure that
replace_pipetakes disabled components into account.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe
- Python
Published by svlandeg almost 5 years ago
spacy - v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more
✨ New features and improvements
- NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
- NEW: Experimental
SpanCategorizercomponent for labeling arbitrary and potentially overlapping spans of text. - NEW: Use predicted annotations during training via the
[training.annotating_components]config setting. - Alpha tokenization support for Azerbaijani.
- Part-of-speech tag-based lemmatizers for Catalan and Italian.
- The TextCatCNN and TextCatBOW architectures are now resizable.
- Support updating the
EntityRecognizerwith known incorrect span annotations. - Auto-generate a pretty
README.mdbased on the meta inspacy package.
For more details, see the New in v3.1 usage guide.
📦 New trained pipelines
| Package | Language | UPOS | Parser LAS | NER F |
| ----------------------------------------------------------------- | -------- | ---: | ---------: | -----: |
| ca_core_news_sm | Catalan | 98.2 | 87.4 | 79.8 |
| ca_core_news_md | Catalan | 98.3 | 88.2 | 84.0 |
| ca_core_news_lg | Catalan | 98.5 | 88.4 | 84.2 |
| ca_core_news_trf | Catalan | 98.9 | 93.0 | 91.2 |
| da_core_news_trf | Danish | 98.0 | 85.0 | 82.9 |
⚠️ Upgrading from v3.0
- Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the
spacy_versionin your model package meta to">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1. - Use
spacy init fill-configto update a v3.0 config for v3.1. - When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in
[initialize.vectors]. - Logger warnings have been converted to Python warnings. Use
warnings.filterwarningsor the new helper methodspacy.errors.filter_warning(action, error_msg='')to manage warnings.
For more information, see Notes on upgrading from v3.0.
🔴 Bug fixes
- Fix issue #7036: Use a context manager when reading model.
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7799: Ensure
spacy raycommand works. - Fix issue #7807: Show warning if entity ruler runs without patterns.
- Fix issue #7886: Fix unknown tokens percentage in
debug data. - Fix issue #7930: Make
EntityLinkerrobust for nO=None. - Fix issue #7925: Skip vector ngram backoff if
minnis not set. - Fix issue #7973: Fix
debug modelfor transformers. - Fix issue #7988: Preserve existing
ENT_KB_IDinnerannotation. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()for all empty docs. - Fix issue #8012: Fix ensemble
textcatwith listener. - Fix issue #8054: Add
ENT_IDandNORMtoDocBinstrings. - Fix issue #8055: Handle partial entities in
Span.as_doc. - Fix issue #8062: Make all
Spanattrs writable. - Fix issue #8066: Update
debug datafortextcat. - Fix issue #8069: Custom warning if
DocBinis too large. - Fix issue #8099: Update Vietnamese tokenizer.
- Fix issue #8113: Support
to/from_bytesforKnowledgeBaseandEntityLinker. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1. - Fix issue #8169: Fix bug from
EntityRuler:ent_idsreturns None for phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler. - Fix issue #8265: Address mypy errors.
- Fix issue #8335: Raise error if deps not provided with heads in
Doc. - Fix issue #8368: Preserve whitespace in
Span.lemma_. - Fix issue #8388: Don't clobber vectors when loading components from source models.
- Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict. - Fix issue #8441: Add correct types for
Language.pipereturn values. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs. - Fix issue #8559: Fix vectors check for sourced components.
- Fix issue #8584: Raise an error for
textcatwith <2 labels.
👥 Contributors
@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD
- Python
Published by adrianeboyd almost 5 years ago
spacy - v2.3.7: Bug fix for download CLI
🔴 Bug fixes
- Fix issue #8286: Fix
spacy download.
- Python
Published by adrianeboyd almost 5 years ago
spacy - v2.3.6: Bug fixes and base support for Amharic
✨ New features and improvements
- Add base support for Amharic.
- Add noun chunk iterator for Danish.
- Updates to French, Portuguese and Romanian stop words.
🔴 Bug fixes
- Fix issue #6705: Fix deserialization of null
token_matchandurl_matchfor the tokenizer. - Fix issue #6712: Prevent overlapping noun chunks for Spanish.
- Fix issue #6745: Fix minibatch iterator when size iterator is finished.
- Fix issue #6759: Skip 0-length matches in the
Matcher. - Fix issue #6771: Support
IS_SENT_STARTin thePhraseMatcher. - Fix issue #6772: Fix
Span.textfor empty spans. - Fix issue #6820: Improve
Doc.char_spanalignment_modehandling. - Fix issue #6857: Remove
--no-cache-dirwhen downloading models. - Fix issue #8115: Fix offsets in
Span.get_lca_matrix.
👥 Contributors
Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.
- Python
Published by adrianeboyd about 5 years ago
spacy - v3.0.6: assemble CLI, Matcher alignments, training from streamed corpora and many bug fixes
✨ New features and improvements
- New
assembleCLI command for assembling a pipeline from a config without training. - Add support for match alignments in the
Matcherto align matched tokens with matcher patterns. - Add support for training from streamed corpora.
- Add support for W&B data and model checkpoint logging and versioning in
spacy.WandbLogger.v2. - Extend
Scorer.score_spansto support overlapping and unlabeled spans. - Update
debug datafor new v3 components. - Improve language data for Italian.
- Various improvements to error handling and UX.
🔴 Bug fixes
- Fix issue #7408: Add
vocabkwarg tospacy.load. - Fix issue #7419: Exclude user hooks in displacy conversion.
- Fix issue #7421: Update
--codeusage in CLI commands. - Fix issue #7424: Preserve sent starts on retokenization without parse.
- Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
- Fix issue #7471: Improve warnings related to listening components.
- Fix issue #7488: Fix
upstreamcheck in pretraining. - Fix issue #7489: Support
callbacksentry points. - Fix issue #7497: Merge
doc.spansinDoc.from_docs(). - Fix issue #7528: Preserve user data for
DependencyMatcheron spans. - Fix issue #7557: Fix
__add__method forPRFScore. - Fix issue #7574: Fix conversion of custom extension data in
Span.as_docandDoc.from_docs. - Fix issue #7620: Fix
replace_listenersin configs. - Fix issue #7626: Fix vectors data on GPU.
- Fix issue #7630: Update NEL for entities crossing sentence boundaries.
- Fix issue #7631: Fix parser sourcing in NER converter.
- Fix issue #7642: Fix handling of hyphen string value in config files.
- Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
- Fix issue #7674: Fix handling of unknown tokens in
StaticVectors. - Fix issue #7690: Fix pickling of
Lemmatizer. - Fix issue #7749: Update
Tokenizer.explainfor special cases in v3. - Fix issue #7755: Fix config parsing of ints/strings.
- Fix issue #7836: Fix tokenizer cache flushing.
- Fix issue #7847: Fix handling of boolean values in
Example.from_dictfor sent starts.
📖 Documentation and examples
- Add documentation for legacy functions and architectures.
- Add documentation for pretrained pipeline design.
- Add more details about
pipeand multiprocessing. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!
- Python
Published by adrianeboyd about 5 years ago
spacy - v3.0.5: Bug fix for thinc requirement
🔴 Bug fixes
- Fix related to issue #7075: Update
thincrequirement for Jupyter notebook GPU warning
- Python
Published by adrianeboyd about 5 years ago
spacy - v3.0.4: Fix tok2vec pretraining, source disabled components, better UX and bug fixes
✨ New features and improvements
- Allow sourcing disabled components in config.
- Support
Doc.spansinExample.from_dict. - Improve transformer recommendations in quickstart widget and
init config. - Improve language data for Bulgarian.
- Various improvements to error handling and UX.
🔴 Bug fixes
- Fix issue #6952, #7285, #7289: Make
tok2vecpretraining andpretraincommand work as expected again. - Fix issue #7062: Only evaluate named entities for NEL if there is a corresponding gold span.
- Fix issue #7065: Correctly handle sentence boundaries in
Span.sent. - Fix issue #7071: Fix
conllconverter option. - Fix issue #7100: Re-add
n_sentsto entity linker and fix config handling and I/O. - Fix issue #7122: Fix displaCy output in
evaluateCLI.- Fix issue #7127: Fix initialization of
UkrainianLemmatizer.
- Fix issue #7127: Fix initialization of
- Fix issue #7176: Re-refactor
Sentencizerto usePipeAPI. - Fix issue #7182: Allow
SpanGroupimport fromspacy.tokens. - Fix issue #7204: Adjust Cython compilation for setups with custom include paths.
- Fix issue #7222: Correct YAML formatting in quickstart recommendations for
bgandbn. - Fix issue #7225: Fix
spansweakref inDoc.copy. - Fix issue #7237: Fix
is_cython_funcfor additional imported code. - Fix issue #7250: Fix patience for identical scores.
- Fix issue #7329: Make
spacy.orth_variants.v1andspacy.lower_case.v1augmenters work as expected. - Fix issue #7352: Sort
EntityRuler.labelsalphabetically.
📖 Documentation and examples
- Add documentation for
textcat_multilabelcomponent. - Extend documentation for
Vocab.get_noun_chunks. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @MartinoMensio, @SergeyShk, @R1j1t, @palandlom, @dardoria, @Tocic, @clippered, @graue70, @koaning and @jankrepl for the pull requests and contributions!
- Python
Published by ines about 5 years ago
spacy - v3.0.3: Bug fixes for sentence segmentation and config filling
🔴 Bug fixes
- Fix issue #7035, #7056: Fix parser transition bug that could lead to incorrect sentence fragments.
- Fix issue #7055: Preserve sourced components in
init fill-config.
📖 Documentation and examples
- Update spaCy Universe.
👥 Contributors
Thanks @MartinoMensio for the pull request!
- Python
Published by ines over 5 years ago
spacy - v3.0.2: CLI overrides and env variables in projects, base support for Setswana, PhraseMatcher for spans and bug fixes
✨ New features and improvements
- NEW: Base support for Setswana.
- The
PhraseMatchercan now also be run onSpanobjects. - Support CLI overrides and environment variables in
project.yml: a sectionenvdefines environment variable names that can be used in commands. Theproject runcommand now also supports CLI overrides, e.g.--vars.batch_size 128. - Reduce memory load when reading all vectors from file during initialization.
- Update recommended transformers in training quickstart and
init configCLI.
🔴 Bug fixes
- Fix issue #6826: Ensure the loss value is cast to a float.
- Fix issue #6891: Include
noun_chunkswhen picklingVocab. - Fix issue #6908: Fix expected type for textcat labels.
- Fix issue #6924: Correctly pass
vocabforward inspacy.blank. - Fix issue #6950: Allow pickling Tok2Vec with listeners .
- Fix issue #6983: Ensure
is_same_funcworks correctly for classes in component decorator. - Fix issue #7019: Correctly handle non-float/int values in
spacy evaluateprinter. - Fix issue #7029: Fix listener architecture with empty
Docin batch.
📖 Documentation and examples
- Improve installation instructions.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @peter-exos, @KoichiYasuoka, @tarskiandhutch, @reneoctavio, @melonwater211, @mapmeld and @Shumie82 for the pull requests and contributions.
- Python
Published by ines over 5 years ago
spacy - v3.0.1: Bug fixes for transfomer training
🔴 Bug fixes
- Fix issue #6883: Fix bug in transformer training for
Cannot get dimension 'nO' for model 'transformer': value unset.
- Python
Published by adrianeboyd over 5 years ago
spacy - v3.0.0: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more
📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →
🚀 Quickstart
For the smoothest updating process, we recommend starting with a fresh virtual environment.
bash
pip install -U spacy
- New in v3.0: New features, backwards incompatibilities and migration guide.
- Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
- Training Quickstart: Generate a training config for your specific use case.
- Benchmarks: Results and accuracy comparisons.
- Projects & Project Templates: Get started by cloning a project template.
✨ New features and improvements
- Transformer-based pipelines with support for multi-task learning.
- Retrained model families for 18+ languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
- Retrained pipelines for all supported languages, plus new core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for the contributions!
- New training workflow and config system.
- Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
- spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
- Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
- Parallel training and distributed computing with Ray.
- New built-in pipeline components:
SentenceRecognizer,Morphologizer,Lemmatizer,AttributeRulerandTransformer. - New and improved pipeline component API and decorators for custom components.
- Source trained components from other pipelines in your training config.
- Pre-built and more efficient binary wheels for all trained pipeline packages.
DependencyMatcherfor matching patterns within the dependency parse using Semgrex operators.- Support for greedy patterns in
Matcher. - New data structure
SpanGroupfor efficiently storing collections of potentially overlapping spans via theDoc.spans. - Type hints and type-based data validation for custom registered functions.
- Various new methods, attributes and commands.
📺 Video introductions & tutorials
| spaCy v3: State-of-the-art NLP from Prototype to Production | spaCy v3: Design concepts explained (behind the scenes) | spaCy v3: Custom trainable relation extraction component |
| :---: | :---: | :---: |
|
|
|
|
📦 Trained pipelines (58)
To download a trained pipeline, you can use the spacy download command. See the training documentation for details on how to train your own pipelines on your data.
| Name | Language | POS | TAG | LAS | UAS | NER | Sent | Size | |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | :---: |
| da_core_news_lg v3.0.0 | Danish | 0.97 | 0.97 | 0.78 | 0.82 | 0.82 | 0.88 | 547 MB | 📖 |
| da_core_news_md v3.0.0 | Danish | 0.96 | 0.96 | 0.78 | 0.82 | 0.81 | 0.86 | 47 MB | 📖 |
| da_core_news_sm v3.0.0 | Danish | 0.95 | 0.95 | 0.76 | 0.81 | 0.72 | 0.86 | 17 MB | 📖 |
| de_core_news_lg v3.0.0 | German | 0.98 | 0.98 | 0.91 | 0.93 | 0.85 | 0.95 | 546 MB | 📖 |
| de_core_news_md v3.0.0 | German | 0.98 | 0.98 | 0.91 | 0.93 | 0.84 | 0.95 | 47 MB | 📖 |
| de_core_news_sm v3.0.0 | German | 0.98 | 0.97 | 0.90 | 0.92 | 0.82 | 0.94 | 18 MB | 📖 |
| de_dep_news_trf v3.0.0 | German | 0.99 | 0.99 | 0.95 | 0.96 | n/a | 0.98 | 393 MB | 📖 |
| el_core_news_lg v3.0.0 | Greek | 0.97 | 0.94 | 0.85 | 0.88 | 0.80 | 1.00 | 544 MB | 📖 |
| el_core_news_md v3.0.0 | Greek | 0.96 | 0.93 | 0.84 | 0.87 | 0.79 | 1.00 | 42 MB | 📖 |
| el_core_news_sm v3.0.0 | Greek | 0.94 | 0.91 | 0.81 | 0.85 | 0.72 | 1.00 | 12 MB | 📖 |
| en_core_web_lg v3.0.0 | English | n/a | 0.97 | 0.90 | 0.92 | 0.86 | 0.89 | 742 MB | 📖 |
| en_core_web_md v3.0.0 | English | n/a | 0.97 | 0.90 | 0.92 | 0.85 | 0.89 | 44 MB | 📖 |
| en_core_web_sm v3.0.0 | English | n/a | 0.97 | 0.90 | 0.92 | 0.84 | 0.89 | 13 MB | 📖 |
| en_core_web_trf v3.0.0 | English | n/a | 0.98 | 0.94 | 0.95 | 0.90 | 0.89 | 438 MB | 📖 |
| es_core_news_lg v3.0.0 | Spanish | 0.99 | 0.98 | 0.88 | 0.91 | 0.90 | 1.00 | 547 MB | 📖 |
| es_core_news_md v3.0.0 | Spanish | 0.99 | 0.98 | 0.88 | 0.91 | 0.90 | 1.00 | 46 MB | 📖 |
| es_core_news_sm v3.0.0 | Spanish | 0.98 | 0.97 | 0.87 | 0.90 | 0.89 | 1.00 | 17 MB | 📖 |
| es_dep_news_trf v3.0.0 | Spanish | 0.99 | 0.98 | 0.93 | 0.95 | n/a | 0.97 | 395 MB | 📖 |
| fr_core_news_lg v3.0.0 | French | 0.98 | 0.95 | 0.86 | 0.90 | 0.82 | 0.88 | 546 MB | 📖 |
| fr_core_news_md v3.0.0 | French | 0.97 | 0.94 | 0.85 | 0.89 | 0.81 | 0.87 | 45 MB | 📖 |
| fr_core_news_sm v3.0.0 | French | 0.96 | 0.93 | 0.84 | 0.88 | 0.79 | 0.85 | 16 MB | 📖 |
| fr_dep_news_trf v3.0.0 | French | 0.99 | 0.96 | 0.92 | 0.94 | n/a | 0.94 | 381 MB | 📖 |
| it_core_news_lg v3.0.0 | Italian | 0.98 | 0.97 | 0.88 | 0.91 | 0.89 | 0.97 | 545 MB | 📖 |
| it_core_news_md v3.0.0 | Italian | 0.97 | 0.97 | 0.88 | 0.91 | 0.87 | 0.97 | 44 MB | 📖 |
| it_core_news_sm v3.0.0 | Italian | 0.97 | 0.97 | 0.86 | 0.90 | 0.85 | 0.97 | 16 MB | 📖 |
| ja_core_news_lg v3.0.0 | Japanese | 0.96 | 0.97 | 0.90 | 0.92 | 0.72 | 0.98 | 531 MB | 📖 |
| ja_core_news_md v3.0.0 | Japanese | 0.96 | 0.97 | 0.90 | 0.92 | 0.72 | 0.99 | 41 MB | 📖 |
| ja_core_news_sm v3.0.0 | Japanese | 0.96 | 0.97 | 0.90 | 0.92 | 0.64 | 0.99 | 12 MB | 📖 |
| lt_core_news_lg v3.0.0 | Lithuanian | 0.96 | 0.89 | 0.68 | 0.75 | 0.80 | 0.82 | 545 MB | 📖 |
| lt_core_news_md v3.0.0 | Lithuanian | 0.95 | 0.86 | 0.67 | 0.74 | 0.79 | 0.83 | 44 MB | 📖 |
| lt_core_news_sm v3.0.0 | Lithuanian | 0.91 | 0.82 | 0.59 | 0.68 | 0.74 | 0.79 | 15 MB | 📖 |
| mk_core_news_lg v3.0.0 | Macedonian | 0.93 | n/a | 0.51 | 0.68 | 0.76 | 0.73 | 312 MB | 📖 |
| mk_core_news_md v3.0.0 | Macedonian | 0.93 | n/a | 0.51 | 0.67 | 0.75 | 0.73 | 44 MB | 📖 |
| mk_core_news_sm v3.0.0 | Macedonian | 0.92 | n/a | 0.47 | 0.62 | 0.70 | 0.62 | 18 MB | 📖 |
| nb_core_news_lg v3.0.0 | Norwegian | 0.97 | 0.97 | 0.87 | 0.89 | 0.85 | 0.94 | 547 MB | 📖 |
| nb_core_news_md v3.0.0 | Norwegian | 0.97 | 0.97 | 0.87 | 0.90 | 0.85 | 0.93 | 44 MB | 📖 |
| nb_core_news_sm v3.0.0 | Norwegian | 0.97 | 0.97 | 0.85 | 0.88 | 0.77 | 0.93 | 15 MB | 📖 |
| nl_core_news_lg v3.0.0 | Dutch | 0.96 | 0.95 | 0.82 | 0.87 | 0.77 | 0.87 | 546 MB | 📖 |
| nl_core_news_md v3.0.0 | Dutch | 0.96 | 0.95 | 0.82 | 0.87 | 0.74 | 0.87 | 45 MB | 📖 |
| nl_core_news_sm v3.0.0 | Dutch | 0.95 | 0.93 | 0.80 | 0.85 | 0.72 | 0.86 | 16 MB | 📖 |
| pl_core_news_lg v3.0.0 | Polish | 0.97 | 0.98 | 0.84 | 0.89 | 0.85 | 0.99 | 584 MB | 📖 |
| pl_core_news_md v3.0.0 | Polish | 0.97 | 0.98 | 0.84 | 0.89 | 0.84 | 0.98 | 84 MB | 📖 |
| pl_core_news_sm v3.0.0 | Polish | 0.95 | 0.98 | 0.79 | 0.86 | 0.80 | 0.98 | 55 MB | 📖 |
| pt_core_news_lg v3.0.0 | Portuguese | 0.97 | 0.90 | 0.86 | 0.90 | 0.91 | 0.95 | 551 MB | 📖 |
| pt_core_news_md v3.0.0 | Portuguese | 0.97 | 0.90 | 0.86 | 0.90 | 0.90 | 0.95 | 49 MB | 📖 |
| pt_core_news_sm v3.0.0 | Portuguese | 0.97 | 0.89 | 0.85 | 0.89 | 0.88 | 0.92 | 21 MB | 📖 |
| ro_core_news_lg v3.0.0 | Romanian | 0.96 | 0.97 | 0.84 | 0.89 | 0.77 | 0.96 | 546 MB | 📖 |
| ro_core_news_md v3.0.0 | Romanian | 0.96 | 0.97 | 0.85 | 0.89 | 0.76 | 0.96 | 44 MB | 📖 |
| ro_core_news_sm v3.0.0 | Romanian | 0.96 | 0.96 | 0.82 | 0.87 | 0.72 | 0.97 | 15 MB | 📖 |
| ru_core_news_lg v3.0.0 | Russian | 0.99 | 0.99 | 0.95 | 0.96 | 0.95 | 1.00 | 491 MB | 📖 |
| ru_core_news_md v3.0.0 | Russian | 0.99 | 0.99 | 0.95 | 0.96 | 0.94 | 1.00 | 41 MB | 📖 |
| ru_core_news_sm v3.0.0 | Russian | 0.99 | 0.99 | 0.95 | 0.96 | 0.95 | 1.00 | 16 MB | 📖 |
| xx_ent_wiki_sm v3.0.0 | MultiLanguage | n/a | n/a | n/a | n/a | 0.82 | n/a | 14 MB | 📖 |
| xx_sent_ud_sm v3.0.0 | MultiLanguage | n/a | n/a | n/a | n/a | n/a | 0.86 | 10 MB | 📖 |
| zh_core_web_lg v3.0.0 | Chinese | n/a | 0.90 | 0.66 | 0.71 | 0.71 | 0.75 | 577 MB | 📖 |
| zh_core_web_md v3.0.0 | Chinese | n/a | 0.90 | 0.65 | 0.70 | 0.70 | 0.76 | 76 MB | 📖 |
| zh_core_web_sm v3.0.0 | Chinese | n/a | 0.90 | 0.64 | 0.70 | 0.69 | 0.75 | 47 MB | 📖 |
| zh_core_web_trf v3.0.0 | Chinese | n/a | 0.92 | 0.73 | 0.77 | 0.75 | 0.65 | 398 MB | 📖 |
💬 TAG: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_) POS: Part-of-speech tags (coarse-grained tags, i.e.Token.pos_) UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). NER: Named entities (F-score). Sent: Sentence segmentation. Size: Model file size (zipped archive).
⚠️ Backwards incompatibilities
For more info on how to migrate from spaCy v2.x, see the detailed migration guide.
API changes
- Pipeline package symlinks, the
linkcommand and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name likeen_core_web_smexplicitly. - A pipeline's
meta.jsonis now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in theconfig.cfg, which also includes all settings used to train the pipeline. - The
train,pretrainanddebug datacommands now only take aconfig.cfg. Language.add_pipenow takes the string name of the component factory instead of the component function.- Custom pipeline components now need to be decorated with the
@Language.componentor@Language.factorydecorator. - The
Language.update,Language.evaluateandTrainablePipe.updatemethods now all take batches ofExampleobjects instead ofDocandGoldParseobjects, or raw text and a dictionary of annotations. - The
begin_trainingmethods have been renamed toinitializeand now take a function that returns a sequence ofExampleobjects to initialize the model instead of a list of tuples. Matcher.addandPhraseMatcher.addnow only accept a list of patterns as the second argument (instead of a variable number of arguments). Theon_matchcallback becomes an optional keyword argument.- The
Docflags likeDoc.is_parsedorDoc.is_taggedhave been replaced byDoc.has_annotation. - The
spacy.goldmodule has been renamed tospacy.training. - The
PRON_LEMMAsymbol and-PRON-as an indicator for pronoun lemmas has been removed. - The
TAG_MAPandMORPH_RULESin the language data have been replaced by the more flexibleAttributeRuler. - The
Lemmatizeris now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization. - Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.
Removed or renamed API
| Removed | Replacement |
| -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Language.disable_pipes | Language.select_pipes, Language.disable_pipe, Language.enable_pipe |
| Language.begin_training, Pipe.begin_training, ... | Language.initialize, Pipe.initialize, ... |
| Doc.is_tagged, Doc.is_parsed, ... | Doc.has_annotation |
| GoldParse | Example |
| GoldCorpus | Corpus |
| KnowledgeBase.load_bulk, KnowledgeBase.dump | KnowledgeBase.from_disk, KnowledgeBase.to_disk |
| Matcher.pipe, PhraseMatcher.pipe | not needed |
| gold.offsets_from_biluo_tags, gold.spans_from_biluo_tags, gold.biluo_tags_from_offsets | training.biluo_tags_to_offsets, training.biluo_tags_to_spans, training.offsets_to_biluo_tags |
| spacy init-model | spacy init vectors |
| spacy debug-data | spacy debug data |
| spacy profile | spacy debug profile |
| spacy link, util.set_data_path, util.get_data_path | not needed, symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.
| Removed | Replacement |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Doc.tokens_from_list | Doc.__init__ |
| Doc.merge, Span.merge | Doc.retokenize |
| Token.string, Span.string, Span.upper, Span.lower | Span.text, Token.text |
| Language.tagger, Language.parser, Language.entity | Language.get_pipe |
| keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytes | exclude=["vocab"] |
| n_threads argument on Tokenizer, Matcher, PhraseMatcher | n_process |
| verbose argument on Language.evaluate | logging (DEBUG) |
| SentenceSegmenter hook, SimilarityHook | user hooks, Sentencizer, SentenceRecognizer |
👥 Contributors
This release is brought to you by @honnibal, @ines, @svlandeg and @adrianeboyd. Thanks to @AMArostegui, @BramVanroy, @Cristianasp, @DeNeutoy, @DuyguA, @Jan-711, @KKsharma99, @KeshavG-lb, @KoichiYasuoka, @MartinoMensio, @Nuccy90, @PluieElectrique, @SamEdwardes, @Stannislav, @abchapman93, @alexcombessie, @alvaroabascar, @baranitharan2020, @bittlingmayer, @bjascob, @borijang, @borijang, @bratao, @bryant1410, @buriy, @chopeen, @danielvasic, @delzac, @dhruvrnaik, @erip, @florijanstamenkovic, @forest1988, @gandersen101, @garethsparks, @graue70, @guadiromero, @hertelm, @hiroshi-matsuda-rit, @holubvl3, @idoshr, @jabortell, @jbesomi, @jenojp, @jganseman, @jgutix, @jmargeta, @jumasheff, @kuk, @leyendecker, @lizhe2004, @lorenanda, @mahnerak, @mikeizbicki, @myavrum, @nipunsadvilkar, @oculusrepairo, @ophelielacroix, @rahul1990gupta, @rameshhpathak, @rasyidf, @revuel, @richardliaw, @robertsipek, @snsten, @solarmist, @tamuhey, @thomasbird, @tiangolo, @tilusnet, @timgates42, @vha14, @walterhenry, @wannaphong, @werew, @yosiasz and @zaibacu for the pull requests and contributions!
- Python
Published by ines over 5 years ago
spacy - v3.0.0rc3: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more
🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.
⚠️⚠️⚠️ Make sure to retrain your models! ⚠️⚠️⚠️ This release includes changes to the config and model architectures, so if you've trained a custom pipeline with
v3.0.0rc1orv3.0.0rc2, you'll need to retrain it. We recommend using the new spaCy projects system to make it easy to re-run your training process. To auto-fill and update your configs, you can use theinit fill-configcommand.📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →
🚀 Quickstart
bash
pip install -U spacy-nightly --pre
- Introducing spaCy v3.0 nightly
- New in v3.0: New features, backwards incompatibilities and migration guide.
- Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
- Training Quickstart: Generate a training config for your specific use case.
- Benchmarks: Results and accuracy comparisons.
- Projects & Project Templates: Get started by cloning a project template.
✨ New features and improvements
- Transformer-based pipelines with support for multi-task learning.
- Retrained model families for 18 languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
- New core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for their contributions!
- New training workflow and config system.
- Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
- spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
- Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
- Parallel training and distributed computing with Ray.
- New built-in pipeline components:
SentenceRecognizer,Morphologizer,Lemmatizer,AttributeRulerandTransformer. - New and improved pipeline component API and decorators for custom components.
- Source trained components from other pipelines in your training config.
DependencyMatcherfor matching patterns within the dependency parse using Semgrex operators.- Support for greedy patterns in
Matcher. - Type hints and type-based data validation for custom registered functions.
- Various new methods, attributes and commands.
⚠️ Backwards incompatibilities
For more info on how to migrate from spaCy v2.x, see the detailed migration guide.
API changes
- Pipeline package symlinks, the
linkcommand and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name likeen_core_web_smexplicitly. - A pipeline's
meta.jsonis now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in theconfig.cfg, which also includes all settings used to train the pipeline. - The
train,pretrainanddebug datacommands now only take aconfig.cfg. Language.add_pipenow takes the string name of the component factory instead of the component function.- Custom pipeline components now need to be decorated with the
@Language.componentor@Language.factorydecorator. - The
Language.update,Language.evaluateandTrainablePipe.updatemethods now all take batches ofExampleobjects instead ofDocandGoldParseobjects, or raw text and a dictionary of annotations. - The
begin_trainingmethods have been renamed toinitializeand now take a function that returns a sequence ofExampleobjects to initialize the model instead of a list of tuples. Matcher.addandPhraseMatcher.addnow only accept a list of patterns as the second argument (instead of a variable number of arguments). Theon_matchcallback becomes an optional keyword argument.- The
Docflags likeDoc.is_parsedorDoc.is_taggedhave been replaced byDoc.has_annotation. - The
spacy.goldmodule has been renamed tospacy.training. - The
PRON_LEMMAsymbol and-PRON-as an indicator for pronoun lemmas has been removed. - The
TAG_MAPandMORPH_RULESin the language data have been replaced by the more flexibleAttributeRuler. - The
Lemmatizeris now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization. - Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.
Removed or renamed API
| Removed | Replacement |
| -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Language.disable_pipes | Language.select_pipes, Language.disable_pipe, Language.enable_pipe |
| Language.begin_training, Pipe.begin_training, ... | Language.initialize, Pipe.initialize, ... |
| Doc.is_tagged, Doc.is_parsed, ... | Doc.has_annotation |
| GoldParse | Example |
| GoldCorpus | Corpus |
| KnowledgeBase.load_bulk, KnowledgeBase.dump | KnowledgeBase.from_disk, KnowledgeBase.to_disk |
| Matcher.pipe, PhraseMatcher.pipe | not needed |
| gold.offsets_from_biluo_tags, gold.spans_from_biluo_tags, gold.biluo_tags_from_offsets | training.biluo_tags_to_offsets, training.biluo_tags_to_spans, training.offsets_to_biluo_tags |
| spacy init-model | spacy init vectors |
| spacy debug-data | spacy debug data |
| spacy profile | spacy debug profile |
| spacy link, util.set_data_path, util.get_data_path | not needed, symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.
| Removed | Replacement |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Doc.tokens_from_list | Doc.__init__ |
| Doc.merge, Span.merge | Doc.retokenize |
| Token.string, Span.string, Span.upper, Span.lower | Span.text, Token.text |
| Language.tagger, Language.parser, Language.entity | Language.get_pipe |
| keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytes | exclude=["vocab"] |
| n_threads argument on Tokenizer, Matcher, PhraseMatcher | n_process |
| verbose argument on Language.evaluate | logging (DEBUG) |
| SentenceSegmenter hook, SimilarityHook | user hooks, Sentencizer, SentenceRecognizer |
- Python
Published by ines over 5 years ago
spacy - v2.3.5: Bug fixes and simpler source installs
✨ New features and improvements
- Modify
blisandnumpybuild dependencies to simplify source installations. - Support
cupyv8+ in combination withthincv7.4.5.
🔴 Bug fixes
- Fix issue #6443: Only set
NORMon token in retokenizer. - Fix issue #6453: Add
SPACYas aMatcherattribute. - Fix issue #6512: Add
nlp.max_lengthcheck tonlp.pipethroughnlp.make_doc. - Fix issue #6515: Add missing
.pipemethods to Chinese, Japanese, Korean and Thai tokenizers. - Fix issue #6518: Fix subsequent pipe detection in
EntityRuler. - Fix issue #6523: Remove non-working
--use-charsfrom train CLI.
👥 Contributors
Thanks to @KoichiYasuoka for the pull requests and contributions.
- Python
Published by adrianeboyd over 5 years ago
spacy - v2.3.4: Fix beam parser API
🔴 Bug fixes
- Fix issue #6446: Restore
cleanup_beammethod.
📖 Documentation and examples
- Update rule-based matching docs
👥 Contributors
Thanks to @jabortell for the pull requests and contributions.
- Python
Published by adrianeboyd over 5 years ago
spacy - v2.3.3: Alpha support for Macedonian and Sanskrit, updates for many languages and bug fixes
✨ New features and improvements
- NEW: Add alpha support for Macedonian and Sanskrit.
- Update language data for Croatian, Czech, English, Hebrew, Hindi, Indonesian, Swedish, Thai and Turkish.
- Add support for aarch64 and ppc64le on linux with binary packages available on conda-forge.
🔴 Bug fixes
- Fix issue #5610: Make sure
sys.argvexists. - Fix issue #5643: Add
ent_id_to strings serialized withDoc. - Fix issue #5727: Clarify warning for misaligned BILUO tags.
- Fix issue #5768: Improve tag map initialization and updating.
- Fix issue #5794: Improve warnings around normalization tables.
- Fix issue #5796: Update invalid tag maps.
- Fix issue #5799: Remove hard-coded GPU ID from
pretrain. - Fix issue #5802: Mark Japanese documents as tagged.
- Fix issue #5823: Fix typo in unit tests.
- Fix issue #5838: Fix
EntityRendererto support break lines (after last entity). - Fix issue #5843: Prefer earlier spans in
EntityRuler. - Fix issue #5849: Allow
Doc.char_spanto snap to token boundaries. - Fix issue #5853: Fix span boundary handling in Spanish noun chunks.
- Fix issue #5861: Add
Spanindex boundary checks. - Fix issue #5904: Fix typos in comments.
- Fix issue #5910: Update default sentencizer characters for Armenian, Greek and Arabic.
- Fix issue #6014: Fix off-by-one error for best iteration calculation.
- Fix issue #6112: Fix overlapping German noun chunks.
- Fix issue #6148: Identify final
Matcherpattern node by quantifier. - Fix issue #6164: Reorder so tag map is replaced only if a custom file is provided.
- Fix issue #6218: Reproducibility for
TextCategorizerandTok2Vec. - Fix issue #6219: Add re-enabled pipe names back to the meta before serializing.
- Fix issue #6300: Fix
on_matchcallback and exclude empty match lists from results forDependencyMatcher. - Fix issue #6347: Memory leak issues with
beam_parse(requiresthinc>=7.4.3). - Fix issue #6373: Bugfix textcat reproducibility on GPU (requires
thinc>=7.4.3). - Fix issue #6405: Add all vectors to vocab before pruning.
- Fix issue #6413: Use int8_t instead of char in
Matcher.
👥 Contributors
Thanks to @abchapman93, @baranitharan2020, @bittlingmayer, @bjascob, @borijang, @BramVanroy, @chopeen, @danielvasic, @delzac, @DuyguA, @erip, @florijanstamenkovic, @graue70, @hiroshi-matsuda-rit, @holubvl3, @idoshr, @jgutix, @KKsharma99, @leyendecker, @lizhe2004, @MartinoMensio, @nipunsadvilkar, @Nuccy90, @oculusrepairo, @rahul1990gupta, @rasyidf, @robertsipek, @SamEdwardes, @snsten, @solarmist, @Stannislav, @tamuhey, @tilusnet, @vha14, @wannaphong, @zaibacu for the pull requests and contributions.
- Python
Published by adrianeboyd over 5 years ago
spacy - v3.0.0rc1: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more
🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.
🚀 Quickstart
bash
pip install -U spacy-nightly --pre
- Introducing spaCy v3.0 nightly
- New in v3.0: New features, backwards incompatibilities and migration guide.
- Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
- Training Quickstart: Generate a training config for your specific use case.
- Benchmarks: Results and accuracy comparisons.
- Projects & Project Templates: Get started by cloning a project template.
✨ New features and improvements
- Transformer-based pipelines with support for multi-task learning.
- Retrained model families for 16 languages and 52 trained pipelines in total, including 6 transformer-based pipelines.
- New training workflow and config system.
- Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
- spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
- Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
- Parallel training and distributed computing with Ray.
- New built-in pipeline components:
SentenceRecognizer,Morphologizer,Lemmatizer,AttributeRulerandTransformer. - New and improved pipeline component API and decorators for custom components.
- Source trained components from other pipelines in your training config.
DependencyMatcherfor matching patterns within the dependency parse using Semgrex operators.- Support for greedy patterns in
Matcher. - Type hints and type-based data validation for custom registered functions.
- Various new methods, attributes and commands.
⚠️ Backwards incompatibilities
For more info on how to migrate from spaCy v2.x, see the detailed migration guide.
API changes
- Pipeline package symlinks, the
linkcommand and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name likeen_core_web_smexplicitly. - A pipeline's
meta.jsonis now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in theconfig.cfg, which also includes all settings used to train the pipeline. - The
train,pretrainanddebug datacommands now only take aconfig.cfg. Language.add_pipenow takes the string name of the component factory instead of the component function.- Custom pipeline components now need to be decorated with the
@Language.componentor@Language.factorydecorator. - The
Language.update,Language.evaluateandTrainablePipe.updatemethods now all take batches ofExampleobjects instead ofDocandGoldParseobjects, or raw text and a dictionary of annotations. - The
begin_trainingmethods have been renamed toinitializeand now take a function that returns a sequence ofExampleobjects to initialize the model instead of a list of tuples. Matcher.addandPhraseMatcher.addnow only accept a list of patterns as the second argument (instead of a variable number of arguments). Theon_matchcallback becomes an optional keyword argument.- The
Docflags likeDoc.is_parsedorDoc.is_taggedhave been replaced byDoc.has_annotation. - The
spacy.goldmodule has been renamed tospacy.training. - The
PRON_LEMMAsymbol and-PRON-as an indicator for pronoun lemmas has been removed. - The
TAG_MAPandMORPH_RULESin the language data have been replaced by the more flexibleAttributeRuler. - The
Lemmatizeris now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization. - Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.
Removed or renamed API
| Removed | Replacement |
| -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Language.disable_pipes | Language.select_pipes, Language.disable_pipe, Language.enable_pipe |
| Language.begin_training, Pipe.begin_training, ... | Language.initialize, Pipe.initialize, ... |
| Doc.is_tagged, Doc.is_parsed, ... | Doc.has_annotation |
| GoldParse | Example |
| GoldCorpus | Corpus |
| KnowledgeBase.load_bulk, KnowledgeBase.dump | KnowledgeBase.from_disk, KnowledgeBase.to_disk |
| Matcher.pipe, PhraseMatcher.pipe | not needed |
| gold.offsets_from_biluo_tags, gold.spans_from_biluo_tags, gold.biluo_tags_from_offsets | training.biluo_tags_to_offsets, training.biluo_tags_to_spans, training.offsets_to_biluo_tags |
| spacy init-model | spacy init vectors |
| spacy debug-data | spacy debug data |
| spacy profile | spacy debug profile |
| spacy link, util.set_data_path, util.get_data_path | not needed, symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.
| Removed | Replacement |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Doc.tokens_from_list | Doc.__init__ |
| Doc.merge, Span.merge | Doc.retokenize |
| Token.string, Span.string, Span.upper, Span.lower | Span.text, Token.text |
| Language.tagger, Language.parser, Language.entity | Language.get_pipe |
| keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytes | exclude=["vocab"] |
| n_threads argument on Tokenizer, Matcher, PhraseMatcher | n_process |
| verbose argument on Language.evaluate | logging (DEBUG) |
| SentenceSegmenter hook, SimilarityHook | user hooks, Sentencizer, SentenceRecognizer |
- Python
Published by ines over 5 years ago
spacy - v2.3.2: Improved Korean tokenizer speed, experimental character-based pretraining and bug fixes
✨ New features and improvements
- Improve Korean tokenizer speed.
- Add experimental character-based pretraining.
🔴 Bug fixes
- Fix issue #5728: Fix French lemmatizer.
- Fix issue #5729: Fix lemmatizer for python 2.7.
- Fix issue #5751: Fix meta serialization in train CLI.
👥 Contributors
Thanks to @graue70, @mikeizbicki, @jbesomi, @gandersen101 and @DeNeutoy for the pull requests and contributions.
- Python
Published by adrianeboyd almost 6 years ago
spacy - v2.3.1: Alpha support for Nepali, updated Armenian and Japanese language data and bug fixes
✨ New features and improvements
- NEW: Add alpha support for Nepali.
- Refactor Japanese tokenizer and include additional custom tokenizer features.
- Update Armenian language data.
- Include spacy git commit in package and model meta for reference.
🔴 Bug fixes
- Fix issue #5620: Skip vocab in component config overrides.
- Fix issue #5634: Fix polarity of
Token.is_oovandLexeme.is_oov. - Fix issue #5643: Add strings and
ENT_KB_IDtoDocserialization. - Fix issue #5648: Disregard special tag _SP in check for new tag map.
- Fix issue #5658 : Move lemmatizer
is_base_formto language settings.
👥 Contributors
Thanks to @myavrum, @mahnerak, @rameshhpathak, @hiroshi-matsuda-rit, @PluieElectrique, @hertelm and @alvaroabascar for the pull requests and contributions.
- Python
Published by adrianeboyd almost 6 years ago
spacy - v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes
⚠️ This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
- NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total!
- NEW: 2-4× faster loading times for models with vectors and 2× smaller packages.
- NEW: Alpha support for Armenian, Gujarati and Malayalam.
- NEW: Lookup lemmatization for Polish.
- NEW: Allow
Matcherto match on bothDocandSpanobjects. - NEW: Add
Token.is_sent_endproperty. - Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora.
- Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu.
- Add support for
pkusegalongsidejiebafor Chinese. - Switch from
fugashitosudachipyfor Japanese. - Improve punctuation used in sentencizer.
- Switch to new and more consistent alignment method in
gold.align. - Reduce stored lexemes data and move non-derivable features to
spacy-lookups-data.
🔴 Bug fixes
- Fix issue #5056: Introduce support for matching
Spanobjects. - Fix issue #5086: Remove
Vectors.from_glove. - Fix issue #5131: Improve data processing in named entity linking scripts.
- Fix issue #5137: Fix passing of component configuration to component.
- Fix issue #5144: Fix sentence comparison in test util.
- Fix issue #5166: Fix handling of
exclusive_classesin textcat ensemble. - Fix issue #5170: Set rank for new vector in
Vocab.set_vector. - Fix issue #5181: Prevent
Nonevalues in gold fields. - Fix issue #5191: Fix
GoldParseinitialization when the number of tokens has changed. - Fix issue #5193: Correctly pin
cupy-cudaextra dependencies. - Fix issue #5200: Fix minor bugs in train CLI.
- Fix issue #5216: Modify
Vectors.resizeto work withcupy. - Fix issue #5228: Raise error for inplace resize with new vector dimension.
- Fix issue #5230: Fix
unittestwarnings when saving a model. - Fix issue #5257: Use inline flags in
token_matchpatterns. - Fix issue #5278, #5359: Add missing
__init__.pyfiles to language data tests. - Fix issue #5281: Fix comparison predicate handling for
!=. - Fix issue #5287: Normalize
TokenC.sent_startvalues forMatcher. - Fix issue #5292: Fix typo in option name
--n-save_every. - Fix issue #5303: Use
max(uint64)for OOV lexeme rank. - Fix issue #5311: Fix alignment of cards on landing page.
- Fix issue #5320: Fix
most_similarfor vectors with unused rows. - Fix issue #5344: Prevent pip from installing spaCy on Python 3.4.
- Fix issue #5356: Fix bug in
Span.similaritythat could triggerTypeError. - Fix issue #5361: Fix problems with lower and whitespace in variants.
- Fix issue #5373: Improve exceptions for
'd(would/had) in English. - Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU.
- Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks.
- Fix issue #5429: Modify array type to accommodate
OOV_RANK. - Fix issue #5430: Check that row is within bounds when adding vector.
- Fix issue #5435: Use
Token.sent_startforSpan.sent. - Fix issue #5436: Fix
ErrorsWithCodes().__class__return value. - Fix issue #5450: Disallow merging 0-length spans.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you're training new models, you'll want to install the package
spacy-lookups-data, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you're using pretrained models, nothing changes, because the relevant tables are included in the model packages. - Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tagsets contain new merged tags related to contracted forms, such as
ADP_DETfor French"au", which maps to UPOSADPbased on the head"à". This increases the accuracy of the models by improving the alignment between spaCy's tokenization and Universal Dependencies multi-word tokens used for contractions. - spaCy's custom warnings have been replaced with native Python
warnings. Instead of settingSPACY_WARNING_IGNORE, use thewarningsfilters to manage warnings.
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add new projects to the spaCy Universe.
- Move
bin/wiki_entity_linkingscripts for Wikipedia toprojectsrepo.
🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon!
📦 Model packages (43)
| Model | Language | Version | Vectors |
| ------------------- | ---------- | ------: | ----:
| zh_core_web_sm | Chinese | 2.3.0 | 𐄂 |
| zh_core_web_md | Chinese | 2.3.0 | ✓ |
| zh_core_web_lg | Chinese | 2.3.0 | ✓ |
| da_core_news_sm | Danish | 2.3.0 | 𐄂 |
| da_core_news_md | Danish | 2.3.0 | ✓ |
| da_core_news_lg | Danish | 2.3.0 | ✓ |
| nl_core_news_sm | Dutch | 2.3.0 | 𐄂 |
| nl_core_news_md | Dutch | 2.3.0 | ✓ |
| nl_core_news_lg | Dutch | 2.3.0 | ✓ |
| en_core_web_sm | English | 2.3.0 | 𐄂 |
| en_core_web_md | English | 2.3.0 | ✓ |
| en_core_web_lg | English | 2.3.0 | ✓ |
| fr_core_news_sm | French | 2.3.0 | 𐄂 |
| fr_core_news_md | French | 2.3.0 | ✓ |
| fr_core_news_lg | French | 2.3.0 | ✓ |
| de_core_news_sm | German | 2.3.0 | 𐄂 |
| de_core_news_md | German | 2.3.0 | ✓ |
| de_core_news_lg | German | 2.3.0 | ✓ |
| el_core_news_sm | Greek | 2.3.0 | 𐄂 |
| el_core_news_md | Greek | 2.3.0 | ✓ |
| el_core_news_lg | Greek | 2.3.0 | ✓ |
| it_core_news_sm | Italian | 2.3.0 | 𐄂 |
| it_core_news_md | Italian | 2.3.0 | ✓ |
| it_core_news_lg | Italian | 2.3.0 | ✓ |
| ja_core_news_sm | Japanese | 2.3.0 | 𐄂 |
| ja_core_news_md | Japanese | 2.3.0 | ✓ |
| ja_core_news_lg | Japanese | 2.3.0 | ✓ |
| lt_core_news_sm | Lithuanian | 2.3.0 | 𐄂 |
| lt_core_news_md | Lithuanian | 2.3.0 | ✓ |
| lt_core_news_lg | Lithuanian | 2.3.0 | ✓ |
| nb_core_news_sm | Norwegian Bokmål | 2.3.0 | 𐄂 |
| nb_core_news_md | Norwegian Bokmål | 2.3.0 | ✓ |
| nb_core_news_lg | Norwegian Bokmål | 2.3.0 | ✓ |
| pl_core_news_sm | Polish | 2.3.0 | 𐄂 |
| pl_core_news_md | Polish | 2.3.0 | ✓ |
| pl_core_news_lg | Polish | 2.3.0 | ✓ |
| pt_core_news_sm | Portuguese | 2.3.0 | 𐄂 |
| pt_core_news_md | Portuguese | 2.3.0 | ✓ |
| pt_core_news_lg | Portuguese | 2.3.0 | ✓ |
| ro_core_news_sm | Romanian | 2.3.0 | 𐄂 |
| ro_core_news_md | Romanian | 2.3.0 | ✓ |
| ro_core_news_lg | Romanian | 2.3.0 | ✓ |
| es_core_news_sm | Spanish | 2.3.0 | 𐄂 |
| es_core_news_md | Spanish | 2.3.0 | ✓ |
| es_core_news_lg | Spanish | 2.3.0 | ✓ |
| xx_ent_wiki_sm | Multi-language | 2.3.0 | 𐄂 |
👥 Contributors
Thanks to @mabraham, @sloev, @pinealan, @pmbaumgartner, @Baciccin, @nlptechbook, @guerda, @Tiljander, @nikhilsaldanha, @tommilligan, @Jacse, @leicmi, @YohannesDatasci, @mirfan899, @koaning, @umarbutler, @chopeen, @paoloq, @thomasthiebaud, @sebastienharinck, @elben10, @laszabine, @Mlawrence95, @sabiqueqb, @punitvara, @michael-k, @louisguitton, @vondersam, @thoppe, @vishnupriyavr, @ilivans and @osori for the pull requests and contributions.
🙏 Special thanks to everyone who helped us develop and test the new models: @lixiepeng, @lingvisa and @howl-anderson (Chinese), @hvingelby (Danish), @hiroshi-matsuda-rit and @polm (Japanese), @ryszardtuora (Polish) and @avramandrei and @dumitrescustefan (Romanian).
- Python
Published by ines almost 6 years ago
spacy - v2.2.4: Alpha support for Yoruba and Basque, language data improvements and lots of bug fixes
✨ New features and improvements
- NEW: Add
Span.char_spanmethod. - NEW: Base language support for Yoruba and Basque.
- NEW: Add
--tag-map-pathargument todebug-dataandtraincommands. - NEW Add
add_lemmaoption todisplacydependency visualizer. - Add
IDXas an attribute available viaDoc.to_array. - Improve speed of adding large number of patterns to
EntityRuler. - Replace
python-mecab3withfugashifor Japanese. - Improve language data for Norwegian, Luxembourgish, Finnish, Slovak, Romanian, Greek and German.
🔴 Bug fixes
- Fix issue #3979, #4819, #4871: Add
tok2vecparameters totraincommand. - Fix issue #4009: Fix use of pretrained vectors in text classifier.
- Fix issue #4342: Improve CLI training with base model.
- Fix issue #4432: Add destructors for states in
TransitionSystem. - Fix issue #4440: Require
HEADforis_parsedinDoc.from_array. - Fix issue #4615: Update
SHAPEdocs and examples. - Fix issue #4665: Allow
HEADfield in CoNLL-U format to be an underscore. - Fix issue #4673: Ensure correct array module is used when returning a vector via
Vocab. - Fix issue #4674: Make
set_entitiesin theKnowledgeBasemore robust. - Fix issue #4677: Add missing tags to tag maps for
el,esandpt. - Fix issue #4688: Iterate over
lr_edgesuntilDoc.sentsare correct. - Fix issue #4703, #4823: Facilitate large training files.
- Fix issue #4707: Auto-exclude
disabledwhen callingfrom_diskduring load. - Fix issue #4717: Fix int value handling in
Matcher. - Fix issue #4719: Add message when cli train script throws exception.
- Fix issue #4723: Update
EntityLinkerexample. - Fix issue #4725: Take care of global vectors in multiprocessing.
- Fix issue #4770: Include
Doc.catsin serialization ofDocandDocBin. - Fix issue #4772: Fix bug in
EntityLinker.predict. - Fix issue #4777: Fix link to user hooks in documentation.
- Fix issue #4829: Update build dependencies in
pyproject.toml. - Fix issue #4830: Warn for punctuation in entities when training with noise.
- Fix issue #4833: Make example scripts work with transformer starter models.
- Fix issue #4849: Fix serialization of
ENT_ID. - Fix issue #4862: Fix and improve URL pattern.
- Fix issue #4868: Include
.pyxand.pxdfiles in the distribution. - Fix issue #4876: Add friendlier error to entity linking example script.
- Fix issue #4903: Fix handling of custom underscore attributes during multiprocessing.
- Fix issue #4924: Fix handling of empty docs or golds in
Language.evaluate. - Fix issue #4934: Prevent updating component config if the
Modelwas already defined. - Fix issue #4935: Fix
Sentencizer.pipefor emptyDoc. - Fix issue #4961: Remove old docs section links.
- Fix issue #4965: Sync
Span.__eq__andSpan.__hash__. - Fix issue #4975: Adjust
srslypin. - Fix issue #5048: Fix behavior of
get_doctest utility. - Fix issue #5073: Normalize
IS_SENT_STARTtoSENT_STARTforMatcher. - Fix issue #5075: Make it impossible to create invalid heads with
Doc.from_array. - Fix issue #5082: Correctly set vector of merged span in
merge_entities. - Fix issue #5115: Ensure paths in
Tokenizer.to_diskandTokenizer.from_disk. - Fix issue #5117: Clarify behavior of
Doc.is_flags for emptyDocs.
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add new projects to the spaCy Universe.
👥 Contributors
Thanks to @polm, @mmaybeno, @jarib, @questoph, @aajanki, @mr-bjerre, @Tclack88, @thiagola92, @tamuhey, @Olamyy, @AlJohri, @iechevarria, @iurshina, @lineality, @pbadeer, @BramVanroy, @kabirkhan, @ceteri, @omri374, @maknotavailable, @onlyanegg, @drndos, @ju-sh, @nlptechbook, @chkoar, @Jan-711, @MisterKeefe, @bryant1410, @mirfan899, @dhpollack and @mabraham for the pull requests and contributions!
- Python
Published by ines about 6 years ago
spacy - v2.2.3: Tokenizer.explain, Korean base support, dependency scores per label and bug fixes
✨ New features and improvements
- NEW:
Tokenizer.explainmethod to see which rule or pattern was matched.python tok_exp = nlp.tokenizer.explain("(don't)") assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"] assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"] - NEW: Official Python 3.8 wheels for spaCy and its dependencies.
- Base language support for Korean.
- Add
Scorer.las_per_type(labelled depdencency scores per label). - Rework Chinese language initialization and tokenization
- Improve language data for Luxembourgish.
🔴 Bug fixes
- Fix issue #4573, #4645: Improve tokenizer usage docs.
- Fix issue #4575: Add error in
debug-dataif no dev docs are available. - Fix issue #4582: Make
as_tuples=TrueinLanguage.pipework with multiprocessing. - Fix issue #4590: Correctly call
on_matchinDependencyMatcher. - Fix issue #4593: Build wheels for Python 3.8.
- Fix issue #4604: Fix realloc in
Retokenizer.split. - Fix issue #4656: Fix
conllu2jsonconverter when-n> 1. - Fix issue #4662: Fix
Language.evaluatefor components without.pipemethod. - Fix issue #4670: Ensure
EntityRuleris deserialized correctly from disk. - Fix issue #4680: Raise error if non-string labels are added to
TaggerorTextCategorizer. - Fix issue #4691: Make
Vectors.findreturn keys in correct order.
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @yash1994, @walterhenry, @prilopes, @f11r, @questoph, @erip, @richardpaulhudson and @GuiGel for the pull requests and contributions.
- Python
Published by ines over 6 years ago
spacy - v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install
✨ New features and improvements
- NEW: Support multiprocessing in
nlp.pipevia then_processargument (Python 3 only). - Base language support for Luxembourgish.
- Add noun chunks iterator for Swedish.
- Retrained models for Greek, Norwegian Bokmål and Lithuanian that now correctly support parser-based sentence segmentation.
- Repackaged models for Greek and German with improved lookup tables via
spacy-lookups-data. - Add warning in
debug-datafor low sentences per doc ratio. - Improve checks and errors related to ill-formed IOB input in
convertanddebug-dataCLI. - Support training dict format as JSONL.
- Make
EntityRulerID resolution 2× faster and support"id"in patterns to setToken.ent_id. - Improve rendering of named entity spans in
displacyfor RTL languages. - Update Thinc to ditch
thinc_gpu_opsfor simpler GPU install. - Support Mish activation in
spacy pretrain. - Add forwards-compatible support for new
Language.disable_pipesAPI, which will become the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments). ```diff- disabled = nlp.disable_pipes("tagger", "parser")
- disabled = nlp.disable_pipes(["tagger", "parser"]) ```
- Add forwards-compatible support for new
Matcher.addandPhraseMatcher.addAPI, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). Theon_matchcallback becomes an optional keyword argument. ```diff patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]- matcher.add("GoogleNow", None, *patterns)
- matcher.add("GoogleNow", patterns)
- matcher.add("GoogleNow", on_match, *patterns)
- matcher.add("GoogleNow", patterns, onmatch=onmatch) ```
- Add new and improved tokenization alignment in
gold.alignbehind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0.python import spacy.gold spacy.gold.USE_NEW_ALIGN = True
🔴 Bug fixes
- Fix issue #1303: Support multiprocessing in
nlp.pipe. - Fix issue #1745: Ditch
thinc_gpu_opsfor simpler GPU install. - Fix issue #2411: Update Thinc to fix compilation on cygwin.
- Fix issue #3412: Prevent division by zero in
Vectors.most_similar. - Fix issue #3618: Fix memory leak for long-running parsing processes.
- Fix issue #4241: Update Greek lookups in
spacy-lookups-data. - Fix issue #4269: Extend unicode character block for Sinhala.
- Fix issue #4362: Improve
URL_PATTERNand handling in tokenizer. - Fix issue #4373: Make
PhraseMatcher.vocabconsistent withMatcher.vocab. - Fix issue #4377: Clarify serialization of extension attributes.
- Fix issue #4382: Improve usage of
pkg_resourcesand handling of entry points. - Fix issue #4386: Consider
batch_sizewhen sorting similar vectors. - Fix issue #4389: Fix
ner_jsonl2jsonconverter. - Fix issue #4397: Ensure
on_matchcallback is executed inPhraseMatcher. - Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models.
- Fix issue #4402: Fix issue with how training data was passed through the pipeline.
- Fix issue #4406: Correct spelling in lemmatizer API docs.
- Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing.
- Fix issue #4435: Fix
PhraseMatcher.removefor overlapping patterns. - Fix issue #4443: Fix bug in
Vectors.most_similar. - Fix issue #4452: Fix
gold.docs_to_jsondocumentation. - Fix issue #4463: Add missing
catstoGoldParse.from_annot_tuplesinScorer. - Fix issue #4470: Suppress convert output if writing to
stdout. - Fix issue #4475: Correct mistake in docs example.
- Fix issue #4485: Update tag maps and docs for English and German.
- Fix issue #4493: Update information in spaCy Universe.
- Fix issue #4496: Improve docs of
PhraseMatcher.addarguments. - Fix issue #4506: Ensure
Vectors.most_similarreturns1.0for identical vectors. - Fix issue #4509: Fix
Noneiteration error in entity linking script. - Fix issue #4524: Fix typo in
Parsersample construction ofGoldParse. - Fix issue #4528: Fix serialization of extension attribute values in
DocBin. - Fix issue #4529: Ensure
GoldParseis initialized correctly with misaligned tokens. - Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9.
⚠️ Backwards incompatibilities
- The unused attributes
lemma_rules,lemma_index,lemma_excandlemma_lookupof theLanguage.Defaultshave now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is vianlp.vocab.lookups. ```diff- nlp.Defaults.lemma_lookup["spaCies"] = "spaCy"
- lemmalookup = nlp.vocab.lookups.gettable("lemma_lookup")
- lemma_lookup["spaCies"] = "spaCy" ```
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add more projects to the spaCy Universe.
👥 Contributors
Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.
- Python
Published by ines over 6 years ago
spacy - v2.1.9: Backport memory leak fix
This is a small maintenance update that backports a bug fix for a memory leak that'd occur in long-running parsing processes. It's intended for users who can't or don't yet want to upgrade to spaCy v2.2 (e.g. because it requires retraining all the models). If you're able to upgrade, you shouldn't use this version and instead install the latest v2.2.
🔴 Bug fixes
- Fix issue #3618: Fix memory leak for long-running parsing processes.
- Fix issue #4538: Backport memory leak fix to v2.1.x branch.
- Python
Published by ines over 6 years ago
spacy - v2.2.1: Fix DocBin and Dutch model, improve Vectors.most_similar
✨ New features and improvements
- Make
Vectors.most_similarreturn the top most similar vectors instead of only one.
🔴 Bug fixes
- Fix issue #4365: Fix tag map in Dutch model.
- Fix issue #4368: Fix initialization of
DocBinwith attributes.
📖 Documentation and examples
- Add API docs for
Vectors.most_similar.
👥 Contributors
Thanks to @bintay and @svlandeg for the pull requests and contributuons.
- Python
Published by ines over 6 years ago
spacy - v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more
⚠️ This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
- NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
- NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
- NEW: Make spaCy roughly 5-10× smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
- NEW:
EntityLinkerandKnowledgeBaseAPI to train and access entity linking models, plus scripts to train your own Wikidata models. - NEW: 10× faster
PhraseMatcherand improved phrase matching algorithm. - NEW:
DocBinclass to efficiently serialize collections ofDocobjects. - NEW: Train text classification models on the command line with
spacy trainand gettextcatresults via theScorer. - NEW:
debug-datacommand to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more. - NEW: Efficient
Lookupsclass using Bloom filters that allows storing, accessing and serializing large dictionaries viavocab.lookups. - Data augmentation in
spacy trainvia the--orth-variant-levelflag, which defines the percentage of occurrences of some tokens subject to replacement during training. - Add
nlp.pipe_labels(labels assigned by pipeline components) and include"labels"innlp.meta. - Support
spacy_displacy_colorsentry point to allow packages to add entity colors todisplacy. - Allow
templateconfig option indisplacyto customize entity HTML template. - Improve match pattern validation and handling of unsupported attributes.
- Add lookup lemmatization data for Croatian and Serbian.
- Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.
🔴 Bug fixes
- Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
- Fix issue #3540: Update lemma and vector information after splitting a token.
- Fix issue #3687: Automatically skip duplicates in
Doc.retokenize. - Fix issue #3830: Retrain German model and fix
subtokerrors. - Fix issue #3850: Allow customizing entity HTML template in displaCy.
- Fix issue #3879, #3951, #4154: Fix bug in
Matcherretry loop that'd cause problems with?operator. - Fix issue #3917: Raise error for negative token indices in
displacy. - Fix issue #3922: Add
PhraseMatcher.removemethod. - Fix issue #3959, #4133: Make sure both
posandtagare correctly serialized. - Fix issue #3972: Ensure
PhraseMatcherreturns multiple matches for identical rules. - Fix issue #4020: Raise error for overlapping entities in
biluo_tags_from_offsets. - Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
- Fix issue #4070: Improve token pattern checking without validation.
- Fix issue #4096: Add checks for cycles in
debug-data. - Fix issue #4100: Improve docs on phrase pattern attributes.
- Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
- Fix issue #4104: Make visualized NER examples in docs more clear.
- Fix issue #4107: Automatically set span root attributes on merging.
- Fix issue #4111, #4170: Improve NER/IOB converters.
- Fix issue #4120: Correctly handle
?operator at the end of pattern. - Fix issue #4123: Provide more details in cycle error message
E069. - Fix issue #4138: Correctly open
.htmlfiles as UTF-8 inevaluatecommand. - Fix issue #4139: Make emoticon data a raw string.
- Fix issue #4148: Add missing API docs for
forceflag onset_extension. - Fix issue #4155: Correct language code for Serbian.
- Fix issue #4165: Add more attributes to matcher validation schema.
- Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
- Fix issue #4200: Work around
tqdmbug that'd remove text color from terminal output. - Fix issue #4229: Fix handling of pre-set entities.
- Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
- Fix issue #4242: Make
.pos/.tagdistinction more clear in the docs. - Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
- Fix issue #4262: Fix handling of spaces in Japanese.
- Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
- Fix issue #4270: Fix
--vectors-locdocumentation. - Fix issue #4302: Remove duplicate
Parser.tok2vecproperty. - Fix issue #4303: Correctly support
as_tuplesandreturn_matchesinMatcher.pipe. - Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
- Fix issue #4308: Fix bug that could cause
PhraseMatcherwith very large lists to miss matches. - Fix issue #4348: Ensure training doesn't crash with empty batches.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - The lemmatization tables have been moved to their own package,
spacy-lookups-data, which is not installed by default. If you're using pre-trained models, nothing changes, because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g.spacy.blank("en")), you'll need to explicitly install spaCy plus data viapip install spacy[lookups]. The data will be registered automatically via entry points. - Lemmatization tables (rules, exceptions, index and lookups) are now part of the
Vocaband serialized with it. This means that serialized objects (nlp, pipeline components, vocab) will now include additional data, and models written to disk will include additional files. - The
Lemmatizerclass is now initialized with an instance ofLookupscontaining the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a customLemmatizer, you'll need to update your code. - If you've been training your own models, you'll need to retrain them with the new version.
- The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
- The
spacy downloadcommand does not set the--no-depspip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source,--no-depsis added back automatically to prevent spaCy from being downloaded and installed again from pip. - The built-in
biluo_tags_from_offsetsconverter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the newdebug-datacommand to find problems in your data. - Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an
ent_iobvalue set, it won't be reset to an "unset" state and will always have at leastOassigned.list(doc.ents)now actually keeps the annotations on the token level consistent, instead of resettingOto an empty string. - The default punctuation in the
Sentencizerhas been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, setpunct_chars=[".", "!", "?"]on initialization. - The
PhraseMatcheralgorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results – however, the results should now be fully correct. See #4309 for details on this change. - The
Serbianlanguage class (introduced in v2.1.8) incorrectly used the language codersinstead ofsr. This has now been fixed, soSerbianis now available viaspacy.lang.sr. - The
"sources"in themeta.jsonhave changed from a list of strings to a list of dicts. This is mostly internals, but if your code usednlp.meta["sources"], you might have to update it.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| ------------------- | ---------- | ------: | ----: | ----: | ----: | ----: | :-: | -----: |
| en_core_web_sm | English | 2.2.0 | 91.61 | 89.71 | 97.03 | 85.07 | 𐄂 | 11 MB |
| en_core_web_md | English | 2.2.0 | 91.65 | 89.77 | 97.14 | 86.10 | ✓ | 91 MB |
| en_core_web_lg | English | 2.2.0 | 91.98 | 90.16 | 97.21 | 86.30 | ✓ | 789 MB |
| de_core_news_sm | German | 2.2.0 | 90.75 | 88.63 | 96.29 | 83.11 | 𐄂 | 14 MB |
| de_core_news_md | German | 2.2.0 | 91.26 | 89.36 | 96.44 | 83.42 | ✓ | 214 MB |
| es_core_news_sm | Spanish | 2.2.0 | 90.20 | 87.05 | 96.79 | 89.45 | 𐄂 | 15 MB |
| es_core_news_md | Spanish | 2.2.0 | 90.89 | 87.94 | 97.03 | 89.86 | ✓ | 74 MB |
| pt_core_news_sm | Portuguese | 2.2.0 | 89.53 | 86.07 | 79.96 | 87.97 | 𐄂 | 20 MB |
| fr_core_news_sm | French | 2.2.0 | 87.27 | 84.28 | 94.38 | 82.77 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.2.0 | 88.82 | 86.07 | 95.15 | 82.82 | ✓ | 84 MB |
| it_core_news_sm | Italian | 2.2.0 | 90.79 | 86.94 | 96.06 | 86.29 | 𐄂 | 13 MB |
| nl_core_news_sm | Dutch | 2.2.0 | 76.79 | 69.53 | 90.10 | 68.79 | 𐄂 | 14 MB |
| el_core_news_sm | Greek | 2.2.0 | 84.40 | 80.98 | 94.41 | 71.88 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.2.0 | 87.96 | 84.88 | 96.38 | 77.59 | ✓ | 126 MB |
| nb_core_news_sm | Norwegian | 2.2.0 | 89.02 | 86.49 | 95.72 | 83.99 | 𐄂 | 12 MB |
| lt_core_news_sm | Lithuanian | 2.2.0 | 59.87 | 48.00 | 74.02 | 76.58 | 𐄂 | 12 MB |
| xx_ent_wiki_sm | Multi | 2.2.0 | - | - | - | 79.88 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Add "label scheme" section to all models in the models directory that lists the labels assigned by the different components.
- Extend the
sourceslisted in themeta.jsonof pre-trained models with more details on the training corpora and include more information in the models directory. - Add more examples of matching regular expressions.
- Add instructions for training an entity linking model.
- Add API docs for new
debug-data,EntityLinker,KnowledgeBaseandLookups. - Add new projects to the spaCy Universe.
- Add example for interactive model visualizer with Streamlit.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @ICLRandD, @phiedulxp, @ajrader, @RyanZHe, @jenojp, @yanaiela, @isaric, @mrdbourke, @avramandrei, @Pavle992, @chkoar, @wannaphongcom, @BreakBB, @b1uec0in, @mihaigliga21, @tamuhey, @euand, @Hazoom, @SeanBE, @esemeniuc, @zqianem, @ajkl, @jaydeepborkar, @EarlGreyT and @er-raoniz for the pull requests and contributions.
Special thanks to our spaCy team @svlandeg and @adrianeboyd for the bug fixes and new features, @polm for the Bloom filters implementation and data compression and @yvespeirsman, @lemontheme, @jarib, @miktoki and @rokasramas for the help and resources for the new models.
- Python
Published by ines over 6 years ago
spacy - v2.1.8: Usability improvements and Serbian alpha tokenization
✨ New features and improvements
- NEW: Alpha tokenization support for Serbian
- Improve language data for Urdu.
- Support installing and loading model packages in the same session.
🔴 Bug fixes
- Fix issue #4002: Make
PhraseMatcherwork as expected forNORMattribute. - Fix issue #4063: Improve docs on
Matcherattributes. - Fix issue #4068: Make Korean work as expected on Python 2.7.
- Fix issue #4069: Add
validateoption toEntityRuler. - Fix issue #4074: Raise error if annotation dict in simple training style has unexpected keys.
- Fix issue #4081: Fix typo in
pyproject.toml. - Fix handling of keyword arguments in
Language.evaluate.
📖 Documentation and examples
- Improve
Matcherattribute docs. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @akornilo, @mirfan899, @veer-bains, @seppeljordan, @Pavle992, @svlandeg, @jenojp and @adrianeboyd for the pull requests and contributions.
- Python
Published by ines almost 7 years ago
spacy - v2.1.7: Improved evaluation, better language factories and bug fixes
✨ New features and improvements
- Add
Token.tensorandSpan.tensorattributes. - Support simple training format of
(text, annotations)instead of only(doc, gold)fornlp.evaluate. - Add support for
"lang_factory"setting in modelmeta.json(see #4031). - Also support
"requirements"inmeta.jsonto define packages for setup'sinstall_requires. - Improve
Pipebase class methods and make them less presumptuous. - Improve Danish and Korean tokenization.
- Improve error messages when deserializing model fails.
🔴 Bug fixes
- Fix issue #3669, #3962: Fix dependency copy in
Span.as_docthat could cause segfault. - Fix issue #3968: Fix bug in per-entity scores.
- Fix issue #4000: Improve entity linking API.
- Fix issue #4022: Fix error when Korean text contains special characters.
- Fix issue #4030: Handle edge case when calling
TextCategorizer.predictwith emptyDoc. - Fix issue #4045: Correct
Span.sentdocs. - Fix issue #4048: Fix
init-modelcommand if there's no vocab. - Fix issue #4052: Improve per-type scoring of NER.
- Fix issue #4054: Ensure the
langofnlpandnlp.vocabstay consistent. - Fix bugs in
Token.similarityandSpan.similaritywhen called via hook.
📖 Documentation and examples
- Add documentation for
gold.alignhelper. - Add more explicit section on processing text.
- Improve documentation on disabling pipeline components.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @sorenlind, @pmbaumgartner, @svlandeg, @FallakAsad, @BreakBB, @adrianeboyd, @polm, @b1uec0in, @mdaudali and @ejarkm for the pull requests and contributions.
- Python
Published by ines almost 7 years ago
spacy - v2.1.6: Fix order of symbols that caused tag maps to be out-of-sync
🔴 Bug fixes
- Fix issue #3958: Fix order of symbols that caused tag maps to be out-of-sync.
- Python
Published by ines almost 7 years ago
spacy - v2.1.5: Base support for Marathi and Korean, better pretraining, scores per entity and bug fixes
✨ New features and improvements
- NEW: Base language data for Marathi and Korean (via
mecab-ko,mecab-ko-dicandnatto-py). - Improve language data for Lithuanian, Spanish, Kannada, French, Norwegian and Hindi.
- Add evaluation metrics per entity type.
- Add resume logic to
spacy pretrain. - Add optional
idproperty to EntityRuler patterns. - Better introspection and IDE automcomplete for custom extension attributes.
- Make
Doc.is_sentencedalways returnTruefor single-token docs.
🔴 Bug fixes
- Fix issue #3490: Add evaluation metrics per entity type to
Scorer. - Fix issue #3526: Serialize
EntityRulersettings correctly. - Fix issue #3558: Improve
E024error message for incorrectGoldParse. - Fix issue #3611: Fix bug when setting
ngramparameter in text classifier. - Fix issue #3625: Improve default punctuation rules for Hindi.
- Fix issue #3707: Improve introspection of custom attributes.
- Fix issue #3737: Check if component is callable in
Language.replace_pipe. - Fix issue #3743: Fix documentation of
lex_id. - Fix issue #3749: Change vector training script to work with latest Gensim.
- Fix issue #3762, #3934: Make
Doc.is_sentenceddefault toTruefor single-tokenDocs. - Fix issue #3802: Fix typo in docs example.
- Fix issue #3811: Fix type of
--seedoption inspacy pretrain. - Fix issue #3822: Allow passing
PhraseMatcherarguments toEntityRuler. - Fix issue #3839: Ensure the
Matcherreturns correct match IDs when used with operators. - Fix issue #3840: Improve error messages in
spacy pretrain. - Fix issue #3853: Rename vectors if multiple models are loaded to prevent clashes.
- Fix issue #3859: Update
pretrainto prevent unintended overwriting of weight files. - Fix issue #3862: Fix matcher callback example.
- Fix issue #3868: Add
"v.s."to English tokenizer exceptions. - Fix issue #3869: Make
Doc.count_bywork as expected. - Fix issue #3880: Fix unflatten padding in Thinc when last element is empty.
- Fix issue #3882: Exclude
user_datawhen copying doc in displaCy. - Fix issue #3892: Update
Tokenizerinitialization docs. - Fix issue #3912: Make text classifier raise more friendly errors.
📖 Documentation and examples
- Add documentation for
Scorer,Language.evaluateandgold.docs_to_json. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @BreakBB, @ujwal-narayan, @estr4ng7d, @maknotavailable, @ramananbalakrishnan, @nipunsadvilkar, @NirantK, @munozbravo, @intrafindBreno, @Azagh3l, @jarib, @tokestermw, @polm, @skrcode, @kabirkhan, @demongolem, @elbaulp, @clarus, @BramVanroy, @rokasramas, @askhogan, @khellan, @kognate, @cedar101 and @yash1994 for the pull requests and contributions.
- Python
Published by ines almost 7 years ago
spacy - v2.1.4: Training improvements and bug fixes
✨ New features and improvements
- NEW:
util.filter_spanshelper to filter duplicates and overlaps from a list ofSpanobjects. - Improve language data for Thai, Japanese, Indonesian and Dutch.
- Add
--n-save-everytospacy pretrainand rename--nr-iterto--n-iterfor consistency. - Add
--return-scoresflag tospacy evaluateto return a dict. - Add
--n-early-stoppingoption tospacy trainto define maximum number of iterations without dev accuracy improvements.
🔴 Bug fixes
- Fix issue #3307: Fix symlink creation to show error on Windows.
- Fix issue #3473: Fix GPU training for text classification.
- Fix issue #3475: Change favicon.
- Fix issue #3482: Add Estonian base support to documentation.
- Fix issue #3484: Ensure lemmatization is always consistent between sessions.
- Fix issue #3521: Add variations of contractions to English stop words.
- Fix issue #3523: Make
spacy convertcorrectly default tojson. - Fix issue #3525, #3551, #3572: Fix problem that'd cause lemmas to not be lowercase.
- Fix issue #3531: Don't make
"settings"or"title"required in displaCy data. - Fix issue #3533: Remove non-existent example from docs.
- Fix issue #3546: Make sure path in
GoldParse.__del__is a string. - Fix issue #3549: Ensure match pattern error isn't raised on empty errors list.
- Fix issue #3561: Fix
DependencyParser.predictdocs. - Fix issue #3598: Allow
jupyter=Falseto override Jupyter mode indisplacy. - Fix issue #3620: Fix bug in
.iobconverter. - Fix issue #3628: Relax
jsonschemapin. - Fix issue #3667: Fix offset bug in loading pre-trained word2vec.
- Fix issue #3679: Update glossary to include missing labels in
spacy.explain. - Fix issue #3680: Re-add missing universe README.
- Fix issue #3681: Rewrite information extraction example to use
Doc.retokenize. - Fix issue #3692: Fix return value in
Language.updatedocs. - Fix issue #3694: Make
"text"inspacy pretrainoptional when"tokens"is provided. - Fix issue #3701: Improve
Token.probandLexeme.probdocs. - Fix issue #3708: Fix error in regex matcher examples.
- Fix issue #3713: Call
rmtreeandcopytreewith strings inspacy train. - Fix issue #3720: Add version tag to
--base-modelargument inspacy traindocs.
📖 Documentation and examples
- Add free interactive spaCy course.
- Fix various typos and inconsistencies.
- Add new projects to the spaCy universe.
👥 Contributors
Thanks to @svlandeg, @wannaphongcom, @Bharat123rox, @DuyguA, @SamuelLKane, @graus, @HiromuHota, @jeannefukumaru, @ivigamberdiev, @socool, @yvespeirsman, @lemontheme, @Dobita21, @w4nderlust, @pierremonico, @bryant1410, @celikomer, @xssChauhan, @kowaalczyk, @BreakBB, @fizban99, @tokestermw, @bjascob, @pickfire, @yaph, @amitness, @henry860916, @d5555, @BramVanroy, @F0rge1cE, @richardpaulhudson, @ldorigo, @aaronkub and @devforfu for the pull requests and contributions.
- Python
Published by ines about 7 years ago
spacy - v2.1.3: Improve sentencizer and serialization
✨ New features and improvements
- Allow customizing punctuation characters in sentencizer and make it serializable.
- Add new
"bow"architecture forTextCategorizer, to do faster bag-of-words text classification.
🔴 Bug fixes
- Fix issue #3433, #3458: Fix mismatch of classes in parser after serialization.
- Fix issue #3464: Fix training loop in
train_textcat.pyexample. - Fix issue #3468: Make sentencizer set
Token.is_sent_startcorrectly. - Fix bug in the
"ensemble"TextClassifierarchitecture that prevented the unigram bag-of-words submodel from working properly.
👥 Contributors
Thanks to @chkoar for the pull request!
- Python
Published by ines about 7 years ago
spacy - v2.1.2: Fixes to regex handling on Python 2 and tag map
🔴 Bug fixes
- Fix issue #3356: Fix handling of unicode ranges in regular expressions on Python 2.
- Fix issue #3432: Update
wasabito better handle non-UTF-8 terminals. - Fix issue #3445: Update docs on
labelargument inSpan.__init__. - Fix issue #3455: Bring English
tag_mapin line with UD Treebank.
📖 Documentation and examples
- Add
--init-tok2vecargument totrain_textcat.pyexample. - Fix various typos and inconsistencies.
- Python
Published by ines about 7 years ago
spacy - v2.1.1: Small GPU fixes
✨ New features and improvements
- Raise error if user is running a narrow unicode build.
- Move
ud_train,ud_evaluateand other UD scripts from CLI to/binin repo only. - Improve accuracy of
spacy pretrainby implementing cosine loss.
🔴 Bug fixes
- Fix issue #3421: Update docs and raise error for narrow unicode builds.
- Fix issue #3427: Correct mistake in French lemmatizer.
- Fix issue #3431: Make
Doc.vectorandDoc.vector_normwork as expected on GPU. - Fix issue #3437: Fix installation problem on GPU.
- Fix issue #3439, #3446: Don't include UD scripts in
spacy.cli.
👥 Contributors
Thanks to @mhham and @Bharat123Rox for the pull requests!
- Python
Published by ines about 7 years ago
spacy - v2.1.0: New models, ULMFit/BERT/Elmo-like pretraining, faster tokenization, better Matcher, bug fixes & more
⚠️ This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add simpler, GPU-friendly option to
TextCategorizer, and allow settingexclusive_classesandarchitecturearguments on initialization. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%. - Add
Vocab.writing_system(populated via the language data) to expose settings like writing direction.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #795: Fix behaviour of
Token.conjuncts. - Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to
TextCategorizer. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2091: Fix
displacysupport for RTL languages. - Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
- Fix issue #2329: Correct
TextCategorizerandGoldParseAPI docs. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2390: Support setting lexical attributes during retokenization.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
- Fix issue #2603: Improve handling of missing NER tags.
- Fix issue #2644: Add table explaining training metrics to docs.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2728: Fix HTML escaping in
displacyNER visualization and correct API docs. - Fix issue #2740: Add ability to pass additional arguments to pipeline components.
- Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2869: Make
doc[0].is_sent_start == True. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3036: Support mutable default arguments in extension attributes.
- Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix issue #3112: Make sure entity types are added correctly on GPU.
- Fix issue #3191: Fix pickling of
Japanese. - Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3274: Make
Token.sentwork as expected without the parser. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix issue #3346: Expose Japanese stop words in language class.
- Fix issue #3357: Update displaCy examples in docs to correctly show
Token.pos_. - Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
- Fix issue #3348: Don't use
numpydirectly for similarity. - Fix issue #3366: Improve converters, training data formats and docs.
- Fix issue #3369: Fix
#eggfragments in direct downloads. - Fix issue #3382: Make
Doc.from_arrayconsistent withDoc.to_array. - Fix issue #3398: Don't set extension attributes in language classes.
- Fix issue #3373: Merge and improve
conlluconverters. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future. - While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The deprecated
Doc.mergeandSpan.mergemethods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use theDoc.retokenizecontext manager and perform as many merges as possible together in thewithblock. ```diff - doc[1:5].merge()
- doc[6:8].merge()
- with doc.retokenize() as retokenizer:
- retokenizer.merge(doc[1:5])
- retokenizer.merge(doc[6:8]) ```
- The serialization methods
to_disk,from_disk,to_bytesandfrom_bytesnow support a singleexcludeargument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. Thedisableargument on theLanguageserialization methods has been renamed toexcludefor consistency. ```diff - nlp.to_disk("/path", disable=["parser", "ner"])
- nlp.to_disk("/path", exclude=["parser", "ner"])
- data = nlp.tokenizer.to_bytes(vocab=False)
- data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
- The
.posvalue for several common English words has changed, due to corrections to long-standing mistakes in the English tag map (see #593, #3311). - For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The keyword argument
n_threadson the.pipemethods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce an_processargument for parallel inference via multiprocessing.) - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
is_sent_startattribute of the first token in aDocnow correctly defaults toTrue. It previously defaulted toNone. - The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0 | 91.5 | 89.7 | 96.8 | 85.9 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0 | 91.8 | 90.0 | 96.9 | 86.6 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0 | 91.8 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0 | 90.7 | 88.6 | 96.3 | 83.1 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0 | 91.2 | 89.4 | 96.6 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0 | 90.4 | 87.3 | 96.9 | 89.5 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0 | 91.0 | 88.2 | 97.2 | 89.7 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0 | 89.1 | 85.9 | 80.4 | 88.9 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0 | 87.6 | 84.7 | 94.5 | 82.6 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0 | 89.1 | 86.4 | 95.3 | 83.1 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0 | 91.0 | 87.3 | 95.8 | 86.1 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0 | 83.7 | 77.6 | 91.6 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0 | 84.4 | 80.6 | 94.6 | 71.6 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0 | 88.3 | 85.0 | 96.6 | 81.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0 | - | - | - | 81.3 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.
While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:
- Usage Guide: Rule-based Matching. How to use the
Matcher,PhraseMatcherand the newEntityRuler, and write powerful components to combine statistical models and rules. - Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
- Usage Guide: Merging and Splitting. How to retokenize a
Docusing the newretokenizecontext manager and merge spans into single tokens and split single tokens into multiple. - Universe: Videos and Podcasts
- API:
EntityRuler - API:
SentenceSegmenter - API: Pipeline functions
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2, @adrienball and @Poluglottos for the pull requests and contributions.
- Python
Published by ines about 7 years ago
spacy - v2.1.0a13: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ This nightly release currently doesn't work on Python 2.7 on Windows, due to difficulties compiling our new matrix multiplication dependency
blisin that environment. We expect this can be corrected in future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add simpler, GPU-friendly option to
TextCategorizer, and allow settingexclusive_classesandarchitecturearguments on initialization. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%. - Add
Vocab.writing_system(populated via the language data) to expose settings like writing direction.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #795: Fix behaviour of
Token.conjuncts. - Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to
TextCategorizer. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2091: Fix
displacysupport for RTL languages. - Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
- Fix issue #2329: Correct
TextCategorizerandGoldParseAPI docs. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2390: Support setting lexical attributes during retokenization.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
- Fix issue #2603: Improve handling of missing NER tags.
- Fix issue #2644: Add table explaining training metrics to docs.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2728: Fix HTML escaping in
displacyNER visualization and correct API docs. - Fix issue #2740: Add ability to pass additional arguments to pipeline components.
- Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2869: Make
doc[0].is_sent_start == True. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3036: Support mutable default arguments in extension attributes.
- Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix issue #3112: Make sure entity types are added correctly on GPU.
- Fix issue #3191: Fix pickling of
Japanese. - Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3274: Make
Token.sentwork as expected without the parser. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix issue #3346: Expose Japanese stop words in language class.
- Fix issue #3357: Update displaCy examples in docs to correctly show
Token.pos_. - Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
- Fix issue #3348: Don't use
numpydirectly for similarity. - Fix issue #3366: Improve converters, training data formats and docs.
- Fix issue #3369: Fix
#eggfragments in direct downloads. - Fix issue #3382: Make
Doc.from_arrayconsistent withDoc.to_array. - Fix issue #3398: Don't set extension attributes in language classes.
- Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future. - While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The deprecated
Doc.mergeandSpan.mergemethods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use theDoc.retokenizecontext manager and perform as many merges as possible together in thewithblock. ```diff - doc[1:5].merge()
- doc[6:8].merge()
- with doc.retokenize() as retokenizer:
- retokenizer.merge(doc[1:5])
- retokenizer.merge(doc[6:8]) ```
- The serialization methods
to_disk,from_disk,to_bytesandfrom_bytesnow support a singleexcludeargument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. Thedisableargument on theLanguageserialization methods has been renamed toexcludefor consistency. ```diff - nlp.to_disk("/path", disable=["parser", "ner"])
- nlp.to_disk("/path", exclude=["parser", "ner"])
- data = nlp.tokenizer.to_bytes(vocab=False)
- data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
- For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The keyword argument
n_threadson the.pipemethods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce an_processargument for parallel inference via multiprocessing.) - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
is_sent_startattribute of the first token in aDocnow correctly defaults toTrue. It previously defaulted toNone. - The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.
While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:
- Usage Guide: Rule-based Matching. How to use the
Matcher,PhraseMatcherand the newEntityRuler, and write powerful components to combine statistical models and rules. - Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
- Usage Guide: Merging and Splitting. How to retokenize a
Docusing the newretokenizecontext manager and merge spans into single tokens and split single tokens into multiple. - Universe: Videos and Podcasts
- API:
EntityRuler - API:
SentenceSegmenter - API: Pipeline functions
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2 and @adrienball for the pull requests and contributions.
- Python
Published by ines about 7 years ago
spacy - v2.1.0a12: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add simpler, GPU-friendly option to
TextCategorizer, and allow settingexclusive_classesandarchitecturearguments on initialization. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%. - Add
Vocab.writing_system(populated via the language data) to expose settings like writing direction.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #795: Fix behaviour of
Token.conjuncts. - Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to
TextCategorizer. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2091: Fix
displacysupport for RTL languages. - Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
- Fix issue #2329: Correct
TextCategorizerandGoldParseAPI docs. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2390: Support setting lexical attributes during retokenization.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
- Fix issue #2603: Improve handling of missing NER tags.
- Fix issue #2644: Add table explaining training metrics to docs.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2728: Fix HTML escaping in
displacyNER visualization and correct API docs. - Fix issue #2740: Add ability to pass additional arguments to pipeline components.
- Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2869: Make
doc[0].is_sent_start == True. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3036: Support mutable default arguments in extension attributes.
- Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix issue #3112: Make sure entity types are added correctly on GPU.
- Fix issue #3191: Fix pickling of
Japanese. - Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3274: Make
Token.sentwork as expected without the parser. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix issue #3346: Expose Japanese stop words in language class.
- Fix issue #3357: Update displaCy examples in docs to correctly show
Token.pos_. - Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
- Fix issue #3348: Don't use
numpydirectly for similarity. - Fix issue #3366: Improve converters, training data formats and docs.
- Fix issue #3369: Fix
#eggfragments in direct downloads. - Fix issue #3382: Make
Doc.from_arrayconsistent withDoc.to_array. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future. - While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The deprecated
Doc.mergeandSpan.mergemethods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use theDoc.retokenizecontext manager and perform as many merges as possible together in thewithblock. ```diff - doc[1:5].merge()
- doc[6:8].merge()
- with doc.retokenize() as retokenizer:
- retokenizer.merge(doc[1:5])
- retokenizer.merge(doc[6:8]) ```
- The serialization methods
to_disk,from_disk,to_bytesandfrom_bytesnow support a singleexcludeargument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. Thedisableargument on theLanguageserialization methods has been renamed toexcludefor consistency. ```diff - nlp.to_disk("/path", disable=["parser", "ner"])
- nlp.to_disk("/path", exclude=["parser", "ner"])
- data = nlp.tokenizer.to_bytes(vocab=False)
- data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
- For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The keyword argument
n_threadson the.pipemethods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce an_processargument for parallel inference via multiprocessing.) - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
is_sent_startattribute of the first token in aDocnow correctly defaults toTrue. It previously defaulted toNone. - The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.
While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:
- Usage Guide: Rule-based Matching. How to use the
Matcher,PhraseMatcherand the newEntityRuler, and write powerful components to combine statistical models and rules. - Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
- Usage Guide: Merging and Splitting. How to retokenize a
Docusing the newretokenizecontext manager and merge spans into single tokens and split single tokens into multiple. - Universe: Videos and Podcasts
- API:
EntityRuler - API:
SentenceSegmenter - API: Pipeline functions
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2 and @adrienball for the pull requests and contributions.
- Python
Published by ines about 7 years ago
spacy - v2.1.0a11: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add simpler, GPU-friendly option to
TextCategorizer, and allow settingexclusive_classesandarchitecturearguments on initialization. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to
TextCategorizer. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
- Fix issue #2329: Correct
TextCategorizerandGoldParseAPI docs. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2390: Support setting lexical attributes during retokenization.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
- Fix issue #2603: Improve handling of missing NER tags.
- Fix issue #2644: Add table explaining training metrics to docs.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2728: Fix HTML escaping in
displacyNER visualization and correct API docs. - Fix issue #2740: Add ability to pass additional arguments to pipeline components.
- Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2869: Make
doc[0].is_sent_start == True. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix issue #3112: Make sure entity types are added correctly on GPU.
- Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3274: Make
Token.sentwork as expected without the parser. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix issue #3346: Expose Japanese stop words in language class.
- Fix issue #3357: Update displaCy examples in docs to correctly show
Token.pos_. - Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
- Fix issue #3348: Don't use
numpydirectly for similarity. - Fix issue #3366: Improve converters, training data formats and docs.
- Fix issue #3369: Fix
#eggfragments in direct downloads. - Fix issue #3382: Make
Doc.from_arrayconsistent withDoc.to_array. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future. - While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The deprecated
Doc.mergeandSpan.mergemethods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use theDoc.retokenizecontext manager and perform as many merges as possible together in thewithblock. ```diff - doc[1:5].merge()
- doc[6:8].merge()
- with doc.retokenize() as retokenizer:
- retokenizer.merge(doc[1:5])
- retokenizer.merge(doc[6:8]) ```
- The serialization methods
to_disk,from_disk,to_bytesandfrom_bytesnow support a singleexcludeargument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. Thedisableargument on theLanguageserialization methods has been renamed toexcludefor consistency. ```diff - nlp.to_disk("/path", disable=["parser", "ner"])
- nlp.to_disk("/path", exclude=["parser", "ner"])
- data = nlp.tokenizer.to_bytes(vocab=False)
- data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
- For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The keyword argument
n_threadson the.pipemethods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce an_processargument for parallel inference via multiprocessing.) - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
is_sent_startattribute of the first token in aDocnow correctly defaults toTrue. It previously defaulted toNone. - The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.
While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:
- Usage Guide: Rule-based Matching. How to use the
Matcher,PhraseMatcherand the newEntityRuler, and write powerful components to combine statistical models and rules. - Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
- Usage Guide: Merging and Splitting. How to retokenize a
Docusing the newretokenizecontext manager and merge spans into single tokens and split single tokens into multiple. - Universe: Videos and Podcasts
- API:
EntityRuler - API:
SentenceSegmenter - API: Pipeline functions
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2 and @adrienball for the pull requests and contributions.
- Python
Published by ines about 7 years ago
spacy - v2.1.0a10: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add simpler, GPU-friendly option to
TextCategorizer, and allow settingexclusive_classesandarchitecturearguments on initialization. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to
TextCategorizer. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2329: Correct
TextCategorizerandGoldParseAPI docs. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2390: Support setting lexical attributes during retokenization.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2603: Improve handling of missing NER tags.
- Fix issue #2644: Add table explaining training metrics to docs.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2728: Fix HTML escaping in
displacyNER visualization and correct API docs. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2869: Make
doc[0].is_sent_start == True. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix issue #3112: Make sure entity types are added correctly on GPU.
- Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3274: Make
Token.sentwork as expected without the parser. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future. - While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The deprecated
Doc.mergeandSpan.mergemethods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use theDoc.retokenizecontext manager and perform as many merges as possible together in thewithblock. ```diff - doc[1:5].merge()
- doc[6:8].merge()
- with doc.retokenize() as retokenizer:
- retokenizer.merge(doc[1:5])
- retokenizer.merge(doc[6:8]) ```
- For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The keyword argument
n_threadson the.pipemethods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce an_processargument for parallel inference via multiprocessing.) - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
is_sent_startattribute of the first token in aDocnow correctly defaults toTrue. It previously defaulted toNone. - The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.
While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:
- Usage Guide: Rule-based Matching. How to use the
Matcher,PhraseMatcherand the newEntityRuler, and write powerful components to combine statistical models and rules. - Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
- Usage Guide: Merging and Splitting. How to retokenize a
Docusing the newretokenizecontext manager and merge spans into single tokens and split single tokens into multiple. - Universe: Videos and Podcasts
- API:
EntityRuler - API:
SentenceSegmenter - API: Pipeline functions
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig and @mikelibg for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.1.0a9: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add simpler, GPU-friendly option to
TextCategorizer, and allow settingexclusive_classesandarchitecturearguments on initialization. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to
TextCategorizer. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2329: Correct
TextCategorizerandGoldParseAPI docs. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2390: Support setting lexical attributes during retokenization.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2644: Add table explaining training metrics to docs.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2728: Fix HTML escaping in
displacyNER visualization and correct API docs. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix issue #3112: Make sure entity types are added correctly on GPU.
- Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future. - While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The deprecated
Doc.mergeandSpan.mergemethods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use theDoc.retokenizecontext manager and perform as many merges as possible together in thewithblock. ```diff - doc[1:5].merge()
- doc[6:8].merge()
- with doc.retokenize() as retokenizer:
- retokenizer.merge(doc[1:5])
- retokenizer.merge(doc[6:8]) ```
- For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The keyword argument
n_threadson the.pipemethods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce an_processargument for parallel inference via multiprocessing.) - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.
While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:
- Usage Guide: Rule-based Matching. How to use the
Matcher,PhraseMatcherand the newEntityRuler, and write powerful components to combine statistical models and rules. - Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
- Usage Guide: Merging and Splitting. How to retokenize a
Docusing the newretokenizecontext manager and merge spans into single tokens and split single tokens into multiple. - Universe: Videos and Podcasts
- API:
EntityRuler - API:
SentenceSegmenter - API: Pipeline functions
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig and @mikelibg for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.1.0a8: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Refactor CLI and add
debug-datacommand to validate training data (see #2932). - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The deprecated
Doc.mergeandSpan.mergemethods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use theDoc.retokenizecontext manager and perform as many merges as possible together in thewithblock. ```diff - doc[1:5].merge()
- doc[6:8].merge()
- with doc.retokenize() as retokenizer:
- retokenizer.merge(doc[1:5])
- retokenizer.merge(doc[6:8]) ```
- For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The keyword argument
n_threadson the.pipemethods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce an_processargument for parallel inference via multiprocessing.) - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.
While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:
- Usage Guide: Rule-based Matching. How to use the
Matcher,PhraseMatcherand the newEntityRuler, and write powerful components to combine statistical models and rules. - Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
- Usage Guide: Merging and Splitting. How to retokenize a
Docusing the newretokenizecontext manager and merge spans into single tokens and split single tokens into multiple. - Universe: Videos and Podcasts
- API:
EntityRuler - API:
SentenceSegmenter - API: Pipeline functions
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig and @mikelibg for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.1.0a7: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: 2-3 times faster tokenization across all languages at the same accuracy!
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging and splitting tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - NEW:
gold.spans_from_biluo_tagshelper that returnsSpanobjects, e.g. to overwrite thedoc.ents. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Refactor CLI and add
debug-datacommand to validate training data (see #2932). - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1642: Replace
regexwithreand speed up tokenization. - Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2833: Raise better error if
TokenorSpanare pickled. - Fix issue #2838: Add
Retokenizer.splitmethod to split one token into several. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #2901: Fix issue with first call of
nlpin Japanese (MeCab). - Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix issue #3122: Correct docs of
Token.subtreeandSpan.subtree. - Fix issue #3128: Improve error handling in converters.
- Fix issue #3248: Fix
PhraseMatcherpickling and make__len__consistent. - Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
- The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- The
spacy init-modelcommand now uses a--jsonl-locargument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate--freqs-locand--clusters-loc. ```diff - $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
- $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825 and @grivaz for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.1.0a6: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW: Enhanced pattern API for rule-based
Matcher(see #1971). - NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Refactor CLI and add
debug-datacommand to validate training data (see #2932). - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.- Improved JSON(L) format for training (see #2928, #2932).
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1537: Make
Span.as_docreturn a copy, not a view. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Fix issue #1773: Prevent tokenizer exceptions from setting
POSbut notTAG. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #1963: Resize
Doc.tensorwhen merging spans. - Fix issue #1971: Update
Matcherengine to support regex, extension attributes and rich comparison. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2396: Fix
Doc.get_lca_matrix. - Fix issue #2464, #3009: Fix behaviour of
Matcher's?quantifier. - Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #3012: Fix clobber of
Doc.is_taggedinDoc.from_array. - Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix issue #3064: Allow single string attributes in
Doc.to_array. - Fix issue #3093, #3067: Set
vectors.namecorrectly when exporting model via CLI. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a6 | 91.5 | 89.6 | 96.8 | 85.5 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a6 | 91.9 | 90.2 | 97.0 | 86.4 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a6 | 92.0 | 90.2 | 97.0 | 86.6 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a6 | 91.6 | 89.6 | 97.2 | 83.3 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a6 | 92.2 | 90.3 | 97.5 | 83.9 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a6 | 90.3 | 87.3 | 97.0 | 89.0 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a6 | 90.9 | 88.1 | 97.2 | 89.3 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a6 | 89.4 | 86.0 | 80.4 | 89.1 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a6 | 87.7 | 84.8 | 94.5 | 82.9 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a6 | 89.1 | 86.5 | 95.1 | 83.4 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a6 | 90.9 | 87.2 | 95.9 | 86.4 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a6 | 83.7 | 77.6 | 91.5 | 87.1 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a6 | 85.0 | 81.5 | 94.8 | 73.1 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a6 | 88.4 | 85.2 | 96.6 | 81.0 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a6 | - | - | - | 81.6 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin and @moreymat for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.1.0a5: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Refactor CLI and add
debug-datacommand to validate training data (see #2932). - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher(see #1971).- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.- Improved JSON(L) format for training (see #2928, #2932).
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1585: Prevent parser from predicting unseen classes.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1816: Allow custom
Languagesubclasses via entry points. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2779: Fix handling of pre-set entities.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix issue #3048: Raise better errors for uninitialized pipeline components.
- Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a5 | 91.2 | 89.3 | 96.9 | 85.6 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a5 | 91.4 | 89.5 | 96.9 | 85.9 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a5 | 91.5 | 89.7 | 97.0 | 86.3 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a5 | 91.3 | 89.0 | 97.1 | 82.2 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a5 | 92.0 | 90.0 | 97.4 | 82.7 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a5 | 89.9 | 86.7 | 96.6 | 87.3 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a5 | 90.6 | 87.7 | 97.0 | 88.0 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a5 | 89.3 | 86.0 | 78.5 | 87.8 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a5 | 87.3 | 84.4 | 94.4 | 81.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a5 | 88.8 | 86.1 | 94.9 | 82.2 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a5 | 90.8 | 87.0 | 95.7 | 84.8 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a5 | 83.7 | 77.4 | 90.9 | 85.4 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a5 | 85.5 | 81.8 | 94.7 | 75.9 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a5 | 88.5 | 85.2 | 96.8 | 80.01 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a5 | - | - | - | 82.8 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.1.0a4: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
✨ New features and improvements
Tagger, Parser, NER and Text Categorizer
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
- Improve loading time of
Frenchby ~30%.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - NEW: Replace
ujson,msgpack,msgpack-numpy,pickle,cloudpickleanddillwith our own packagesrslyto centralise dependencies and allow binary wheels. - NEW:
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). - NEW: Built-in
EntityRulercomponent to make it easier to build rule-based NER and combinations of statistical and rule-based systems. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive. - Accept
"TEXT"as an alternative to"ORTH"inMatcherpatterns. - Refactor CLI and add
debug-datacommand to validate training data (see #2932). - Use
blackfor auto-formatting.pysource and optimse codebase usingflake8. You can now runflake8 spacyand it should return no errors or warnings. SeeCONTRIBUTING.mdfor details.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher(see #1971).- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.- Improved JSON(L) format for training (see #2928, #2932).
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1748, #1798, #2756, #2934: Make
TextCategorizerdefault to a simpler, GPU-friendly model. - Fix issue #1782, #2343: Fix training on GPU.
- Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2648: Fix
KeyErrorinVectors.most_similar. - Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2693: Only use
'sentencizer'as built-in sentence boundary component name. - Fix issue #2754, #3028: Make
NORMaTokenattribute instead of aLexemeattribute to allow setting context-specific norms in tokenizer exceptions. - Fix issue #2769: Fix issue that'd cause segmentation fault when calling
EntityRecognizer.add_label. - Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as
B,LorU. - Fix issue #2871: Fix vectors for reserved words.
- Fix issue #3027: Allow
Spanto take unicode value forlabelargument. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - The
Doc.print_treemethod is not deprecated in favour of a unifiedDoc.to_jsonmethod, which outputs data in the same format as the expected JSON training data. - The built-in rule-based sentence boundary detector is now only called
'sentencizer'– the name'sbd'is deprecated. ```diff - sentencesplitter = nlp.createpipe('sbd')
- sentencesplitter = nlp.createpipe('sentencizer') ```
- The
spacy traincommand now lets you specify a comma-separated list of pipeline component names, instead of separate flags like--no-parserto disable components. This is more flexible and also handles custom components out-of-the-box. ```diff - $ spacy train en /output traindata.json devdata.json --no-parser
- $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
- Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a5 | 91.2 | 89.3 | 96.9 | 85.6 | 𐄂 | 10 MB |
| en_core_web_md | English | 2.1.0a5 | 91.4 | 89.5 | 96.9 | 85.9 | ✓ | 90 MB |
| en_core_web_lg | English | 2.1.0a5 | 91.5 | 89.7 | 97.0 | 86.3 | ✓ | 788 MB |
| de_core_news_sm | German | 2.1.0a5 | 91.3 | 89.0 | 97.1 | 82.2 | 𐄂 | 10 MB |
| de_core_news_md | German | 2.1.0a5 | 92.0 | 90.0 | 97.4 | 82.7 | ✓ | 210 MB |
| es_core_news_sm | Spanish | 2.1.0a5 | 89.9 | 86.7 | 96.6 | 87.3 | 𐄂 | 10 MB |
| es_core_news_md | Spanish | 2.1.0a5 | 90.6 | 87.7 | 97.0 | 88.0 | ✓ | 69 MB |
| pt_core_news_sm | Portuguese | 2.1.0a5 | 89.3 | 86.0 | 78.5 | 87.8 | 𐄂 | 12 MB |
| fr_core_news_sm | French | 2.1.0a5 | 87.3 | 84.4 | 94.4 | 81.0 | 𐄂 | 14 MB |
| fr_core_news_md | French | 2.1.0a5 | 88.8 | 86.1 | 94.9 | 82.2 | ✓ | 82 MB |
| it_core_news_sm | Italian | 2.1.0a5 | 90.8 | 87.0 | 95.7 | 84.8 | 𐄂 | 10 MB |
| nl_core_news_sm | Dutch | 2.1.0a5 | 83.7 | 77.4 | 90.9 | 85.4 | 𐄂 | 10 MB |
| el_core_news_sm | Greek | 2.1.0a5 | 85.5 | 81.8 | 94.7 | 75.9 | 𐄂 | 10 MB |
| el_core_news_md | Greek | 2.1.0a5 | 88.5 | 85.2 | 96.8 | 80.01 | ✓ | 126 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a5 | - | - | - | 82.8 | 𐄂 | 3 MB |
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.0.18: Alpha support for Catalan and dependency fixes
✨ New features and improvements
- NEW: Alpha tokenization support for Catalan.
- Improve French tokenization.
- Fix
regexpin to harmonise dependencies with conda. - Fix
msgpackpin. - Update tests for
pytest4.0.
🔴 Bug fixes
- Fix issue #2933: Correct mistake in
is_asciidocumentation. - Fix issue #2976: Fix bug where
Vocab.prune_vectorsdid not usebatch_size. - Fix issue #2986: Correctly document when
Span.entswas added. - Fix issue #2995, #2996: Fix
msgpackpin.
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @mpuig, @ALSchwalm, @bpben, @svlandeg and @wxv for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
⚠️ Due to difficulties linking our new
blisfor faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.
✨ New features and improvements
Tagger, Parser & NER
- NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new
spacy pretraincommand. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used inspacy train, using the new-t2vargument. - NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Add
EntityRecognizer.labelsproperty. - Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v7.0, which defaults to single-thread with fast
bliskernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
CLI
- NEW:
pretraincommand for ULMFit/BERT/Elmo-like pretraining (see #2931). - NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- NEW: Allow
PhraseMatcherto match on token attributes other thanORTH, e.g.LOWER(for case-insensitive matching) or evenPOSorTAG. - Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher(see #1971).- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.- Improved JSON(L) format for training (see #2928, #2932).
Doc.to_json()method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).- Refactor CLI and add
debug-datacommand to validate training data (see #2932).
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2482: Fix serialization when parser model is empty.
- Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix issue #2772: Fix bug in sentence starts for non-projective parses.
- Fix issue #2782: Make
like_numwork with prefixed numbers. - Fix serialization of custom tokenizer if not all functions are defined.
- Fix bugs in beam-search training objective.
- Fix problems with model pickling.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a4 | 91.7 | 89.8 | 96.8 | 85.7 | 𐄂 | 12 MB |
| en_core_web_md | English | 2.1.0a4 | 92.0 | 90.1 | 97.0 | 86.2 | ✓ | 93 MB |
| en_core_web_lg | English | 2.1.0a4 | 92.1 | 90.3 | 97.0 | 86.5 | ✓ | 780 MB |
| de_core_news_sm | German | 2.1.0a4 | 91.9 | 89.8 | 97.2 | 83.4 | 𐄂 | 12 MB |
| de_core_news_md | German | 2.1.0a4 | 91.3 | 90.5 | 97.4 | 83.6 | ✓ | 212 MB |
| es_core_news_sm | Spanish | 2.1.0a4 | 90.1 | 87.1 | 96.8 | 89.3 | 𐄂 | 12 MB |
| es_core_news_md | Spanish | 2.1.0a4 | 90.7 | 87.8 | 97.1 | 89.4 | ✓ | 72 MB |
| pt_core_news_sm | Portuguese | 2.1.0a4 | 89.2 | 85.8 | 79.8 | 82.4 | 𐄂 | 14 MB |
| fr_core_news_sm | French | 2.1.0a4 | 87.2 | 84.0 | 94.4 | 67.0 1 | 𐄂 | 16 MB |
| fr_core_news_md | French | 2.1.0a4 | 88.8 | 86.0 | 94.9 | 70.0 1 | ✓ | 84 MB |
| it_core_news_sm | Italian | 2.1.0a4 | 90.6 | 87.0 | 96.0 | 81.7 | 𐄂 | 12 MB |
| nl_core_news_sm | Dutch | 2.1.0a4 | 83.1 | 77.2 | 91.3 | 87.3 | 𐄂 | 12 MB |
| el_core_news_sm | Greek | 2.1.0a4 | 84.2 | 80.4 | 94.6 | 71.5 | 𐄂 | 12 MB |
| el_core_news_md | Greek | 2.1.0a4 | 87.5 | 84.1 | 96.4 | 78.3 | ✓ | 128 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a4 | - | - | - | 83.2 | 𐄂 | 4 MB |
1) We're currently investigating this, as the results are anomalously low.
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas and @skrcode for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.0.17: Fix NER segfaults and various small issues
✨ New features and improvements
- Make
max_lengthof input text inclusive. - Raise error when setting overlapping entities as
doc.ents. - Improve French lemmatization and check if a word is in one of the regular lists specific to each part-of-speech tag.
🔴 Bug fixes
- Fix issue #1581, #1969, #1986: Fix out-of-bounds access in NER training that'd cause segmentation fault.
- Fix issue #2924: Prevent problem where
displacyarcs would receive the same IDs in Jupyter notebooks, causing weirdly positioned arc labels. - Fix issue #2948: Fix problem with symlink creation on Windows.
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Update spaCy Universe with new projects.
- Add example script showing a fix-up rule for whitespace entities like
'\n'.
👥 Contributors
Thanks to @digest0r, @BramVanroy, @grivaz, @wannaphongcom, @mikelibg, @danielhers, @frascuchon, @mauryaland and @cicorias for the pull requests and contributions.
- Python
Published by ines over 7 years ago
spacy - v2.0.16: Fix msgpack-numpy pin
🔴 Bug fixes
- Fix
msgpack-numpypin, which could affect serialization on Python 2.7.
- Python
Published by ines over 7 years ago
spacy - v2.0.15: More wheels and GPU improvements
✨ New features and improvements
- Improve version compatibility to support wheels for all spaCy dependencies maintained by us:
thinc,cymem,preshedandmurmurhash. - Support GPU installation by specifying
spacy[cuda],spacy[cuda90],spacy[cuda91],spacy[cuda92]orspacy[cuda10], which will installcupyandthinc_gpu_ops. - Add
spacy.prefer_gpu()andspacy.require_gpu()functions.
📖 Documentation and examples
- Update GPU installation and usage docs.
- Python
Published by ines over 7 years ago
spacy - v2.0.13: Wheels, alpha support for Telugu and Sinhala, rule-based lemmatization for French and Greek, plus various small fixes
✨ New features and improvements
- NEW: Pre-built wheels and up to 10 times faster installation! This release starts the journey towards pre-built wheels for all of spaCy's dependencies. Once that's completed, you won't even need a local compiler anymore to install the library. For more details on our wheels process, see
explosion/wheelwright. - NEW: Alpha support for Telugu and Sinhala.
- NEW: Rule-based lemmatization for Greek and French.
- Port over Chinese support (#1210) from v1.x.
- Improve language data for Persian, Greek, Swedish, Bengali, Polish, Portuguese, Indonesian, French, German and Russian.
- Add
Span.entsproperty for consistency withDoc.ents. - Add
--verboseoption tospacy trainto output more details for debugging.
🔴 Bug fixes
- Fix issue #653: Introduce bulk merge function.
- Fix issue #1445, #1917, #2209, #2362, #2371, #2383, #2501, #2743, #2758: Fix Keras examples.
- Fix issue #2261, #2800: Fix bug that could cause a crash with too many entity types.
- Fix issue #2540: Improve French stop words.
- Fix issue #2582, #2640, #2645, #2657, #2705, #2784, #2815, #2841, #2845: Fix typos and inconsistencies in documentation.
- Fix issue #2593: Prevent
numpywarning. - Fix issue #2706: Add missing label
FACtospacy.explainglossary. - Fix issue #2709: Pass default option when calling
getoption()inconftest.py.
📖 Documentation and examples
- Improve Keras examples.
- Update training examples to use minibatching.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DimaBryuhanov, @kororo, @AndriyMulyar, @katarkor, @giannisdaras, @bphi, @vikaskyadav, @sammous, @EmilStenstrom, @howl-anderson, @ohenrik, @aashishg, @aryaprabhudesai, @steve-prod, @njsmith, @aniruddha-adhikary, @pzelasko, @mbkupfer, @sainathadapa, @tyburam, @grivaz, @filipecaixeta, @aongko, @free-variation, @mauryaland, @pmj642, @keshan, @darindf, @charlax, @phojnacki, @skrcode, @jacopofar, @Cinnamy and @JKhakpour for the pull requests and contributions!
- Python
Published by ines over 7 years ago
spacy - v2.1.0a1: New models, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
✨ New features and improvements
Tagger, Parser & NER
- NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Fix bugs in beam-search training objective.
- Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
CLI
- NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated. - Add support for multi-task objectives to
traincommand. - Add support for data-augmentation to
traincommand.
Other
- NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher(see #1971).- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix serialization of custom tokenizer if not all functions are defined.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | English | 2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | 𐄂 | 28 MB |
| en_core_web_md | English | 2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | ✓ | 107 MB |
| en_core_web_lg | English | 2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | ✓ | 805 MB |
| de_core_news_sm | German | 2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | 𐄂 | 26 MB |
| de_core_news_md | German | 2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | ✓ | 228 MB |
| es_core_news_sm | Spanish | 2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | 𐄂 | 28 MB |
| es_core_news_md | Spanish | 2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | ✓ | 88 MB |
| pt_core_news_sm | Portuguese | 2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | 𐄂 | 29 MB |
| fr_core_news_sm | French | 2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | 𐄂 | 32 MB |
| fr_core_news_md | French | 2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | ✓ | 100 MB |
| it_core_news_sm | Italian | 2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | 𐄂 | 27 MB |
| nl_core_news_sm | Dutch | 2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | 𐄂 | 27 MB |
| el_core_news_sm | Greek | 2.1.0a0 | 84.5 | 81.0 | 95.0 | 73.5 | 𐄂 | 27 MB |
| el_core_news_md | Greek | 2.1.0a0 | 87.7 | 84.7 | 96.3 | 80.2 | ✓ | 143 MB |
| xx_ent_wiki_sm | Multi | 2.1.0a0 | - | - | - | 83.8 | 𐄂 | 9 MB |
1) We're currently investigating this, as the results are anomalously low.
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos and @louridas for the pull requests and contributions.
- Python
Published by ines almost 8 years ago
spacy - v2.0.12: Greek, Arabic, Urdu, Tatar, improved language data, better model downloads & various compatibility and bug fixes
We had to release another update to the v2.0.x branch of spaCy to resolve a dependency issue, so we decided to also include and/or backport a bunch of features and fixes that were originally intended for v2.1.0 (see here for the nightly version).
✨ New features and improvements
- NEW: Alpha tokenization and language data for Arabic, Urdu, Tatar and Greek.
- NEW: Mecab-based Japanese tokenization and lemmatization.
- NEW: Add Norwegian rule-based and lookup lemmatization.
- NEW: Add Danish lookup lemmatization based on the Den store danske SprogTeknologiske Ordbase, STO dataset, courtesy of The University of Copenhagen.
- NEW: Romanian lookup lemmatization.
- Improve language data for Polish, Turkish, French, Romanian, Swedish and Japanese.
- Improve case-sensitive lookup lemmatization in German.
- Add
Token.sentproperty that returns the sentenceSpanthe token is part of. - Add
remove_extensionmethod onDoc,TokenandSpan. - Add
Doc.is_sentencedproperty that returnsTrueif sentence boundaries have been applied. - Allow ignoring warning by code via the
SPACY_WARNING_IGNOREenvironment variable. - Add
--silentoption toinfocommand.
🔴 Bug fixes
- Fix issue #1456: Pass additional arguments of
downloadcommand topipand check if model is already installed before downloading it. - Fix issue #2191: Update
READMEsection on tests and dependencies. - Fix issue #2194: Ensure that
Doc.noun_chunks_iteratorisn'tNonebefore calling it. - Fix issue #2196: Return data in
cli.infoand addsilentoption. - Fix issue #2200: Correct typo in
spacy packagecommand message. - Fix issue #2210: Fix bug in Spanish noun chunks.
- Fix issue #2211, #2320: Resolve problem in
downloadcommand and userequestslibrary again. - Fix issue #2219: Fix token similarity of single-letter tokens.
- Fix issue #2222, #2223: Fix typos in documentation and docstrings.
- Fix issue #2226: Use correct, non-deprecated merge syntax in
merge_ents. - Fix issue #2228: Fix deserialization when using
tensor=Falseorsentiment=False. - Fix issue #2238: Correct Swedish lookup lemmatization.
- Fix issue #2242: Add
remove_extensionmethod onDoc,TokenandSpan. - Fix issue #2266: Add
collapse_phrasesoption to displaCy visualizer. - Fix issue #2269: Fix
KeyErrorby renamingSPto_SP. - Fix issue #2304: Don't require
attrsargument inDoc.retokenizeand allow ints/unicode. - Fix issue #2361: Escape HTML tags in
displacy.render. - Fix issue #2376: Improve
Matcherexamples and add section on using pipeline components. - Fix issue #2385: Handle multi-word entities correctly in IOB to BILUO conversion.
- Fix issue #2452: Fix bug that would cause
displacyarrows to only point in one direction. - Fix issue #2477: Also allow
Spanobjects indisplacy.render. - Fix issue #2490: Update Thinc's dependencies for Python 3.7 compatibility.
- Fix issue #2495: Fix loading tokenizer with custom prefix search.
- Fix issue #2514: Switch from
msgpack-pythontomsgpackto hopefully prevent conda from downloading a two-year-old spaCy version when installing with latest the Anaconda distribution. - Ensure that
Doc.is_taggedis set correctly when usingLanguage.pipe. - Fix bug in
merge_noun_chunksfactory that would returnNoneifDocwasn't parsed. - Explicitly require
pathlibbackport on Python 2 only.
📖 Documentation and examples
- NEW: Edit and execute code examples in your browser – all across the documentation!
- NEW: The spaCy Universe, a collection of plugins, extensions and other resources for spaCy.
- NEW: Experimental rule-based
MatcherExplorer demo – create token patterns interactively, test them against your text and copy-paste the Python pattern code. - NEW: Document Cython API.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @mollerhoj, @howl-anderson, @pktippa, @skrcode, @miroli, @ivyleavedtoadflax, @5hirish, @therealronnie, @alexvy86, @mn3mos, @polm, @knoxdw, @bellabie, @mauryaland, @LRAbbade, @janimo, @vishnumenon, @tzano, @cclauss, @armsp, @aristorinjuang, @BigstickCarpet, @idealley, @ansgar-t, @mpszumowski, @91ns, @msklvsk, @himkt, @DanielRuf, @nathanathan, @GolanLevy, @nipunsadvilkar, @cjhurst, @aliiae, @mirfan899, @ohenrik, @btrungchi, @kleinay, @DuyguA, @stefan-it, @Eleni170, @datascouting, @tjkemp, @x-ji, @giannisdaras, @kororo and @katarkor for the pull requests and contributions.
- Python
Published by ines almost 8 years ago
spacy - v2.1.0a0: New models, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.
bash
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
✨ New features and improvements
Tagger, Parser & NER
- NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Fix bugs in beam-search training objective.
- Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
CLI
- NEW: New
ud-traincommand, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download. - Pass additional arguments of
downloadcommand topipto customise installation. - Improve
traincommand by lettingGoldCorpusstream data, instead of loading into memory. - Improve
init-modelcommand, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocabcommand, which is now deprecated.
Other
- NEW:
Doc.retokenizecontext manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- Add warnings if
.similaritymethod is called with empty vectors or without word vectors. - Improve rule-based
Matcherand addreturn_matcheskeyword argument toMatcher.pipeto yield(doc, matches)tuples instead of onlyDocobjects, andas_tuplesto add context to theDocobjects. - Make stop words via
Token.is_stopandLexeme.is_stopcase-insensitive.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.xrelease, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher(see #1971).- Built-in rule-based NER component to add entities based on match patterns (see #2513).
- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexemeattributes on merge (see #2390).mdandlgmodels and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inanand addAnimacy_nhum. - Fix issue #1865: Correct licensing of
it_core_news_smmodel. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcldependency label to symbols. - Fix issue #2014: Make
Token.pos_writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix serialization of custom tokenizer if not all functions are defined.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
MatcherAPI is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcherinv2.1.xmay produce different results compared to theMatcherinv2.0.x. - Also note that some of the model licenses have changed:
it_core_news_smis now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
| Model | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | 2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | 𐄂 | 28 MB |
| en_core_web_md | 2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | ✓ | 107 MB |
| en_core_web_lg | 2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | ✓ | 805 MB |
| de_core_news_sm | 2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | 𐄂 | 26 MB |
| de_core_news_md | 2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | ✓ | 228 MB |
| es_core_news_sm | 2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | 𐄂 | 28 MB |
| es_core_news_md | 2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | ✓ | 88 MB |
| pt_core_news_sm | 2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | 𐄂 | 29 MB |
| fr_core_news_sm | 2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | 𐄂 | 32 MB |
| fr_core_news_md | 2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | ✓ | 100 MB |
| it_core_news_sm | 2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | 𐄂 | 27 MB |
| nl_core_news_sm | 2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | 𐄂 | 27 MB |
| xx_ent_wiki_sm | 2.1.0a0 | - | - | - | 83.8 | 𐄂 | 9 MB |
1) We're currently investigating this, as the results are anomalously low.
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA for the pull requests and contributions.
- Python
Published by ines almost 8 years ago
spacy - v2.0.11: Alpha Vietnamese support, fixes to vectors, improved errors and more
📊 Help us improve spaCy and take the User Survey 2018!
✨ New features and improvements
- NEW: Alpha Vietnamese support with tokenization via Pyvi.
- NEW: Improved system for error messages and warnings. Errors now have unique error codes and are referenced in one place, and all unspecified
asserts have been replaced with descriptive errors. See #2163 for implementation details, and let us know if you have any suggestions for errors and warnings in #2164! - Improve language data for Polish.
- Tidy up dependencies and drop
six,html5lib,ftfyandrequests. - Improve efficiency (and potentially accuracy) of beam-search training, by randomly using greedy updates for some sentences. This can be controlled by changing the
beam_update_probentry innlp.parser.cfg. The default value is 0.5, so 50% of beam updates will be done as greedy updates.
🔴 Bug fixes
- Fix issue #1554, #1752, #2159: Fix
Token.ent_iobafterDoc.merge(), and ensure consistency inDoc.ents. - Fix issue #1660: Fix loading of multiple vector models.
- Fix issue #1967: Allow entity types with dashes.
- Fix issue #2032: Fix accidentally quadratic runtime in
Vocab.set_vector. - Fix issue #2050: Correct mistakes in Italian lemmatizer data.
- Fix issue #2073: Make
Token.set_extensionwork as expected. - Fix issue #2100, #2151, #2181: Drop
sixandhtml5liband prevent dependency conflict with TensorFlow / Keras. - Fix issue #2101: Improve error message if token text is empty string.
- Fix issue #2121: Fix
Language.to_bytesand pickling in Thinc. - Fix issue #2156: Fix hashtag example in
Matcherdocs. - Fix issue #2177: Don't raise error in
set_extensionifgetterandsetterare specified or ifdefault=None, and add error ifsetteris specified with nogetter.
📖 Documentation and examples
- Add example for TensorBoard's standalone embedding projector.
- Improve example for training a new entity type.
- Add formal
CITATIONfor assigning a DOI via Zenodo.
👥 Contributors
Thanks to @jimregan, @justindujardin, @trungtv, @katrinleinweber and @skrcode for the pull requests and contributions.
- Python
Published by ines about 8 years ago