Recent Releases of spacy

spacy - v3.8.7: Python 3.13 support, Cython 3, centralize registry entries

In order to support Python 3.13, spaCy is now compiled with Cython 3. This brings a change to the way types are handled at runtime (Cython 3 uses the from __future__ import annotations semantics, which stores types as strings at runtime. This difference caused problems for components registered within Cython files, as we rely on building Pydantic models from factory function signatures to do validation.

To support Python 3.13 we therefore create a new module, spacy.pipeline.factories, which contains the factory function implementations. __getattr__ import shims have been added to the previous locations of these functions to prevent backwards incompatibilities.

As well as moving the factories, the new implementation avoids import-time side-effects, by moving the actual calls to the decorator inside a function, which is executed once when the Language class is initialised.

A matching change has been made to the catalogue registry decorators. A new module spacy.registrations has been created that performs all the catalogue registrations. Moving these registrations away from the functions prevents these decorators from running at import time. This change was not necessary for the Python 3.13 support, but it means we no longer rely on any import-time side-effects, which will allow us to improve spaCy's import time and therefore CLI execution time. The change also makes maintenance easier as it's easier to find the implementations of different registry functions (this may help library users as well).

- Python
Published by github-actions[bot] about 1 year ago

spacy - v3.8.6: Restore wheels, remove Python 3.13 compatibility

Restores support for wheels for ARM platforms, while correctly noting compatibility range.

- Python
Published by github-actions[bot] about 1 year ago

spacy - v3.8.3: Improve memory zone stability

Fix bug in memory zones when non-transient strings were added to the StringStore inside a memory zone. This caused a bug in the morphological analyser that caused string not found errors when applied during a memory zone.

- Python
Published by github-actions[bot] over 1 year ago

spacy - v3.8: Memory management for persistent services, numpy 2.0 support

Optional memory management for persistent services

Support a new context manager method Language.memory_zone(), to allow long-running services to avoid growing memory usage from cached entries in the Vocab or StringStore. Once the memory zone block ends, spaCy will evict Vocab and StringStore entries that were added during the block, freeing up memory. Doc objects created inside a memory zone block should not be accessed outside the block.

The current implementation disables population of the tokenizer cache inside the memory zone, resulting in some performance impact. The performance difference will likely be negligible if you're running a full pipeline, but if you're only running the tokenizer, it'll be much slower. If this is a problem, you can mitigate it by warming the cache first, by processing the first few batches of text without creating a memory zone. Support for memory zones in the tokenizer will be added in a future update.

The Language.memory_zone() context manager also checks for a memory_zone() method on pipeline components, so that components can perform similar memory management if necessary. None of the built-in components currently require this.

If you component needs to add non-transient entries to the StringStore or Vocab, you can pass the allow_transient=False flag to the Vocab.add() or StringStore.add() components.

Example usage:

```python

import spacy import json from pathlib import Path from typing import Iterator from collections import Counter import typer from spacy.util import minibatch

def texts(path: Path) -> Iterator[str]: with path.open("r", encoding="utf8") as file: for line in file: yield json.loads(line)["text"]

def main(jsonlpath: Path) -> None: nlp = spacy.load("encorewebsm") counts = Counter() batches = minibatch(texts(jsonlpath), 1000) for i, batch in enumerate(batches): print("Batch", i) with nlp.memoryzone(): for doc in nlp.pipe(batch): for token in doc: counts[token.text] += 1 for word, count in counts.most_common(100): print(count, word)

if name == "main": typer.run(main) ```

Numpy v2 compatibility

Numpy 2.0 isn't binary-compatible with numpy v1, so we need to build against one or the other. This release isolates the dependency change and has no other changes, to make things easier if the dependency change causes problems.

This dependency change was previously attempted in version 3.7.6, but dependencies within the v3.7 family of models resulted in some conflicts, and some packages depending on numpy v1 were incompatible with v3.7.6. I've therefore removed the 3.7.6 release and replaced it with this one, which increments the minor version.

Model packages no longer list spacy as a requirement

I've also made a change to the way models are packaged to make it easier to release more quickly. Previously spaCy models specified a versioned requirement on spacy itself. This meant that there was no way to increment the spaCy version and have it work with the existing models, because the models would specify they were only compatible with spacy>=3.7.0,<3.8.0. We have a compatibility table that allows spacy to see which models are compatible, but the models themselves can't know which future versions of spaCy they work with.

I've therefore added a flag --require-parent/--no-require-parent to the spacy package CLI, which controls where the parent package (e.g. spaCy) should be listed as a requirement of the model. --require-parent is the default for v3.8, but this will change to --no-require-parent by default in v4. I've set --no-require-parent for the v3.8 models, so that further changes can be published that don't impact the models, without retraining the models or forcing users to redownload them.

- Python
Published by github-actions[bot] over 1 year ago

spacy - Optional memory management for persistent services

Support a new context manager method Language.memory_zone(), to allow long-running services to avoid growing memory usage from cached entries in the Vocab or StringStore. Once the memory zone block ends, spaCy will evict Vocab and StringStore entries that were added during the block, freeing up memory. Doc objects created inside a memory zone block should not be accessed outside the block.

The current implementation disables population of the tokenizer cache inside the memory zone, resulting in some performance impact. The performance difference will likely be negligible if you're running a full pipeline, but if you're only running the tokenizer, it'll be much slower. If this is a problem, you can mitigate it by warming the cache first, by processing the first few batches of text without creating a memory zone. Support for memory zones in the tokenizer will be added in a future update.

The Language.memory_zone() context manager also checks for a memory_zone() method on pipeline components, so that components can perform similar memory management if necessary. None of the built-in components currently require this.

If you component needs to add non-transient entries to the StringStore or Vocab, you can pass the allow_transient=False flag to the Vocab.add() or StringStore.add() components.

Example usage:

```python

import spacy import json from pathlib import Path from typing import Iterator from collections import Counter import typer from spacy.util import minibatch

def texts(path: Path) -> Iterator[str]: with path.open("r", encoding="utf8") as file: for line in file: yield json.loads(line)["text"]

def main(jsonlpath: Path) -> None: nlp = spacy.load("encorewebsm") counts = Counter() batches = minibatch(texts(jsonlpath), 1000) for i, batch in enumerate(batches): print("Batch", i) with nlp.vocab.memoryzone(): for doc in nlp.pipe(batch): for token in doc: counts[token.text] += 1 for word, count in counts.most_common(100): print(count, word)

if name == "main": typer.run(main)```

- Python
Published by github-actions[bot] over 1 year ago

spacy - v3.7.6: Depend on numpy 2.0

Numpy 2.0 isn't binary-compatible with numpy v1, so we need to build against one or the other. This release isolates the dependency change and has no other changes, to make things easier if the dependency change causes problems.

- Python
Published by github-actions[bot] almost 2 years ago

spacy - v3.7.6a: Test pypi release process

- Python
Published by github-actions[bot] almost 2 years ago

spacy - v3.7.5: Download sanitization, Typer compatibility, and a bugfix for linking gold entities

✨ New features and improvements

  • Sanitize direct download for spacy download (#13313).
  • Convert Cython properties to decorator syntax (#13390).
  • Bump Weasel pin to allow v0.4.x (#13409).
  • Improvements to the test suite (#13469, #13470).
  • Bump Typer pin to allow v0.10.0 and above (#13471).
  • Allow typing-extensions<5.0.0 for Python < 3.8 (#13516).

🔴 Bug fixes

  • #13400: Fix use_gold_ents behaviour for EntityLinker.

📖 Documentation and examples

  • Make the file name for code listings stick to the top (#13379).
  • Update the documentation of MorphAnalysis (#13433).
  • Typo fixes in the documentation (#13466).

👥 Contributors

@danieldk, @honnibal, @ines, @JoeSchiff, @nokados, @Paillat-dev, @rmitsch, @schorfma, @strickvl, @svlandeg, @ynx0

- Python
Published by svlandeg almost 2 years ago

spacy - v3.7.4: New textcat layers and fo/nn language extensions

✨ New features and improvements

  • Improve NumPy 2.0 compatibility (#13103).
  • Added language extensions for Faroese and Norwegian Nynorsk (#13116).
  • Add new TextCatReduce.v1 layer for text classification (#13181).
  • Add new TextCatParametricAttention.v1 layer for text classification (#13201).
  • Use build module for creating model packages by default (#13109).
  • Add support for code loading to the benchmark speed command (#13247).
  • Extend lexical attributes for English with more numericals (#13106).
  • Warn about reloading dependencies after downloading models (#13081).

🔴 Bug fixes

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @danieldk, @evornov, @honnibal, @ines, @lise-brinck, @ridge-kimani, @rmitsch, @shadeMe, @svlandeg

- Python
Published by danieldk over 2 years ago

spacy - v3.7.2: Fixes for APIs and requirements

✨ New features and improvements

  • Update __all__ fields (#13063).

🔴 Bug fixes

  • #13035: Remove Pathy requirement.
  • #13053: Restore spacy.cli.project API.
  • #13057: Support Any comparisons for Token and Span.

📖 Documentation and examples

  • Many updates for spacy-llm including Azure OpenAI, PaLM, and Mistral support.
  • Various documentation corrections.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @rmitsch, @svlandeg

- Python
Published by adrianeboyd over 2 years ago

spacy - v3.7.1: Bug fix for spacy.cli module loading

🔴 Bug fixes

  • Revert lazy loading of CLI module for spacy.info to fix availability of spacy.cli following import spacy (#13040).

👥 Contributors

@adrianeboyd, @honnibal, @ines, @svlandeg

- Python
Published by adrianeboyd over 2 years ago

spacy - v3.7.0: Trained pipelines using Curated Transformers and support for Python 3.12

This release drops support for Python 3.6 and adds support for Python 3.12.

✨ New features and improvements

  • Add support for Python 3.12 (#12979).
  • Use the new library Weasel for spaCy projects functionality (#12769).
    • All spacy project commands should run as before, just now they're using Weasel under the hood.
    • ⚠️ Remote storage is not yet supported for Python 3.12. Use Python 3.11 or earlier for remote storage.
  • Extend to Thinc v8.2 (#12897).
  • Extend transformers extra to spacy-transformers v1.3 (#13025).
  • Support registered vectors (#12492).
  • Add --spans-key option for CLI evaluation with spacy benchmark accuracy (#12981).
  • Load the CLI module lazily for spacy.info (#12962).
  • Add type stubs for spacy.training.example (#12801).
  • Warn for unsupported pattern keys in dependency matcher (#12928).
  • Language.replace_listeners: Pass the replaced listener and the tok2vec pipe to the callback in order to support spacy-curated-transformers (#12785).
  • Always use tqdm with disable=None to disable output in non-interactive environments (#12979).
  • Language updates:
    • Add left and right pointing angle brackets as punctuation to ancient Greek (#12829).
    • Update example sentences for Turkish (#12895).
  • Package setup updates:
    • Update NumPy build constraints for NumPy 1.25+ (#12839). For Python 3.9+, it is no longer necessary to set build constraints while building binary wheels.
    • Refactor Cython profiling in order to disable profiling for Python 3.12 in the package setup, since Cython does not currently support profiling for Python 3.12 (#12979).

📦 Trained pipelines updates

The transformer-based trf pipelines have been updated to use our new Curated Transformers library through the Thinc model wrappers and pipeline component from spaCy Curated Transformers.

⚠️ Backwards incompatibilities

  • Drop support for Python 3.6.
  • Drop mypy checks for Python 3.7.
  • Remove ray extra.
  • spacy project has a few backwards incompatibilities due to the transition to the standalone library Weasel, which is not as tightly coupled to spaCy. Weasel produces warnings when it detects older spaCy-specific settings in your environment or project config.
    • Support for the spacy_version configuration key has been dropped.
    • Support for the check_requirements configuration key has been dropped due to the deprecation of pkg_resources.
    • The SPACY_CONFIG_OVERRIDES environment variable is no longer checked. You can set configuration overrides using WEASEL_CONFIG_OVERRIDES.
    • Support for SPACY_PROJECT_USE_GIT_VERSION environment variable has been dropped.
    • Error codes are now Weasel-specific and do not follow spaCy error codes.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @bdura, @connorbrinton, @danieldk, @davidberenstein1957, @denizcodeyaa, @eltociear, @evornov, @honnibal, @ines, @jmyerston, @koaning, @magdaaniol, @pdhall99, @ringohoffman, @rmitsch, @senisioi, @shadeMe, @svlandeg, @vinbo8, @wjbmattingly

- Python
Published by adrianeboyd over 2 years ago

spacy - v3.6.1: Support for Pydantic v2, find-function CLI and more

✨ New features and improvements

  • Allow Pydantic v2 using transitional v1 support (#12888).
  • Add find-function CLI for finding locations of registered functions (#12757).
  • Add extra spacy[cuda12x] for cupy-cuda12x (#12890).
  • Extend tests for init config and train CLI (#12173).
  • Switch from distutils to setuptools/sysconfig (#12853).

🔴 Bug fixes

  • #12817: Escape annotated HTML tags in displaCy span renderer.
  • #12857: Display model's full base version string in incompatibility warning.
  • #12882: Update <br> tags in displaCy.

📖 Documentation and examples

  • Various documentation corrections and updates.
  • New additions to spaCy Universe:

👥 Contributors

@adrianeboyd, @afriedman412, @arplusman, @bdura, @connorbrinton, @honnibal, @ines, @it176131, @pmbaumgartner, @rmitsch, @shadeMe, @svlandeg, @thomashacker, @victorialslocum, @x-tabdeveloping

- Python
Published by adrianeboyd almost 3 years ago

spacy - v3.6.0: New span finder component and pipelines for Slovenian

✨ New features and improvements

  • NEW: span_finder pipeline component to identify overlapping, unlabeled spans (#12507).
  • Language updates:
    • Add initial support for Malay (#12602).
    • Update Latin defaults to support noun chunks, update lexical/tokenizer defaults and add example sentences (#12538).
  • Add option to return scores separately keyed by component name with spacy evaluate --per-component, Language.evaluate(per_component=True) and Scorer.score(per_component=True) (#12540).
  • Support custom token/lexeme attribute for vectors (#12625).
  • Support spancat_singlelabel in spacy debug data CLI (#12749).
  • Typing updates for PhraseMatcher and SpanGroup (#12642, #12714).

🔴 Bug fixes

  • #12569: Require that all SpanGroup spans come from the current doc.

📦 Trained pipelines updates

We have added new pipelines for Slovenian that use the trainable lemmatizer and floret vectors.

| Package | UPOS | Parser LAS | NER F | | --- | --- | --- | --- | | sl_core_news_sm | 96.9 | 82.1 | 62.9 | | sl_core_news_md | 97.6 | 84.3 | 73.5 | | sl_core_news_lg | 97.7 | 84.3 | 79.0 | | sl_core_news_trf | 99.0 | 91.7 | 90.0 |

  • 🙏 Special thanks to @orglce for help with the new pipelines!

The English pipelines have been updated to improve handling of contractions with various apostrophes and to lemmatize "get" as a passive auxiliary.

The Danish pipeline da_core_news_trf has been updated to use vesteinn/DanskBERT with performance improvements across the board.

⚠️ Backwards incompatibilities

  • SpanGroup spans are now required to be from the same doc. When initializing a SpanGroup, there is a new check to verify that all added spans refer to the current doc. Without this check, it was possible to run into string store or other errors.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @bdura, @danieldk, @davidberenstein1957, @diyclassics, @essenmitsosse, @honnibal, @ines, @isabelizimm, @jmyerston, @kadarakos, @KennethEnevoldsen, @khursani8, @ljvmiranda921, @rmitsch, @shadeMe, @svlandeg, @tomaarsen, @victorialslocum, @vin-ivar, @ZiadAmerr

- Python
Published by adrianeboyd almost 3 years ago

spacy - v3.5.4: Bug fixes for overrides with registered functions and sourced components with listeners

✨ New features and improvements

  • Extend Typer support to v0.9 (#12631).

🔴 Bug fixes

  • #12701: Fix issues with component names and listeners for sourced components.
  • #12623: Support overrides for registered functions in configs.

👥 Contributors

@adrianeboyd, @bdura, @honnibal, @ines, @svlandeg

- Python
Published by adrianeboyd almost 3 years ago

spacy - v3.2.6: Bug fixes for Pydantic and pip

This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0.

✨ New features and improvements

  • Huge speed improvements for spancat, in particular on GPU (~10x-30x faster) (#12577).

🔴 Bug fixes

  • Add typing_extensions requirement due to Pydantic incompatibility with typing_extensions>=4.6.0.
  • Remove #egg from download URLs due to future deprecation in pip.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @kadarakos, @svlandeg

- Python
Published by adrianeboyd about 3 years ago

spacy - v3.3.3: Bug fixes for Pydantic and pip

This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0.

✨ New features and improvements

  • Huge speed improvements for spancat, in particular on GPU (~10x-30x faster) (#12577).

🔴 Bug fixes

  • Add typing_extensions requirement due to Pydantic incompatibility with typing_extensions>=4.6.0.
  • Remove #egg from download URLs due to future deprecation in pip.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @kadarakos, @svlandeg

- Python
Published by adrianeboyd about 3 years ago

spacy - v3.5.3: Speed improvements, bug fixes and more

✨ New features and improvements

  • Huge speed improvements for spancat, in particular on GPU (~10x-30x faster) (#12577).
  • Improve speed for child operators (>+, >-, >++, >--) for the dependency matcher (#12528).
  • Improve loading speed for tokenizers with a large number of exceptions (#12553).
  • Support doc.spans for displaCy output in spacy benchmark accuracy / spacy evaluate (#12575).
  • Add MorphAnalysis.get(default=) argument for user-provided default values similar to dict (#12545).
  • Only perform vectors checks during initialization if there are sourced components (#12607).

🔴 Bug fixes

  • #12567: Remove #egg from download URLs due to future deprecation in pip.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @andyjessen, @bdura, @davidberenstein1957, @diyclassics, @honnibal, @ines, @kadarakos, @KennethEnevoldsen, @ljvmiranda921, @moxley01, @royashcenazi, @svlandeg, @tanloong, @victorialslocum

- Python
Published by adrianeboyd about 3 years ago

spacy - v3.5.2: Pretraining improvements, bug fixes for spans and spancat and more

✨ New features and improvements

  • Add support for floret vectors in spacy pretrain (#12435).
  • Save final model as model-last.bin for spacy pretrain (#12459).
  • Support Span input for displacy.parse_deps (#12477).
  • Extend support to CuPy 12.0 for cupy install extras.

🔴 Bug fixes

  • #12398: Fix entity linker failure on sentence-crossing entities.
  • #12405: Fix sentence indexing bug in Span.sents.
  • #12469: Fix scores attribute for spancat_singlelabel.
  • #12484: Fix Span.sents when the final sentence is the last token in a Doc.
  • #12486: Fix pickle for the ngram suggester.
  • #12493: Include Span.kb_id and Span.id strings in Doc and DocBin serialization.

📖 Documentation and examples

  • Various documentation corrections and updates.
  • New addition to spaCy Universe:

👥 Contributors

@adrianeboyd, @BLKSerene, @honnibal, @ines, @kadarakos, @prajakta-1527, @rmitsch, @shadeMe, @sloev, @svlandeg, @thomashacker, @willfrey

- Python
Published by adrianeboyd about 3 years ago

spacy - v3.5.1: spancat for multi-class labeling, fixes for textcat+transformers and more

💥 We'd love to hear more about your experience with spaCy! Take our survey here.

✨ New features and improvements

  • NEW: spancat_singlelabel pipeline component for multi-class and non-overlapping span classification. The spancat_singlelabel component predicts at most one label for each suggested span and adds a new setting allow_overlap to restrict the output to non-overlapping spans (#11365).
  • Extend to mypy v1.0 (#12245).
  • Use transformer + CNN for efficient GPU textcat with spacy init config (#11900).
  • Support trainable lemmatizer in spacy debug data (#11419).
  • Add new operators to dependency matcher for left/right immediate child/parent nodes (>+, >-, <+, <-) (#12334).
  • Add spacy.PlainTextCorpusReader.v1 for plain text input (#12122).
  • Add alignment_mode and span_id to Span.char_span() (#12145, #12196).
  • Use string formatting types in logging calls (#12215).

🔴 Bug fixes

  • #12017: Improve speed for top_k>1 in trainable lemmatizer.
  • #12048: Make test_cli_find_threshold() test more robust.
  • #12227: Fix return type of registry.find().
  • #12272: Fix speed regression for Matcher patterns with extension attributes.
  • #12287: Add grc to languages with lexeme norms in spacy-lookups-data.
  • #12320: Make generation of empty KnowledgeBase instances configurable.
  • #12343: Fix error message for displacy auto_select_port.
  • #12347: Fix length check for knowledge base in entity linker, add InMemoryLookupKB.is_empty.
  • #12365: Fix types for Lexeme.orth and Lexeme.lower.
  • #12366: Raise error for non-default vectors with PretrainVectors.
  • #12368: Partially address pending deprecation of pkg_resources.
  • Various improvements and fixes for the test suite (#12148, #12157, #12210, #12303, #12372).

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @andyjessen, @danieldk, @essenmitsosse, @honnibal, @ines, @itssimon, @kadarakos, @kwhumphreys, @ljvmiranda921, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @shadeMe, @svlandeg, @tanloong, @thomashacker, @victorialslocum

- Python
Published by adrianeboyd about 3 years ago

spacy - v3.5.0: New CLI commands, language updates, bug fixes and much more

✨ New features and improvements

  • NEW: New apply CLI command to annotate new documents with a trained pipeline (#11376).
  • NEW: New benchmark CLI command to benchmark pipelines. The new benchmark speed subcommand measures the speed of a pipeline, the benchmark accuracy subcommand is a new alias for evaluate (#11902).
  • NEW: New find-threshold CLI command to identify an optimal threshold for classification models (#11280).
  • NEW: New FUZZY Matcher operator for fuzzy matches based on Levenshtein edit distance. In addition, the FUZZY and REGEX operators are now supported in combination with IN/NOT_IN. (#11359).
  • Language updates for Ancient Greek, Dutch, Russian, Slovenian and Ukrainian (#11345, #11162, #11426, #11753, #11811, #11997, more details below).
  • Allow up to typer v0.7.x (#11720), mypy 0.990 (#11801) and typing_extensions v4.4.x (#12036).
  • New spacy.ConsoleLogger.v3 with expanded progress tracking (#11972).
  • Improved scoring behavior for textcat with spacy.textcat_scorer.v2 (#11696 and #11971) and spacy.textcat_multilabel_scorer.v2 (#11820).
  • Improved customizability of the knowledge base used for entity linking, with the default implementation being the new InMemoryLookupKB (#11268).
  • Optional before_update callback that is invoked at the start of each training step (#11739).
  • Improve performance of SpanGroup (#11380).
  • Improve UX around displacy.serve when the default port is in use (#11948).
  • Patch a security vulnerability in extracting tar files (#11746).
  • Add equality definition for vectors (#11806).
  • Allow interpolation of variables in directory names in projects (#11235).
  • Update default component configs to use the latest tok2vec version (#11618).

🔴 Bug fixes

  • #11382: Fix lookup behavior for the French and Catalan lemmatizers.
  • #11385: Ensure that downstream components can train properly on a frozen tok2vec or transformer layer.
  • #11762: Support local file system remotes for projects.
  • #11763: Raise an error when unsupported values are used for textcat.
  • #11834: Ensure Vocab.to_disk respects the exclude setting for lookups and vectors.
  • #12009: Fix a few typing issues for SpanGroup and Span objects.
  • #12098: Correctly handle missing annotations in the edit tree lemmatizer.

⚠️ Backwards incompatibilities and model updates

The following changes may require you to update code that is using the relevant functionality:

  • An error is now raised when unsupported values are given as input to train a textcat or textcat_multilabel model - ensure that values are 0.0 or 1.0 as explained in the docs.
  • As KnowledgeBase is now an abstract class, you should call the constructor of the new InMemoryLookupKB instead when you want to use spaCy's default KB implementation. If you've written a custom KB that inherits from KnowledgeBase, you'll need to implement its abstract methods, or alternatively inherit from InMemoryLookupKB instead.

The following changes may influence the output of your language pipeline or trained models:

  • Updates to language defaults:
    • Extended support for Slovenian (#11162).
    • Switch Russian and Ukrainian lemmatizers to pymorphy3 (#11345, #11811).
    • Support for editorial punctuation in Ancient Greek (#11426).
    • Update to Russian tokenizer exceptions (#11753).
    • Small fix in the list of Dutch stop words (#11997).
  • Updates to model defaults:
    • Use the latest tok2vec defaults in all components (#11618).
    • Improve the default attributes used for the textcat and textcat_multilabel components (#11698).
    • Update the default scorer for textcat and textcat_multilabel to fix a bug related to threshold for textcat and to make it possible to score multiple textcat/textcat_multilabel components in a single pipeline with custom scorers. If no custom scorers are used, the cat_p/r/f scores will now only reflect the final component's labels and performance (#11696, #11820).
    • Correct the token_acc score to report the intended measure (# correct tokens / # predicted tokens, the same as in spaCy v2). The token_acc scores for v3.5 will be lower for the same performance because they were incorrectly inflated in v3.0-v3.4. The token_p/r/f scores should remain unchanged (#12073).

The following functionality will be changed in the near future - so it's best to start updating your scripts now to make them more generic:

  • From v4 onwards, we'll rename the master branch to main.

📦 Trained pipelines updates

  • The CNN pipelines add IS_SPACE as a tok2vec feature for tagger and morphologizer components to improve tagging of non-whitespace vs. whitespace tokens.
  • The transformer pipelines require spacy-transformers v1.2, which uses the exact alignment from tokenizers for fast tokenizers instead of the heuristic alignment from spacy-alignments. For all trained pipelines except ja_core_news_trf, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about the spacy-transformers changes in the v1.2.0 release notes.

📖 Documentation and examples

👥 Contributors

@aaronzipp, @adrianeboyd, @albertvillanova, @ArchiDevil, @cfuerbachersparks, @damian-romero, @danieldk, @darigovresearch, @DSLituiev, @essenmitsosse, @gremur, @honnibal, @ines, @jmyerston, @JosPolfliet, @kadarakos, @koaning, @kwhumphreys, @ljvmiranda921, @MarcoGorelli, @orglce, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @ryndaniels, @shadeMe, @svlandeg, @thomashacker, @TrellixVulnTeam, @wannaphong, @zhiiw, @zrpxx

- Python
Published by adrianeboyd over 3 years ago

spacy - v2.3.9: Compatibility with NumPy v1.24+

This release addresses future compatibility with NumPy v1.24+.

🔴 Bug fixes

  • #11940: Update for compatibility with NumPy v1.24+ integer conversions.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @svlandeg

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.0.9: Bug fixes and future NumPy compatibility

This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

  • #11331, #11701: Clean up warnings in spaCy and its test suite.
  • #11845: Don't raise an error in displaCy for unset spans keys.
  • #11864: Add smart_open requirement and update deprecated options.
  • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
  • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
  • #11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @polm, @svlandeg

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.1.7: Bug fixes and future NumPy compatibility

This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

  • #10573: Remove Click pin following Typer updates.
  • #11331, #11701: Clean up warnings in spaCy and its test suite.
  • #11845: Don't raise an error in displaCy for unset spans keys.
  • #11860: Fix spancat for docs with zero suggestions.
  • #11864: Add smart_open requirement and update deprecated options.
  • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
  • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
  • #11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @polm, @svlandeg

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.2.5: Bug fixes and future NumPy compatibility

This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

  • #10573: Remove Click pin following Typer updates.
  • #11331, #11701: Clean up warnings in spaCy and its test suite.
  • #11845: Don't raise an error in displaCy for unset spans keys.
  • #11860: Fix spancat for docs with zero suggestions.
  • #11864: Add smart_open requirement and update deprecated options.
  • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
  • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
  • #11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @polm, @svlandeg

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.3.2: Bug fixes and future NumPy compatibility

This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

  • #10911, #11194: Improve speed in precomputable_biaffine by avoiding concatenation.
  • #11276, #11331, #11701: Clean up warnings in spaCy and its test suite.
  • #11845: Don't raise an error in displaCy for unset spans keys.
  • #11860: Fix spancat for docs with zero suggestions.
  • #11864: Add smart_open requirement and update deprecated options.
  • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
  • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
  • #11934: Add strings when initializing from labels in EditTreeLemmatizer.
  • #11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.4.4: Bug fixes and future NumPy compatibility

This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

  • #11845: Don't raise an error in displaCy for unset spans keys.
  • #11860: Fix spancat for docs with zero suggestions.
  • #11864: Add smart_open requirement and update deprecated options.
  • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
  • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
  • #11934: Add strings when initializing from labels in EditTreeLemmatizer.
  • #11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.4.3: Extended Typer support and bug fixes

✨ New features and improvements

  • Extend Typer support to v0.7.x (#11720).

🔴 Bug fixes

  • #11640: Handle docs with no entities in EntityLinker.
  • #11688: Restore custom doc extension values in Doc.to_json() for attributes set by getters.
  • #11706: Remove incorrect warning for pipeline_package.load().
  • #11735: Improve spacy project requirements checks for unsupported specifiers and requirements lines.
  • #11745: Revert modifications to spacy.load(disable=) that could enable currently disabled components.

👥 Contributors

@aaronzipp, @adrianeboyd, @honnibal, @ines, @polm, @rmitsch, @ryndaniels, @svlandeg, @thomashacker

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.4.2: Latin and Luganda support, Python 3.11 wheels and more

✨ New features and improvements

  • NEW: Luganda language support (#10847).
  • NEW: Latin language support (#11349).
  • NEW: spacy.ConsoleLogger.v2 optionally saves training logs to JSONL (#11214).
  • NEW: New operators for the DependencyMatcher to include matching parents or children to the left or the right of the node (#10371).
  • Prebuilt Python 3.11 wheels are now available for all spaCy dependencies distributed by @explosion.
  • Support pydantic v1.10 and mypy 0.980+, drop mypy support for Python 3.6 (#11546, #11635).
  • Support CuPy v11 and add extras for cuda11x and cuda-autodetect (using cupy-wheel) (#11279).
  • Support custom attributes for tokens and spans in Doc.to_json() and Doc.from_json() (#11125).
  • Make the enable and disable options for spacy.load() more consistent (#11459).
  • Allow a single string argument for disable/enclude/exclude for spacy.load() (#11406).
  • New --url flag for spacy info to print the direct download URL for a pipeline (#11175).
  • Add a check for missing requirements in the spacy project CLI (#11226).
  • Add a Levenshtein distance function (#11418).
  • Improvements to the spacy debug data CLI for spancat data (#11504).
  • Allow overriding spacy_version in spacy package metadata (#11552).
  • Improve the error message when using the wrong command for spacy project assets (#11458).
  • Ensure parent directories are created when storing the results of the spacy pretrain command (#11210).
  • Extend support to newer versions of natto-py for the ko extra (#11222).

📦 Trained pipelines updates

This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_* v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).

Use spacy download to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0. You can check that you are using the new version (v3.4.1) with spacy validate:

NAME SPACY VERSION en_core_web_md >=3.4.0,<3.5.0 3.4.1 ✔

🔴 Bug fixes

  • #11275: Fix Dutch noun chunks to skip overlapping spans.
  • #11276: Fix regex invalid escape sequences.
  • #11312: Better handling of unexpected types in SetPredicate.
  • #11460: Fix config validation failures caused by NVTX pipeline wrappers.
  • #11506: Avoid unwanted side effects in Doc.__init__.
  • #11540: Preserve missing entity annotation in augmenters.
  • #11592: Fix issues with DVC commands.
  • #11631: Fix initialization for pymorphy2_lookup lemmatizer mode for Russian and Ukrainian.

⚠️ Backwards incompatibilities

  • If you're using a custom component that does not return a Doc type, an error will now be raised (#11424).
  • If you're using a dot in a factory name, an error is raised as this is not supported (#11336).

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy

- Python
Published by adrianeboyd over 3 years ago

spacy - v2.3.8: Updates for Python 3.10 and 3.11

✨ New features and improvements

  • Updates and binary wheels for Python 3.10 and 3.11.

👥 Contributors

@adrianeboyd, @honnibal, @ines

- Python
Published by adrianeboyd over 3 years ago

spacy - v3.4.1: Fix compatibility with CuPy v9.x

🔴 Bug fixes

  • Fix issue #11137: Fix compatibility with CuPy v9.x.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic

- Python
Published by adrianeboyd almost 4 years ago

spacy - v3.4.0: Updated types, speed improvements and pipelines for Croatian

✨ New features and improvements

  • Support for mypy 0.950+ and pydantic v1.9 (#10786).
  • Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
  • Min/max {n,m} operator for Matcher patterns (#10981).
  • Language updates:
    • Improve tokenization for Cyrillic combining diacritics (#10837).
    • Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).
  • Improved speed of vector lookups (#10992).
  • For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops (#10773).
  • Improved speed of Example.get_aligned_parse and Example.get_aligned (#10952).
  • Improved speed of StringStore lookups (#10938).
  • Updated spacy project clone to try both main and master branches by default (#10843).
  • Added confidence threshold for named entity linker (#11016).
  • Improved handling of Typer optional default values for init_config_cli (#10788).
  • Added cycle detection in parser projectivization methods (#10877).
  • Added counts for NER labels in debug data (#10960).
  • Support for adding NVTX ranges to TrainablePipe components (#10965).
  • Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip (#11073).

📦 Trained pipelines updates

We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.

| Package | UPOS | Parser LAS | NER F | | ----------------------------------------------- | ---: | ---------: | ----: | | hr_core_news_sm | 96.6 | 77.5 | 76.1 | | hr_core_news_md | 97.3 | 80.1 | 81.8 | | hr_core_news_lg | 97.5 | 80.4 | 83.0 |

🙏 Special thanks to @gtoffoli for help with the new pipelines!

The English pipelines have new word vectors:

| Package | Model Version | TAG | Parser LAS | NER F | | ----------------------------------------------- | ------------- | ---: | ---------: | ----: | | en_core_news_md | v3.3.0 | 97.3 | 90.1 | 84.6 | | en_core_news_md | v3.4.0 | 97.2 | 90.3 | 85.5 | | en_core_news_lg | v3.3.0 | 97.4 | 90.1 | 85.3 | | en_core_news_lg | v3.4.0 | 97.3 | 90.2 | 85.6 |

All CNN pipelines have been extended to add whitespace augmentation.

🔴 Bug fixes

  • Fix issue #10960: Support hyphens in NER labels.
  • Fix issue #10994: Fix horizontal spacing for spans in displaCy.
  • Fix issue #11013: Check for any token with a vector in Doc.has_vector, distinguish 0-vectors and missing vectors in similarity warnings.
  • Fix issue #11056: Don't use get_array_module in textcat.
  • Fix issue #11092: Fix vertical alignment for spans in displaCy.

🚀 Notes about upgrading from v3.3

  • Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.

📖 Documentation and examples

  • spaCy universe additions:
    • Aim-spacy: An Aim-based spaCy experiment tracker.
    • Asent: Fast, flexible and transparent sentiment analysis.
    • spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
    • spacy-report: Generates interactive reports for spaCy models.

👥 Contributors

@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere

- Python
Published by adrianeboyd almost 4 years ago

spacy - v3.3.1: New Span Ruler component, JSON (de)serialization of Doc, span analyzer and more

✨ New features and improvements

🔴 Bug fixes

  • Fix issue #9575: Fix Entity Linker with tokenization mismatches between gold and predicted Doc objects.
  • Fix issue #10685: Fix serialization of SpanGroup objects that share the same name within one SpanGroups container.
  • Fix issue #10718: Remove debug print statements in walk_head_nodes to avoid acquiring the GIL.
  • Fix issue #10741: Make the StringStore.__getitem__ return type dependent on its parameter type.
  • Fix issue #10734: Support removal of overlapping terms in PhraseMatcher.
  • Fix issue #10772: Override SpanGroups.setdefault to also support Iterable[SpanGroup] as the default.
  • Fix issue #10817: Ensure that the term ROOT is in the glossary.
  • Fix issue #10830: Better errors for Doc.has_annotation and Matcher.
  • Fix issue #10864: Avoid pickling Doc inputs passed to Language.pipe().
  • Fix issue #10898: Fix schemas import in Doc.

⚠️ Backward incompatibilities

  • Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the name attribute. For example, the following pipeline component:

ini [components.transformer] factory = "transformer" name = "custom_transformer_name"

would be registered erroneously as custom_transformer_name. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer.

👥 Contributors

@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg

- Python
Published by danieldk almost 4 years ago

spacy - v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish

✨ New features and improvements

📦 Trained pipelines

v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

| Package | Language | UPOS | Parser LAS | NER F | | --------------------------------------------------------------- | -------- | ---: | ---------: | ----: | | fi_core_news_sm | Finnish | 92.5 | 71.9 | 75.9 | | fi_core_news_md | Finnish | 95.9 | 78.6 | 80.6 | | fi_core_news_lg | Finnish | 96.2 | 79.4 | 82.4 | | ko_core_news_sm | Korean | 86.1 | 65.6 | 71.3 | | ko_core_news_md | Korean | 94.7 | 80.9 | 83.1 | | ko_core_news_lg | Korean | 94.7 | 81.3 | 85.3 | | sv_core_news_sm | Swedish | 95.0 | 75.9 | 74.7 | | sv_core_news_md | Swedish | 96.3 | 78.5 | 79.3 | | sv_core_news_lg | Swedish | 96.3 | 79.1 | 81.1 |

🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

| Model | v3.2 Lemma Acc | v3.3 Lemma Acc | | ----------------------------------------------- | -------------: | -------------: | | da_core_news_md | 84.9 | 94.8 | | de_core_news_md | 73.4 | 97.7 | | el_core_news_md | 56.5 | 88.9 | | fi_core_news_md | - | 86.2 | | it_core_news_md | 86.6 | 97.2 | | ko_core_news_md | - | 90.0 | | lt_core_news_md | 71.1 | 84.8 | | nb_core_news_md | 76.7 | 97.1 | | nl_core_news_md | 81.5 | 94.0 | | pl_core_news_md | 87.1 | 93.7 | | pt_core_news_md | 76.7 | 96.9 | | ro_core_news_md | 81.8 | 95.5 | | sv_core_news_md | - | 95.5 |

🔴 Bug fixes

  • Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
  • Fix issue #9443: Fix Scorer.score_cats for missing labels.
  • Fix issue #9669: Fix entity linker batching.
  • Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.
  • Fix issue #9904: Fix textcat loss scaling.
  • Fix issue #9956: Compare all Span attributes consistently.
  • Fix issue #10073: Add "spans" to the output of doc.to_json.
  • Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.
  • Fix issue #10189: Allow Example to align whitespace annotation.
  • Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
  • Fix issue #10324: Fix Tok2Vec for empty batches.
  • Fix issue #10347: Update basic functionality for rehearse.
  • Fix issue #10394: Fix Vectors.n_keys for floret vectors.
  • Fix issue #10400: Use meta in util.load_model_from_config.
  • Fix issue #10451: Fix Example.get_matching_ents.
  • Fix issue #10460: Fix initial special cases for Tokenizer.explain.
  • Fix issue #10521: Stream large assets on download in spaCy projects.
  • Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.
  • Fix issue #10551: Add automatic vector deduplication for init vectors.

🚀 Notes about upgrading from v3.2

  • To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.
  • Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).
  • Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
  • Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

📖 Documentation and examples

👥 Contributors

@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996

- Python
Published by adrianeboyd about 4 years ago

spacy - v3.1.6: Workaround for Click/Typer issues

🔴 Bug fixes

  • Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

- Python
Published by adrianeboyd about 4 years ago

spacy - v3.2.4: Workaround for Click/Typer issues

🔴 Bug fixes

  • Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines

- Python
Published by adrianeboyd about 4 years ago

spacy - v3.2.3: Fix Tok2Vec for empty batches

🔴 Bug fixes

  • Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @honnibal, @ines

- Python
Published by adrianeboyd over 4 years ago

spacy - v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more

🔴 Bug fixes

  • Fix issue #9593: Use metaclass to subclass errors for easier pickling.
  • Fix issue #9654: Fix spancat for empty docs and zero suggestions.
  • Fix issue #9979: Fix type of Lexeme.rank.
  • Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz

- Python
Published by adrianeboyd over 4 years ago

spacy - v3.0.8: Fix Tok2Vec for empty batches

🔴 Bug fixes

  • Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines

- Python
Published by adrianeboyd over 4 years ago

spacy - v3.2.2: Improved NER and parser speeds, bug fixes and more

✨ New features and improvements

  • Improved parser and ner speeds on long documents (see technical details in #10019).
  • Support for spancat components in debug data.
  • Support for ENT_IOB as a Matcher token pattern key.
  • Extended and improved types for many classes.

🔴 Bug fixes

  • Fix issue #9735: Make floret murmurhash endian-neutral.
  • Fix issue #9738: Support string IOB values for ENT_IOB.
  • Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
  • Fix issue #9960: Warn about entities that cross sentence boundaries in debug data.
  • Fix issue #9979: Fix type for Lexeme.rank.
  • Fix issue #10026: Check for 0-size assets in spacy project.
  • Fix issue #10051: Consistently return scalars from similarity methods.
  • Fix issue #10052: Fix spaces in Doc.from_docs() for empty docs.
  • Fix issue #10079: Fix label detection in debug data for components with custom names.
  • Fix issue #10109: Add types to Underscore and DependencyMatcher and improve types in Language, Matcher and PhraseMatcher.
  • Fix issue #10130: Fix Tokenizer.explain when infixes appear as prefixes.
  • Fix issue #10143: Use simple suggester in spancat initialization.
  • Fix issue #10164: Support IS_SENT_END in Doc.has_annotation.
  • Fix issue #10192: Detect invalid package names in spacy package.
  • Fix issue #10223: Support mixed case in package names.
  • Fix issue #10234: Fix type in PhraseMatcher.

📖 Documentation and examples

  • Various documentation updates.
  • New spaCy version tags in spaCy universe.
  • New Dockerfile for repeatable website builds and easier local development.
  • New additions to spaCy universe:
    • Augmenty: a text augmentation library
    • Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
    • spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
    • spacypdfreader: easy PDF to text to spaCy text extraction
    • textnets: text analysis with networks

👥 Contributors

@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav

- Python
Published by adrianeboyd over 4 years ago

spacy - v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more

✨ New features and improvements

  • NEW: doc_cleaner component for removing doc.tensor,doc._._trf_data or other Doc attributes at the end of the pipeline to reduce size of output docs.
  • NEW: ENT_ID and ENT_KB_ID to Matcher pattern attributes.
  • Support kb_id for entities in displaCy from Doc input.
  • Add Span.sents property for spans spanning over more than one sentence.
  • Add EntityRuler.remove to remove patterns by id.
  • Make the Tagger neg_prefix configurable.
  • Use Language.pipe in Language.evaluate for more efficient processing.
  • Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.

🔴 Bug fixes

  • Fix issue #9638: Make JsonlCorpus path optional again.
  • Fix issue #9654: Fix spancat for empty docs and zero suggestions.
  • Fix issue #9658: Improve error message for incorrect .jsonl paths in EntityRuler.
  • Fix issue #9674: Fix language-specific factory handling in package CLI.
  • Fix issue #9694: Convert labels to strings for README in package CLI.
  • Fix issue #9697: Exclude strings from source vector checks.
  • Fix issue #9701: Allow Scorer.score_spans to handle predicted docs with missing annotation.
  • Fix issue #9722: Initialize parser from reference parse rather than aligned example.
  • Fix issue #9764: Set annotations more efficiently in tagger and morphologizer.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar

- Python
Published by adrianeboyd over 4 years ago

spacy - v3.2.0: Registered scoring functions, Doc input, floret vectors and more

✨ New features and improvements

  • NEW: Registered scoring functions for each component in the config.
  • NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
  • NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
  • overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
  • extend config setting for morphologizer for whether existing feature types are preserved.
  • Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
  • New package spacy-loggers for additional loggers.
  • New Irish lemmatizer.
  • New Portuguese noun chunks and updated Spanish noun chunks.
  • Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
  • Japanese reading and inflection from sudachipy are annotated as Token.morph features.
  • Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
  • LIKE_URL attribute includes the tokenizer URL pattern.
  • --n-save-epoch option for spacy pretrain.
  • Trained pipelines:
    • New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
    • Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
    • Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
    • Universal Dependencies corpora updated to v2.8.
    • Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
    • English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

  • Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
  • Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
  • Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
  • Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

  • In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
  • The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
  • In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

- Python
Published by adrianeboyd over 4 years ago

spacy - v3.1.4: Python 3.10 wheels and support for AppleOps

✨ New features and improvements

  • NEW: Binary wheels for Python 3.10.
  • NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].
  • GPU profiling with spacy.models_with_nvtx_range.v1.
  • Full mypy integration in the CI and many type fixes across the code base.
  • Added custom Protocol classes in ty.py to define behavior of pipeline components.
  • Support for entity linking visualization in displacy.
  • Allow overriding vars in spacy project assets .
  • Standalone train function to run the training from Python scripts just like the spacy train CLI.
  • Support for spacy-transformers>=1.1.0 with improved IO.
  • Support for thinc>=8.0.11 with improved gradient clipping.

🔴 Bug fixes

  • Fix issue #5507: Improve UX for multiprocessing on GPU.
  • Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.
  • Fix issue #9244: Fix vectors for 0-length spans.
  • Fix issue #9247: Improve UX for the DocBin constructor.
  • Fix Issue #9254: Allow unicode in a spacy project title.
  • Fix issue #9263: Make added patterns consistent in the DependencyMatcher.
  • Fix issue #9305: Restore tokenization timing during evaluation.
  • Fix issue #9335: Sync vocab in vectors and sourced components.
  • Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
  • Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.
  • Fix issue #9437: Improve UX around Doc object creation.
  • Fix issue #9465: Fix minor issues with convert CLI.
  • Fix issue #9500: Include .pyi files in the distributed package.

📖 Documentation and examples

  • Various updates to the documentation.
  • New additions to the spaCy universe:
    • deplacy: CUI-based dependency visualizer
    • ipymarkup: Visualizations for NER and syntax trees
    • PhruzzMatcher: Find fuzzy matches
    • spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub
    • spaCyOpenTapioca: Entity Linking on Wikidata
    • spacy-clausie: Clause-based information extraction system
    • "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
    • "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

👥 Contributors

@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker

- Python
Published by svlandeg over 4 years ago

spacy - v3.1.3: Bug fixes and UX updates

✨ New features and improvements

  • The v3 of WandbLogger now supports optional run_name and entity parameters.
  • Improved UX when providing invalid pos values for a Doc or Token.

🔴 Bug fixes

  • Fix issue #9001: Pass alignments to Matcher callbacks.
  • Fix issue #9009: Include component factories in third-party dependencies resolver.
  • Fix issue #9012: Correct type of config in create_pipe.
  • Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.
  • Fix issue #9033: Fix verbs list for French tokenizer exceptions.
  • Fix issue #9059: Pass overrides to subcommands in spacy project workflows.
  • Fix issue #9074: Improve UX around repo and path arguments in spacy project.
  • Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.
  • Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.
  • Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

📖 Documentation and examples

  • Various updates to the documentation.
  • Few additions and updates to the spaCy universe.
  • Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

👥 Contributors

@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker

- Python
Published by svlandeg over 4 years ago

spacy - v3.1.2: Improved spancat component and various bugfixes

✨ New features and improvements

  • NEW: Provide scores for the SpanCategorizer predictions.
  • NEW: Broader compatibility with type checkers thanks to .pyi stub files.
  • NEW: Auto-detect package dependencies in spacy package.
  • New INTERSECTS operator for the Matcher.
  • More debugging info for spacy project push and pull commands.
  • Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.
  • The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

🔴 Bug fixes

  • Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
  • Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.
  • Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.
  • Fix issue #8796: Respect the no_skip value for spacy project run.
  • Fix issue #8810: Make ConsoleLogger flush after each logging line.
  • Fix issue #8819: Pass exclude when serializing the vocab.
  • Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
  • Fix issue #8970: Fix allow_overlap default for span categorizer scoring.
  • Fix issue #8982: Add glossary entry for _SP.
  • Fix issue #9007: Fix span categorizer training on nested entities.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker

- Python
Published by svlandeg almost 5 years ago

spacy - v3.0.7: Bug fixes and base support for Azerbaijani

✨ New features and improvements

  • Alpha tokenization support for Azerbaijani.
  • Updates for French stop words.

🔴 Bug fixes

  • Fix issue #7629: Fix scoring normalization.
  • Fix issue #7886: Fix unknown tokens percentage in debug data.
  • Fix issue #7907: Update load_lookups return type and docstring.
  • Fix issue #7930: Make EntityLinker robust for nO=None.
  • Fix issue #7925: Skip vector ngram backoff if minn is not set.
  • Fix issue #7973: Fix debug model for transformers.
  • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
  • Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.
  • Fix issue #8004: Handle errors while multiprocessing.
  • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
  • Fix issue #8012: Fix ensemble textcat with listener.
  • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
  • Fix issue #8055: Handle partial entities in Span.as_doc.
  • Fix issue #8062: Make all Span attrs writable.
  • Fix issue #8066: Update debug data for textcat.
  • Fix issue #8069: Custom warning if DocBin is too large.
  • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
  • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
  • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
  • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
  • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
  • Fix issue #8208: Address missing config overrides post load of models.
  • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
  • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
  • Fix issue #8244: Use context manager when reading model file.
  • Fix issue #8245: Fix other open calls without context managers.
  • Fix issue #8265: Address mypy errors.
  • Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.
  • Fix issue #8335: Raise error if deps not provided with heads in Doc.
  • Fix issue #8368: Preserve whitespace in Span.lemma_.
  • Fix issue #8396: Make JsonlReader path optional.
  • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
  • Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
  • Fix issue #8426: Fix setting empty entities in Example.from_dict.
  • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
  • Fix issue #8584: Raise an error for textcat with <2 labels.
  • Fix issue #8551: Fix duplicate spacy package CLI opts.

👥 Contributors

@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD

- Python
Published by adrianeboyd almost 5 years ago

spacy - v3.1.1: Support for Ancient Greek and various bug fixes

✨ New features and improvements

  • Alpha tokenization support for Ancient Greek.
  • Implementation of a noun_chunk iterator for Dutch.
  • Support for black & flake8 as pre-commit hooks.
  • New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

🔴 Bug fixes

  • Fix issue #8638: Fix Azerbaijani initialization.
  • Fix issue #8639: Use 0-vector for OOV lexemes.
  • Fix issue #8640: Update lexeme ranks for loaded vectors.
  • Fix issue #8651: Fix ru and uk multiprocessing (with spawn).
  • Fix issue #8663: Preserve existing meta information with spacy package.
  • Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe

- Python
Published by svlandeg almost 5 years ago

spacy - v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more

✨ New features and improvements

For more details, see the New in v3.1 usage guide.

📦 New trained pipelines

| Package | Language | UPOS | Parser LAS |  NER F | | ----------------------------------------------------------------- | -------- | ---: | ---------: | -----: | | ca_core_news_sm | Catalan | 98.2 | 87.4 | 79.8 | | ca_core_news_md | Catalan | 98.3 | 88.2 | 84.0 | | ca_core_news_lg | Catalan | 98.5 | 88.4 | 84.2 | | ca_core_news_trf | Catalan | 98.9 | 93.0 | 91.2 | | da_core_news_trf | Danish | 98.0 | 85.0 | 82.9 |

⚠️ Upgrading from v3.0

  • Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.
  • Use spacy init fill-config to update a v3.0 config for v3.1.
  • When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].
  • Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

For more information, see Notes on upgrading from v3.0.

🔴 Bug fixes

  • Fix issue #7036: Use a context manager when reading model.
  • Fix issue #7629: Fix scoring normalization.
  • Fix issue #7799: Ensure spacy ray command works.
  • Fix issue #7807: Show warning if entity ruler runs without patterns.
  • Fix issue #7886: Fix unknown tokens percentage in debug data.
  • Fix issue #7930: Make EntityLinker robust for nO=None.
  • Fix issue #7925: Skip vector ngram backoff if minn is not set.
  • Fix issue #7973: Fix debug model for transformers.
  • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
  • Fix issue #8004: Handle errors while multiprocessing.
  • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
  • Fix issue #8012: Fix ensemble textcat with listener.
  • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
  • Fix issue #8055: Handle partial entities in Span.as_doc.
  • Fix issue #8062: Make all Span attrs writable.
  • Fix issue #8066: Update debug data for textcat.
  • Fix issue #8069: Custom warning if DocBin is too large.
  • Fix issue #8099: Update Vietnamese tokenizer.
  • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
  • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
  • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
  • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
  • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
  • Fix issue #8208: Address missing config overrides post load of models.
  • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
  • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
  • Fix issue #8265: Address mypy errors.
  • Fix issue #8335: Raise error if deps not provided with heads in Doc.
  • Fix issue #8368: Preserve whitespace in Span.lemma_.
  • Fix issue #8388: Don't clobber vectors when loading components from source models.
  • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
  • Fix issue #8426: Fix setting empty entities in Example.from_dict.
  • Fix issue #8441: Add correct types for Language.pipe return values.
  • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
  • Fix issue #8559: Fix vectors check for sourced components.
  • Fix issue #8584: Raise an error for textcat with <2 labels.

👥 Contributors

@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD

- Python
Published by adrianeboyd almost 5 years ago

spacy - v2.3.7: Bug fix for download CLI

🔴 Bug fixes

  • Fix issue #8286: Fix spacy download.

- Python
Published by adrianeboyd almost 5 years ago

spacy - v2.3.6: Bug fixes and base support for Amharic

✨ New features and improvements

  • Add base support for Amharic.
  • Add noun chunk iterator for Danish.
  • Updates to French, Portuguese and Romanian stop words.

🔴 Bug fixes

  • Fix issue #6705: Fix deserialization of null token_match and url_match for the tokenizer.
  • Fix issue #6712: Prevent overlapping noun chunks for Spanish.
  • Fix issue #6745: Fix minibatch iterator when size iterator is finished.
  • Fix issue #6759: Skip 0-length matches in the Matcher.
  • Fix issue #6771: Support IS_SENT_START in the PhraseMatcher.
  • Fix issue #6772: Fix Span.text for empty spans.
  • Fix issue #6820: Improve Doc.char_span alignment_mode handling.
  • Fix issue #6857: Remove --no-cache-dir when downloading models.
  • Fix issue #8115: Fix offsets in Span.get_lca_matrix.

👥 Contributors

Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.

- Python
Published by adrianeboyd about 5 years ago

spacy - v3.0.6: assemble CLI, Matcher alignments, training from streamed corpora and many bug fixes

✨ New features and improvements

  • New assemble CLI command for assembling a pipeline from a config without training.
  • Add support for match alignments in the Matcher to align matched tokens with matcher patterns.
  • Add support for training from streamed corpora.
  • Add support for W&B data and model checkpoint logging and versioning in spacy.WandbLogger.v2.
  • Extend Scorer.score_spans to support overlapping and unlabeled spans.
  • Update debug data for new v3 components.
  • Improve language data for Italian.
  • Various improvements to error handling and UX.

🔴 Bug fixes

  • Fix issue #7408: Add vocab kwarg to spacy.load.
  • Fix issue #7419: Exclude user hooks in displacy conversion.
  • Fix issue #7421: Update --code usage in CLI commands.
  • Fix issue #7424: Preserve sent starts on retokenization without parse.
  • Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
  • Fix issue #7471: Improve warnings related to listening components.
  • Fix issue #7488: Fix upstream check in pretraining.
  • Fix issue #7489: Support callbacks entry points.
  • Fix issue #7497: Merge doc.spans in Doc.from_docs().
  • Fix issue #7528: Preserve user data for DependencyMatcher on spans.
  • Fix issue #7557: Fix __add__ method for PRFScore.
  • Fix issue #7574: Fix conversion of custom extension data in Span.as_doc and Doc.from_docs.
  • Fix issue #7620: Fix replace_listeners in configs.
  • Fix issue #7626: Fix vectors data on GPU.
  • Fix issue #7630: Update NEL for entities crossing sentence boundaries.
  • Fix issue #7631: Fix parser sourcing in NER converter.
  • Fix issue #7642: Fix handling of hyphen string value in config files.
  • Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
  • Fix issue #7674: Fix handling of unknown tokens in StaticVectors.
  • Fix issue #7690: Fix pickling of Lemmatizer.
  • Fix issue #7749: Update Tokenizer.explain for special cases in v3.
  • Fix issue #7755: Fix config parsing of ints/strings.
  • Fix issue #7836: Fix tokenizer cache flushing.
  • Fix issue #7847: Fix handling of boolean values in Example.from_dict for sent starts.

📖 Documentation and examples

  • Add documentation for legacy functions and architectures.
  • Add documentation for pretrained pipeline design.
  • Add more details about pipe and multiprocessing.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!

- Python
Published by adrianeboyd about 5 years ago

spacy - v3.0.5: Bug fix for thinc requirement

🔴 Bug fixes

  • Fix related to issue #7075: Update thinc requirement for Jupyter notebook GPU warning

- Python
Published by adrianeboyd about 5 years ago

spacy - v3.0.4: Fix tok2vec pretraining, source disabled components, better UX and bug fixes

✨ New features and improvements

  • Allow sourcing disabled components in config.
  • Support Doc.spans in Example.from_dict.
  • Improve transformer recommendations in quickstart widget and init config.
  • Improve language data for Bulgarian.
  • Various improvements to error handling and UX.

🔴 Bug fixes

  • Fix issue #6952, #7285, #7289: Make tok2vec pretraining and pretrain command work as expected again.
  • Fix issue #7062: Only evaluate named entities for NEL if there is a corresponding gold span.
  • Fix issue #7065: Correctly handle sentence boundaries in Span.sent.
  • Fix issue #7071: Fix conll converter option.
  • Fix issue #7100: Re-add n_sents to entity linker and fix config handling and I/O.
  • Fix issue #7122: Fix displaCy output in evaluate CLI.
    • Fix issue #7127: Fix initialization of UkrainianLemmatizer.
  • Fix issue #7176: Re-refactor Sentencizer to use Pipe API.
  • Fix issue #7182: Allow SpanGroup import from spacy.tokens.
  • Fix issue #7204: Adjust Cython compilation for setups with custom include paths.
  • Fix issue #7222: Correct YAML formatting in quickstart recommendations for bg and bn.
  • Fix issue #7225: Fix spans weakref in Doc.copy.
  • Fix issue #7237: Fix is_cython_func for additional imported code.
  • Fix issue #7250: Fix patience for identical scores.
  • Fix issue #7329: Make spacy.orth_variants.v1 and spacy.lower_case.v1 augmenters work as expected.
  • Fix issue #7352: Sort EntityRuler.labels alphabetically.

📖 Documentation and examples

  • Add documentation for textcat_multilabel component.
  • Extend documentation for Vocab.get_noun_chunks.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @MartinoMensio, @SergeyShk, @R1j1t, @palandlom, @dardoria, @Tocic, @clippered, @graue70, @koaning and @jankrepl for the pull requests and contributions!

- Python
Published by ines about 5 years ago

spacy - v3.0.3: Bug fixes for sentence segmentation and config filling

🔴 Bug fixes

  • Fix issue #7035, #7056: Fix parser transition bug that could lead to incorrect sentence fragments.
  • Fix issue #7055: Preserve sourced components in init fill-config.

📖 Documentation and examples

  • Update spaCy Universe.

👥 Contributors

Thanks @MartinoMensio for the pull request!

- Python
Published by ines over 5 years ago

spacy - v3.0.2: CLI overrides and env variables in projects, base support for Setswana, PhraseMatcher for spans and bug fixes

✨ New features and improvements

  • NEW: Base support for Setswana.
  • The PhraseMatcher can now also be run on Span objects.
  • Support CLI overrides and environment variables in project.yml: a section env defines environment variable names that can be used in commands. The project run command now also supports CLI overrides, e.g. --vars.batch_size 128.
  • Reduce memory load when reading all vectors from file during initialization.
  • Update recommended transformers in training quickstart and init config CLI.

🔴 Bug fixes

  • Fix issue #6826: Ensure the loss value is cast to a float.
  • Fix issue #6891: Include noun_chunks when pickling Vocab.
  • Fix issue #6908: Fix expected type for textcat labels.
  • Fix issue #6924: Correctly pass vocab forward in spacy.blank.
  • Fix issue #6950: Allow pickling Tok2Vec with listeners .
  • Fix issue #6983: Ensure is_same_func works correctly for classes in component decorator.
  • Fix issue #7019: Correctly handle non-float/int values in spacy evaluate printer.
  • Fix issue #7029: Fix listener architecture with empty Doc in batch.

📖 Documentation and examples

  • Improve installation instructions.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @peter-exos, @KoichiYasuoka, @tarskiandhutch, @reneoctavio, @melonwater211, @mapmeld and @Shumie82 for the pull requests and contributions.

- Python
Published by ines over 5 years ago

spacy - v3.0.1: Bug fixes for transfomer training

🔴 Bug fixes

  • Fix issue #6883: Fix bug in transformer training for Cannot get dimension 'nO' for model 'transformer': value unset.

- Python
Published by adrianeboyd over 5 years ago

spacy - v3.0.0: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →

🚀 Quickstart

For the smoothest updating process, we recommend starting with a fresh virtual environment.

bash pip install -U spacy

✨ New features and improvements

  • Transformer-based pipelines with support for multi-task learning.
  • Retrained model families for 18+ languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
  • Retrained pipelines for all supported languages, plus new core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for the contributions!
  • New training workflow and config system.
  • Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
  • spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
  • Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
  • Parallel training and distributed computing with Ray.
  • New built-in pipeline components: SentenceRecognizer, Morphologizer, Lemmatizer, AttributeRuler and Transformer.
  • New and improved pipeline component API and decorators for custom components.
  • Source trained components from other pipelines in your training config.
  • Pre-built and more efficient binary wheels for all trained pipeline packages.
  • DependencyMatcher for matching patterns within the dependency parse using Semgrex operators.
  • Support for greedy patterns in Matcher.
  • New data structure SpanGroup for efficiently storing collections of potentially overlapping spans via the Doc.spans.
  • Type hints and type-based data validation for custom registered functions.
  • Various new methods, attributes and commands.

📺 Video introductions & tutorials

| spaCy v3: State-of-the-art NLP from Prototype to Production | spaCy v3: Design concepts explained (behind the scenes) | spaCy v3: Custom trainable relation extraction component | | :---: | :---: | :---: | | | | |

📦 Trained pipelines (58)

To download a trained pipeline, you can use the spacy download command. See the training documentation for details on how to train your own pipelines on your data.

| Name | Language | POS | TAG | LAS | UAS | NER | Sent | Size | | | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | :---: | | da_core_news_lg v3.0.0 | Danish | 0.97 | 0.97 | 0.78 | 0.82 | 0.82 | 0.88 | 547 MB | 📖 | | da_core_news_md v3.0.0 | Danish | 0.96 | 0.96 | 0.78 | 0.82 | 0.81 | 0.86 | 47 MB | 📖 | | da_core_news_sm v3.0.0 | Danish | 0.95 | 0.95 | 0.76 | 0.81 | 0.72 | 0.86 | 17 MB | 📖 | | de_core_news_lg v3.0.0 | German | 0.98 | 0.98 | 0.91 | 0.93 | 0.85 | 0.95 | 546 MB | 📖 | | de_core_news_md v3.0.0 | German | 0.98 | 0.98 | 0.91 | 0.93 | 0.84 | 0.95 | 47 MB | 📖 | | de_core_news_sm v3.0.0 | German | 0.98 | 0.97 | 0.90 | 0.92 | 0.82 | 0.94 | 18 MB | 📖 | | de_dep_news_trf v3.0.0 | German | 0.99 | 0.99 | 0.95 | 0.96 | n/a | 0.98 | 393 MB | 📖 | | el_core_news_lg v3.0.0 | Greek | 0.97 | 0.94 | 0.85 | 0.88 | 0.80 | 1.00 | 544 MB | 📖 | | el_core_news_md v3.0.0 | Greek | 0.96 | 0.93 | 0.84 | 0.87 | 0.79 | 1.00 | 42 MB | 📖 | | el_core_news_sm v3.0.0 | Greek | 0.94 | 0.91 | 0.81 | 0.85 | 0.72 | 1.00 | 12 MB | 📖 | | en_core_web_lg v3.0.0 | English | n/a | 0.97 | 0.90 | 0.92 | 0.86 | 0.89 | 742 MB | 📖 | | en_core_web_md v3.0.0 | English | n/a | 0.97 | 0.90 | 0.92 | 0.85 | 0.89 | 44 MB | 📖 | | en_core_web_sm v3.0.0 | English | n/a | 0.97 | 0.90 | 0.92 | 0.84 | 0.89 | 13 MB | 📖 | | en_core_web_trf v3.0.0 | English | n/a | 0.98 | 0.94 | 0.95 | 0.90 | 0.89 | 438 MB | 📖 | | es_core_news_lg v3.0.0 | Spanish | 0.99 | 0.98 | 0.88 | 0.91 | 0.90 | 1.00 | 547 MB | 📖 | | es_core_news_md v3.0.0 | Spanish | 0.99 | 0.98 | 0.88 | 0.91 | 0.90 | 1.00 | 46 MB | 📖 | | es_core_news_sm v3.0.0 | Spanish | 0.98 | 0.97 | 0.87 | 0.90 | 0.89 | 1.00 | 17 MB | 📖 | | es_dep_news_trf v3.0.0 | Spanish | 0.99 | 0.98 | 0.93 | 0.95 | n/a | 0.97 | 395 MB | 📖 | | fr_core_news_lg v3.0.0 | French | 0.98 | 0.95 | 0.86 | 0.90 | 0.82 | 0.88 | 546 MB | 📖 | | fr_core_news_md v3.0.0 | French | 0.97 | 0.94 | 0.85 | 0.89 | 0.81 | 0.87 | 45 MB | 📖 | | fr_core_news_sm v3.0.0 | French | 0.96 | 0.93 | 0.84 | 0.88 | 0.79 | 0.85 | 16 MB | 📖 | | fr_dep_news_trf v3.0.0 | French | 0.99 | 0.96 | 0.92 | 0.94 | n/a | 0.94 | 381 MB | 📖 | | it_core_news_lg v3.0.0 | Italian | 0.98 | 0.97 | 0.88 | 0.91 | 0.89 | 0.97 | 545 MB | 📖 | | it_core_news_md v3.0.0 | Italian | 0.97 | 0.97 | 0.88 | 0.91 | 0.87 | 0.97 | 44 MB | 📖 | | it_core_news_sm v3.0.0 | Italian | 0.97 | 0.97 | 0.86 | 0.90 | 0.85 | 0.97 | 16 MB | 📖 | | ja_core_news_lg v3.0.0 | Japanese | 0.96 | 0.97 | 0.90 | 0.92 | 0.72 | 0.98 | 531 MB | 📖 | | ja_core_news_md v3.0.0 | Japanese | 0.96 | 0.97 | 0.90 | 0.92 | 0.72 | 0.99 | 41 MB | 📖 | | ja_core_news_sm v3.0.0 | Japanese | 0.96 | 0.97 | 0.90 | 0.92 | 0.64 | 0.99 | 12 MB | 📖 | | lt_core_news_lg v3.0.0 | Lithuanian | 0.96 | 0.89 | 0.68 | 0.75 | 0.80 | 0.82 | 545 MB | 📖 | | lt_core_news_md v3.0.0 | Lithuanian | 0.95 | 0.86 | 0.67 | 0.74 | 0.79 | 0.83 | 44 MB | 📖 | | lt_core_news_sm v3.0.0 | Lithuanian | 0.91 | 0.82 | 0.59 | 0.68 | 0.74 | 0.79 | 15 MB | 📖 | | mk_core_news_lg v3.0.0 | Macedonian | 0.93 | n/a | 0.51 | 0.68 | 0.76 | 0.73 | 312 MB | 📖 | | mk_core_news_md v3.0.0 | Macedonian | 0.93 | n/a | 0.51 | 0.67 | 0.75 | 0.73 | 44 MB | 📖 | | mk_core_news_sm v3.0.0 | Macedonian | 0.92 | n/a | 0.47 | 0.62 | 0.70 | 0.62 | 18 MB | 📖 | | nb_core_news_lg v3.0.0 | Norwegian | 0.97 | 0.97 | 0.87 | 0.89 | 0.85 | 0.94 | 547 MB | 📖 | | nb_core_news_md v3.0.0 | Norwegian | 0.97 | 0.97 | 0.87 | 0.90 | 0.85 | 0.93 | 44 MB | 📖 | | nb_core_news_sm v3.0.0 | Norwegian | 0.97 | 0.97 | 0.85 | 0.88 | 0.77 | 0.93 | 15 MB | 📖 | | nl_core_news_lg v3.0.0 | Dutch | 0.96 | 0.95 | 0.82 | 0.87 | 0.77 | 0.87 | 546 MB | 📖 | | nl_core_news_md v3.0.0 | Dutch | 0.96 | 0.95 | 0.82 | 0.87 | 0.74 | 0.87 | 45 MB | 📖 | | nl_core_news_sm v3.0.0 | Dutch | 0.95 | 0.93 | 0.80 | 0.85 | 0.72 | 0.86 | 16 MB | 📖 | | pl_core_news_lg v3.0.0 | Polish | 0.97 | 0.98 | 0.84 | 0.89 | 0.85 | 0.99 | 584 MB | 📖 | | pl_core_news_md v3.0.0 | Polish | 0.97 | 0.98 | 0.84 | 0.89 | 0.84 | 0.98 | 84 MB | 📖 | | pl_core_news_sm v3.0.0 | Polish | 0.95 | 0.98 | 0.79 | 0.86 | 0.80 | 0.98 | 55 MB | 📖 | | pt_core_news_lg v3.0.0 | Portuguese | 0.97 | 0.90 | 0.86 | 0.90 | 0.91 | 0.95 | 551 MB | 📖 | | pt_core_news_md v3.0.0 | Portuguese | 0.97 | 0.90 | 0.86 | 0.90 | 0.90 | 0.95 | 49 MB | 📖 | | pt_core_news_sm v3.0.0 | Portuguese | 0.97 | 0.89 | 0.85 | 0.89 | 0.88 | 0.92 | 21 MB | 📖 | | ro_core_news_lg v3.0.0 | Romanian | 0.96 | 0.97 | 0.84 | 0.89 | 0.77 | 0.96 | 546 MB | 📖 | | ro_core_news_md v3.0.0 | Romanian | 0.96 | 0.97 | 0.85 | 0.89 | 0.76 | 0.96 | 44 MB | 📖 | | ro_core_news_sm v3.0.0 | Romanian | 0.96 | 0.96 | 0.82 | 0.87 | 0.72 | 0.97 | 15 MB | 📖 | | ru_core_news_lg v3.0.0 | Russian | 0.99 | 0.99 | 0.95 | 0.96 | 0.95 | 1.00 | 491 MB | 📖 | | ru_core_news_md v3.0.0 | Russian | 0.99 | 0.99 | 0.95 | 0.96 | 0.94 | 1.00 | 41 MB | 📖 | | ru_core_news_sm v3.0.0 | Russian | 0.99 | 0.99 | 0.95 | 0.96 | 0.95 | 1.00 | 16 MB | 📖 | | xx_ent_wiki_sm v3.0.0 | MultiLanguage | n/a | n/a | n/a | n/a | 0.82 | n/a | 14 MB | 📖 | | xx_sent_ud_sm v3.0.0 | MultiLanguage | n/a | n/a | n/a | n/a | n/a | 0.86 | 10 MB | 📖 | | zh_core_web_lg v3.0.0 | Chinese | n/a | 0.90 | 0.66 | 0.71 | 0.71 | 0.75 | 577 MB | 📖 | | zh_core_web_md v3.0.0 | Chinese | n/a | 0.90 | 0.65 | 0.70 | 0.70 | 0.76 | 76 MB | 📖 | | zh_core_web_sm v3.0.0 | Chinese | n/a | 0.90 | 0.64 | 0.70 | 0.69 | 0.75 | 47 MB | 📖 | | zh_core_web_trf v3.0.0 | Chinese | n/a | 0.92 | 0.73 | 0.77 | 0.75 | 0.65 | 398 MB | 📖 |

💬 TAG: Part-of-speech tags (fine-grained tags, i.e. Token.tag_) POS: Part-of-speech tags (coarse-grained tags, i.e. Token.pos_) UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). NER: Named entities (F-score). Sent: Sentence segmentation. Size: Model file size (zipped archive).

⚠️ Backwards incompatibilities

For more info on how to migrate from spaCy v2.x, see the detailed migration guide.

API changes

  • Pipeline package symlinks, the link command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name like en_core_web_sm explicitly.
  • A pipeline's meta.json is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the pipeline.
  • The train, pretrain and debug data commands now only take a config.cfg.
  • Language.add_pipe now takes the string name of the component factory instead of the component function.
  • Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
  • The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.
  • The begin_training methods have been renamed to initialize and now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
  • Matcher.add and PhraseMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
  • The Doc flags like Doc.is_parsed or Doc.is_tagged have been replaced by Doc.has_annotation.
  • The spacy.gold module has been renamed to spacy.training.
  • The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
  • The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler.
  • The Lemmatizer is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization.
  • Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.

Removed or renamed API

| Removed | Replacement | | -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Language.disable_pipes | Language.select_pipes, Language.disable_pipe, Language.enable_pipe | | Language.begin_training, Pipe.begin_training, ... | Language.initialize, Pipe.initialize, ... | | Doc.is_tagged, Doc.is_parsed, ... | Doc.has_annotation | | GoldParse | Example | | GoldCorpus | Corpus | | KnowledgeBase.load_bulk, KnowledgeBase.dump | KnowledgeBase.from_disk, KnowledgeBase.to_disk | | Matcher.pipe, PhraseMatcher.pipe | not needed | | gold.offsets_from_biluo_tags, gold.spans_from_biluo_tags, gold.biluo_tags_from_offsets | training.biluo_tags_to_offsets, training.biluo_tags_to_spans, training.offsets_to_biluo_tags | | spacy init-model | spacy init vectors | | spacy debug-data | spacy debug data | | spacy profile | spacy debug profile | | spacy link, util.set_data_path, util.get_data_path | not needed, symlinks are deprecated |

The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.

| Removed | Replacement | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Doc.tokens_from_list | Doc.__init__ | | Doc.merge, Span.merge | Doc.retokenize | | Token.string, Span.string, Span.upper, Span.lower | Span.text, Token.text | | Language.tagger, Language.parser, Language.entity | Language.get_pipe | | keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytes | exclude=["vocab"] | | n_threads argument on Tokenizer, Matcher, PhraseMatcher | n_process | | verbose argument on Language.evaluate | logging (DEBUG) | | SentenceSegmenter hook, SimilarityHook | user hooks, Sentencizer, SentenceRecognizer |

👥 Contributors

This release is brought to you by @honnibal, @ines, @svlandeg and @adrianeboyd. Thanks to @AMArostegui, @BramVanroy, @Cristianasp, @DeNeutoy, @DuyguA, @Jan-711, @KKsharma99, @KeshavG-lb, @KoichiYasuoka, @MartinoMensio, @Nuccy90, @PluieElectrique, @SamEdwardes, @Stannislav, @abchapman93, @alexcombessie, @alvaroabascar, @baranitharan2020, @bittlingmayer, @bjascob, @borijang, @borijang, @bratao, @bryant1410, @buriy, @chopeen, @danielvasic, @delzac, @dhruvrnaik, @erip, @florijanstamenkovic, @forest1988, @gandersen101, @garethsparks, @graue70, @guadiromero, @hertelm, @hiroshi-matsuda-rit, @holubvl3, @idoshr, @jabortell, @jbesomi, @jenojp, @jganseman, @jgutix, @jmargeta, @jumasheff, @kuk, @leyendecker, @lizhe2004, @lorenanda, @mahnerak, @mikeizbicki, @myavrum, @nipunsadvilkar, @oculusrepairo, @ophelielacroix, @rahul1990gupta, @rameshhpathak, @rasyidf, @revuel, @richardliaw, @robertsipek, @snsten, @solarmist, @tamuhey, @thomasbird, @tiangolo, @tilusnet, @timgates42, @vha14, @walterhenry, @wannaphong, @werew, @yosiasz and @zaibacu for the pull requests and contributions!

- Python
Published by ines over 5 years ago

spacy - v3.0.0rc3: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.

⚠️⚠️⚠️ Make sure to retrain your models! ⚠️⚠️⚠️ This release includes changes to the config and model architectures, so if you've trained a custom pipeline with v3.0.0rc1 or v3.0.0rc2, you'll need to retrain it. We recommend using the new spaCy projects system to make it easy to re-run your training process. To auto-fill and update your configs, you can use the init fill-config command.

📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →

🚀 Quickstart

bash pip install -U spacy-nightly --pre

✨ New features and improvements

  • Transformer-based pipelines with support for multi-task learning.
  • Retrained model families for 18 languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
  • New core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for their contributions!
  • New training workflow and config system.
  • Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
  • spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
  • Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
  • Parallel training and distributed computing with Ray.
  • New built-in pipeline components: SentenceRecognizer, Morphologizer, Lemmatizer, AttributeRuler and Transformer.
  • New and improved pipeline component API and decorators for custom components.
  • Source trained components from other pipelines in your training config.
  • DependencyMatcher for matching patterns within the dependency parse using Semgrex operators.
  • Support for greedy patterns in Matcher.
  • Type hints and type-based data validation for custom registered functions.
  • Various new methods, attributes and commands.

⚠️ Backwards incompatibilities

For more info on how to migrate from spaCy v2.x, see the detailed migration guide.

API changes

  • Pipeline package symlinks, the link command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name like en_core_web_sm explicitly.
  • A pipeline's meta.json is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the pipeline.
  • The train, pretrain and debug data commands now only take a config.cfg.
  • Language.add_pipe now takes the string name of the component factory instead of the component function.
  • Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
  • The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.
  • The begin_training methods have been renamed to initialize and now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
  • Matcher.add and PhraseMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
  • The Doc flags like Doc.is_parsed or Doc.is_tagged have been replaced by Doc.has_annotation.
  • The spacy.gold module has been renamed to spacy.training.
  • The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
  • The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler.
  • The Lemmatizer is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization.
  • Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.

Removed or renamed API

| Removed | Replacement | | -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Language.disable_pipes | Language.select_pipes, Language.disable_pipe, Language.enable_pipe | | Language.begin_training, Pipe.begin_training, ... | Language.initialize, Pipe.initialize, ... | | Doc.is_tagged, Doc.is_parsed, ... | Doc.has_annotation | | GoldParse | Example | | GoldCorpus | Corpus | | KnowledgeBase.load_bulk, KnowledgeBase.dump | KnowledgeBase.from_disk, KnowledgeBase.to_disk | | Matcher.pipe, PhraseMatcher.pipe | not needed | | gold.offsets_from_biluo_tags, gold.spans_from_biluo_tags, gold.biluo_tags_from_offsets | training.biluo_tags_to_offsets, training.biluo_tags_to_spans, training.offsets_to_biluo_tags | | spacy init-model | spacy init vectors | | spacy debug-data | spacy debug data | | spacy profile | spacy debug profile | | spacy link, util.set_data_path, util.get_data_path | not needed, symlinks are deprecated |

The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.

| Removed | Replacement | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Doc.tokens_from_list | Doc.__init__ | | Doc.merge, Span.merge | Doc.retokenize | | Token.string, Span.string, Span.upper, Span.lower | Span.text, Token.text | | Language.tagger, Language.parser, Language.entity | Language.get_pipe | | keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytes | exclude=["vocab"] | | n_threads argument on Tokenizer, Matcher, PhraseMatcher | n_process | | verbose argument on Language.evaluate | logging (DEBUG) | | SentenceSegmenter hook, SimilarityHook | user hooks, Sentencizer, SentenceRecognizer |

- Python
Published by ines over 5 years ago

spacy - v2.3.5: Bug fixes and simpler source installs

✨ New features and improvements

  • Modify blis and numpy build dependencies to simplify source installations.
  • Support cupy v8+ in combination with thinc v7.4.5.

🔴 Bug fixes

  • Fix issue #6443: Only set NORM on token in retokenizer.
  • Fix issue #6453: Add SPACY as a Matcher attribute.
  • Fix issue #6512: Add nlp.max_length check to nlp.pipe through nlp.make_doc.
  • Fix issue #6515: Add missing .pipe methods to Chinese, Japanese, Korean and Thai tokenizers.
  • Fix issue #6518: Fix subsequent pipe detection in EntityRuler.
  • Fix issue #6523: Remove non-working --use-chars from train CLI.

👥 Contributors

Thanks to @KoichiYasuoka for the pull requests and contributions.

- Python
Published by adrianeboyd over 5 years ago

spacy - v2.3.4: Fix beam parser API

🔴 Bug fixes

  • Fix issue #6446: Restore cleanup_beam method.

📖 Documentation and examples

  • Update rule-based matching docs

👥 Contributors

Thanks to @jabortell for the pull requests and contributions.

- Python
Published by adrianeboyd over 5 years ago

spacy - v2.3.3: Alpha support for Macedonian and Sanskrit, updates for many languages and bug fixes

✨ New features and improvements

  • NEW: Add alpha support for Macedonian and Sanskrit.
  • Update language data for Croatian, Czech, English, Hebrew, Hindi, Indonesian, Swedish, Thai and Turkish.
  • Add support for aarch64 and ppc64le on linux with binary packages available on conda-forge.

🔴 Bug fixes

  • Fix issue #5610: Make sure sys.argv exists.
  • Fix issue #5643: Add ent_id_ to strings serialized with Doc.
  • Fix issue #5727: Clarify warning for misaligned BILUO tags.
  • Fix issue #5768: Improve tag map initialization and updating.
  • Fix issue #5794: Improve warnings around normalization tables.
  • Fix issue #5796: Update invalid tag maps.
  • Fix issue #5799: Remove hard-coded GPU ID from pretrain.
  • Fix issue #5802: Mark Japanese documents as tagged.
  • Fix issue #5823: Fix typo in unit tests.
  • Fix issue #5838: Fix EntityRenderer to support break lines (after last entity).
  • Fix issue #5843: Prefer earlier spans in EntityRuler.
  • Fix issue #5849: Allow Doc.char_span to snap to token boundaries.
  • Fix issue #5853: Fix span boundary handling in Spanish noun chunks.
  • Fix issue #5861: Add Span index boundary checks.
  • Fix issue #5904: Fix typos in comments.
  • Fix issue #5910: Update default sentencizer characters for Armenian, Greek and Arabic.
  • Fix issue #6014: Fix off-by-one error for best iteration calculation.
  • Fix issue #6112: Fix overlapping German noun chunks.
  • Fix issue #6148: Identify final Matcher pattern node by quantifier.
  • Fix issue #6164: Reorder so tag map is replaced only if a custom file is provided.
  • Fix issue #6218: Reproducibility for TextCategorizer and Tok2Vec.
  • Fix issue #6219: Add re-enabled pipe names back to the meta before serializing.
  • Fix issue #6300: Fix on_match callback and exclude empty match lists from results for DependencyMatcher.
  • Fix issue #6347: Memory leak issues with beam_parse (requires thinc>=7.4.3).
  • Fix issue #6373: Bugfix textcat reproducibility on GPU (requires thinc>=7.4.3).
  • Fix issue #6405: Add all vectors to vocab before pruning.
  • Fix issue #6413: Use int8_t instead of char in Matcher.

👥 Contributors

Thanks to @abchapman93, @baranitharan2020, @bittlingmayer, @bjascob, @borijang, @BramVanroy, @chopeen, @danielvasic, @delzac, @DuyguA, @erip, @florijanstamenkovic, @graue70, @hiroshi-matsuda-rit, @holubvl3, @idoshr, @jgutix, @KKsharma99, @leyendecker, @lizhe2004, @MartinoMensio, @nipunsadvilkar, @Nuccy90, @oculusrepairo, @rahul1990gupta, @rasyidf, @robertsipek, @SamEdwardes, @snsten, @solarmist, @Stannislav, @tamuhey, @tilusnet, @vha14, @wannaphong, @zaibacu for the pull requests and contributions.

- Python
Published by adrianeboyd over 5 years ago

spacy - v3.0.0rc1: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.

🚀 Quickstart

bash pip install -U spacy-nightly --pre

✨ New features and improvements

  • Transformer-based pipelines with support for multi-task learning.
  • Retrained model families for 16 languages and 52 trained pipelines in total, including 6 transformer-based pipelines.
  • New training workflow and config system.
  • Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
  • spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
  • Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
  • Parallel training and distributed computing with Ray.
  • New built-in pipeline components: SentenceRecognizer, Morphologizer, Lemmatizer, AttributeRuler and Transformer.
  • New and improved pipeline component API and decorators for custom components.
  • Source trained components from other pipelines in your training config.
  • DependencyMatcher for matching patterns within the dependency parse using Semgrex operators.
  • Support for greedy patterns in Matcher.
  • Type hints and type-based data validation for custom registered functions.
  • Various new methods, attributes and commands.

⚠️ Backwards incompatibilities

For more info on how to migrate from spaCy v2.x, see the detailed migration guide.

API changes

  • Pipeline package symlinks, the link command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name like en_core_web_sm explicitly.
  • A pipeline's meta.json is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the pipeline.
  • The train, pretrain and debug data commands now only take a config.cfg.
  • Language.add_pipe now takes the string name of the component factory instead of the component function.
  • Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
  • The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.
  • The begin_training methods have been renamed to initialize and now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
  • Matcher.add and PhraseMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
  • The Doc flags like Doc.is_parsed or Doc.is_tagged have been replaced by Doc.has_annotation.
  • The spacy.gold module has been renamed to spacy.training.
  • The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
  • The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler.
  • The Lemmatizer is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization.
  • Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.

Removed or renamed API

| Removed | Replacement | | -------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Language.disable_pipes | Language.select_pipes, Language.disable_pipe, Language.enable_pipe | | Language.begin_training, Pipe.begin_training, ... | Language.initialize, Pipe.initialize, ... | | Doc.is_tagged, Doc.is_parsed, ... | Doc.has_annotation | | GoldParse | Example | | GoldCorpus | Corpus | | KnowledgeBase.load_bulk, KnowledgeBase.dump | KnowledgeBase.from_disk, KnowledgeBase.to_disk | | Matcher.pipe, PhraseMatcher.pipe | not needed | | gold.offsets_from_biluo_tags, gold.spans_from_biluo_tags, gold.biluo_tags_from_offsets | training.biluo_tags_to_offsets, training.biluo_tags_to_spans, training.offsets_to_biluo_tags | | spacy init-model | spacy init vectors | | spacy debug-data | spacy debug data | | spacy profile | spacy debug profile | | spacy link, util.set_data_path, util.get_data_path | not needed, symlinks are deprecated |

The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.

| Removed | Replacement | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Doc.tokens_from_list | Doc.__init__ | | Doc.merge, Span.merge | Doc.retokenize | | Token.string, Span.string, Span.upper, Span.lower | Span.text, Token.text | | Language.tagger, Language.parser, Language.entity | Language.get_pipe | | keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytes | exclude=["vocab"] | | n_threads argument on Tokenizer, Matcher, PhraseMatcher | n_process | | verbose argument on Language.evaluate | logging (DEBUG) | | SentenceSegmenter hook, SimilarityHook | user hooks, Sentencizer, SentenceRecognizer |

- Python
Published by ines over 5 years ago

spacy - v2.3.2: Improved Korean tokenizer speed, experimental character-based pretraining and bug fixes

✨ New features and improvements

  • Improve Korean tokenizer speed.
  • Add experimental character-based pretraining.

🔴 Bug fixes

  • Fix issue #5728: Fix French lemmatizer.
  • Fix issue #5729: Fix lemmatizer for python 2.7.
  • Fix issue #5751: Fix meta serialization in train CLI.

👥 Contributors

Thanks to @graue70, @mikeizbicki, @jbesomi, @gandersen101 and @DeNeutoy for the pull requests and contributions.

- Python
Published by adrianeboyd almost 6 years ago

spacy - v2.3.1: Alpha support for Nepali, updated Armenian and Japanese language data and bug fixes

✨ New features and improvements

  • NEW: Add alpha support for Nepali.
  • Refactor Japanese tokenizer and include additional custom tokenizer features.
  • Update Armenian language data.
  • Include spacy git commit in package and model meta for reference.

🔴 Bug fixes

  • Fix issue #5620: Skip vocab in component config overrides.
  • Fix issue #5634: Fix polarity of Token.is_oov and Lexeme.is_oov.
  • Fix issue #5643: Add strings and ENT_KB_ID to Doc serialization.
  • Fix issue #5648: Disregard special tag _SP in check for new tag map.
  • Fix issue #5658 : Move lemmatizer is_base_form to language settings.

👥 Contributors

Thanks to @myavrum, @mahnerak, @rameshhpathak, @hiroshi-matsuda-rit, @PluieElectrique, @hertelm and @alvaroabascar for the pull requests and contributions.

- Python
Published by adrianeboyd almost 6 years ago

spacy - v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes

⚠️ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.

✨ New features and improvements

  • NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total!
  • NEW: 2-4× faster loading times for models with vectors and 2× smaller packages.
  • NEW: Alpha support for Armenian, Gujarati and Malayalam.
  • NEW: Lookup lemmatization for Polish.
  • NEW: Allow Matcher to match on both Doc and Span objects.
  • NEW: Add Token.is_sent_end property.
  • Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora.
  • Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu.
  • Add support for pkuseg alongside jieba for Chinese.
  • Switch from fugashi to sudachipy for Japanese.
  • Improve punctuation used in sentencizer.
  • Switch to new and more consistent alignment method in gold.align.
  • Reduce stored lexemes data and move non-derivable features to spacy-lookups-data.

🔴 Bug fixes

  • Fix issue #5056: Introduce support for matching Span objects.
  • Fix issue #5086: Remove Vectors.from_glove.
  • Fix issue #5131: Improve data processing in named entity linking scripts.
  • Fix issue #5137: Fix passing of component configuration to component.
  • Fix issue #5144: Fix sentence comparison in test util.
  • Fix issue #5166: Fix handling of exclusive_classes in textcat ensemble.
  • Fix issue #5170: Set rank for new vector in Vocab.set_vector.
  • Fix issue #5181: Prevent None values in gold fields.
  • Fix issue #5191: Fix GoldParse initialization when the number of tokens has changed.
  • Fix issue #5193: Correctly pin cupy-cuda extra dependencies.
  • Fix issue #5200: Fix minor bugs in train CLI.
  • Fix issue #5216: Modify Vectors.resize to work with cupy.
  • Fix issue #5228: Raise error for inplace resize with new vector dimension.
  • Fix issue #5230: Fix unittest warnings when saving a model.
  • Fix issue #5257: Use inline flags in token_match patterns.
  • Fix issue #5278, #5359: Add missing __init__.py files to language data tests.
  • Fix issue #5281: Fix comparison predicate handling for !=.
  • Fix issue #5287: Normalize TokenC.sent_start values for Matcher.
  • Fix issue #5292: Fix typo in option name --n-save_every.
  • Fix issue #5303: Use max(uint64) for OOV lexeme rank.
  • Fix issue #5311: Fix alignment of cards on landing page.
  • Fix issue #5320: Fix most_similar for vectors with unused rows.
  • Fix issue #5344: Prevent pip from installing spaCy on Python 3.4.
  • Fix issue #5356: Fix bug in Span.similarity that could trigger TypeError.
  • Fix issue #5361: Fix problems with lower and whitespace in variants.
  • Fix issue #5373: Improve exceptions for 'd (would/had) in English.
  • Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU.
  • Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks.
  • Fix issue #5429: Modify array type to accommodate OOV_RANK.
  • Fix issue #5430: Check that row is within bounds when adding vector.
  • Fix issue #5435: Use Token.sent_start for Span.sent.
  • Fix issue #5436: Fix ErrorsWithCodes().__class__ return value.
  • Fix issue #5450: Disallow merging 0-length spans.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you're training new models, you'll want to install the package spacy-lookups-data, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you're using pretrained models, nothing changes, because the relevant tables are included in the model packages.
  • Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
  • For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tagsets contain new merged tags related to contracted forms, such as ADP_DET for French "au", which maps to UPOS ADP based on the head "à". This increases the accuracy of the models by improving the alignment between spaCy's tokenization and Universal Dependencies multi-word tokens used for contractions.
  • spaCy's custom warnings have been replaced with native Python warnings. Instead of setting SPACY_WARNING_IGNORE, use the warnings filters to manage warnings.

📖 Documentation and examples

  • Fix various typos and inconsistencies.
  • Add new projects to the spaCy Universe.
  • Move bin/wiki_entity_linking scripts for Wikipedia to projects repo.

🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon!

📦 Model packages (43)

| Model | Language | Version | Vectors | | ------------------- | ---------- | ------: | ----: | zh_core_web_sm | Chinese | 2.3.0 | 𐄂 | | zh_core_web_md | Chinese | 2.3.0 | ✓ | | zh_core_web_lg | Chinese | 2.3.0 | ✓ | | da_core_news_sm | Danish | 2.3.0 | 𐄂 | | da_core_news_md | Danish | 2.3.0 | ✓ | | da_core_news_lg | Danish | 2.3.0 | ✓ | | nl_core_news_sm | Dutch | 2.3.0 | 𐄂 | | nl_core_news_md | Dutch | 2.3.0 | ✓ | | nl_core_news_lg | Dutch | 2.3.0 | ✓ | | en_core_web_sm | English | 2.3.0 | 𐄂 | | en_core_web_md | English | 2.3.0 | ✓ | | en_core_web_lg | English | 2.3.0 | ✓ | | fr_core_news_sm | French | 2.3.0 | 𐄂 | | fr_core_news_md | French | 2.3.0 | ✓ | | fr_core_news_lg | French | 2.3.0 | ✓ | | de_core_news_sm | German | 2.3.0 | 𐄂 | | de_core_news_md | German | 2.3.0 | ✓ | | de_core_news_lg | German | 2.3.0 | ✓ | | el_core_news_sm | Greek | 2.3.0 | 𐄂 | | el_core_news_md | Greek | 2.3.0 | ✓ | | el_core_news_lg | Greek | 2.3.0 | ✓ | | it_core_news_sm | Italian | 2.3.0 | 𐄂 | | it_core_news_md | Italian | 2.3.0 | ✓ | | it_core_news_lg | Italian | 2.3.0 | ✓ | | ja_core_news_sm | Japanese | 2.3.0 | 𐄂 | | ja_core_news_md | Japanese | 2.3.0 | ✓ | | ja_core_news_lg | Japanese | 2.3.0 | ✓ | | lt_core_news_sm | Lithuanian | 2.3.0 | 𐄂 | | lt_core_news_md | Lithuanian | 2.3.0 | ✓ | | lt_core_news_lg | Lithuanian | 2.3.0 | ✓ | | nb_core_news_sm | Norwegian Bokmål | 2.3.0 | 𐄂 | | nb_core_news_md | Norwegian Bokmål | 2.3.0 | ✓ | | nb_core_news_lg | Norwegian Bokmål | 2.3.0 | ✓ | | pl_core_news_sm | Polish | 2.3.0 | 𐄂 | | pl_core_news_md | Polish | 2.3.0 | ✓ | | pl_core_news_lg | Polish | 2.3.0 | ✓ | | pt_core_news_sm | Portuguese | 2.3.0 | 𐄂 | | pt_core_news_md | Portuguese | 2.3.0 | ✓ | | pt_core_news_lg | Portuguese | 2.3.0 | ✓ | | ro_core_news_sm | Romanian | 2.3.0 | 𐄂 | | ro_core_news_md | Romanian | 2.3.0 | ✓ | | ro_core_news_lg | Romanian | 2.3.0 | ✓ | | es_core_news_sm | Spanish | 2.3.0 | 𐄂 | | es_core_news_md | Spanish | 2.3.0 | ✓ | | es_core_news_lg | Spanish | 2.3.0 | ✓ | | xx_ent_wiki_sm | Multi-language | 2.3.0 | 𐄂 |

👥 Contributors

Thanks to @mabraham, @sloev, @pinealan, @pmbaumgartner, @Baciccin, @nlptechbook, @guerda, @Tiljander, @nikhilsaldanha, @tommilligan, @Jacse, @leicmi, @YohannesDatasci, @mirfan899, @koaning, @umarbutler, @chopeen, @paoloq, @thomasthiebaud, @sebastienharinck, @elben10, @laszabine, @Mlawrence95, @sabiqueqb, @punitvara, @michael-k, @louisguitton, @vondersam, @thoppe, @vishnupriyavr, @ilivans and @osori for the pull requests and contributions.

🙏 Special thanks to everyone who helped us develop and test the new models: @lixiepeng, @lingvisa and @howl-anderson (Chinese), @hvingelby (Danish), @hiroshi-matsuda-rit and @polm (Japanese), @ryszardtuora (Polish) and @avramandrei and @dumitrescustefan (Romanian).

- Python
Published by ines almost 6 years ago

spacy - v2.2.4: Alpha support for Yoruba and Basque, language data improvements and lots of bug fixes

✨ New features and improvements

  • NEW: Add Span.char_span method.
  • NEW: Base language support for Yoruba and Basque.
  • NEW: Add --tag-map-path argument to debug-data and train commands.
  • NEW Add add_lemma option to displacy dependency visualizer.
  • Add IDX as an attribute available via Doc.to_array.
  • Improve speed of adding large number of patterns to EntityRuler.
  • Replace python-mecab3 with fugashi for Japanese.
  • Improve language data for Norwegian, Luxembourgish, Finnish, Slovak, Romanian, Greek and German.

🔴 Bug fixes

  • Fix issue #3979, #4819, #4871: Add tok2vec parameters to train command.
  • Fix issue #4009: Fix use of pretrained vectors in text classifier.
  • Fix issue #4342: Improve CLI training with base model.
  • Fix issue #4432: Add destructors for states in TransitionSystem.
  • Fix issue #4440: Require HEAD for is_parsed in Doc.from_array.
  • Fix issue #4615: Update SHAPE docs and examples.
  • Fix issue #4665: Allow HEAD field in CoNLL-U format to be an underscore.
  • Fix issue #4673: Ensure correct array module is used when returning a vector via Vocab.
  • Fix issue #4674: Make set_entities in the KnowledgeBase more robust.
  • Fix issue #4677: Add missing tags to tag maps for el, es and pt.
  • Fix issue #4688: Iterate over lr_edges until Doc.sents are correct.
  • Fix issue #4703, #4823: Facilitate large training files.
  • Fix issue #4707: Auto-exclude disabled when calling from_disk during load.
  • Fix issue #4717: Fix int value handling in Matcher.
  • Fix issue #4719: Add message when cli train script throws exception.
  • Fix issue #4723: Update EntityLinker example.
  • Fix issue #4725: Take care of global vectors in multiprocessing.
  • Fix issue #4770: Include Doc.cats in serialization of Doc and DocBin.
  • Fix issue #4772: Fix bug in EntityLinker.predict.
  • Fix issue #4777: Fix link to user hooks in documentation.
  • Fix issue #4829: Update build dependencies in pyproject.toml.
  • Fix issue #4830: Warn for punctuation in entities when training with noise.
  • Fix issue #4833: Make example scripts work with transformer starter models.
  • Fix issue #4849: Fix serialization of ENT_ID.
  • Fix issue #4862: Fix and improve URL pattern.
  • Fix issue #4868: Include .pyx and .pxd files in the distribution.
  • Fix issue #4876: Add friendlier error to entity linking example script.
  • Fix issue #4903: Fix handling of custom underscore attributes during multiprocessing.
  • Fix issue #4924: Fix handling of empty docs or golds in Language.evaluate.
  • Fix issue #4934: Prevent updating component config if the Model was already defined.
  • Fix issue #4935: Fix Sentencizer.pipe for empty Doc.
  • Fix issue #4961: Remove old docs section links.
  • Fix issue #4965: Sync Span.__eq__ and Span.__hash__.
  • Fix issue #4975: Adjust srsly pin.
  • Fix issue #5048: Fix behavior of get_doc test utility.
  • Fix issue #5073: Normalize IS_SENT_START to SENT_START for Matcher.
  • Fix issue #5075: Make it impossible to create invalid heads with Doc.from_array.
  • Fix issue #5082: Correctly set vector of merged span in merge_entities.
  • Fix issue #5115: Ensure paths in Tokenizer.to_disk and Tokenizer.from_disk.
  • Fix issue #5117: Clarify behavior of Doc.is_ flags for empty Docs.

📖 Documentation and examples

  • Fix various typos and inconsistencies.
  • Add new projects to the spaCy Universe.

👥 Contributors

Thanks to @polm, @mmaybeno, @jarib, @questoph, @aajanki, @mr-bjerre, @Tclack88, @thiagola92, @tamuhey, @Olamyy, @AlJohri, @iechevarria, @iurshina, @lineality, @pbadeer, @BramVanroy, @kabirkhan, @ceteri, @omri374, @maknotavailable, @onlyanegg, @drndos, @ju-sh, @nlptechbook, @chkoar, @Jan-711, @MisterKeefe, @bryant1410, @mirfan899, @dhpollack and @mabraham for the pull requests and contributions!

- Python
Published by ines about 6 years ago

spacy - v2.2.3: Tokenizer.explain, Korean base support, dependency scores per label and bug fixes

✨ New features and improvements

  • NEW: Tokenizer.explain method to see which rule or pattern was matched. python tok_exp = nlp.tokenizer.explain("(don't)") assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"] assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
  • NEW: Official Python 3.8 wheels for spaCy and its dependencies.
  • Base language support for Korean.
  • Add Scorer.las_per_type (labelled depdencency scores per label).
  • Rework Chinese language initialization and tokenization
  • Improve language data for Luxembourgish.

🔴 Bug fixes

  • Fix issue #4573, #4645: Improve tokenizer usage docs.
  • Fix issue #4575: Add error in debug-data if no dev docs are available.
  • Fix issue #4582: Make as_tuples=True in Language.pipe work with multiprocessing.
  • Fix issue #4590: Correctly call on_match in DependencyMatcher.
  • Fix issue #4593: Build wheels for Python 3.8.
  • Fix issue #4604: Fix realloc in Retokenizer.split.
  • Fix issue #4656: Fix conllu2json converter when -n > 1.
  • Fix issue #4662: Fix Language.evaluate for components without .pipe method.
  • Fix issue #4670: Ensure EntityRuler is deserialized correctly from disk.
  • Fix issue #4680: Raise error if non-string labels are added to Tagger or TextCategorizer.
  • Fix issue #4691: Make Vectors.find return keys in correct order.

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @yash1994, @walterhenry, @prilopes, @f11r, @questoph, @erip, @richardpaulhudson and @GuiGel for the pull requests and contributions.

- Python
Published by ines over 6 years ago

spacy - v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install

✨ New features and improvements

  • NEW: Support multiprocessing in nlp.pipe via the n_process argument (Python 3 only).
  • Base language support for Luxembourgish.
  • Add noun chunks iterator for Swedish.
  • Retrained models for Greek, Norwegian Bokmål and Lithuanian that now correctly support parser-based sentence segmentation.
  • Repackaged models for Greek and German with improved lookup tables via spacy-lookups-data.
  • Add warning in debug-data for low sentences per doc ratio.
  • Improve checks and errors related to ill-formed IOB input in convert and debug-data CLI.
  • Support training dict format as JSONL.
  • Make EntityRuler ID resolution 2× faster and support "id" in patterns to set Token.ent_id.
  • Improve rendering of named entity spans in displacy for RTL languages.
  • Update Thinc to ditch thinc_gpu_ops for simpler GPU install.
  • Support Mish activation in spacy pretrain.
  • Add forwards-compatible support for new Language.disable_pipes API, which will become the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments). ```diff
    • disabled = nlp.disable_pipes("tagger", "parser")
    • disabled = nlp.disable_pipes(["tagger", "parser"]) ```
  • Add forwards-compatible support for new Matcher.add and PhraseMatcher.add API, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument. ```diff patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
    • matcher.add("GoogleNow", None, *patterns)
    • matcher.add("GoogleNow", patterns)
    • matcher.add("GoogleNow", on_match, *patterns)
    • matcher.add("GoogleNow", patterns, onmatch=onmatch) ```
  • Add new and improved tokenization alignment in gold.align behind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0. python import spacy.gold spacy.gold.USE_NEW_ALIGN = True

🔴 Bug fixes

  • Fix issue #1303: Support multiprocessing in nlp.pipe.
  • Fix issue #1745: Ditch thinc_gpu_ops for simpler GPU install.
  • Fix issue #2411: Update Thinc to fix compilation on cygwin.
  • Fix issue #3412: Prevent division by zero in Vectors.most_similar.
  • Fix issue #3618: Fix memory leak for long-running parsing processes.
  • Fix issue #4241: Update Greek lookups in spacy-lookups-data.
  • Fix issue #4269: Extend unicode character block for Sinhala.
  • Fix issue #4362: Improve URL_PATTERN and handling in tokenizer.
  • Fix issue #4373: Make PhraseMatcher.vocab consistent with Matcher.vocab.
  • Fix issue #4377: Clarify serialization of extension attributes.
  • Fix issue #4382: Improve usage of pkg_resources and handling of entry points.
  • Fix issue #4386: Consider batch_size when sorting similar vectors.
  • Fix issue #4389: Fix ner_jsonl2json converter.
  • Fix issue #4397: Ensure on_match callback is executed in PhraseMatcher.
  • Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models.
  • Fix issue #4402: Fix issue with how training data was passed through the pipeline.
  • Fix issue #4406: Correct spelling in lemmatizer API docs.
  • Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing.
  • Fix issue #4435: Fix PhraseMatcher.remove for overlapping patterns.
  • Fix issue #4443: Fix bug in Vectors.most_similar.
  • Fix issue #4452: Fix gold.docs_to_json documentation.
  • Fix issue #4463: Add missing cats to GoldParse.from_annot_tuples in Scorer.
  • Fix issue #4470: Suppress convert output if writing to stdout.
  • Fix issue #4475: Correct mistake in docs example.
  • Fix issue #4485: Update tag maps and docs for English and German.
  • Fix issue #4493: Update information in spaCy Universe.
  • Fix issue #4496: Improve docs of PhraseMatcher.add arguments.
  • Fix issue #4506: Ensure Vectors.most_similar returns 1.0 for identical vectors.
  • Fix issue #4509: Fix None iteration error in entity linking script.
  • Fix issue #4524: Fix typo in Parser sample construction of GoldParse.
  • Fix issue #4528: Fix serialization of extension attribute values in DocBin.
  • Fix issue #4529: Ensure GoldParse is initialized correctly with misaligned tokens.
  • Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9.

⚠️ Backwards incompatibilities

  • The unused attributes lemma_rules, lemma_index, lemma_exc and lemma_lookup of the Language.Defaults have now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is via nlp.vocab.lookups. ```diff
    • nlp.Defaults.lemma_lookup["spaCies"] = "spaCy"
    • lemmalookup = nlp.vocab.lookups.gettable("lemma_lookup")
    • lemma_lookup["spaCies"] = "spaCy" ```

📖 Documentation and examples

  • Fix various typos and inconsistencies.
  • Add more projects to the spaCy Universe.

👥 Contributors

Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.

- Python
Published by ines over 6 years ago

spacy - v2.1.9: Backport memory leak fix

This is a small maintenance update that backports a bug fix for a memory leak that'd occur in long-running parsing processes. It's intended for users who can't or don't yet want to upgrade to spaCy v2.2 (e.g. because it requires retraining all the models). If you're able to upgrade, you shouldn't use this version and instead install the latest v2.2.

🔴 Bug fixes

  • Fix issue #3618: Fix memory leak for long-running parsing processes.
  • Fix issue #4538: Backport memory leak fix to v2.1.x branch.

- Python
Published by ines over 6 years ago

spacy - v2.2.1: Fix DocBin and Dutch model, improve Vectors.most_similar

✨ New features and improvements

  • Make Vectors.most_similar return the top most similar vectors instead of only one.

🔴 Bug fixes

  • Fix issue #4365: Fix tag map in Dutch model.
  • Fix issue #4368: Fix initialization of DocBin with attributes.

📖 Documentation and examples

👥 Contributors

Thanks to @bintay and @svlandeg for the pull requests and contributuons.

- Python
Published by ines over 6 years ago

spacy - v2.2.0: Norwegian & Lithuanian models, better Dutch NER, smaller install, faster matching & more

⚠️ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.

✨ New features and improvements

  • NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
  • NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
  • NEW: Make spaCy roughly 5-10× smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
  • NEW: EntityLinker and KnowledgeBase API to train and access entity linking models, plus scripts to train your own Wikidata models.
  • NEW: 10× faster PhraseMatcher and improved phrase matching algorithm.
  • NEW: DocBin class to efficiently serialize collections of Doc objects.
  • NEW: Train text classification models on the command line with spacy train and get textcat results via the Scorer.
  • NEW: debug-data command to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more.
  • NEW: Efficient Lookups class using Bloom filters that allows storing, accessing and serializing large dictionaries via vocab.lookups.
  • Data augmentation in spacy train via the --orth-variant-level flag, which defines the percentage of occurrences of some tokens subject to replacement during training.
  • Add nlp.pipe_labels (labels assigned by pipeline components) and include "labels" in nlp.meta.
  • Support spacy_displacy_colors entry point to allow packages to add entity colors to displacy.
  • Allow template config option in displacy to customize entity HTML template.
  • Improve match pattern validation and handling of unsupported attributes.
  • Add lookup lemmatization data for Croatian and Serbian.
  • Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.

🔴 Bug fixes

  • Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
  • Fix issue #3540: Update lemma and vector information after splitting a token.
  • Fix issue #3687: Automatically skip duplicates in Doc.retokenize.
  • Fix issue #3830: Retrain German model and fix subtok errors.
  • Fix issue #3850: Allow customizing entity HTML template in displaCy.
  • Fix issue #3879, #3951, #4154: Fix bug in Matcher retry loop that'd cause problems with ? operator.
  • Fix issue #3917: Raise error for negative token indices in displacy.
  • Fix issue #3922: Add PhraseMatcher.remove method.
  • Fix issue #3959, #4133: Make sure both pos and tag are correctly serialized.
  • Fix issue #3972: Ensure PhraseMatcher returns multiple matches for identical rules.
  • Fix issue #4020: Raise error for overlapping entities in biluo_tags_from_offsets.
  • Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
  • Fix issue #4070: Improve token pattern checking without validation.
  • Fix issue #4096: Add checks for cycles in debug-data.
  • Fix issue #4100: Improve docs on phrase pattern attributes.
  • Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
  • Fix issue #4104: Make visualized NER examples in docs more clear.
  • Fix issue #4107: Automatically set span root attributes on merging.
  • Fix issue #4111, #4170: Improve NER/IOB converters.
  • Fix issue #4120: Correctly handle ? operator at the end of pattern.
  • Fix issue #4123: Provide more details in cycle error message E069.
  • Fix issue #4138: Correctly open .html files as UTF-8 in evaluate command.
  • Fix issue #4139: Make emoticon data a raw string.
  • Fix issue #4148: Add missing API docs for force flag on set_extension.
  • Fix issue #4155: Correct language code for Serbian.
  • Fix issue #4165: Add more attributes to matcher validation schema.
  • Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
  • Fix issue #4200: Work around tqdm bug that'd remove text color from terminal output.
  • Fix issue #4229: Fix handling of pre-set entities.
  • Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
  • Fix issue #4242: Make .pos/.tag distinction more clear in the docs.
  • Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
  • Fix issue #4262: Fix handling of spaces in Japanese.
  • Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
  • Fix issue #4270: Fix --vectors-loc documentation.
  • Fix issue #4302: Remove duplicate Parser.tok2vec property.
  • Fix issue #4303: Correctly support as_tuples and return_matches in Matcher.pipe.
  • Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
  • Fix issue #4308: Fix bug that could cause PhraseMatcher with very large lists to miss matches.
  • Fix issue #4348: Ensure training doesn't crash with empty batches.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • The lemmatization tables have been moved to their own package, spacy-lookups-data, which is not installed by default. If you're using pre-trained models, nothing changes, because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g. spacy.blank("en")), you'll need to explicitly install spaCy plus data via pip install spacy[lookups]. The data will be registered automatically via entry points.
  • Lemmatization tables (rules, exceptions, index and lookups) are now part of the Vocab and serialized with it. This means that serialized objects (nlp, pipeline components, vocab) will now include additional data, and models written to disk will include additional files.
  • The Lemmatizer class is now initialized with an instance of Lookups containing the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a custom Lemmatizer, you'll need to update your code.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
  • The spacy download command does not set the --no-deps pip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source, --no-deps is added back automatically to prevent spaCy from being downloaded and installed again from pip.
  • The built-in biluo_tags_from_offsets converter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the new debug-data command to find problems in your data.
  • Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an ent_iob value set, it won't be reset to an "unset" state and will always have at least O assigned. list(doc.ents) now actually keeps the annotations on the token level consistent, instead of resetting O to an empty string.
  • The default punctuation in the Sentencizer has been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, set punct_chars=[".", "!", "?"] on initialization.
  • The PhraseMatcher algorithm was rewritten from scratch and it's now 10× faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results – however, the results should now be fully correct. See #4309 for details on this change.
  • The Serbian language class (introduced in v2.1.8) incorrectly used the language code rs instead of sr. This has now been fixed, so Serbian is now available via spacy.lang.sr.
  • The "sources" in the meta.json have changed from a list of strings to a list of dicts. This is mostly internals, but if your code used nlp.meta["sources"], you might have to update it.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | ------------------- | ---------- | ------: | ----: | ----: | ----: | ----: | :-: | -----: | | en_core_web_sm | English | 2.2.0 | 91.61 | 89.71 | 97.03 | 85.07 | 𐄂 | 11 MB | | en_core_web_md | English | 2.2.0 | 91.65 | 89.77 | 97.14 | 86.10 | ✓ | 91 MB | | en_core_web_lg | English | 2.2.0 | 91.98 | 90.16 | 97.21 | 86.30 | ✓ | 789 MB | | de_core_news_sm | German | 2.2.0 | 90.75 | 88.63 | 96.29 | 83.11 | 𐄂 | 14 MB | | de_core_news_md | German | 2.2.0 | 91.26 | 89.36 | 96.44 | 83.42 | ✓ | 214 MB | | es_core_news_sm | Spanish | 2.2.0 | 90.20 | 87.05 | 96.79 | 89.45 | 𐄂 | 15 MB | | es_core_news_md | Spanish | 2.2.0 | 90.89 | 87.94 | 97.03 | 89.86 | ✓ | 74 MB | | pt_core_news_sm | Portuguese | 2.2.0 | 89.53 | 86.07 | 79.96 | 87.97 | 𐄂 | 20 MB | | fr_core_news_sm | French | 2.2.0 | 87.27 | 84.28 | 94.38 | 82.77 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.2.0 | 88.82 | 86.07 | 95.15 | 82.82 | ✓ | 84 MB | | it_core_news_sm | Italian | 2.2.0 | 90.79 | 86.94 | 96.06 | 86.29 | 𐄂 | 13 MB | | nl_core_news_sm | Dutch | 2.2.0 | 76.79 | 69.53 | 90.10 | 68.79 | 𐄂 | 14 MB | | el_core_news_sm | Greek | 2.2.0 | 84.40 | 80.98 | 94.41 | 71.88 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.2.0 | 87.96 | 84.88 | 96.38 | 77.59 | ✓ | 126 MB | | nb_core_news_sm | Norwegian | 2.2.0 | 89.02 | 86.49 | 95.72 | 83.99 | 𐄂 | 12 MB | | lt_core_news_sm | Lithuanian | 2.2.0 | 59.87 | 48.00 | 74.02 | 76.58 | 𐄂 | 12 MB | | xx_ent_wiki_sm | Multi | 2.2.0 | - | - | - | 79.88 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

👥 Contributors

Thanks to @ICLRandD, @phiedulxp, @ajrader, @RyanZHe, @jenojp, @yanaiela, @isaric, @mrdbourke, @avramandrei, @Pavle992, @chkoar, @wannaphongcom, @BreakBB, @b1uec0in, @mihaigliga21, @tamuhey, @euand, @Hazoom, @SeanBE, @esemeniuc, @zqianem, @ajkl, @jaydeepborkar, @EarlGreyT and @er-raoniz for the pull requests and contributions.

Special thanks to our spaCy team @svlandeg and @adrianeboyd for the bug fixes and new features, @polm for the Bloom filters implementation and data compression and @yvespeirsman, @lemontheme, @jarib, @miktoki and @rokasramas for the help and resources for the new models.

- Python
Published by ines over 6 years ago

spacy - v2.1.8: Usability improvements and Serbian alpha tokenization

✨ New features and improvements

  • NEW: Alpha tokenization support for Serbian
  • Improve language data for Urdu.
  • Support installing and loading model packages in the same session.

🔴 Bug fixes

  • Fix issue #4002: Make PhraseMatcher work as expected for NORM attribute.
  • Fix issue #4063: Improve docs on Matcher attributes.
  • Fix issue #4068: Make Korean work as expected on Python 2.7.
  • Fix issue #4069: Add validate option to EntityRuler.
  • Fix issue #4074: Raise error if annotation dict in simple training style has unexpected keys.
  • Fix issue #4081: Fix typo in pyproject.toml.
  • Fix handling of keyword arguments in Language.evaluate.

📖 Documentation and examples

👥 Contributors

Thanks to @akornilo, @mirfan899, @veer-bains, @seppeljordan, @Pavle992, @svlandeg, @jenojp and @adrianeboyd for the pull requests and contributions.

- Python
Published by ines almost 7 years ago

spacy - v2.1.7: Improved evaluation, better language factories and bug fixes

✨ New features and improvements

  • Add Token.tensor and Span.tensor attributes.
  • Support simple training format of (text, annotations) instead of only (doc, gold) for nlp.evaluate.
  • Add support for "lang_factory" setting in model meta.json (see #4031).
  • Also support "requirements" in meta.json to define packages for setup's install_requires.
  • Improve Pipe base class methods and make them less presumptuous.
  • Improve Danish and Korean tokenization.
  • Improve error messages when deserializing model fails.

🔴 Bug fixes

  • Fix issue #3669, #3962: Fix dependency copy in Span.as_doc that could cause segfault.
  • Fix issue #3968: Fix bug in per-entity scores.
  • Fix issue #4000: Improve entity linking API.
  • Fix issue #4022: Fix error when Korean text contains special characters.
  • Fix issue #4030: Handle edge case when calling TextCategorizer.predict with empty Doc.
  • Fix issue #4045: Correct Span.sent docs.
  • Fix issue #4048: Fix init-model command if there's no vocab.
  • Fix issue #4052: Improve per-type scoring of NER.
  • Fix issue #4054: Ensure the lang of nlp and nlp.vocab stay consistent.
  • Fix bugs in Token.similarity and Span.similarity when called via hook.

📖 Documentation and examples

👥 Contributors

Thanks to @sorenlind, @pmbaumgartner, @svlandeg, @FallakAsad, @BreakBB, @adrianeboyd, @polm, @b1uec0in, @mdaudali and @ejarkm for the pull requests and contributions.

- Python
Published by ines almost 7 years ago

spacy - v2.1.6: Fix order of symbols that caused tag maps to be out-of-sync

🔴 Bug fixes

  • Fix issue #3958: Fix order of symbols that caused tag maps to be out-of-sync.

- Python
Published by ines almost 7 years ago

spacy - v2.1.5: Base support for Marathi and Korean, better pretraining, scores per entity and bug fixes

✨ New features and improvements

  • NEW: Base language data for Marathi and Korean (via mecab-ko, mecab-ko-dic and natto-py).
  • Improve language data for Lithuanian, Spanish, Kannada, French, Norwegian and Hindi.
  • Add evaluation metrics per entity type.
  • Add resume logic to spacy pretrain.
  • Add optional id property to EntityRuler patterns.
  • Better introspection and IDE automcomplete for custom extension attributes.
  • Make Doc.is_sentenced always return True for single-token docs.

🔴 Bug fixes

  • Fix issue #3490: Add evaluation metrics per entity type to Scorer.
  • Fix issue #3526: Serialize EntityRuler settings correctly.
  • Fix issue #3558: Improve E024 error message for incorrect GoldParse.
  • Fix issue #3611: Fix bug when setting ngram parameter in text classifier.
  • Fix issue #3625: Improve default punctuation rules for Hindi.
  • Fix issue #3707: Improve introspection of custom attributes.
  • Fix issue #3737: Check if component is callable in Language.replace_pipe.
  • Fix issue #3743: Fix documentation of lex_id.
  • Fix issue #3749: Change vector training script to work with latest Gensim.
  • Fix issue #3762, #3934: Make Doc.is_sentenced default to True for single-token Docs.
  • Fix issue #3802: Fix typo in docs example.
  • Fix issue #3811: Fix type of --seed option in spacy pretrain.
  • Fix issue #3822: Allow passing PhraseMatcher arguments to EntityRuler.
  • Fix issue #3839: Ensure the Matcher returns correct match IDs when used with operators.
  • Fix issue #3840: Improve error messages in spacy pretrain.
  • Fix issue #3853: Rename vectors if multiple models are loaded to prevent clashes.
  • Fix issue #3859: Update pretrain to prevent unintended overwriting of weight files.
  • Fix issue #3862: Fix matcher callback example.
  • Fix issue #3868: Add "v.s." to English tokenizer exceptions.
  • Fix issue #3869: Make Doc.count_by work as expected.
  • Fix issue #3880: Fix unflatten padding in Thinc when last element is empty.
  • Fix issue #3882: Exclude user_data when copying doc in displaCy.
  • Fix issue #3892: Update Tokenizer initialization docs.
  • Fix issue #3912: Make text classifier raise more friendly errors.

📖 Documentation and examples

👥 Contributors

Thanks to @BreakBB, @ujwal-narayan, @estr4ng7d, @maknotavailable, @ramananbalakrishnan, @nipunsadvilkar, @NirantK, @munozbravo, @intrafindBreno, @Azagh3l, @jarib, @tokestermw, @polm, @skrcode, @kabirkhan, @demongolem, @elbaulp, @clarus, @BramVanroy, @rokasramas, @askhogan, @khellan, @kognate, @cedar101 and @yash1994 for the pull requests and contributions.

- Python
Published by ines almost 7 years ago

spacy - v2.1.4: Training improvements and bug fixes

✨ New features and improvements

  • NEW: util.filter_spans helper to filter duplicates and overlaps from a list of Span objects.
  • Improve language data for Thai, Japanese, Indonesian and Dutch.
  • Add --n-save-every to spacy pretrain and rename --nr-iter to --n-iter for consistency.
  • Add --return-scores flag to spacy evaluate to return a dict.
  • Add --n-early-stopping option to spacy train to define maximum number of iterations without dev accuracy improvements.

🔴 Bug fixes

  • Fix issue #3307: Fix symlink creation to show error on Windows.
  • Fix issue #3473: Fix GPU training for text classification.
  • Fix issue #3475: Change favicon.
  • Fix issue #3482: Add Estonian base support to documentation.
  • Fix issue #3484: Ensure lemmatization is always consistent between sessions.
  • Fix issue #3521: Add variations of contractions to English stop words.
  • Fix issue #3523: Make spacy convert correctly default to json.
  • Fix issue #3525, #3551, #3572: Fix problem that'd cause lemmas to not be lowercase.
  • Fix issue #3531: Don't make "settings" or "title" required in displaCy data.
  • Fix issue #3533: Remove non-existent example from docs.
  • Fix issue #3546: Make sure path in GoldParse.__del__ is a string.
  • Fix issue #3549: Ensure match pattern error isn't raised on empty errors list.
  • Fix issue #3561: Fix DependencyParser.predict docs.
  • Fix issue #3598: Allow jupyter=False to override Jupyter mode in displacy.
  • Fix issue #3620: Fix bug in .iob converter.
  • Fix issue #3628: Relax jsonschema pin.
  • Fix issue #3667: Fix offset bug in loading pre-trained word2vec.
  • Fix issue #3679: Update glossary to include missing labels in spacy.explain.
  • Fix issue #3680: Re-add missing universe README.
  • Fix issue #3681: Rewrite information extraction example to use Doc.retokenize.
  • Fix issue #3692: Fix return value in Language.update docs.
  • Fix issue #3694: Make "text" in spacy pretrain optional when "tokens" is provided.
  • Fix issue #3701: Improve Token.prob and Lexeme.prob docs.
  • Fix issue #3708: Fix error in regex matcher examples.
  • Fix issue #3713: Call rmtree and copytree with strings in spacy train.
  • Fix issue #3720: Add version tag to --base-model argument in spacy train docs.

📖 Documentation and examples

👥 Contributors

Thanks to @svlandeg, @wannaphongcom, @Bharat123rox, @DuyguA, @SamuelLKane, @graus, @HiromuHota, @jeannefukumaru, @ivigamberdiev, @socool, @yvespeirsman, @lemontheme, @Dobita21, @w4nderlust, @pierremonico, @bryant1410, @celikomer, @xssChauhan, @kowaalczyk, @BreakBB, @fizban99, @tokestermw, @bjascob, @pickfire, @yaph, @amitness, @henry860916, @d5555, @BramVanroy, @F0rge1cE, @richardpaulhudson, @ldorigo, @aaronkub and @devforfu for the pull requests and contributions.

- Python
Published by ines about 7 years ago

spacy - v2.1.3: Improve sentencizer and serialization

✨ New features and improvements

  • Allow customizing punctuation characters in sentencizer and make it serializable.
  • Add new "bow" architecture for TextCategorizer, to do faster bag-of-words text classification.

🔴 Bug fixes

  • Fix issue #3433, #3458: Fix mismatch of classes in parser after serialization.
  • Fix issue #3464: Fix training loop in train_textcat.py example.
  • Fix issue #3468: Make sentencizer set Token.is_sent_start correctly.
  • Fix bug in the "ensemble" TextClassifier architecture that prevented the unigram bag-of-words submodel from working properly.

👥 Contributors

Thanks to @chkoar for the pull request!

- Python
Published by ines about 7 years ago

spacy - v2.1.2: Fixes to regex handling on Python 2 and tag map

🔴 Bug fixes

  • Fix issue #3356: Fix handling of unicode ranges in regular expressions on Python 2.
  • Fix issue #3432: Update wasabi to better handle non-UTF-8 terminals.
  • Fix issue #3445: Update docs on label argument in Span.__init__.
  • Fix issue #3455: Bring English tag_map in line with UD Treebank.

📖 Documentation and examples

  • Add --init-tok2vec argument to train_textcat.py example.
  • Fix various typos and inconsistencies.

- Python
Published by ines about 7 years ago

spacy - v2.1.1: Small GPU fixes

✨ New features and improvements

  • Raise error if user is running a narrow unicode build.
  • Move ud_train, ud_evaluate and other UD scripts from CLI to /bin in repo only.
  • Improve accuracy of spacy pretrain by implementing cosine loss.

🔴 Bug fixes

  • Fix issue #3421: Update docs and raise error for narrow unicode builds.
  • Fix issue #3427: Correct mistake in French lemmatizer.
  • Fix issue #3431: Make Doc.vector and Doc.vector_norm work as expected on GPU.
  • Fix issue #3437: Fix installation problem on GPU.
  • Fix issue #3439, #3446: Don't include UD scripts in spacy.cli.

👥 Contributors

Thanks to @mhham and @Bharat123Rox for the pull requests!

- Python
Published by ines about 7 years ago

spacy - v2.1.0: New models, ULMFit/BERT/Elmo-like pretraining, faster tokenization, better Matcher, bug fixes & more

⚠️ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Add simpler, GPU-friendly option to TextCategorizer, and allow setting exclusive_classes and architecture arguments on initialization.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.
  • Add Vocab.writing_system (populated via the language data) to expose settings like writing direction.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #795: Fix behaviour of Token.conjuncts.
  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to TextCategorizer.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2091: Fix displacy support for RTL languages.
  • Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
  • Fix issue #2329: Correct TextCategorizer and GoldParse API docs.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2390: Support setting lexical attributes during retokenization.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
  • Fix issue #2603: Improve handling of missing NER tags.
  • Fix issue #2644: Add table explaining training metrics to docs.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2728: Fix HTML escaping in displacy NER visualization and correct API docs.
  • Fix issue #2740: Add ability to pass additional arguments to pipeline components.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2869: Make doc[0].is_sent_start == True.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3036: Support mutable default arguments in extension attributes.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix issue #3112: Make sure entity types are added correctly on GPU.
  • Fix issue #3191: Fix pickling of Japanese.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3274: Make Token.sent work as expected without the parser.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix issue #3346: Expose Japanese stop words in language class.
  • Fix issue #3357: Update displaCy examples in docs to correctly show Token.pos_.
  • Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
  • Fix issue #3348: Don't use numpy directly for similarity.
  • Fix issue #3366: Improve converters, training data formats and docs.
  • Fix issue #3369: Fix #egg fragments in direct downloads.
  • Fix issue #3382: Make Doc.from_array consistent with Doc.to_array.
  • Fix issue #3398: Don't set extension attributes in language classes.
  • Fix issue #3373: Merge and improve conllu converters.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • Due to difficulties linking our new blis for faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The deprecated Doc.merge and Span.merge methods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use the Doc.retokenize context manager and perform as many merges as possible together in the with block. ```diff
  • doc[1:5].merge()
  • doc[6:8].merge()
  • with doc.retokenize() as retokenizer:
  • retokenizer.merge(doc[1:5])
  • retokenizer.merge(doc[6:8]) ```
  • The serialization methods to_disk, from_disk, to_bytes and from_bytes now support a single exclude argument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. The disable argument on the Language serialization methods has been renamed to exclude for consistency. ```diff
  • nlp.to_disk("/path", disable=["parser", "ner"])
  • nlp.to_disk("/path", exclude=["parser", "ner"])
  • data = nlp.tokenizer.to_bytes(vocab=False)
  • data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
  • The .pos value for several common English words has changed, due to corrections to long-standing mistakes in the English tag map (see #593, #3311).
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.)
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The is_sent_start attribute of the first token in a Doc now correctly defaults to True. It previously defaulted to None.
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0 | 91.5 | 89.7 | 96.8 | 85.9 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0 | 91.8 | 90.0 | 96.9 | 86.6 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0 | 91.8 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0 | 90.7 | 88.6 | 96.3 | 83.1 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0 | 91.2 | 89.4 | 96.6 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0 | 90.4 | 87.3 | 96.9 | 89.5 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0 | 91.0 | 88.2 | 97.2 | 89.7 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0 | 89.1 | 85.9 | 80.4 | 88.9 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0 | 87.6 | 84.7 | 94.5 | 82.6 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0 | 89.1 | 86.4 | 95.3 | 83.1 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0 | 91.0 | 87.3 | 95.8 | 86.1 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0 | 83.7 | 77.6 | 91.6 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0 | 84.4 | 80.6 | 94.6 | 71.6 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0 | 88.3 | 85.0 | 96.6 | 81.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0 | - | - | - | 81.3 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.

While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:

  • Usage Guide: Rule-based Matching. How to use the Matcher, PhraseMatcher and the new EntityRuler, and write powerful components to combine statistical models and rules.
  • Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
  • Usage Guide: Merging and Splitting. How to retokenize a Doc using the new retokenize context manager and merge spans into single tokens and split single tokens into multiple.
  • Universe: Videos and Podcasts
  • API: EntityRuler
  • API: SentenceSegmenter
  • API: Pipeline functions

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2, @adrienball and @Poluglottos for the pull requests and contributions.

- Python
Published by ines about 7 years ago

spacy - v2.1.0a13: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ This nightly release currently doesn't work on Python 2.7 on Windows, due to difficulties compiling our new matrix multiplication dependency blis in that environment. We expect this can be corrected in future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Add simpler, GPU-friendly option to TextCategorizer, and allow setting exclusive_classes and architecture arguments on initialization.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.
  • Add Vocab.writing_system (populated via the language data) to expose settings like writing direction.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #795: Fix behaviour of Token.conjuncts.
  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to TextCategorizer.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2091: Fix displacy support for RTL languages.
  • Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
  • Fix issue #2329: Correct TextCategorizer and GoldParse API docs.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2390: Support setting lexical attributes during retokenization.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
  • Fix issue #2603: Improve handling of missing NER tags.
  • Fix issue #2644: Add table explaining training metrics to docs.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2728: Fix HTML escaping in displacy NER visualization and correct API docs.
  • Fix issue #2740: Add ability to pass additional arguments to pipeline components.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2869: Make doc[0].is_sent_start == True.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3036: Support mutable default arguments in extension attributes.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix issue #3112: Make sure entity types are added correctly on GPU.
  • Fix issue #3191: Fix pickling of Japanese.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3274: Make Token.sent work as expected without the parser.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix issue #3346: Expose Japanese stop words in language class.
  • Fix issue #3357: Update displaCy examples in docs to correctly show Token.pos_.
  • Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
  • Fix issue #3348: Don't use numpy directly for similarity.
  • Fix issue #3366: Improve converters, training data formats and docs.
  • Fix issue #3369: Fix #egg fragments in direct downloads.
  • Fix issue #3382: Make Doc.from_array consistent with Doc.to_array.
  • Fix issue #3398: Don't set extension attributes in language classes.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • Due to difficulties linking our new blis for faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The deprecated Doc.merge and Span.merge methods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use the Doc.retokenize context manager and perform as many merges as possible together in the with block. ```diff
  • doc[1:5].merge()
  • doc[6:8].merge()
  • with doc.retokenize() as retokenizer:
  • retokenizer.merge(doc[1:5])
  • retokenizer.merge(doc[6:8]) ```
  • The serialization methods to_disk, from_disk, to_bytes and from_bytes now support a single exclude argument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. The disable argument on the Language serialization methods has been renamed to exclude for consistency. ```diff
  • nlp.to_disk("/path", disable=["parser", "ner"])
  • nlp.to_disk("/path", exclude=["parser", "ner"])
  • data = nlp.tokenizer.to_bytes(vocab=False)
  • data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.)
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The is_sent_start attribute of the first token in a Doc now correctly defaults to True. It previously defaulted to None.
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.

While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:

  • Usage Guide: Rule-based Matching. How to use the Matcher, PhraseMatcher and the new EntityRuler, and write powerful components to combine statistical models and rules.
  • Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
  • Usage Guide: Merging and Splitting. How to retokenize a Doc using the new retokenize context manager and merge spans into single tokens and split single tokens into multiple.
  • Universe: Videos and Podcasts
  • API: EntityRuler
  • API: SentenceSegmenter
  • API: Pipeline functions

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2 and @adrienball for the pull requests and contributions.

- Python
Published by ines about 7 years ago

spacy - v2.1.0a12: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Add simpler, GPU-friendly option to TextCategorizer, and allow setting exclusive_classes and architecture arguments on initialization.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.
  • Add Vocab.writing_system (populated via the language data) to expose settings like writing direction.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #795: Fix behaviour of Token.conjuncts.
  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to TextCategorizer.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2091: Fix displacy support for RTL languages.
  • Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
  • Fix issue #2329: Correct TextCategorizer and GoldParse API docs.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2390: Support setting lexical attributes during retokenization.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
  • Fix issue #2603: Improve handling of missing NER tags.
  • Fix issue #2644: Add table explaining training metrics to docs.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2728: Fix HTML escaping in displacy NER visualization and correct API docs.
  • Fix issue #2740: Add ability to pass additional arguments to pipeline components.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2869: Make doc[0].is_sent_start == True.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3036: Support mutable default arguments in extension attributes.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix issue #3112: Make sure entity types are added correctly on GPU.
  • Fix issue #3191: Fix pickling of Japanese.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3274: Make Token.sent work as expected without the parser.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix issue #3346: Expose Japanese stop words in language class.
  • Fix issue #3357: Update displaCy examples in docs to correctly show Token.pos_.
  • Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
  • Fix issue #3348: Don't use numpy directly for similarity.
  • Fix issue #3366: Improve converters, training data formats and docs.
  • Fix issue #3369: Fix #egg fragments in direct downloads.
  • Fix issue #3382: Make Doc.from_array consistent with Doc.to_array.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • Due to difficulties linking our new blis for faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The deprecated Doc.merge and Span.merge methods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use the Doc.retokenize context manager and perform as many merges as possible together in the with block. ```diff
  • doc[1:5].merge()
  • doc[6:8].merge()
  • with doc.retokenize() as retokenizer:
  • retokenizer.merge(doc[1:5])
  • retokenizer.merge(doc[6:8]) ```
  • The serialization methods to_disk, from_disk, to_bytes and from_bytes now support a single exclude argument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. The disable argument on the Language serialization methods has been renamed to exclude for consistency. ```diff
  • nlp.to_disk("/path", disable=["parser", "ner"])
  • nlp.to_disk("/path", exclude=["parser", "ner"])
  • data = nlp.tokenizer.to_bytes(vocab=False)
  • data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.)
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The is_sent_start attribute of the first token in a Doc now correctly defaults to True. It previously defaulted to None.
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.

While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:

  • Usage Guide: Rule-based Matching. How to use the Matcher, PhraseMatcher and the new EntityRuler, and write powerful components to combine statistical models and rules.
  • Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
  • Usage Guide: Merging and Splitting. How to retokenize a Doc using the new retokenize context manager and merge spans into single tokens and split single tokens into multiple.
  • Universe: Videos and Podcasts
  • API: EntityRuler
  • API: SentenceSegmenter
  • API: Pipeline functions

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2 and @adrienball for the pull requests and contributions.

- Python
Published by ines about 7 years ago

spacy - v2.1.0a11: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Add simpler, GPU-friendly option to TextCategorizer, and allow setting exclusive_classes and architecture arguments on initialization.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to TextCategorizer.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions.
  • Fix issue #2329: Correct TextCategorizer and GoldParse API docs.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2390: Support setting lexical attributes during retokenization.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab.
  • Fix issue #2603: Improve handling of missing NER tags.
  • Fix issue #2644: Add table explaining training metrics to docs.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2728: Fix HTML escaping in displacy NER visualization and correct API docs.
  • Fix issue #2740: Add ability to pass additional arguments to pipeline components.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2869: Make doc[0].is_sent_start == True.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix issue #3112: Make sure entity types are added correctly on GPU.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3274: Make Token.sent work as expected without the parser.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix issue #3346: Expose Japanese stop words in language class.
  • Fix issue #3357: Update displaCy examples in docs to correctly show Token.pos_.
  • Fix issue #3345: Fix NER when preset entities cross-sentence boundaries.
  • Fix issue #3348: Don't use numpy directly for similarity.
  • Fix issue #3366: Improve converters, training data formats and docs.
  • Fix issue #3369: Fix #egg fragments in direct downloads.
  • Fix issue #3382: Make Doc.from_array consistent with Doc.to_array.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • Due to difficulties linking our new blis for faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The deprecated Doc.merge and Span.merge methods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use the Doc.retokenize context manager and perform as many merges as possible together in the with block. ```diff
  • doc[1:5].merge()
  • doc[6:8].merge()
  • with doc.retokenize() as retokenizer:
  • retokenizer.merge(doc[1:5])
  • retokenizer.merge(doc[6:8]) ```
  • The serialization methods to_disk, from_disk, to_bytes and from_bytes now support a single exclude argument to provide a list of string names to exclude. The docs have been updated to list the available serialization fields for each class. The disable argument on the Language serialization methods has been renamed to exclude for consistency. ```diff
  • nlp.to_disk("/path", disable=["parser", "ner"])
  • nlp.to_disk("/path", exclude=["parser", "ner"])
  • data = nlp.tokenizer.to_bytes(vocab=False)
  • data = nlp.tokenizer.to_bytes(exclude=["vocab"]) ```
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.)
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The is_sent_start attribute of the first token in a Doc now correctly defaults to True. It previously defaulted to None.
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.

While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:

  • Usage Guide: Rule-based Matching. How to use the Matcher, PhraseMatcher and the new EntityRuler, and write powerful components to combine statistical models and rules.
  • Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
  • Usage Guide: Merging and Splitting. How to retokenize a Doc using the new retokenize context manager and merge spans into single tokens and split single tokens into multiple.
  • Universe: Videos and Podcasts
  • API: EntityRuler
  • API: SentenceSegmenter
  • API: Pipeline functions

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig, @mikelibg, @danielkingai2 and @adrienball for the pull requests and contributions.

- Python
Published by ines about 7 years ago

spacy - v2.1.0a10: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Add simpler, GPU-friendly option to TextCategorizer, and allow setting exclusive_classes and architecture arguments on initialization.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to TextCategorizer.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2329: Correct TextCategorizer and GoldParse API docs.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2390: Support setting lexical attributes during retokenization.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2603: Improve handling of missing NER tags.
  • Fix issue #2644: Add table explaining training metrics to docs.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2728: Fix HTML escaping in displacy NER visualization and correct API docs.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2869: Make doc[0].is_sent_start == True.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix issue #3112: Make sure entity types are added correctly on GPU.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3274: Make Token.sent work as expected without the parser.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • Due to difficulties linking our new blis for faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The deprecated Doc.merge and Span.merge methods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use the Doc.retokenize context manager and perform as many merges as possible together in the with block. ```diff
  • doc[1:5].merge()
  • doc[6:8].merge()
  • with doc.retokenize() as retokenizer:
  • retokenizer.merge(doc[1:5])
  • retokenizer.merge(doc[6:8]) ```
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.)
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The is_sent_start attribute of the first token in a Doc now correctly defaults to True. It previously defaulted to None.
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.

While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:

  • Usage Guide: Rule-based Matching. How to use the Matcher, PhraseMatcher and the new EntityRuler, and write powerful components to combine statistical models and rules.
  • Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
  • Usage Guide: Merging and Splitting. How to retokenize a Doc using the new retokenize context manager and merge spans into single tokens and split single tokens into multiple.
  • Universe: Videos and Podcasts
  • API: EntityRuler
  • API: SentenceSegmenter
  • API: Pipeline functions

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig and @mikelibg for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.1.0a9: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Add simpler, GPU-friendly option to TextCategorizer, and allow setting exclusive_classes and architecture arguments on initialization.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to TextCategorizer.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2329: Correct TextCategorizer and GoldParse API docs.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2390: Support setting lexical attributes during retokenization.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2644: Add table explaining training metrics to docs.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2728: Fix HTML escaping in displacy NER visualization and correct API docs.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix issue #3112: Make sure entity types are added correctly on GPU.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • Due to difficulties linking our new blis for faster platform-independent matrix multiplication, v2.1.x currently doesn't work on Python 2.7 on Windows. We expect this to be corrected in the future.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The deprecated Doc.merge and Span.merge methods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use the Doc.retokenize context manager and perform as many merges as possible together in the with block. ```diff
  • doc[1:5].merge()
  • doc[6:8].merge()
  • with doc.retokenize() as retokenizer:
  • retokenizer.merge(doc[1:5])
  • retokenizer.merge(doc[6:8]) ```
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.)
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.

While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:

  • Usage Guide: Rule-based Matching. How to use the Matcher, PhraseMatcher and the new EntityRuler, and write powerful components to combine statistical models and rules.
  • Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
  • Usage Guide: Merging and Splitting. How to retokenize a Doc using the new retokenize context manager and merge spans into single tokens and split single tokens into multiple.
  • Universe: Videos and Podcasts
  • API: EntityRuler
  • API: SentenceSegmenter
  • API: Pipeline functions

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig and @mikelibg for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.1.0a8: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use. See here for the updated nightly docs.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The deprecated Doc.merge and Span.merge methods still work, but you may notice that they now run slower when merging many objects in a row. That's because the merging engine was rewritten to be more reliable and to support more efficient merging in bulk. To take advantage of this, you should rewrite your logic to use the Doc.retokenize context manager and perform as many merges as possible together in the with block. ```diff
  • doc[1:5].merge()
  • doc[6:8].merge()
  • with doc.retokenize() as retokenizer:
  • retokenizer.merge(doc[1:5])
  • retokenizer.merge(doc[6:8]) ```
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.)
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

Although it looks pretty much the same, we've rebuilt the entire documentation using Gatsby and MDX. It's now an even faster progressive web app and allows us to write all content entirely in Markdown, without having to compromise on easy-to-use custom UI components. We're hoping that the Markdown source will make it even easier to contribute to the documentation. For more details, check out the styleguide and source.

While converting the pages to Markdown, we've also fixed a bunch of typos, improved the existing pages and added some new content:

  • Usage Guide: Rule-based Matching. How to use the Matcher, PhraseMatcher and the new EntityRuler, and write powerful components to combine statistical models and rules.
  • Usage Guide: Saving and Loading. Everything you need to know about serialization, and how to save and load pipeline components, package your spaCy models as Python modules and use entry points.
  • Usage Guide: Merging and Splitting. How to retokenize a Doc using the new retokenize context manager and merge spans into single tokens and split single tokens into multiple.
  • Universe: Videos and Podcasts
  • API: EntityRuler
  • API: SentenceSegmenter
  • API: Pipeline functions

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825, @grivaz, @roshni-b, @mpuig and @mikelibg for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.1.0a7: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: 2-3 times faster tokenization across all languages at the same accuracy!
  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging and splitting tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • NEW: gold.spans_from_biluo_tags helper that returns Span objects, e.g. to overwrite the doc.ents.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1642: Replace regex with re and speed up tokenization.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2833: Raise better error if Token or Span are pickled.
  • Fix issue #2838: Add Retokenizer.split method to split one token into several.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #2901: Fix issue with first call of nlp in Japanese (MeCab).
  • Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix issue #3122: Correct docs of Token.subtree and Span.subtree.
  • Fix issue #3128: Improve error handling in converters.
  • Fix issue #3248: Fix PhraseMatcher pickling and make __len__ consistent.
  • Fix issue #3277: Add en/em dash to tokenizer prefixes and suffixes.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • For better compatibility with the Universal Dependencies data, the lemmatizer now preserves capitalization, e.g. for proper nouns (see #3256).
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • The spacy init-model command now uses a --jsonl-loc argument to pass in a a newline-delimited JSON (JSONL) file containing one lexical entry per line instead of a separate --freqs-loc and --clusters-loc. ```diff
  • $ spacy init-model en ./model --freqs-loc ./freqs.txt --clusters-loc ./clusters.txt
  • $ spacy init-model en ./model --jsonl-loc ./vocab.jsonl ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a7 | 91.6 | 89.7 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a7 | 91.8 | 90.0 | 96.9 | 86.3 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a7 | 91.9 | 90.1 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a7 | 91.7 | 89.5 | 97.3 | 83.4 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a7 | 92.3 | 90.4 | 97.4 | 83.8 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a7 | 90.2 | 87.1 | 97.0 | 89.1 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a7 | 91.2 | 88.4 | 97.2 | 89.4 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a7 | 89.5 | 86.2 | 80.1 | 89.0 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a7 | 87.3 | 84.4 | 94.7 | 83.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a7 | 89.1 | 86.2 | 95.3 | 83.3 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a7 | 91.1 | 87.2 | 96.0 | 86.3 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a7 | 83.9 | 77.6 | 91.5 | 87.0 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a7 | 85.1 | 81.5 | 94.5 | 73.3 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a7 | 88.2 | 85.1 | 96.7 | 78.1 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a7 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin, @moreymat, @mirfan899, @ozcankasal, @willprice, @alvations, @amperinet, @retnuh, @Loghijiaha, @DeNeutoy, @gavrieltal, @boena, @BramVanroy, @pganssle, @foufaster, @adrianeboyd, @maknotavailable, @pierremonico, @lauraBaakman, @juliamakogon, @Gizzio, @Abhijit-2592, @akki2825 and @grivaz for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.1.0a6: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Enhanced pattern API for rule-based Matcher (see #1971).
  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
  • Improved JSON(L) format for training (see #2928, #2932).

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1537: Make Span.as_doc return a copy, not a view.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1773: Prevent tokenizer exceptions from setting POS but not TAG.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #1963: Resize Doc.tensor when merging spans.
  • Fix issue #1971: Update Matcher engine to support regex, extension attributes and rich comparison.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2396: Fix Doc.get_lca_matrix.
  • Fix issue #2464, #3009: Fix behaviour of Matcher's ? quantifier.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #3012: Fix clobber of Doc.is_tagged in Doc.from_array.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix issue #3064: Allow single string attributes in Doc.to_array.
  • Fix issue #3093, #3067: Set vectors.name correctly when exporting model via CLI.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a6 | 91.5 | 89.6 | 96.8 | 85.5 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a6 | 91.9 | 90.2 | 97.0 | 86.4 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a6 | 92.0 | 90.2 | 97.0 | 86.6 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a6 | 91.6 | 89.6 | 97.2 | 83.3 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a6 | 92.2 | 90.3 | 97.5 | 83.9 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a6 | 90.3 | 87.3 | 97.0 | 89.0 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a6 | 90.9 | 88.1 | 97.2 | 89.3 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a6 | 89.4 | 86.0 | 80.4 | 89.1 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a6 | 87.7 | 84.8 | 94.5 | 82.9 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a6 | 89.1 | 86.5 | 95.1 | 83.4 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a6 | 90.9 | 87.2 | 95.9 | 86.4 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a6 | 83.7 | 77.6 | 91.5 | 87.1 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a6 | 85.0 | 81.5 | 94.8 | 73.1 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a6 | 88.4 | 85.2 | 96.6 | 81.0 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a6 | - | - | - | 81.6 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal, @svlandeg, @jarib, @alvaroabascar, @kbulygin and @moreymat for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.1.0a5: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
  • Improved JSON(L) format for training (see #2928, #2932).

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1585: Prevent parser from predicting unseen classes.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1816: Allow custom Language subclasses via entry points.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2779: Fix handling of pre-set entities.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix issue #3048: Raise better errors for uninitialized pipeline components.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a5 | 91.2 | 89.3 | 96.9 | 85.6 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a5 | 91.4 | 89.5 | 96.9 | 85.9 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a5 | 91.5 | 89.7 | 97.0 | 86.3 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a5 | 91.3 | 89.0 | 97.1 | 82.2 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a5 | 92.0 | 90.0 | 97.4 | 82.7 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a5 | 89.9 | 86.7 | 96.6 | 87.3 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a5 | 90.6 | 87.7 | 97.0 | 88.0 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a5 | 89.3 | 86.0 | 78.5 | 87.8 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a5 | 87.3 | 84.4 | 94.4 | 81.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a5 | 88.8 | 86.1 | 94.9 | 82.2 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a5 | 90.8 | 87.0 | 95.7 | 84.8 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a5 | 83.7 | 77.4 | 90.9 | 85.4 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a5 | 85.5 | 81.8 | 94.7 | 75.9 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a5 | 88.5 | 85.2 | 96.8 | 80.01 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a5 | - | - | - | 82.8 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.1.0a4: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser, NER and Text Categorizer

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Make TextCategorizer default to a simpler, GPU-friendly model.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.
  • Improve loading time of French by ~30%.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • NEW: Replace ujson, msgpack, msgpack-numpy, pickle, cloudpickle and dill with our own package srsly to centralise dependencies and allow binary wheels.
  • NEW: Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • NEW: Built-in EntityRuler component to make it easier to build rule-based NER and combinations of statistical and rule-based systems.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.
  • Accept "TEXT" as an alternative to "ORTH" in Matcher patterns.
  • Refactor CLI and add debug-data command to validate training data (see #2932).
  • Use black for auto-formatting .py source and optimse codebase using flake8. You can now run flake8 spacy and it should return no errors or warnings. See CONTRIBUTING.md for details.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
  • Improved JSON(L) format for training (see #2928, #2932).

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1748, #1798, #2756, #2934: Make TextCategorizer default to a simpler, GPU-friendly model.
  • Fix issue #1782, #2343: Fix training on GPU.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2648: Fix KeyError in Vectors.most_similar.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2693: Only use 'sentencizer' as built-in sentence boundary component name.
  • Fix issue #2754, #3028: Make NORM a Token attribute instead of a Lexeme attribute to allow setting context-specific norms in tokenizer exceptions.
  • Fix issue #2769: Fix issue that'd cause segmentation fault when calling EntityRecognizer.add_label.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as B, L or U.
  • Fix issue #2871: Fix vectors for reserved words.
  • Fix issue #3027: Allow Span to take unicode value for label argument.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • The Doc.print_tree method is not deprecated in favour of a unified Doc.to_json method, which outputs data in the same format as the expected JSON training data.
  • The built-in rule-based sentence boundary detector is now only called 'sentencizer' – the name 'sbd' is deprecated. ```diff
  • sentencesplitter = nlp.createpipe('sbd')
  • sentencesplitter = nlp.createpipe('sentencizer') ```
  • The spacy train command now lets you specify a comma-separated list of pipeline component names, instead of separate flags like --no-parser to disable components. This is more flexible and also handles custom components out-of-the-box. ```diff
  • $ spacy train en /output traindata.json devdata.json --no-parser
  • $ spacy train en /output traindata.json devdata.json --pipeline tagger,ner ```
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a5 | 91.2 | 89.3 | 96.9 | 85.6 | 𐄂 | 10 MB | | en_core_web_md | English | 2.1.0a5 | 91.4 | 89.5 | 96.9 | 85.9 | ✓ | 90 MB | | en_core_web_lg | English | 2.1.0a5 | 91.5 | 89.7 | 97.0 | 86.3 | ✓ | 788 MB | | de_core_news_sm | German | 2.1.0a5 | 91.3 | 89.0 | 97.1 | 82.2 | 𐄂 | 10 MB | | de_core_news_md | German | 2.1.0a5 | 92.0 | 90.0 | 97.4 | 82.7 | ✓ | 210 MB | | es_core_news_sm | Spanish | 2.1.0a5 | 89.9 | 86.7 | 96.6 | 87.3 | 𐄂 | 10 MB | | es_core_news_md | Spanish | 2.1.0a5 | 90.6 | 87.7 | 97.0 | 88.0 | ✓ | 69 MB | | pt_core_news_sm | Portuguese | 2.1.0a5 | 89.3 | 86.0 | 78.5 | 87.8 | 𐄂 | 12 MB | | fr_core_news_sm | French | 2.1.0a5 | 87.3 | 84.4 | 94.4 | 81.0 | 𐄂 | 14 MB | | fr_core_news_md | French | 2.1.0a5 | 88.8 | 86.1 | 94.9 | 82.2 | ✓ | 82 MB | | it_core_news_sm | Italian | 2.1.0a5 | 90.8 | 87.0 | 95.7 | 84.8 | 𐄂 | 10 MB | | nl_core_news_sm | Dutch | 2.1.0a5 | 83.7 | 77.4 | 90.9 | 85.4 | 𐄂 | 10 MB | | el_core_news_sm | Greek | 2.1.0a5 | 85.5 | 81.8 | 94.7 | 75.9 | 𐄂 | 10 MB | | el_core_news_md | Greek | 2.1.0a5 | 88.5 | 85.2 | 96.8 | 80.01 | ✓ | 126 MB | | xx_ent_wiki_sm | Multi | 2.1.0a5 | - | - | - | 82.8 | 𐄂 | 3 MB |

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas, @skrcode, @gavrieltal and @svlandeg for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.0.18: Alpha support for Catalan and dependency fixes

✨ New features and improvements

  • NEW: Alpha tokenization support for Catalan.
  • Improve French tokenization.
  • Fix regex pin to harmonise dependencies with conda.
  • Fix msgpack pin.
  • Update tests for pytest 4.0.

🔴 Bug fixes

  • Fix issue #2933: Correct mistake in is_ascii documentation.
  • Fix issue #2976: Fix bug where Vocab.prune_vectors did not use batch_size.
  • Fix issue #2986: Correctly document when Span.ents was added.
  • Fix issue #2995, #2996: Fix msgpack pin.

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @mpuig, @ALSchwalm, @bpben, @svlandeg and @wxv for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.1.0a3: New models, ULMFit/BERT/Elmo-like pretraining, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

⚠️ Due to difficulties linking our new blis for faster platform-independent matrix multiplication, this nightly release currently doesn't work on Python 2.7 on Windows. We expect this problem to be corrected in the future.

✨ New features and improvements

Tagger, Parser & NER

  • NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new spacy pretrain command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in spacy train, using the new -t2v argument.
  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Add EntityRecognizer.labels property.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v7.0, which defaults to single-thread with fast blis kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.

CLI

  • NEW: pretrain command for ULMFit/BERT/Elmo-like pretraining (see #2931).
  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • NEW: Allow PhraseMatcher to match on token attributes other than ORTH, e.g. LOWER (for case-insensitive matching) or even POS or TAG.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
  • Improved JSON(L) format for training (see #2928, #2932).
  • Doc.to_json() method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932).
  • Refactor CLI and add debug-data command to validate training data (see #2932).

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2482: Fix serialization when parser model is empty.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix issue #2772: Fix bug in sentence starts for non-projective parses.
  • Fix issue #2782: Make like_num work with prefixed numbers.
  • Fix serialization of custom tokenizer if not all functions are defined.
  • Fix bugs in beam-search training objective.
  • Fix problems with model pickling.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a4 | 91.7 | 89.8 | 96.8 | 85.7 | 𐄂 | 12 MB | | en_core_web_md | English | 2.1.0a4 | 92.0 | 90.1 | 97.0 | 86.2 | ✓ | 93 MB | | en_core_web_lg | English | 2.1.0a4 | 92.1 | 90.3 | 97.0 | 86.5 | ✓ | 780 MB | | de_core_news_sm | German | 2.1.0a4 | 91.9 | 89.8 | 97.2 | 83.4 | 𐄂 | 12 MB | | de_core_news_md | German | 2.1.0a4 | 91.3 | 90.5 | 97.4 | 83.6 | ✓ | 212 MB | | es_core_news_sm | Spanish | 2.1.0a4 | 90.1 | 87.1 | 96.8 | 89.3 | 𐄂 | 12 MB | | es_core_news_md | Spanish | 2.1.0a4 | 90.7 | 87.8 | 97.1 | 89.4 | ✓ | 72 MB | | pt_core_news_sm | Portuguese | 2.1.0a4 | 89.2 | 85.8 | 79.8 | 82.4 | 𐄂 | 14 MB | | fr_core_news_sm | French | 2.1.0a4 | 87.2 | 84.0 | 94.4 | 67.0 1 | 𐄂 | 16 MB | | fr_core_news_md | French | 2.1.0a4 | 88.8 | 86.0 | 94.9 | 70.0 1 | ✓ | 84 MB | | it_core_news_sm | Italian | 2.1.0a4 | 90.6 | 87.0 | 96.0 | 81.7 | 𐄂 | 12 MB | | nl_core_news_sm | Dutch | 2.1.0a4 | 83.1 | 77.2 | 91.3 | 87.3 | 𐄂 | 12 MB | | el_core_news_sm | Greek | 2.1.0a4 | 84.2 | 80.4 | 94.6 | 71.5 | 𐄂 | 12 MB | | el_core_news_md | Greek | 2.1.0a4 | 87.5 | 84.1 | 96.4 | 78.3 | ✓ | 128 MB | | xx_ent_wiki_sm | Multi | 2.1.0a4 | - | - | - | 83.2 | 𐄂 | 4 MB |

1) We're currently investigating this, as the results are anomalously low.

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos, @louridas and @skrcode for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.0.17: Fix NER segfaults and various small issues

✨ New features and improvements

  • Make max_length of input text inclusive.
  • Raise error when setting overlapping entities as doc.ents.
  • Improve French lemmatization and check if a word is in one of the regular lists specific to each part-of-speech tag.

🔴 Bug fixes

  • Fix issue #1581, #1969, #1986: Fix out-of-bounds access in NER training that'd cause segmentation fault.
  • Fix issue #2924: Prevent problem where displacy arcs would receive the same IDs in Jupyter notebooks, causing weirdly positioned arc labels.
  • Fix issue #2948: Fix problem with symlink creation on Windows.

📖 Documentation and examples

  • Fix various typos and inconsistencies.
  • Update spaCy Universe with new projects.
  • Add example script showing a fix-up rule for whitespace entities like '\n'.

👥 Contributors

Thanks to @digest0r, @BramVanroy, @grivaz, @wannaphongcom, @mikelibg, @danielhers, @frascuchon, @mauryaland and @cicorias for the pull requests and contributions.

- Python
Published by ines over 7 years ago

spacy - v2.0.16: Fix msgpack-numpy pin

🔴 Bug fixes

  • Fix msgpack-numpy pin, which could affect serialization on Python 2.7.

- Python
Published by ines over 7 years ago

spacy - v2.0.15: More wheels and GPU improvements

✨ New features and improvements

  • Improve version compatibility to support wheels for all spaCy dependencies maintained by us: thinc, cymem, preshed and murmurhash.
  • Support GPU installation by specifying spacy[cuda], spacy[cuda90], spacy[cuda91], spacy[cuda92] or spacy[cuda10], which will install cupy and thinc_gpu_ops.
  • Add spacy.prefer_gpu() and spacy.require_gpu() functions.

📖 Documentation and examples

  • Update GPU installation and usage docs.

- Python
Published by ines over 7 years ago

spacy - v2.0.13: Wheels, alpha support for Telugu and Sinhala, rule-based lemmatization for French and Greek, plus various small fixes

✨ New features and improvements

  • NEW: Pre-built wheels and up to 10 times faster installation! This release starts the journey towards pre-built wheels for all of spaCy's dependencies. Once that's completed, you won't even need a local compiler anymore to install the library. For more details on our wheels process, see explosion/wheelwright.
  • NEW: Alpha support for Telugu and Sinhala.
  • NEW: Rule-based lemmatization for Greek and French.
  • Port over Chinese support (#1210) from v1.x.
  • Improve language data for Persian, Greek, Swedish, Bengali, Polish, Portuguese, Indonesian, French, German and Russian.
  • Add Span.ents property for consistency with Doc.ents.
  • Add --verbose option to spacy train to output more details for debugging.

🔴 Bug fixes

  • Fix issue #653: Introduce bulk merge function.
  • Fix issue #1445, #1917, #2209, #2362, #2371, #2383, #2501, #2743, #2758: Fix Keras examples.
  • Fix issue #2261, #2800: Fix bug that could cause a crash with too many entity types.
  • Fix issue #2540: Improve French stop words.
  • Fix issue #2582, #2640, #2645, #2657, #2705, #2784, #2815, #2841, #2845: Fix typos and inconsistencies in documentation.
  • Fix issue #2593: Prevent numpy warning.
  • Fix issue #2706: Add missing label FAC to spacy.explain glossary.
  • Fix issue #2709: Pass default option when calling getoption() in conftest.py.

📖 Documentation and examples

  • Improve Keras examples.
  • Update training examples to use minibatching.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DimaBryuhanov, @kororo, @AndriyMulyar, @katarkor, @giannisdaras, @bphi, @vikaskyadav, @sammous, @EmilStenstrom, @howl-anderson, @ohenrik, @aashishg, @aryaprabhudesai, @steve-prod, @njsmith, @aniruddha-adhikary, @pzelasko, @mbkupfer, @sainathadapa, @tyburam, @grivaz, @filipecaixeta, @aongko, @free-variation, @mauryaland, @pmj642, @keshan, @darindf, @charlax, @phojnacki, @skrcode, @jacopofar, @Cinnamy and @JKhakpour for the pull requests and contributions!

- Python
Published by ines over 7 years ago

spacy - v2.1.0a1: New models, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

✨ New features and improvements

Tagger, Parser & NER

  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Fix bugs in beam-search training objective.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.
  • NEW: Statistical models for Greek.

CLI

  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.
  • Add support for multi-task objectives to train command.
  • Add support for data-augmentation to train command.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
  • Fix serialization of custom tokenizer if not all functions are defined.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | English | 2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | 𐄂 | 28 MB | | en_core_web_md | English | 2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | ✓ | 107 MB | | en_core_web_lg | English | 2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | ✓ | 805 MB | | de_core_news_sm | German | 2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | 𐄂 | 26 MB | | de_core_news_md | German | 2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | ✓ | 228 MB | | es_core_news_sm | Spanish | 2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | 𐄂 | 28 MB | | es_core_news_md | Spanish | 2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | ✓ | 88 MB | | pt_core_news_sm | Portuguese | 2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | 𐄂 | 29 MB | | fr_core_news_sm | French | 2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | 𐄂 | 32 MB | | fr_core_news_md | French | 2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | ✓ | 100 MB | | it_core_news_sm | Italian | 2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | 𐄂 | 27 MB | | nl_core_news_sm | Dutch | 2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | 𐄂 | 27 MB | | el_core_news_sm | Greek | 2.1.0a0 | 84.5 | 81.0 | 95.0 | 73.5 | 𐄂 | 27 MB | | el_core_news_md | Greek | 2.1.0a0 | 87.7 | 84.7 | 96.3 | 80.2 | ✓ | 143 MB | | xx_ent_wiki_sm | Multi | 2.1.0a0 | - | - | - | 83.8 | 𐄂 | 9 MB |

1) We're currently investigating this, as the results are anomalously low.

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA, @giannisdaras, @mgogoulos and @louridas for the pull requests and contributions.

- Python
Published by ines almost 8 years ago

spacy - v2.0.12: Greek, Arabic, Urdu, Tatar, improved language data, better model downloads & various compatibility and bug fixes

We had to release another update to the v2.0.x branch of spaCy to resolve a dependency issue, so we decided to also include and/or backport a bunch of features and fixes that were originally intended for v2.1.0 (see here for the nightly version).

✨ New features and improvements

  • NEW: Alpha tokenization and language data for Arabic, Urdu, Tatar and Greek.
  • NEW: Mecab-based Japanese tokenization and lemmatization.
  • NEW: Add Norwegian rule-based and lookup lemmatization.
  • NEW: Add Danish lookup lemmatization based on the Den store danske SprogTeknologiske Ordbase, STO dataset, courtesy of The University of Copenhagen.
  • NEW: Romanian lookup lemmatization.
  • Improve language data for Polish, Turkish, French, Romanian, Swedish and Japanese.
  • Improve case-sensitive lookup lemmatization in German.
  • Add Token.sent property that returns the sentence Span the token is part of.
  • Add remove_extension method on Doc, Token and Span.
  • Add Doc.is_sentenced property that returns True if sentence boundaries have been applied.
  • Allow ignoring warning by code via the SPACY_WARNING_IGNORE environment variable.
  • Add --silent option to info command.

🔴 Bug fixes

  • Fix issue #1456: Pass additional arguments of download command to pip and check if model is already installed before downloading it.
  • Fix issue #2191: Update README section on tests and dependencies.
  • Fix issue #2194: Ensure that Doc.noun_chunks_iterator isn't None before calling it.
  • Fix issue #2196: Return data in cli.info and add silent option.
  • Fix issue #2200: Correct typo in spacy package command message.
  • Fix issue #2210: Fix bug in Spanish noun chunks.
  • Fix issue #2211, #2320: Resolve problem in download command and use requests library again.
  • Fix issue #2219: Fix token similarity of single-letter tokens.
  • Fix issue #2222, #2223: Fix typos in documentation and docstrings.
  • Fix issue #2226: Use correct, non-deprecated merge syntax in merge_ents.
  • Fix issue #2228: Fix deserialization when using tensor=False or sentiment=False.
  • Fix issue #2238: Correct Swedish lookup lemmatization.
  • Fix issue #2242: Add remove_extension method on Doc, Token and Span.
  • Fix issue #2266: Add collapse_phrases option to displaCy visualizer.
  • Fix issue #2269: Fix KeyError by renaming SP to _SP.
  • Fix issue #2304: Don't require attrs argument in Doc.retokenize and allow ints/unicode.
  • Fix issue #2361: Escape HTML tags in displacy.render.
  • Fix issue #2376: Improve Matcher examples and add section on using pipeline components.
  • Fix issue #2385: Handle multi-word entities correctly in IOB to BILUO conversion.
  • Fix issue #2452: Fix bug that would cause displacy arrows to only point in one direction.
  • Fix issue #2477: Also allow Span objects in displacy.render.
  • Fix issue #2490: Update Thinc's dependencies for Python 3.7 compatibility.
  • Fix issue #2495: Fix loading tokenizer with custom prefix search.
  • Fix issue #2514: Switch from msgpack-python to msgpack to hopefully prevent conda from downloading a two-year-old spaCy version when installing with latest the Anaconda distribution.
  • Ensure that Doc.is_tagged is set correctly when using Language.pipe.
  • Fix bug in merge_noun_chunks factory that would return None if Doc wasn't parsed.
  • Explicitly require pathlib backport on Python 2 only.

📖 Documentation and examples

  • NEW: Edit and execute code examples in your browser – all across the documentation!
  • NEW: The spaCy Universe, a collection of plugins, extensions and other resources for spaCy.
  • NEW: Experimental rule-based Matcher Explorer demo – create token patterns interactively, test them against your text and copy-paste the Python pattern code.
  • NEW: Document Cython API.
  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @mollerhoj, @howl-anderson, @pktippa, @skrcode, @miroli, @ivyleavedtoadflax, @5hirish, @therealronnie, @alexvy86, @mn3mos, @polm, @knoxdw, @bellabie, @mauryaland, @LRAbbade, @janimo, @vishnumenon, @tzano, @cclauss, @armsp, @aristorinjuang, @BigstickCarpet, @idealley, @ansgar-t, @mpszumowski, @91ns, @msklvsk, @himkt, @DanielRuf, @nathanathan, @GolanLevy, @nipunsadvilkar, @cjhurst, @aliiae, @mirfan899, @ohenrik, @btrungchi, @kleinay, @DuyguA, @stefan-it, @Eleni170, @datascouting, @tjkemp, @x-ji, @giannisdaras, @kororo and @katarkor for the pull requests and contributions.

- Python
Published by ines almost 8 years ago

spacy - v2.1.0a0: New models, joint word segmentation and parsing, better Matcher, bug fixes & more

🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

bash pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

✨ New features and improvements

Tagger, Parser & NER

  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Fix bugs in beam-search training objective.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.

CLI

  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix serialization of custom tokenizer if not all functions are defined.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Version | UAS | LAS | POS | NER F | Vec | Size | | --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: | | en_core_web_sm | 2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | 𐄂 | 28 MB | | en_core_web_md | 2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | ✓ | 107 MB | | en_core_web_lg | 2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | ✓ | 805 MB | | de_core_news_sm | 2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | 𐄂 | 26 MB | | de_core_news_md | 2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | ✓ | 228 MB | | es_core_news_sm | 2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | 𐄂 | 28 MB | | es_core_news_md | 2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | ✓ | 88 MB | | pt_core_news_sm | 2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | 𐄂 | 29 MB | | fr_core_news_sm | 2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | 𐄂 | 32 MB | | fr_core_news_md | 2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | ✓ | 100 MB | | it_core_news_sm | 2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | 𐄂 | 27 MB | | nl_core_news_sm | 2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | 𐄂 | 27 MB | | xx_ent_wiki_sm | 2.1.0a0 | - | - | - | 83.8 | 𐄂 | 9 MB |

1) We're currently investigating this, as the results are anomalously low.

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA for the pull requests and contributions.

- Python
Published by ines almost 8 years ago

spacy - v2.0.11: Alpha Vietnamese support, fixes to vectors, improved errors and more

📊 Help us improve spaCy and take the User Survey 2018!


✨ New features and improvements

  • NEW: Alpha Vietnamese support with tokenization via Pyvi.
  • NEW: Improved system for error messages and warnings. Errors now have unique error codes and are referenced in one place, and all unspecified asserts have been replaced with descriptive errors. See #2163 for implementation details, and let us know if you have any suggestions for errors and warnings in #2164!
  • Improve language data for Polish.
  • Tidy up dependencies and drop six, html5lib, ftfy and requests.
  • Improve efficiency (and potentially accuracy) of beam-search training, by randomly using greedy updates for some sentences. This can be controlled by changing the beam_update_prob entry in nlp.parser.cfg. The default value is 0.5, so 50% of beam updates will be done as greedy updates.

🔴 Bug fixes

  • Fix issue #1554, #1752, #2159: Fix Token.ent_iob after Doc.merge(), and ensure consistency in Doc.ents.
  • Fix issue #1660: Fix loading of multiple vector models.
  • Fix issue #1967: Allow entity types with dashes.
  • Fix issue #2032: Fix accidentally quadratic runtime in Vocab.set_vector.
  • Fix issue #2050: Correct mistakes in Italian lemmatizer data.
  • Fix issue #2073: Make Token.set_extension work as expected.
  • Fix issue #2100, #2151, #2181: Drop six and html5lib and prevent dependency conflict with TensorFlow / Keras.
  • Fix issue #2101: Improve error message if token text is empty string.
  • Fix issue #2121: Fix Language.to_bytes and pickling in Thinc.
  • Fix issue #2156: Fix hashtag example in Matcher docs.
  • Fix issue #2177: Don't raise error in set_extension if getter and setter are specified or if default=None, and add error if setter is specified with no getter.

📖 Documentation and examples

👥 Contributors

Thanks to @jimregan, @justindujardin, @trungtv, @katrinleinweber and @skrcode for the pull requests and contributions.

- Python
Published by ines about 8 years ago