Recent Releases of https://github.com/bramvanroy/spacy_conll
https://github.com/bramvanroy/spacy_conll - v4.0.0
What's Changed
Two new changes thanks to user @rominf:
- Repackaged the library to bring it up to modern standards, notably relying on a pyproject.toml file and removing support for Python <3.8.
- When dep, pos, tag, or lemma fields are empty, the underscore
_will be used
New Contributors
- @rominf made their first contribution in https://github.com/BramVanroy/spacy_conll/pull/32
Full Changelog: https://github.com/BramVanroy/spacy_conll/compare/v3.4.0...v4.0.0
- Python
Published by BramVanroy over 1 year ago
https://github.com/bramvanroy/spacy_conll - Update default field names and allow custom ones
What's Changed
- improve CoNLL-U fields by @BramVanroy in https://github.com/BramVanroy/spacy_conll/pull/25
Full Changelog: https://github.com/BramVanroy/spacy_conll/compare/v3.3.0...v3.4.0
- Python
Published by BramVanroy almost 3 years ago
https://github.com/bramvanroy/spacy_conll - Changes to input format of pretokenized text
Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (["This", "is", "a", "pretokenized", "sentence"]) is not accepted anymore. Therefore,
the is_tokenized option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. "This is a pretokenized sentence", which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.
Specific changes:
- [conllparser] Breaking change:
is_tokenizedis not a valid argument toConllParserany more. - [utils/conllparser] Breaking change: when using UDPipe, pretokenized data is not supported any more.
- [utils] Breaking change:
SpacyPretokenizedTokenizer.__call__does not support a list of tokens any more.
- Python
Published by BramVanroy about 3 years ago
https://github.com/bramvanroy/spacy_conll - Entry points and quality of life improvements
- [conllformatter] Fixed an issue where
SpaceAfter=Nowas not added correctly to tokens - [conllformatter] Added
ConllFormatteras an entry point, which means that you do not have to importspacy_conllanymore when you want to add the pipe to a parser! spaCy will know where to look for the CoNLL formatter when you usenlp.add_pipe("conll_formatter")without you having to import the component manually - [conllformatter] Now adds the component constructor on a construction function rather than directly on the class as recommended by spacy. The formatter has also been re-written as a dataclass
- [conllformatter/utils] Moved
merge_dicts_strictto utils, outside the formatter class - [conllparser] Make ConllParser directly importable from the root of the library, i.e.,
from spacy_conll import ConllParser - [init_parser] Allow users to exclude pipeline components when using the spaCy parser with the
exclude_spacy_componentsargument - [init_parser] Fixed an issue where disabling sentence segmentation would not work if your model does not have a parser
- [init_parser] Enable more options when using stanza in terms of pre-segmented text. Now you can also disable
sentence segmentation for stanza (but still do tokenization) with the
disable_sbdoption - [utils] Added SpacyDisableSentenceSegmentation as an entry-point custom component so that you can use it in your
own code, by calling
nlp.add_pipe("disable_sbd", before="parser")
- Python
Published by BramVanroy almost 4 years ago
https://github.com/bramvanroy/spacy_conll - Fix no_split_on_newline
- [conllparser] Fix: fixed an issue with nospliton_newline in combination with nlp.pipe
- Python
Published by BramVanroy over 4 years ago
https://github.com/bramvanroy/spacy_conll - Bugfix for ConllParser: do not require stanza and udpipe
- [conllparser] Fix: make sure the parser also runs if stanza and UDPipe are not installed
- Python
Published by BramVanroy over 4 years ago
https://github.com/bramvanroy/spacy_conll - Release for spaCy v3
This release makes spacy_conll compatible with spaCy's new v3 release. On top of that some improvements were made to make the project easier to maintain.
- [general] Breaking change: spaCy v3 required (closes https://github.com/BramVanroy/spacy_conll/issues/8)
- [init_parser] Breaking change: in all cases,
is_tokenizednow disables sentence segmentation - [init_parser] Breaking change: no more default values for parser or model anywhere. Important to note here that
spaCy does not work with short-hand codes such as
enany more. You have to provide the full model name, e.g.en_core_web_sm - [init_parser] Improvement: models are automatically downloaded for Stanza and UDPipe
- [cli] Reworked the position of the CLI script in the directory structure as well as the arguments. Run
parse-as-conll -hfor more information. - [conllparser] Made the ConllParser class available as a utility to easily create a wrapper for a spaCy-like parser which can return the parsed CoNLL output of a given file or text
- [conllparser,cli] Improvements to usability of
n_process. Will try to figure out whether multiprocessing is available for your platform and if not, tell you so. Such a priori error messages can be disabled, withignore_pipe_errors, both on the command line as in ConllParser's parse methods
- Python
Published by BramVanroy over 4 years ago
https://github.com/bramvanroy/spacy_conll - Preparing for v3 release
- Last version to support spaCy v2. New versions will require spaCy v3
- Last version to support
spacy-stanfordnlp.spacy-stanzais still supported
- Python
Published by BramVanroy over 4 years ago
https://github.com/bramvanroy/spacy_conll - Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more
Fully reworked version!
- Tested support for both
spacy-stanzaandspacy-udpipe! (Not included as a dependency, install manually) - Added a useful utility function
init_parserthat can easily initialise a parser together with the custom pipeline component. (See the README or examples) - Added the
disable_pandasflag the the formatter class in case you would want to disable setting the pandas attribute even when pandas is installed. - Added custom properties for Tokens as well. So now a Doc, its sentence Spans as well as Tokens have custom attributes
- Reworked datatypes of output. In version 2.0.0 the data types are as follows:
._.conll: raw CoNLL format- in
Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values. - in sentence
Span: a list of its tokens'._.conlldictionaries (list of dictionaries). - in a
Doc: a list of its sentences'._.conlllists (list of list of dictionaries).
- in
._.conll_str: string representation of the CoNLL format- in
Token: tab-separated representation of the contents of the CoNLL fields ending with a newline. - in sentence
Span: the expected CoNLL format where each row represents a token. WhenConllFormatter(include_headers=True)is used, two header lines are included as well, as per theCoNLL format_. - in
Doc: all its sentences'._.conll_strcombined and separated by new lines.
- in
._.conll_pd:pandasrepresentation of the CoNLL format- in
Token: aSeriesrepresentation of this token's CoNLL properties. - in sentence
Span: aDataFramerepresentation of this sentence, with the CoNLL names as column headers. - in
Doc: a concatenation of its sentences'DataFrame's, leading to a new aDataFramewhose index is reset.
- in
field_nameshas been removed, assuming that you do not need to change the column names of the CoNLL properties- Removed the
Spacy2ConllParserclass - Many doc changes, added tests, and a few examples
- Python
Published by BramVanroy almost 6 years ago
https://github.com/bramvanroy/spacy_conll - Add SpaceAfter=No property
- IMPORTANT: This will be the last release that supports the deprecated
Spacy2ConllParserclass! - Community addition: add SpaceAfter=No to the Misc field when applicable (https://github.com/BramVanroy/spacy_conll/pull/6). Thanks @KoichiYasuoka!
- Fixed failing tests
- Python
Published by BramVanroy almost 6 years ago
https://github.com/bramvanroy/spacy_conll - Documentation spacy-stanfordnlp, custom tagset map
The documentation has been greatly expanded. The most important addition to the README is the mention and explanation of using spacy-stanfordnlp. spacy_conll can be used together with this spaCy wrapper around stanfordnlp. The benefit is that we can use Stanford models, with a spaCy interface. From a user perspective, this means better models, guaranteed Universal Dependencies tagsets, and an easy API through spaCy. (The cost is that Stanford NLP models are significantly slower than spaCy's models.) Small tests for spacy_stanfordnlp have been added.
A new feature is that you can now add a custom tagset map (conversion_maps). The idea is that you, as a user, have more control over the output tags. You can for instance specify that all deprel tags nsubj should be renamed to subj. This is useful if your model uses a different tagset than you want. See the advanced example in the README for more information.
This release closes:
- "The dependency relations aren't transformed to universal dependencies" (https://github.com/BramVanroy/spacy_conll/issues/4)
- Python
Published by BramVanroy about 6 years ago
https://github.com/bramvanroy/spacy_conll - Add dependencies to setup.py
This small release adds the dependencies to setup.py, solving potential issues (e.g. https://github.com/BramVanroy/spacy_conll/issues/3).
Current dependencies are: - packaging - spacy
- Python
Published by BramVanroy about 6 years ago
https://github.com/bramvanroy/spacy_conll - spaCy pipeline component, improved command line script with multiprocessing
This small repo has been overhauled so that users can integrate it directly in their spaCy scripts. You can now use it as a spaCy component. Three custom attributes have been added to Doc._. and a Doc's sentences. You can find more information in the README as well as example usage.
The command line script has been improved as well, now using the pipeline component instead of Spacy2ConllParser. The latter has been deprecated (but is still accessible for now). Multiprocessing via the command line script is now possible, too.
- Python
Published by BramVanroy about 6 years ago