Releases | Open Source Science

pyonmttok - Tokenizer 1.37.1

Fixes and improvements

Consider escaped characters as single characters in BPE
Ignore undefined scripts when resolving inherited or common scripts

- C++
Published by guillaumekln almost 3 years ago

pyonmttok - Tokenizer 1.37.0

New features

Add tokenization option allow_isolated_marks to allow combining marks to appear isolated in the tokenization output in specific conditions

Fixes and improvements

Fix infinite loop when the text contains an invalid Unicode character
Fix segmentation fault when the BPELearner does not not find any pairs of characters in the tokenized data
[Python] Update ICU to 72.1

- C++
Published by guillaumekln almost 3 years ago

pyonmttok - Tokenizer 1.36.0

New features

[Python] Add argument vocabulary in the Tokenizer constructor to set the vocabulary with a list of tokens instead of using a file
[Python] Add function pyonmttok.is_valid_language to check if a language code is valid and can be passed to the Tokenizer constructor

- C++
Published by guillaumekln about 3 years ago

pyonmttok - Tokenizer 1.35.0

New features

[Python] Add pickling support to pyonmttok.Vocab

Fixes and improvements

Update pybind11 to 2.10.1
Update cibuildwheel to 2.11.2

- C++
Published by guillaumekln about 3 years ago

pyonmttok - Tokenizer 1.34.0

Changes

[Python] Wheels are now built under manylinux2014 and requires pip >= 19.3 for installation

New features

[Python] Build wheels for Python 3.11

Fixes and improvements

Improve error handling when reading token frequencies in the vocabulary file
[Python] Fix possible crash when pyonmttok is imported before torch
[Python] Update ICU to 71.1
[C++] Fix static compilation with -DBUILD_SHARED_LIBS=OFF
[C++] Fix CMake warning when compiling the tests

- C++
Published by guillaumekln over 3 years ago

pyonmttok - Tokenizer 1.33.0

New features

[Python] Build ARM64 wheels for macOS

Fixes and improvements

[CLI] Fix error when the option --segment_alphabet is not set
Fix SentencePiece build warning when compiling with Clang

- C++
Published by guillaumekln over 3 years ago

pyonmttok - Tokenizer 1.32.0

New features

Add property pyonmttok.Vocab.counters to retrieve the number of occurrences of each token

Fixes and improvements

Update pybind11 to 2.10.0
Update cxxopts to 3.0.0

- C++
Published by guillaumekln over 3 years ago

pyonmttok - Tokenizer 1.31.0

New features

Add utilities to build and use vocabularies:
- pyonmttok.Vocab
- pyonmttok.build_vocab_from_tokens
- pyonmttok.build_vocab_from_lines
Define the method Tokenizer.__call__ to simplify the tokenizer usage when additional features are unused:

python tokens = tokenizer(text)

Fixes and improvements

Update pybind11 to 2.9.1

- C++
Published by guillaumekln almost 4 years ago

pyonmttok - Tokenizer 1.30.1

Fixes and improvements

Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)

- C++
Published by guillaumekln about 4 years ago

pyonmttok - Tokenizer 1.30.0

New features

[Python] Build wheels for AArch64 Linux

Fixes and improvements

[Python] Update ICU to 70.1

- C++
Published by guillaumekln about 4 years ago

pyonmttok - Tokenizer 1.29.0

Changes

[Python] Drop support for Python 3.5

New features

[Python] Build wheels for Python 3.10
[Python] Add tokenization method Tokenizer.tokenize_batch

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.28.1

Fixes and improvements

Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.28.0

Changes

[C++] Remove the SpaceTokenizer class that is not meant to be public and can be confused with the "space" tokenization mode

New features

Build Python wheels for Windows
Add option tokens_delimiter to configure how tokens are delimited in tokenized files (default is a space)
Expose option with_separators in Python and CLI to include whitespace characters in the tokenized output
[Python] Add package version information in pyonmttok.__version__

Fixes and improvements

Fix detokenization when option with_separators is enabled

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.27.0

Changes

Linux Python wheels are now compiled with manylinux2010 and require pip >= 19.0 for installation
macOS Python wheels now require macOS >= 10.14

Fixes and improvements

Fix casing resolution when some letters do not have case information
Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence
Improve error message when setting invalid segment_alphabet or lang options
Update SentencePiece to 0.1.96
[Python] Improve declaration of functions and classes for better type hints and checks
[Python] Update ICU to 69.1

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.4

Fixes and improvements

Fix regression introduced in last version for preserved tokens that are not segmented by BPE

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.3

Fixes and improvements

Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.2

Fixes and improvements

Fix a divergence with the SentencePiece output when the spacer is detached from the word

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.1

Fixes and improvements

Fix application of the BPE vocabulary when using preserve_segmented_tokens and a subword appears without joiner in the vocabulary
Fix compilation with ICU versions older than 60

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.0

New features

Add lang tokenization option to apply language-specific case mappings

Fixes and improvements

Use ICU to convert strings to Unicode values instead of a custom implementation

- C++
Published by guillaumekln almost 5 years ago

pyonmttok - Tokenizer 1.25.0

New features

Add training flag in tokenization methods to disable subword regularization during inference
[Python] Implement __len__ method in the Token class

Fixes and improvements

Raise an error when enabling case_markup with incompatible tokenization modes "space" and "none"
[Python] Improve parallelization when Tokenizer.tokenize is called from multiple Python threads (the Python GIL is now released)
[Python] Cleanup some manual Python <-> C++ types conversion

- C++
Published by guillaumekln almost 5 years ago

pyonmttok - Tokenizer 1.24.0

New features

Add verbose flag in file tokenization APIs to log progress every 100,000 lines
[Python] Add options property to Tokenizer instances
[Python] Add class pyonmttok.SentencePieceTokenizer to help creating a tokenizer compatible with SentencePiece

Fixes and improvements

Fix deserialization into Token objects that was sometimes incorrect
Fix Windows compilation
Fix Google Test integration that was sometimes installed as part of make install
[Python] Update pybind11 to 2.6.2
[Python] Update ICU to 66.1
[Python] Compile ICU with optimization flags

- C++
Published by guillaumekln about 5 years ago

pyonmttok - Tokenizer 1.23.0

Changes

Drop Python 2 support

New features

Publish Python wheels for macOS

Fixes and improvements

Improve performance in all tokenization modes (up to 2x faster)
Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
Fix a regression introduced in 1.20 where segment_alphabet_* options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)
Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using preserve_segmented_tokens and the word is segmented by both a segment_* option and BPE
Fix incorrect tokenization when using support_prior_joiners and some joiners are within protected sequences

- C++
Published by guillaumekln about 5 years ago

pyonmttok - Tokenizer 1.22.2

Fixes and improvements

Do not require "none" tokenization mode for SentencePiece vocabulary restriction

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.22.1

Fixes and improvements

Fix error when enabling vocabulary restriction with SentencePiece and spacer_annotate is not explicitly set
Fix backward compatibility with Kangxi and Kanbun scripts (see segment_alphabet option)

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.22.0

Changes

[C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a std::shared_ptr to make it outlive the Tokenizer instance.

New features

Add set_random_seed function to make subword regularization reproducible
[Python] Support serialization of Token instances
[C++] Add Options structure to configure tokenization options (Flags can still be used for backward compatibility)

Fixes and improvements

Fix BPE vocabulary restriction when using joiner_new, spacer_annotate, or spacer_new (the previous implementation always assumed joiner_annotate was used)
[Python] Fix spacer argument name in Token constructor
[C++] Fix ambiguous subword encoder ownership by using a std::shared_ptr

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.21.0

New features

Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

Fix BPE vocabulary restriction when words have a leading or trailing joiner
Raise an error when using a multi-character joiner and support_prior_joiner
[Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
[Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
[Python] Improve compatibility with Python 3.9

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.20.0

Changes

The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in Release mode by default
- Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

Accept any Unicode script aliases in the segment_alphabet option
Update SentencePiece to 0.1.92
[Python] Improve the capabilities of the Token class:
- Implement the __repr__ method
- Allow setting all attributes in the constructor
- Add a copy constructor
[Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

[Python] Accept None value for segment_alphabet argument

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.19.0

New features

Add BPE dropout (Provilkov et al. 2019)
[Python] Introduce the "Token API": a set of methods that manipulate Token objects instead of serialized strings
[Python] Add unicode_ranges argument to the detokenize_with_ranges method to return ranges over Unicode characters instead of bytes

Fixes and improvements

Include "Half-width kana" in Katakana script detection

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.18.5

Fixes and improvements

Fix possible crash when applying a case insensitive BPE model on Unicode characters

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.18.4

Fixes and improvements

Fix segmentation fault on cli/tokenize exit
Ignore empty tokens during detokenization
When writing to a file, avoid flushing the output stream on each line
Update cli/CMakeLists.txt to mark Boost.ProgramOptions as required

(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)

- C++
Published by guillaumekln almost 6 years ago

Recent Releases of pyonmttok

pyonmttok - Tokenizer 1.37.1

Fixes and improvements

pyonmttok - Tokenizer 1.37.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.36.0

New features

pyonmttok - Tokenizer 1.35.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.34.0

Changes

New features

Fixes and improvements

pyonmttok - Tokenizer 1.33.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.32.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.31.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.30.1

Fixes and improvements

pyonmttok - Tokenizer 1.30.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.29.0

Changes

New features

pyonmttok - Tokenizer 1.28.1

Fixes and improvements

pyonmttok - Tokenizer 1.28.0

Changes

New features

Fixes and improvements

pyonmttok - Tokenizer 1.27.0

Changes

Fixes and improvements

pyonmttok - Tokenizer 1.26.4

Fixes and improvements

pyonmttok - Tokenizer 1.26.3

Fixes and improvements

pyonmttok - Tokenizer 1.26.2

Fixes and improvements

pyonmttok - Tokenizer 1.26.1

Fixes and improvements

pyonmttok - Tokenizer 1.26.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.25.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.24.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.23.0

Changes

New features

Fixes and improvements

pyonmttok - Tokenizer 1.22.2

Fixes and improvements

pyonmttok - Tokenizer 1.22.1

Fixes and improvements

pyonmttok - Tokenizer 1.22.0

Changes

New features

Fixes and improvements

pyonmttok - Tokenizer 1.21.0

New features

Fixes and improvements

pyonmttok - Tokenizer 1.20.0

Changes

New features

Fixes and improvements

pyonmttok - Tokenizer 1.19.0

New features

Fixes and improvements