Recent Releases of pyonmttok
pyonmttok - Tokenizer 1.37.1
Fixes and improvements
- Consider escaped characters as single characters in BPE
- Ignore undefined scripts when resolving inherited or common scripts
- C++
Published by guillaumekln almost 3 years ago
pyonmttok - Tokenizer 1.37.0
New features
- Add tokenization option
allow_isolated_marksto allow combining marks to appear isolated in the tokenization output in specific conditions
Fixes and improvements
- Fix infinite loop when the text contains an invalid Unicode character
- Fix segmentation fault when the
BPELearnerdoes not not find any pairs of characters in the tokenized data - [Python] Update ICU to 72.1
- C++
Published by guillaumekln almost 3 years ago
pyonmttok - Tokenizer 1.36.0
New features
- [Python] Add argument
vocabularyin theTokenizerconstructor to set the vocabulary with a list of tokens instead of using a file - [Python] Add function
pyonmttok.is_valid_languageto check if a language code is valid and can be passed to theTokenizerconstructor
- C++
Published by guillaumekln about 3 years ago
pyonmttok - Tokenizer 1.35.0
New features
- [Python] Add pickling support to
pyonmttok.Vocab
Fixes and improvements
- Update pybind11 to 2.10.1
- Update cibuildwheel to 2.11.2
- C++
Published by guillaumekln about 3 years ago
pyonmttok - Tokenizer 1.34.0
Changes
- [Python] Wheels are now built under
manylinux2014and requirespip>= 19.3 for installation
New features
- [Python] Build wheels for Python 3.11
Fixes and improvements
- Improve error handling when reading token frequencies in the vocabulary file
- [Python] Fix possible crash when
pyonmttokis imported beforetorch - [Python] Update ICU to 71.1
- [C++] Fix static compilation with
-DBUILD_SHARED_LIBS=OFF - [C++] Fix CMake warning when compiling the tests
- C++
Published by guillaumekln over 3 years ago
pyonmttok - Tokenizer 1.33.0
New features
- [Python] Build ARM64 wheels for macOS
Fixes and improvements
- [CLI] Fix error when the option
--segment_alphabetis not set - Fix SentencePiece build warning when compiling with Clang
- C++
Published by guillaumekln over 3 years ago
pyonmttok - Tokenizer 1.32.0
New features
- Add property
pyonmttok.Vocab.countersto retrieve the number of occurrences of each token
Fixes and improvements
- Update pybind11 to 2.10.0
- Update cxxopts to 3.0.0
- C++
Published by guillaumekln over 3 years ago
pyonmttok - Tokenizer 1.31.0
New features
- Add utilities to build and use vocabularies:
pyonmttok.Vocabpyonmttok.build_vocab_from_tokenspyonmttok.build_vocab_from_lines
- Define the method
Tokenizer.__call__to simplify the tokenizer usage when additional features are unused:
python
tokens = tokenizer(text)
Fixes and improvements
- Update pybind11 to 2.9.1
- C++
Published by guillaumekln almost 4 years ago
pyonmttok - Tokenizer 1.30.1
Fixes and improvements
- Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)
- C++
Published by guillaumekln about 4 years ago
pyonmttok - Tokenizer 1.30.0
New features
- [Python] Build wheels for AArch64 Linux
Fixes and improvements
- [Python] Update ICU to 70.1
- C++
Published by guillaumekln about 4 years ago
pyonmttok - Tokenizer 1.29.0
Changes
- [Python] Drop support for Python 3.5
New features
- [Python] Build wheels for Python 3.10
- [Python] Add tokenization method
Tokenizer.tokenize_batch
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.28.1
Fixes and improvements
- Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.28.0
Changes
- [C++] Remove the
SpaceTokenizerclass that is not meant to be public and can be confused with the "space" tokenization mode
New features
- Build Python wheels for Windows
- Add option
tokens_delimiterto configure how tokens are delimited in tokenized files (default is a space) - Expose option
with_separatorsin Python and CLI to include whitespace characters in the tokenized output - [Python] Add package version information in
pyonmttok.__version__
Fixes and improvements
- Fix detokenization when option
with_separatorsis enabled
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.27.0
Changes
- Linux Python wheels are now compiled with
manylinux2010and requirepip>= 19.0 for installation - macOS Python wheels now require macOS >= 10.14
Fixes and improvements
- Fix casing resolution when some letters do not have case information
- Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence
- Improve error message when setting invalid
segment_alphabetorlangoptions - Update SentencePiece to 0.1.96
- [Python] Improve declaration of functions and classes for better type hints and checks
- [Python] Update ICU to 69.1
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.26.4
Fixes and improvements
- Fix regression introduced in last version for preserved tokens that are not segmented by BPE
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.26.3
Fixes and improvements
- Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.26.2
Fixes and improvements
- Fix a divergence with the SentencePiece output when the spacer is detached from the word
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.26.1
Fixes and improvements
- Fix application of the BPE vocabulary when using
preserve_segmented_tokensand a subword appears without joiner in the vocabulary - Fix compilation with ICU versions older than 60
- C++
Published by guillaumekln over 4 years ago
pyonmttok - Tokenizer 1.26.0
New features
- Add
langtokenization option to apply language-specific case mappings
Fixes and improvements
- Use ICU to convert strings to Unicode values instead of a custom implementation
- C++
Published by guillaumekln almost 5 years ago
pyonmttok - Tokenizer 1.25.0
New features
- Add
trainingflag in tokenization methods to disable subword regularization during inference - [Python] Implement
__len__method in theTokenclass
Fixes and improvements
- Raise an error when enabling
case_markupwith incompatible tokenization modes "space" and "none" - [Python] Improve parallelization when
Tokenizer.tokenizeis called from multiple Python threads (the Python GIL is now released) - [Python] Cleanup some manual Python <-> C++ types conversion
- C++
Published by guillaumekln almost 5 years ago
pyonmttok - Tokenizer 1.24.0
New features
- Add
verboseflag in file tokenization APIs to log progress every 100,000 lines - [Python] Add
optionsproperty toTokenizerinstances - [Python] Add class
pyonmttok.SentencePieceTokenizerto help creating a tokenizer compatible with SentencePiece
Fixes and improvements
- Fix deserialization into
Tokenobjects that was sometimes incorrect - Fix Windows compilation
- Fix Google Test integration that was sometimes installed as part of
make install - [Python] Update pybind11 to 2.6.2
- [Python] Update ICU to 66.1
- [Python] Compile ICU with optimization flags
- C++
Published by guillaumekln about 5 years ago
pyonmttok - Tokenizer 1.23.0
Changes
- Drop Python 2 support
New features
- Publish Python wheels for macOS
Fixes and improvements
- Improve performance in all tokenization modes (up to 2x faster)
- Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
- Fix a regression introduced in 1.20 where
segment_alphabet_*options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation) - Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using
preserve_segmented_tokensand the word is segmented by both asegment_*option and BPE - Fix incorrect tokenization when using
support_prior_joinersand some joiners are within protected sequences
- C++
Published by guillaumekln about 5 years ago
pyonmttok - Tokenizer 1.22.2
Fixes and improvements
- Do not require "none" tokenization mode for SentencePiece vocabulary restriction
- C++
Published by guillaumekln over 5 years ago
pyonmttok - Tokenizer 1.22.1
Fixes and improvements
- Fix error when enabling vocabulary restriction with SentencePiece and
spacer_annotateis not explicitly set - Fix backward compatibility with Kangxi and Kanbun scripts (see
segment_alphabetoption)
- C++
Published by guillaumekln over 5 years ago
pyonmttok - Tokenizer 1.22.0
Changes
- [C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a
std::shared_ptrto make it outlive theTokenizerinstance.
New features
- Add
set_random_seedfunction to make subword regularization reproducible - [Python] Support serialization of
Tokeninstances - [C++] Add
Optionsstructure to configure tokenization options (Flagscan still be used for backward compatibility)
Fixes and improvements
- Fix BPE vocabulary restriction when using
joiner_new,spacer_annotate, orspacer_new(the previous implementation always assumedjoiner_annotatewas used) - [Python] Fix
spacerargument name inTokenconstructor - [C++] Fix ambiguous subword encoder ownership by using a
std::shared_ptr
- C++
Published by guillaumekln over 5 years ago
pyonmttok - Tokenizer 1.21.0
New features
- Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)
Fixes and improvements
- Fix BPE vocabulary restriction when words have a leading or trailing joiner
- Raise an error when using a multi-character joiner and
support_prior_joiner - [Python] Implement
__hash__method ofpyonmttok.Tokenobjects to be consistent with the__eq__implementation - [Python] Declare
pyonmttok.Tokenizerarguments (exceptmode) as keyword-only - [Python] Improve compatibility with Python 3.9
- C++
Published by guillaumekln over 5 years ago
pyonmttok - Tokenizer 1.20.0
Changes
- The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in
Releasemode by default - Tests are no longer compiled by default (use
-DBUILD_TESTS=ONto compile the tests)
New features
- Accept any Unicode script aliases in the
segment_alphabetoption - Update SentencePiece to 0.1.92
- [Python] Improve the capabilities of the
Tokenclass:- Implement the
__repr__method - Allow setting all attributes in the constructor
- Add a copy constructor
- Implement the
- [Python] Add a copy constructor for the
Tokenizerclass
Fixes and improvements
- [Python] Accept
Nonevalue forsegment_alphabetargument
- C++
Published by guillaumekln over 5 years ago
pyonmttok - Tokenizer 1.19.0
New features
- Add BPE dropout (Provilkov et al. 2019)
- [Python] Introduce the "Token API": a set of methods that manipulate
Tokenobjects instead of serialized strings - [Python] Add
unicode_rangesargument to thedetokenize_with_rangesmethod to return ranges over Unicode characters instead of bytes
Fixes and improvements
- Include "Half-width kana" in Katakana script detection
- C++
Published by guillaumekln over 5 years ago
pyonmttok - Tokenizer 1.18.5
Fixes and improvements
- Fix possible crash when applying a case insensitive BPE model on Unicode characters
- C++
Published by guillaumekln over 5 years ago
pyonmttok - Tokenizer 1.18.4
Fixes and improvements
- Fix segmentation fault on
cli/tokenizeexit - Ignore empty tokens during detokenization
- When writing to a file, avoid flushing the output stream on each line
- Update
cli/CMakeLists.txtto mark Boost.ProgramOptions as required
(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)
- C++
Published by guillaumekln almost 6 years ago