Recent Releases of pyonmttok

pyonmttok - Tokenizer 1.37.1

Fixes and improvements

  • Consider escaped characters as single characters in BPE
  • Ignore undefined scripts when resolving inherited or common scripts

- C++
Published by guillaumekln almost 3 years ago

pyonmttok - Tokenizer 1.37.0

New features

  • Add tokenization option allow_isolated_marks to allow combining marks to appear isolated in the tokenization output in specific conditions

Fixes and improvements

  • Fix infinite loop when the text contains an invalid Unicode character
  • Fix segmentation fault when the BPELearner does not not find any pairs of characters in the tokenized data
  • [Python] Update ICU to 72.1

- C++
Published by guillaumekln almost 3 years ago

pyonmttok - Tokenizer 1.36.0

New features

  • [Python] Add argument vocabulary in the Tokenizer constructor to set the vocabulary with a list of tokens instead of using a file
  • [Python] Add function pyonmttok.is_valid_language to check if a language code is valid and can be passed to the Tokenizer constructor

- C++
Published by guillaumekln about 3 years ago

pyonmttok - Tokenizer 1.35.0

New features

  • [Python] Add pickling support to pyonmttok.Vocab

Fixes and improvements

  • Update pybind11 to 2.10.1
  • Update cibuildwheel to 2.11.2

- C++
Published by guillaumekln about 3 years ago

pyonmttok - Tokenizer 1.34.0

Changes

  • [Python] Wheels are now built under manylinux2014 and requires pip >= 19.3 for installation

New features

  • [Python] Build wheels for Python 3.11

Fixes and improvements

  • Improve error handling when reading token frequencies in the vocabulary file
  • [Python] Fix possible crash when pyonmttok is imported before torch
  • [Python] Update ICU to 71.1
  • [C++] Fix static compilation with -DBUILD_SHARED_LIBS=OFF
  • [C++] Fix CMake warning when compiling the tests

- C++
Published by guillaumekln over 3 years ago

pyonmttok - Tokenizer 1.33.0

New features

  • [Python] Build ARM64 wheels for macOS

Fixes and improvements

  • [CLI] Fix error when the option --segment_alphabet is not set
  • Fix SentencePiece build warning when compiling with Clang

- C++
Published by guillaumekln over 3 years ago

pyonmttok - Tokenizer 1.32.0

New features

  • Add property pyonmttok.Vocab.counters to retrieve the number of occurrences of each token

Fixes and improvements

  • Update pybind11 to 2.10.0
  • Update cxxopts to 3.0.0

- C++
Published by guillaumekln over 3 years ago

pyonmttok - Tokenizer 1.31.0

New features

  • Add utilities to build and use vocabularies:
    • pyonmttok.Vocab
    • pyonmttok.build_vocab_from_tokens
    • pyonmttok.build_vocab_from_lines
  • Define the method Tokenizer.__call__ to simplify the tokenizer usage when additional features are unused:

python tokens = tokenizer(text)

Fixes and improvements

  • Update pybind11 to 2.9.1

- C++
Published by guillaumekln almost 4 years ago

pyonmttok - Tokenizer 1.30.1

Fixes and improvements

  • Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)

- C++
Published by guillaumekln about 4 years ago

pyonmttok - Tokenizer 1.30.0

New features

  • [Python] Build wheels for AArch64 Linux

Fixes and improvements

  • [Python] Update ICU to 70.1

- C++
Published by guillaumekln about 4 years ago

pyonmttok - Tokenizer 1.29.0

Changes

  • [Python] Drop support for Python 3.5

New features

  • [Python] Build wheels for Python 3.10
  • [Python] Add tokenization method Tokenizer.tokenize_batch

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.28.1

Fixes and improvements

  • Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.28.0

Changes

  • [C++] Remove the SpaceTokenizer class that is not meant to be public and can be confused with the "space" tokenization mode

New features

  • Build Python wheels for Windows
  • Add option tokens_delimiter to configure how tokens are delimited in tokenized files (default is a space)
  • Expose option with_separators in Python and CLI to include whitespace characters in the tokenized output
  • [Python] Add package version information in pyonmttok.__version__

Fixes and improvements

  • Fix detokenization when option with_separators is enabled

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.27.0

Changes

  • Linux Python wheels are now compiled with manylinux2010 and require pip >= 19.0 for installation
  • macOS Python wheels now require macOS >= 10.14

Fixes and improvements

  • Fix casing resolution when some letters do not have case information
  • Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence
  • Improve error message when setting invalid segment_alphabet or lang options
  • Update SentencePiece to 0.1.96
  • [Python] Improve declaration of functions and classes for better type hints and checks
  • [Python] Update ICU to 69.1

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.4

Fixes and improvements

  • Fix regression introduced in last version for preserved tokens that are not segmented by BPE

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.3

Fixes and improvements

  • Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.2

Fixes and improvements

  • Fix a divergence with the SentencePiece output when the spacer is detached from the word

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.1

Fixes and improvements

  • Fix application of the BPE vocabulary when using preserve_segmented_tokens and a subword appears without joiner in the vocabulary
  • Fix compilation with ICU versions older than 60

- C++
Published by guillaumekln over 4 years ago

pyonmttok - Tokenizer 1.26.0

New features

  • Add lang tokenization option to apply language-specific case mappings

Fixes and improvements

  • Use ICU to convert strings to Unicode values instead of a custom implementation

- C++
Published by guillaumekln almost 5 years ago

pyonmttok - Tokenizer 1.25.0

New features

  • Add training flag in tokenization methods to disable subword regularization during inference
  • [Python] Implement __len__ method in the Token class

Fixes and improvements

  • Raise an error when enabling case_markup with incompatible tokenization modes "space" and "none"
  • [Python] Improve parallelization when Tokenizer.tokenize is called from multiple Python threads (the Python GIL is now released)
  • [Python] Cleanup some manual Python <-> C++ types conversion

- C++
Published by guillaumekln almost 5 years ago

pyonmttok - Tokenizer 1.24.0

New features

  • Add verbose flag in file tokenization APIs to log progress every 100,000 lines
  • [Python] Add options property to Tokenizer instances
  • [Python] Add class pyonmttok.SentencePieceTokenizer to help creating a tokenizer compatible with SentencePiece

Fixes and improvements

  • Fix deserialization into Token objects that was sometimes incorrect
  • Fix Windows compilation
  • Fix Google Test integration that was sometimes installed as part of make install
  • [Python] Update pybind11 to 2.6.2
  • [Python] Update ICU to 66.1
  • [Python] Compile ICU with optimization flags

- C++
Published by guillaumekln about 5 years ago

pyonmttok - Tokenizer 1.23.0

Changes

  • Drop Python 2 support

New features

  • Publish Python wheels for macOS

Fixes and improvements

  • Improve performance in all tokenization modes (up to 2x faster)
  • Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
  • Fix a regression introduced in 1.20 where segment_alphabet_* options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)
  • Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using preserve_segmented_tokens and the word is segmented by both a segment_* option and BPE
  • Fix incorrect tokenization when using support_prior_joiners and some joiners are within protected sequences

- C++
Published by guillaumekln about 5 years ago

pyonmttok - Tokenizer 1.22.2

Fixes and improvements

  • Do not require "none" tokenization mode for SentencePiece vocabulary restriction

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.22.1

Fixes and improvements

  • Fix error when enabling vocabulary restriction with SentencePiece and spacer_annotate is not explicitly set
  • Fix backward compatibility with Kangxi and Kanbun scripts (see segment_alphabet option)

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.22.0

Changes

  • [C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a std::shared_ptr to make it outlive the Tokenizer instance.

New features

  • Add set_random_seed function to make subword regularization reproducible
  • [Python] Support serialization of Token instances
  • [C++] Add Options structure to configure tokenization options (Flags can still be used for backward compatibility)

Fixes and improvements

  • Fix BPE vocabulary restriction when using joiner_new, spacer_annotate, or spacer_new (the previous implementation always assumed joiner_annotate was used)
  • [Python] Fix spacer argument name in Token constructor
  • [C++] Fix ambiguous subword encoder ownership by using a std::shared_ptr

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.21.0

New features

  • Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

  • Fix BPE vocabulary restriction when words have a leading or trailing joiner
  • Raise an error when using a multi-character joiner and support_prior_joiner
  • [Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
  • [Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
  • [Python] Improve compatibility with Python 3.9

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.20.0

Changes

  • The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
    • ICU is now required to improve performance and Unicode support
    • SentencePiece is now integrated as a Git submodule and linked statically to the project
    • Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
    • The project is compiled in Release mode by default
    • Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

  • Accept any Unicode script aliases in the segment_alphabet option
  • Update SentencePiece to 0.1.92
  • [Python] Improve the capabilities of the Token class:
    • Implement the __repr__ method
    • Allow setting all attributes in the constructor
    • Add a copy constructor
  • [Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

  • [Python] Accept None value for segment_alphabet argument

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.19.0

New features

  • Add BPE dropout (Provilkov et al. 2019)
  • [Python] Introduce the "Token API": a set of methods that manipulate Token objects instead of serialized strings
  • [Python] Add unicode_ranges argument to the detokenize_with_ranges method to return ranges over Unicode characters instead of bytes

Fixes and improvements

  • Include "Half-width kana" in Katakana script detection

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.18.5

Fixes and improvements

  • Fix possible crash when applying a case insensitive BPE model on Unicode characters

- C++
Published by guillaumekln over 5 years ago

pyonmttok - Tokenizer 1.18.4

Fixes and improvements

  • Fix segmentation fault on cli/tokenize exit
  • Ignore empty tokens during detokenization
  • When writing to a file, avoid flushing the output stream on each line
  • Update cli/CMakeLists.txt to mark Boost.ProgramOptions as required

(This is the first release to be created on GitHub. See the release note of previous tags in CHANGELOG.md.)

- C++
Published by guillaumekln almost 6 years ago