Recent Releases of tokenizers
tokenizers - v0.22.0
What's Changed
- Bump on-headers and compression in /tokenizers/examples/unstable_wasm/www by @dependabot[bot] in https://github.com/huggingface/tokenizers/pull/1827
- Implement
from_bytesandread_bytesMethods in WordPiece Tokenizer for WebAssembly Compatibility by @sondalex in https://github.com/huggingface/tokenizers/pull/1758 - fix: use AHashMap to fix compile error by @b00f in https://github.com/huggingface/tokenizers/pull/1840
- New stream by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1856
- [docs] Add more decoders by @pcuenca in https://github.com/huggingface/tokenizers/pull/1849
- Fix missing parenthesis in
EncodingVisualizer.calculate_label_colorsby @Liam-DeVoe in https://github.com/huggingface/tokenizers/pull/1853 - Update quicktour.mdx re: Issue #1625 by @WilliamPLaCroix in https://github.com/huggingface/tokenizers/pull/1846
- remove stray comment by @sanderland in https://github.com/huggingface/tokenizers/pull/1831
- Fix typo in README by @aisk in https://github.com/huggingface/tokenizers/pull/1808
- RUSTSEC-2024-0436 - replace paste with pastey by @nystromjd in https://github.com/huggingface/tokenizers/pull/1834
- Tokenizer: Add native async bindings, via py03-async-runtimes. by @michaelfeil in https://github.com/huggingface/tokenizers/pull/1843
New Contributors
- @b00f made their first contribution in https://github.com/huggingface/tokenizers/pull/1840
- @pcuenca made their first contribution in https://github.com/huggingface/tokenizers/pull/1849
- @Liam-DeVoe made their first contribution in https://github.com/huggingface/tokenizers/pull/1853
- @WilliamPLaCroix made their first contribution in https://github.com/huggingface/tokenizers/pull/1846
- @sanderland made their first contribution in https://github.com/huggingface/tokenizers/pull/1831
- @aisk made their first contribution in https://github.com/huggingface/tokenizers/pull/1808
- @nystromjd made their first contribution in https://github.com/huggingface/tokenizers/pull/1834
- @michaelfeil made their first contribution in https://github.com/huggingface/tokenizers/pull/1843
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.3...v0.22.0rc0
- Rust
Published by ArthurZucker 6 months ago
tokenizers - v0.21.4
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.3...v0.21.4
No change, the 0.21.3 release failed, this is just a re-release.
https://github.com/huggingface/tokenizers/releases/tag/v0.21.3
- Rust
Published by Narsil 7 months ago
tokenizers - v0.21.3
What's Changed
- Clippy fixes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1818
- Fixed an introduced backward breaking change in our Rust APIs.
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.2...v0.21.3
- Rust
Published by Narsil 8 months ago
tokenizers - v0.21.2
What's Changed
This release if focused around some performance optimization, enabling broader python no gil support, and fixing some onig issues!
- Update the release builds following 0.21.1. by @Narsil in https://github.com/huggingface/tokenizers/pull/1746
- replace lazy_static with stabilized std::sync::LazyLock in 1.80 by @sftse in https://github.com/huggingface/tokenizers/pull/1739
- Fix no-onig no-wasm builds by @414owen in https://github.com/huggingface/tokenizers/pull/1772
- Fix typos in strings and comments by @co63oc in https://github.com/huggingface/tokenizers/pull/1770
- Fix type notation of merges in BPE Python binding by @Coqueue in https://github.com/huggingface/tokenizers/pull/1766
- Bump http-proxy-middleware from 2.0.6 to 2.0.9 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1762
- Fix data path in testcontinuingprefixtrainermismatch by @GaetanLepage in https://github.com/huggingface/tokenizers/pull/1747
- clippy by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1781
- Update pyo3 and rust-numpy depends for no-gil/free-threading compat by @Qubitium in https://github.com/huggingface/tokenizers/pull/1774
- Use ApiBuilder::fromenv() in frompretrained function by @BenLocal in https://github.com/huggingface/tokenizers/pull/1737
- Upgrade onig, to get it compiling with GCC 15 by @414owen in https://github.com/huggingface/tokenizers/pull/1771
- Itertools upgrade by @sftse in https://github.com/huggingface/tokenizers/pull/1756
- Bump webpack-dev-server from 4.10.0 to 5.2.1 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1792
- Bump brace-expansion from 1.1.11 to 1.1.12 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1796
- Fix features blending into a paragraph by @bionicles in https://github.com/huggingface/tokenizers/pull/1798
- Adding throughput to benches to have a more consistent measure across by @Narsil in https://github.com/huggingface/tokenizers/pull/1800
- Upgrading dependencies. by @Narsil in https://github.com/huggingface/tokenizers/pull/1801
- [docs] Whitespace by @stevhliu in https://github.com/huggingface/tokenizers/pull/1785
- Hotfixing the stub. by @Narsil in https://github.com/huggingface/tokenizers/pull/1802
- Bpe clones by @sftse in https://github.com/huggingface/tokenizers/pull/1707
- Fixed Length Pre-Tokenizer by @jonvet in https://github.com/huggingface/tokenizers/pull/1713
- Consolidated optimization ahash dary compact str by @Narsil in https://github.com/huggingface/tokenizers/pull/1799
- 🚨 breaking: Fix training with special tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1617
New Contributors
- @414owen made their first contribution in https://github.com/huggingface/tokenizers/pull/1772
- @co63oc made their first contribution in https://github.com/huggingface/tokenizers/pull/1770
- @Coqueue made their first contribution in https://github.com/huggingface/tokenizers/pull/1766
- @GaetanLepage made their first contribution in https://github.com/huggingface/tokenizers/pull/1747
- @Qubitium made their first contribution in https://github.com/huggingface/tokenizers/pull/1774
- @BenLocal made their first contribution in https://github.com/huggingface/tokenizers/pull/1737
- @bionicles made their first contribution in https://github.com/huggingface/tokenizers/pull/1798
- @stevhliu made their first contribution in https://github.com/huggingface/tokenizers/pull/1785
- @jonvet made their first contribution in https://github.com/huggingface/tokenizers/pull/1713
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.1...v0.21.2rc0
- Rust
Published by ArthurZucker 8 months ago
tokenizers - v0.21.1
What's Changed
- Update dev version and pyproject.toml by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1693
- Add feature flag hint to README.md, fixes #1633 by @sftse in https://github.com/huggingface/tokenizers/pull/1709
- Upgrade to PyO3 0.23 by @Narsil in https://github.com/huggingface/tokenizers/pull/1708
- Fixing the README. by @Narsil in https://github.com/huggingface/tokenizers/pull/1714
- Fix typo in Split docstrings by @Dylan-Harden3 in https://github.com/huggingface/tokenizers/pull/1701
- Fix typos by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1715
- Update documentation of Rust feature by @sondalex in https://github.com/huggingface/tokenizers/pull/1711
- Fix panic in DecodeStream::step due to incorrect index usage by @n0gu-furiosa in https://github.com/huggingface/tokenizers/pull/1699
- Fixing the stream by removing the read_index altogether. by @Narsil in https://github.com/huggingface/tokenizers/pull/1716
- Fixing NormalizedString append when normalized is empty. by @Narsil in https://github.com/huggingface/tokenizers/pull/1717
- 🚨 Support updating template processors by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1652. Removed in this release to keep backware compatibility temporarily.
- Update metadata as Python3.7 and Python3.8 support was dropped by @earlytobed in https://github.com/huggingface/tokenizers/pull/1724
- Add rustls-tls feature by @torymur in https://github.com/huggingface/tokenizers/pull/1732
New Contributors
- @Dylan-Harden3 made their first contribution in https://github.com/huggingface/tokenizers/pull/1701
- @sondalex made their first contribution in https://github.com/huggingface/tokenizers/pull/1711
- @n0gu-furiosa made their first contribution in https://github.com/huggingface/tokenizers/pull/1699
- @earlytobed made their first contribution in https://github.com/huggingface/tokenizers/pull/1724
- @torymur made their first contribution in https://github.com/huggingface/tokenizers/pull/1732
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1
- Rust
Published by Narsil 12 months ago
tokenizers - v0.21.1rc0
What's Changed
- Update dev version and pyproject.toml by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1693
- Add feature flag hint to README.md, fixes #1633 by @sftse in https://github.com/huggingface/tokenizers/pull/1709
- Upgrade to PyO3 0.23 by @Narsil in https://github.com/huggingface/tokenizers/pull/1708
- Fixing the README. by @Narsil in https://github.com/huggingface/tokenizers/pull/1714
- Fix typo in Split docstrings by @Dylan-Harden3 in https://github.com/huggingface/tokenizers/pull/1701
- Fix typos by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1715
- Update documentation of Rust feature by @sondalex in https://github.com/huggingface/tokenizers/pull/1711
- Fix panic in DecodeStream::step due to incorrect index usage by @n0gu-furiosa in https://github.com/huggingface/tokenizers/pull/1699
- Fixing the stream by removing the read_index altogether. by @Narsil in https://github.com/huggingface/tokenizers/pull/1716
- Fixing NormalizedString append when normalized is empty. by @Narsil in https://github.com/huggingface/tokenizers/pull/1717
- 🚨 Support updating template processors by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1652
- Update metadata as Python3.7 and Python3.8 support was dropped by @earlytobed in https://github.com/huggingface/tokenizers/pull/1724
- Add rustls-tls feature by @torymur in https://github.com/huggingface/tokenizers/pull/1732
New Contributors
- @Dylan-Harden3 made their first contribution in https://github.com/huggingface/tokenizers/pull/1701
- @sondalex made their first contribution in https://github.com/huggingface/tokenizers/pull/1711
- @n0gu-furiosa made their first contribution in https://github.com/huggingface/tokenizers/pull/1699
- @earlytobed made their first contribution in https://github.com/huggingface/tokenizers/pull/1724
- @torymur made their first contribution in https://github.com/huggingface/tokenizers/pull/1732
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1rc0
- Rust
Published by Narsil 12 months ago
tokenizers - v0.20.4-rc0
What's Changed
- More cache options. by @Narsil in https://github.com/huggingface/tokenizers/pull/1675
- Disable caching for long strings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1676
- Testing ABI3 wheels to reduce number of wheels by @Narsil in https://github.com/huggingface/tokenizers/pull/1674
- Adding an API for decode streaming. by @Narsil in https://github.com/huggingface/tokenizers/pull/1677
- Decode stream python by @Narsil in https://github.com/huggingface/tokenizers/pull/1678
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.3...v0.20.4-rc0
- Rust
Published by Narsil over 1 year ago
tokenizers - v0.20.3
What's Changed
There was a breaking change in 0.20.3 for tuple inputs of encode_batch!
* fix pylist by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1673
* [MINOR:TYPO] Fix docstrings by @cakiki in https://github.com/huggingface/tokenizers/pull/1653
New Contributors
- @cakiki made their first contribution in https://github.com/huggingface/tokenizers/pull/1653
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.2...v0.20.3
- Rust
Published by ArthurZucker over 1 year ago
tokenizers - v0.20.2
Release v0.20.2
Thanks a MILE to @diliop we now have support for python 3.13! 🥳
What's Changed
- Bump cookie and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1648
- Fix off-by-one error in tokenizer::normalizer::Range::len by @rlanday in https://github.com/huggingface/tokenizers/pull/1638
- Arg name correction: auth_token -> token by @rravenel in https://github.com/huggingface/tokenizers/pull/1621
- Unsound call of
set_varby @sftse in https://github.com/huggingface/tokenizers/pull/1664 - Add safety comments by @Manishearth in https://github.com/huggingface/tokenizers/pull/1651
- Bump actions/checkout to v4 by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1667
- PyO3 0.22 by @diliop in https://github.com/huggingface/tokenizers/pull/1665
- Bump actions versions by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1669
New Contributors
- @rlanday made their first contribution in https://github.com/huggingface/tokenizers/pull/1638
- @rravenel made their first contribution in https://github.com/huggingface/tokenizers/pull/1621
- @sftse made their first contribution in https://github.com/huggingface/tokenizers/pull/1664
- @Manishearth made their first contribution in https://github.com/huggingface/tokenizers/pull/1651
- @tinyboxvk made their first contribution in https://github.com/huggingface/tokenizers/pull/1667
- @diliop made their first contribution in https://github.com/huggingface/tokenizers/pull/1665
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.1...v0.20.2
- Rust
Published by ArthurZucker over 1 year ago
tokenizers - Release v0.20.1
What's Changed
The most awaited offset issue with Llama is fixed 🥳
- Update README.md by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1608
- fix benchmark file link by @152334H in https://github.com/huggingface/tokenizers/pull/1610
- Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in https://github.com/huggingface/tokenizers/pull/1626
- [
ignore_merges] Fix offsets by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1640 - Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1629
- Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1630
- Bump send and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1631
- Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1641
- Fix documentation build by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1642
- style: simplify string formatting for readability by @hamirmahal in https://github.com/huggingface/tokenizers/pull/1632
New Contributors
- @152334H made their first contribution in https://github.com/huggingface/tokenizers/pull/1610
- @hamirmahal made their first contribution in https://github.com/huggingface/tokenizers/pull/1632
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.0...v0.20.1
- Rust
Published by ArthurZucker over 1 year ago
tokenizers - Release v0.20.0: faster encode, better python support
Release v0.20.0
This release is focused on performances and user experience.
Performances:
First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :
Python API
We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
```python3
from tokenizers import Tokenizer; tokenizer = Tokenizer.frompretrained("bert-base-uncased"); print(tokenizer) Tokenizer(version="1.0", truncation=None, padding=None, addedtokens=[{"id":0, "content":"[PAD]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "singleword":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(cleantext=True, handlechinesechars=True, stripaccents=None, lowercase=True), pretokenizer=BertPreTokenizer(), postprocessor=TemplateProcessing(single=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0)], pair=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0), Sequence(id=B, typeid=1), SpecialToken(id="[SEP]", typeid=1)], specialtokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unktoken="[UNK]", continuingsubwordprefix="##", maxinputcharsper_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
tokenizer Tokenizer(version="1.0", truncation=None, padding=None, addedtokens=[{"id":0, "content":"[PAD]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(cleantext=True, handlechinesechars=True, stripaccents=None, lowercase=True), pretokenizer=BertPreTokenizer(), postprocessor=TemplateProcessing(single=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0)], pair=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0), Sequence(id=B, typeid=1), SpecialToken(id="[SEP]", typeid=1)], specialtokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unktoken="[UNK]", continuingsubwordprefix="##", maxinputcharsperword=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...})) ```
The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:
python
from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False
What's Changed
- remove enforcement of non special when adding tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1521
- [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in https://github.com/huggingface/tokenizers/pull/1513
- Make
USED_PARALLELISMatomic by @nathaniel-daniel in https://github.com/huggingface/tokenizers/pull/1532 - Fixing for clippy 1.78 by @Narsil in https://github.com/huggingface/tokenizers/pull/1548
- feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/tokenizers/pull/1551
- Switch from
cached_downloadtohf_hub_downloadin tests by @Wauplin in https://github.com/huggingface/tokenizers/pull/1547 - Fix "dictionnary" typo by @nprisbrey in https://github.com/huggingface/tokenizers/pull/1511
- make sure we don't warn on empty tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1554
- Enable
dropout = 0.0as an equivalent tononein BPE by @mcognetta in https://github.com/huggingface/tokenizers/pull/1550 - Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1569
- Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1555
- Fix clippy + feature test management. by @Narsil in https://github.com/huggingface/tokenizers/pull/1580
- Bump spm_precompiled to 0.1.3 by @MikeIvanichev in https://github.com/huggingface/tokenizers/pull/1571
- Add benchmark vs tiktoken by @Narsil in https://github.com/huggingface/tokenizers/pull/1582
- Fixing the benchmark. by @Narsil in https://github.com/huggingface/tokenizers/pull/1583
- Tiny improvement by @Narsil in https://github.com/huggingface/tokenizers/pull/1585
- Enable fancy regex by @Narsil in https://github.com/huggingface/tokenizers/pull/1586
- Fixing release CI strict (taken from safetensors). by @Narsil in https://github.com/huggingface/tokenizers/pull/1593
- Adding some serialization testing around the wrapper. by @Narsil in https://github.com/huggingface/tokenizers/pull/1594
- Add-legacy-tests by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1597
- Adding a few tests for decoder deserialization. by @Narsil in https://github.com/huggingface/tokenizers/pull/1598
- Better serialization error by @Narsil in https://github.com/huggingface/tokenizers/pull/1595
- Add test normalizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1600
- Improve decoder deserialization by @Narsil in https://github.com/huggingface/tokenizers/pull/1599
- Using serde (serdepyo3) to get _str__ and repr easily. by @Narsil in https://github.com/huggingface/tokenizers/pull/1588
- Merges cannot handle tokens containing spaces. by @Narsil in https://github.com/huggingface/tokenizers/pull/909
- Fix doc about split by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1591
- Support
Noneto reset pre_tokenizers and normalizers, and index sequences by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1590 - Fix strip python type by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1602
- Tests + Deserialization improvement for normalizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1604
- add deserialize for pre tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1603
- Perf improvement 16% by removing offsets. by @Narsil in https://github.com/huggingface/tokenizers/pull/1587
New Contributors
- @nathaniel-daniel made their first contribution in https://github.com/huggingface/tokenizers/pull/1532
- @nprisbrey made their first contribution in https://github.com/huggingface/tokenizers/pull/1511
- @mcognetta made their first contribution in https://github.com/huggingface/tokenizers/pull/1550
- @MikeIvanichev made their first contribution in https://github.com/huggingface/tokenizers/pull/1571
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.1...v0.20.0rc1
- Rust
Published by ArthurZucker over 1 year ago
tokenizers - v0.19.1
What's Changed
- add serialization for
ignore_mergesby @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1504
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.0...v0.19.1
- Rust
Published by ArthurZucker almost 2 years ago
tokenizers - v0.19.0
What's Changed
- chore: Remove CLI - this was originally intended for local development by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1442
- [
remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436 - Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1456
- Added ability to inspect a 'Sequence' decoder and the
AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443 - 🚨🚨 BREAKING CHANGE 🚨🚨: (addprefixspace dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1476
- Add more support for tiktoken based tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1493
- PyO3 0.21. by @Narsil in https://github.com/huggingface/tokenizers/pull/1494
- Remove 3.13 (potential undefined behavior.) by @Narsil in https://github.com/huggingface/tokenizers/pull/1497
- Bumping all versions 3 times (ty transformers :) ) by @Narsil in https://github.com/huggingface/tokenizers/pull/1498
- Fixing doc. by @Narsil in https://github.com/huggingface/tokenizers/pull/1499
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0
- Rust
Published by Narsil almost 2 years ago
tokenizers - v0.19.0rc0
Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177
What's Changed
- chore: Remove CLI - this was originally intended for local development by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1442
- [
remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436 - Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1456
- Added ability to inspect a 'Sequence' decoder and the
AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443 - 🚨🚨 BREAKING CHANGE 🚨🚨: (addprefixspace dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1476
- Add more support for tiktoken based tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1493
- PyO3 0.21. by @Narsil in https://github.com/huggingface/tokenizers/pull/1494
- Remove 3.13 (potential undefined behavior.) by @Narsil in https://github.com/huggingface/tokenizers/pull/1497
- Bumping all versions 3 times (ty transformers :) ) by @Narsil in https://github.com/huggingface/tokenizers/pull/1498
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0rc0
- Rust
Published by Narsil almost 2 years ago
tokenizers - v0.15.2
What's Changed
Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:
- chore: Update dependencies to latest supported versions by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1441
- Convert word counts to u64 by @stephenroller in https://github.com/huggingface/tokenizers/pull/1433
- Efficient Replace normalizer by @rlrs in https://github.com/huggingface/tokenizers/pull/1413
New Contributors
- @bryantbiggs made their first contribution in https://github.com/huggingface/tokenizers/pull/1441
- @stephenroller made their first contribution in https://github.com/huggingface/tokenizers/pull/1433
- @rlrs made their first contribution in https://github.com/huggingface/tokenizers/pull/1413
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.1...v0.15.2rc1
- Rust
Published by ArthurZucker about 2 years ago
tokenizers - v0.15.1
What's Changed
- udpate to version = "0.15.1-dev0" by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1390
- Derive
CloneonTokenizer, addEncoding.into_tokens()method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381 - Stale bot. by @Narsil in https://github.com/huggingface/tokenizers/pull/1404
- Fix doc links in readme by @Pierrci in https://github.com/huggingface/tokenizers/pull/1367
- Faster HF dataset iteration in docs by @mariosasko in https://github.com/huggingface/tokenizers/pull/1414
- Add quick doc to byte_level.rs by @steventrouble in https://github.com/huggingface/tokenizers/pull/1420
- Fix make bench. by @Narsil in https://github.com/huggingface/tokenizers/pull/1428
- Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1430
- pyo3: update to 0.20 by @mikelui in https://github.com/huggingface/tokenizers/pull/1386
- Encode special tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1437
- Update release for python3.12 windows by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1438
New Contributors
- @steventrouble made their first contribution in https://github.com/huggingface/tokenizers/pull/1420
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.0...v0.15.1
- Rust
Published by ArthurZucker about 2 years ago
tokenizers - v0.15.1.rc0
What's Changed
- pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
- Add
expect()for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316 - Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
- Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
- Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
- Move to maturing mimicking move for
safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331 - Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
- Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
- update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
- Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353
- Preparing release. by @Narsil in https://github.com/huggingface/tokenizers/pull/1355
- fix a clerical error in the comment by @tiandiweizun in https://github.com/huggingface/tokenizers/pull/1356
- fix: remove useless token by @rtrompier in https://github.com/huggingface/tokenizers/pull/1371
- Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1370
- Allow hf_hub 0.18 by @mariosasko in https://github.com/huggingface/tokenizers/pull/1383
- Allow
huggingface_hub<1.0by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385 - [
pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357 - udpate to version = "0.15.1-dev0" by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1390
- Derive
CloneonTokenizer, addEncoding.into_tokens()method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381 - Stale bot. by @Narsil in https://github.com/huggingface/tokenizers/pull/1404
- Fix doc links in readme by @Pierrci in https://github.com/huggingface/tokenizers/pull/1367
- Faster HF dataset iteration in docs by @mariosasko in https://github.com/huggingface/tokenizers/pull/1414
- Add quick doc to byte_level.rs by @steventrouble in https://github.com/huggingface/tokenizers/pull/1420
- Fix make bench. by @Narsil in https://github.com/huggingface/tokenizers/pull/1428
- Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1430
- pyo3: update to 0.20 by @mikelui in https://github.com/huggingface/tokenizers/pull/1386
New Contributors
- @mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
- @eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341
- @tiandiweizun made their first contribution in https://github.com/huggingface/tokenizers/pull/1356
- @rtrompier made their first contribution in https://github.com/huggingface/tokenizers/pull/1371
- @mariosasko made their first contribution in https://github.com/huggingface/tokenizers/pull/1383
- @Wauplin made their first contribution in https://github.com/huggingface/tokenizers/pull/1385
- @steventrouble made their first contribution in https://github.com/huggingface/tokenizers/pull/1420
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.15.1.rc0
- Rust
Published by Narsil about 2 years ago
tokenizers -
What's Changed
- fix a clerical error in the comment by @tiandiweizun in https://github.com/huggingface/tokenizers/pull/1356
- fix: remove useless token by @rtrompier in https://github.com/huggingface/tokenizers/pull/1371
- Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1370
- Allow hf_hub 0.18 by @mariosasko in https://github.com/huggingface/tokenizers/pull/1383
- Allow
huggingface_hub<1.0by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385 - [
pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357
New Contributors
- @tiandiweizun made their first contribution in https://github.com/huggingface/tokenizers/pull/1356
- @rtrompier made their first contribution in https://github.com/huggingface/tokenizers/pull/1371
- @mariosasko made their first contribution in https://github.com/huggingface/tokenizers/pull/1383
- @Wauplin made their first contribution in https://github.com/huggingface/tokenizers/pull/1385
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.14.1...v0.15.0
- Rust
Published by ArthurZucker over 2 years ago
tokenizers - v0.14.1
What's Changed
- Fix conda release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1211
- Fix node release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1212
- Printing warning to stderr. by @Narsil in https://github.com/huggingface/tokenizers/pull/1222
- Fixing paddingleft sequenceids. by @Narsil in https://github.com/huggingface/tokenizers/pull/1233
- Use LTO for release and benchmark builds by @csko in https://github.com/huggingface/tokenizers/pull/1157
- fix unigram.rs test_sample() by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1244
- implement a simple maxsentencepiecelength into BPE by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1228
- Makes
decodeanddecode_batchwork on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251 - Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
- Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
- Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
- Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
- fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
- Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
- Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
- [doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
- Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
- Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
- revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
- Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
- import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
- Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
- feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
- Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
- Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
- Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
- CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
- 0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319
- Release all at once for simplicity. by @Narsil in https://github.com/huggingface/tokenizers/pull/1320
- Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321
- pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
- Add
expect()for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316 - Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
- Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
- Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
- Move to maturing mimicking move for
safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331 - Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
- Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
- update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
- Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353
- Preparing release. by @Narsil in https://github.com/huggingface/tokenizers/pull/1355
New Contributors
- @csko made their first contribution in https://github.com/huggingface/tokenizers/pull/1157
- @chris-ha458 made their first contribution in https://github.com/huggingface/tokenizers/pull/1244
- @sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
- @boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
- @hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
- @bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
- @kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
- @SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
- @jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317
- @mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
- @eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.1
- Rust
Published by Narsil over 2 years ago
tokenizers - v0.14.1rc1
What's Changed
- pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
- Add
expect()for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316 - Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
- Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
- Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
- Move to maturing mimicking move for
safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331 - Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
- Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
- update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
- Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353
New Contributors
- @mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
- @eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.14.1rc1
- Rust
Published by Narsil over 2 years ago
tokenizers - v0.14.0
⚠️ Reworks the release pipeline. Other breaking changes ⚠️ :
- #1335, AddedToken is reworked, is_special_token rename to special for consistency
- feature http is now OFF by default, and depends on hf-hub instead of cachedpath (updated cache directory, better sync implementation)
- Removed SSL link on the python package, calling huggingfacehub directly instead.
- New dependency : huggingfacehub (while we deprecate Tokenizer.frompretrained(...) to Tokenizer.fromfile(hugginngfacehub.hfhubdownload(MODEL_ID, "tokenizer.json")
What's Changed
- Fix conda release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1211
- Fix node release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1212
- Printing warning to stderr. by @Narsil in https://github.com/huggingface/tokenizers/pull/1222
- Fixing paddingleft sequenceids. by @Narsil in https://github.com/huggingface/tokenizers/pull/1233
- Use LTO for release and benchmark builds by @csko in https://github.com/huggingface/tokenizers/pull/1157
- fix unigram.rs test_sample() by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1244
- implement a simple maxsentencepiecelength into BPE by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1228
- Makes
decodeanddecode_batchwork on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251 - Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
- Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
- Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
- Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
- fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
- Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
- Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
- [doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
- Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
- Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
- revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
- Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
- import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
- Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
- feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
- Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
- Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
- Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
- CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
- 0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319
- Release all at once for simplicity. by @Narsil in https://github.com/huggingface/tokenizers/pull/1320
- Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321
- pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
- Add
expect()for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316 - Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
- Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
- Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
- Move to maturing mimicking move for
safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331 - Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
- Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
New Contributors
- @csko made their first contribution in https://github.com/huggingface/tokenizers/pull/1157
- @chris-ha458 made their first contribution in https://github.com/huggingface/tokenizers/pull/1244
- @sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
- @boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
- @hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
- @bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
- @kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
- @SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
- @jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317
- @mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.0
- Rust
Published by ArthurZucker over 2 years ago
tokenizers - v0.14.0.rc1
Reworks the release pipeline. Other breaking changes are mostly related to https://github.com/huggingface/tokenizers/pull/1335, where AddedToken is reworked
What's Changed
- pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
- Add
expect()for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316 - Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
- Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
- Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
- Move to maturing mimicking move for
safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331 - Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
- Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
New Contributors
- @mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.14.0.rc1
- Rust
Published by ArthurZucker over 2 years ago
tokenizers - v0.13.4.rc3
Mostly checking the new release scripts actually work.
What's Changed
- pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
- Add
expect()for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316 - Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
New Contributors
- @mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.13.4.rc3
- Rust
Published by Narsil over 2 years ago
tokenizers - v0.13.4.rc2
What's Changed
- Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc1...v0.13.4.rc2
- Rust
Published by Narsil over 2 years ago
tokenizers - Python v0.13.4.rc1
What's Changed
- Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
- Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
- Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
- Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
- fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
- Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
- Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
- [doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
- Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
- Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
- revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
- Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
- import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
- Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
- feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
- Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
- Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
- Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
- CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
- 0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319
New Contributors
- @sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
- @boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
- @hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
- @bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
- @kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
- @SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
- @jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4-rc2...v0.13.4.rc1
- Rust
Published by Narsil over 2 years ago
tokenizers - https://github.com/huggingface/tokenizers/releases/tag/v0.13.4-rc2
- Rust
Published by github-actions[bot] almost 3 years ago
tokenizers - https://github.com/huggingface/tokenizers/releases/tag/v0.13.4-rc1
- Rust
Published by github-actions[bot] almost 3 years ago
tokenizers - Node v0.13.3
What's Changed
- Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
- Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
- Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
- Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
- Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
- Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
- Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
- Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
- Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
- Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
- Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
- Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
- Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
- Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
- Making
Tokenizerclone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152 - Prevent using
from_pretrainedon invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153 - Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
- Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
- Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
- pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
- Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
- Adding ByteFallback support for
tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183 - Faster
datasetstrain example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192 - Adding
Replaceto decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195 - Creating
normalizers.Prepend(To be used instead ofMetaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194 - Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
- Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
- Add
contentto Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199
New Contributors
- @ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
- @SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
- @hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
- @fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
- @mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
- @lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192
Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1
What's Changed
- Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
- Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
- Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
- Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
- Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
- Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
- Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
- Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
- Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
- Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
- Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
- Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
- Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
- Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
- Making
Tokenizerclone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152 - Prevent using
from_pretrainedon invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153 - Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
- Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
- Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
- pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
- Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
- Adding ByteFallback support for
tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183 - Faster
datasetstrain example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192 - Adding
Replaceto decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195 - Creating
normalizers.Prepend(To be used instead ofMetaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194 - Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
- Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
- Add
contentto Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199 - New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205
New Contributors
- @ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
- @SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
- @hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
- @fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
- @mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
- @lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192
Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1
- Rust
Published by ArthurZucker almost 3 years ago
tokenizers - Rust v0.13.3
What's Changed
- Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
- Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
- Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
- Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
- Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
- Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
- Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
- Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
- Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
- Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
- Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
- Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
- Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
- Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
- Making
Tokenizerclone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152 - Prevent using
from_pretrainedon invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153 - Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
- Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
- Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
- pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
- Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
- Adding ByteFallback support for
tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183 - Faster
datasetstrain example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192 - Adding
Replaceto decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195 - Creating
normalizers.Prepend(To be used instead ofMetaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194 - Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
- Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
- Add
contentto Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199 - New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205
- New release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1207
New Contributors
- @ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
- @SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
- @hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
- @fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
- @mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
- @lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.2...v0.13.3
- Rust
Published by ArthurZucker almost 3 years ago
tokenizers - Python v0.13.3
What's Changed
- Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
- Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
- Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
- Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
- Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
- Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
- Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
- Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
- Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
- Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
- Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
- Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
- Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
- Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
- Making
Tokenizerclone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152 - Prevent using
from_pretrainedon invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153 - Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
- Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
- Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
- pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
- Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
- Adding ByteFallback support for
tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183 - Faster
datasetstrain example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192 - Adding
Replaceto decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195 - Creating
normalizers.Prepend(To be used instead ofMetaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194 - Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
- Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
- Add
contentto Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199
New Contributors
- @ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
- @SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
- @hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
- @fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
- @mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
- @lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192
Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1
What's Changed
- Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
- Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
- Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
- Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
- Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
- Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
- Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
- Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
- Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
- Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
- Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
- Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
- Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
- Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
- Making
Tokenizerclone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152 - Prevent using
from_pretrainedon invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153 - Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
- Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
- Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
- pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
- Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
- Adding ByteFallback support for
tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183 - Faster
datasetstrain example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192 - Adding
Replaceto decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195 - Creating
normalizers.Prepend(To be used instead ofMetaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194 - Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
- Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
- Add
contentto Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199 - New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205
New Contributors
- @ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
- @SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
- @hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
- @fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
- @mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
- @lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192
Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1
- Rust
Published by ArthurZucker almost 3 years ago
tokenizers - Python v0.13.3rc1
What's Changed
- Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
- Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
- Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
- Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
- Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
- Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
- Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
- Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
- Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
- Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
- Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
- Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
- Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
- Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
- Making
Tokenizerclone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152 - Prevent using
from_pretrainedon invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153 - Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
- Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
- Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
- pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
- Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
- Adding ByteFallback support for
tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183 - Faster
datasetstrain example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192 - Adding
Replaceto decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195 - Creating
normalizers.Prepend(To be used instead ofMetaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194 - Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
- Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
- Add
contentto Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199
New Contributors
- @ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
- @SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
- @hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
- @fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
- @mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
- @lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192
Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1
What's Changed
- Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
- Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
- Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
- Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
- Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
- [FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
- Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
- Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
- Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
- Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
- Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
- Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
- Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
- Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
- Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
- Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
- [FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
- Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
- Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
- Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
- Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
- Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
- Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
- Making
Tokenizerclone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152 - Prevent using
from_pretrainedon invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153 - Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
- Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
- Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
- pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
- Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
- Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
- Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
- Adding ByteFallback support for
tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183 - Faster
datasetstrain example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192 - Adding
Replaceto decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195 - Creating
normalizers.Prepend(To be used instead ofMetaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194 - Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
- Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
- Add
contentto Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199 - New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205
New Contributors
- @ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
- @SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
- @hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
- @fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
- @mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
- @lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192
Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1
- Rust
Published by Narsil almost 3 years ago
tokenizers - Node 0.13.2
Python 3.11 support (Python only modification)
- Rust
Published by Narsil over 3 years ago
tokenizers - Rust 0.13.2
Python 3.11 support (Python only modification)
- Rust
Published by Narsil over 3 years ago
tokenizers - Python 0.13.2
[0.13.2]
- [#1096] Python 3.11 support
- Rust
Published by Narsil over 3 years ago
tokenizers - Node 0.13.1
[0.13.1]
- [#1072] Fixing Roberta type ids.
- Rust
Published by Narsil over 3 years ago
tokenizers - Rust 0.13.1
[0.13.1]
- [#1072] Fixing Roberta type ids.
- Rust
Published by Narsil over 3 years ago
tokenizers - Python v0.13.1
[0.13.1]
- [#1072] Fixing Roberta type ids.
- Rust
Published by Narsil over 3 years ago
tokenizers - Python v0.13.0
[0.13.0]
- [#956] PyO3 version upgrade
- [#1055] M1 automated builds
- [#1008]
Decoderis now a composable trait, but without being backward incompatible - [#1047, #1051, #1052]
Processoris now a composable trait, but without being backward incompatible
Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.
- Rust
Published by Narsil over 3 years ago
tokenizers - Node v0.13.0
[0.13.0]
- [#1008]
Decoderis now a composable trait, but without being backward incompatible - [#1047, #1051, #1052]
Processoris now a composable trait, but without being backward incompatible
- Rust
Published by Narsil over 3 years ago
tokenizers - Rust v0.13.0
[0.13.0]
- [#1009]
unstable_wasmfeature to support building on Wasm (it's unstable !) - [#1008]
Decoderis now a composable trait, but without being backward incompatible - [#1047, #1051, #1052]
Processoris now a composable trait, but without being backward incompatible
Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.
- Rust
Published by Narsil over 3 years ago
tokenizers - Python v0.12.1
[0.12.1]
- [#938] Reverted breaking change. https://github.com/huggingface/transformers/issues/16520
- Rust
Published by Narsil almost 4 years ago
tokenizers - [YANKED] Node v0.12.0
[0.12.0]
The breaking change was causing more issues upstream in transformers than anticipated:
https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657
The decision was to rollback on that breaking change, and figure out a different way later to do this modification
Bump minor version because of a breaking change.
Using 0.12 to match other bindings.
- [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in
ByteLevelpre_tokenizer optional (necessary for BigScience)[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of
tokenizers
- Rust
Published by Narsil almost 4 years ago
tokenizers - [YANKED] Python v0.12.0
[0.12.0]
The breaking change was causing more issues upstream in transformers than anticipated:
https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657
The decision was to rollback on that breaking change, and figure out a different way later to do this modification
Bump minor version because of a breaking change.
- [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in
ByteLevelpre_tokenizer optional (necessary for BigScience)[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#962] Fix tests for python 3.10
[#961] Added link for Ruby port of
tokenizers
- Rust
Published by Narsil almost 4 years ago
tokenizers - [YANKED] Rust v0.12.0
[0.12.0]
Bump minor version because of a breaking change.
The breaking change was causing more issues upstream in transformers than anticipated:
https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657
The decision was to rollback on that breaking change, and figure out a different way later to do this modification
- [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in
ByteLevelpre_tokenizer optional (necessary for BigScience)[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of
tokenizers[#960] Feature gate for
cliand itsclapdependency
- Rust
Published by Narsil almost 4 years ago
tokenizers - Rust v0.11.2
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
- [#916] Deserializing faster
added_tokensby loading them in batch.
- Rust
Published by Narsil almost 4 years ago
tokenizers - Python v0.11.6
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
- [#916] Deserializing faster
added_tokensby loading them in batch.
- Rust
Published by Narsil almost 4 years ago
tokenizers - Python v0.11.5
[#895] Add wheel support for Python 3.10
- Rust
Published by Narsil about 4 years ago
tokenizers - Node v0.8.2
[#884] Fixing bad deserialization following inclusion of a default for Punctuation
- Rust
Published by Narsil about 4 years ago
tokenizers - Python v0.11.4
[#884] Fixing bad deserialization following inclusion of a default for Punctuation
- Rust
Published by Narsil about 4 years ago
tokenizers - Python v0.11.3
- [#882] Fixing Punctuation deserialize without argument.
- [#868] Fixing missing direction in TruncationParams
- [#860] Adding TruncationSide to TruncationParams
- Rust
Published by Narsil about 4 years ago
tokenizers - Rust v0.11.1
- [#882] Fixing Punctuation deserialize without argument.
- [#868] Fixing missing direction in TruncationParams
- [#860] Adding TruncationSide to TruncationParams
- Rust
Published by Narsil about 4 years ago
tokenizers - Node v0.8.1
Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.
- Rust
Published by Narsil about 4 years ago
tokenizers - Python v0.11.2
Fixes https://github.com/huggingface/tokenizers/pull/868
- Rust
Published by Narsil about 4 years ago
tokenizers - Python v0.11.1
[#860] Adding TruncationSide to TruncationParams.
- Rust
Published by Narsil about 4 years ago
tokenizers - Python v0.11.0
Fixed
- [#585] Conda version should now work on old CentOS
- [#844] Fixing interaction between
is_pretokenizedandtrim_offsets. - [#851] Doc links
Added
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
- [#845]: Documentation for
Decoders.
Changed
- [#850]: Added a feature gate to enable disabling
httpfeatures - [#718]: Fix
WordLeveltokenizer determinism during training - [#762]: Add a way to specify the unknown token in
SentencePieceUnigramTokenizer - [#770]: Improved documentation for
UnigramTrainer - [#780]: Add
Tokenizer.from_pretrainedto load tokenizers from the Hugging Face Hub - [#793]: Saving a pretty JSON file by default when saving a tokenizer
- Rust
Published by n1t0 about 4 years ago
tokenizers - Node v0.8.0
BREACKING CHANGES
- Many improvements on the Trainer (#519).
The files must now be provided first when calling
tokenizer.train(files, trainer).
Features
- Adding the
TemplateProcessing - Add
WordLevelandUnigrammodels (#490) - Add
nmtNormalizerandprecompiledNormalizernormalizers (#490) - Add
templateProcessingpost-processor (#490) - Add
digitsPreTokenizerpre-tokenizer (#490) - Add support for mapping to sequences (#506)
- Add
splitPreTokenizerpre-tokenizer (#542) - Add
behavioroption to thepunctuationPreTokenizer(#657) - Add the ability to load tokenizers from the Hugging Face Hub using
fromPretrained(#780)
Fixes
- Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
- Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)
- Rust
Published by n1t0 over 4 years ago
tokenizers - Python v0.10.3
Fixed
- [#686]: Fix SPM conversion process for whitespace deduplication
- [#707]: Fix stripping strings containing Unicode characters
Added
- [#693]: Add a CTC Decoder for Wave2Vec models
Removed
- [#714]: Removed support for Python 3.5
- Rust
Published by n1t0 over 4 years ago
tokenizers - Python v0.10.2
Fixed
- [#652]: Fix offsets for
Precompiledcorner case - [#656]: Fix BPE
continuing_subword_prefix - [#674]: Fix
Metaspaceserialization problems
- Rust
Published by n1t0 almost 5 years ago
tokenizers - Python v0.10.1
Fixed
- [#616]: Fix SentencePiece tokenizers conversion
- [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
- [#618]: Fix Normalizer.normalize with
PyNormalizedStringRefMut - [#620]: Fix serialization/deserialization for overlapping models
- [#621]: Fix
ByteLevelinstantiation from a previously saved state (using__getstate__())
- Rust
Published by n1t0 about 5 years ago
tokenizers - Python v0.10.0
Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a
WordLevelTrainerused to train aWordLevelmodel - [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with
datasets - [#590]: Add getters/setters for components on BaseTokenizer
- [#574]: Add
fust_unkoption to SentencePieceBPETokenizer
Changed
- [#509]: Automatically stubbing the
.pyifiles - [#519]: Each
Modelcan return its associatedTrainerwithget_trainer() - [#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1) - [#538]: The API Reference has been improved and is now up-to-date.
Fixed
- [#519]: During training, the
Modelis now trained in-place. This fixes several bugs that were forcing to reload theModelafter a training. - [#539]: Fix
BaseTokenizerenable_truncation docstring
- Rust
Published by n1t0 about 5 years ago
tokenizers - Python v0.10.0rc1
Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a
WordLevelTrainerused to train aWordLevelmodel - [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with
datasets
Changed
- [#509]: Automatically stubbing the
.pyifiles - [#519]: Each
Modelcan return its associatedTrainerwithget_trainer() - [#530]: The various attributes on each component can be get/set (ie.
tokenizer.model.dropout = 0.1) - [#538]: The API Reference has been improved and is now up-to-date.
Fixed
- [#519]: During training, the
Modelis now trained in-place. This fixes several bugs that were forcing to reload theModelafter a training. - [#539]: Fix
BaseTokenizerenable_truncation docstring
- Rust
Published by n1t0 about 5 years ago
tokenizers - Python v0.9.4
Fixed
- [#492]: Fix
from_fileonBertWordPieceTokenizer - [#498]: Fix the link to download
sentencepiece_model_pb2.py - [#500]: Fix a typo in the docs quicktour
Changed
- [#506]: Improve Encoding mappings for pairs of sequence
- Rust
Published by n1t0 over 5 years ago
tokenizers - Python v0.9.3
Fixed
- [#470]: Fix hanging error when training with custom component
- [#476]: TemplateProcessing serialization is now deterministic
- [#481]: Fix SentencePieceBPETokenizer.from_files
Added
- [#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
- [#480]: Unigram now accepts an
initial_alphabetand handlesspecial_tokenscorrectly
- Rust
Published by n1t0 over 5 years ago
tokenizers - Python v0.9.2
Fixed
- [#464] Fix a problem with RobertaProcessing being deserialized as BertProcessing
- Rust
Published by n1t0 over 5 years ago
tokenizers - Python v0.9.1
Fixed
- [#459] Fix a problem with deserialization
- Rust
Published by n1t0 over 5 years ago
tokenizers - Python v0.9.0
Fixed
- [#362]: Fix training deadlock with Python components.
- [#363]: Fix a crash when calling
.trainwith some non-existent files - [#355]: Remove a lot of possible crashes
- [#389]: Improve truncation (crash and consistency)
Added
- [#379]: Add the ability to call
encode/encode_batchwith numpy arrays - [#292]: Support for the Unigram algorithm
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
- [#403]: Add
TemplateProcessingPostProcessor. - [#420]: Ability to fuse the "unk" token in BPE.
Changed
- [#360]: Lots of improvements related to words/alignment tracking
- [#426]: Improvements on error messages thanks to PyO3 0.12
- Rust
Published by n1t0 over 5 years ago
tokenizers - Python v0.9.0.rc1
Fixed
- [#362]: Fix training deadlock with Python components.
- [#363]: Fix a crash when calling
.trainwith some non-existent files - [#355]: Remove a lot of possible crashes
- [#389]: Improve truncation (crash and consistency)
Added
- [#379]: Add the ability to call
encode/encode_batchwith numpy arrays - [#292]: Support for the Unigram algorithm
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
- [#403]: Add
TemplateProcessingPostProcessor. - [#420]: Ability to fuse the "unk" token in BPE.
Changed
- [#360]: Lots of improvements related to words/alignment tracking
- [#426]: Improvements on error messages thanks to PyO3 0.12
- Rust
Published by n1t0 over 5 years ago
tokenizers - Python 0.8.1
Fixed
- [#333]: Fix deserialization of
AddedToken, where the content was not restored properly
Changed
- [#329]: Improved warning and behavior when we detect a fork
- [#330]: BertNormalizer now keeps the same behavior than the original implementation when
strip_accentsis not specified.
- Rust
Published by n1t0 over 5 years ago
tokenizers - Python v0.8.0
Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps while applying labels to each word.
- Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
- With the serialization comes the compatibility with
Pickle! The Tokenizer, all of its components, Encodings, everything can be pickled! - Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with
multiprocessing, even when using theforkstart method. Since this library makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization, this led to problems (deadlocks) when used withmultiprocessing. This version now allows to disable the parallelism, and will warn you if this is necessary. - And a lot of other improvements, and fixes.
Fixed
- [#286]: Fix various crash when training a BPE model
- [#309]: Fixed a few bugs related to additional vocabulary/tokens
Added
- [#272]: Serialization of the
Tokenizerand all the parts (PreTokenizer,Normalizer, ...). This adds some methods to easily save/load an entire tokenizer (from_str,from_file). - [#273]:
Tokenizerand its parts are now pickable - [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
enable_padding(pad_to_multiple_of=8)for example. - [#298]: Ability to get the currently set truncation/padding params
- [#311]: Ability to enable/disable the parallelism using the
TOKENIZERS_PARALLELISMenvironment variable. This is especially usefull when usingmultiprocessingcapabilities, with theforkstart method, which happens to be the default on Linux systems. Without disabling the parallelism, the process dead-locks while encoding. (Cf [#187] for more information)
Changed
- Improved errors generated during truncation: When the provided max length is too low are now handled properly.
- [#249]
encodeandencode_batchnow accept pre-tokenized inputs. When the input is pre-tokenized, the argumentis_pretokenized=Truemust be specified. - [#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the processing of each file
- [#280]: Use
onigfor byte-level pre-tokenization to remove all the differences with the original implementation from GPT-2 - [#309]: Improved the management of the additional vocabulary. This introduces an option
normalized, controlling whether a token should be extracted from the normalized version of the input text.
- Rust
Published by n1t0 over 5 years ago
tokenizers - Rust v0.10.1
Fixed
- [#226]: Fix the word indexes when there are special tokens
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Python v0.7.0
Changed
- Only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bars on screen. Also avoids reading the size of each file before starting to actually read these files, as this process could take really long.
- [#193]:
encodeandencode_batchnow take a new optional argument, specifying whether we should add the special tokens. This is activated by default. - [#197]:
original_strandnormalized_strhave been removed from theEncodingreturned byencodeandencode_batch. This brings a reduction of 70% of the memory footprint. - [#197]: The offsets provided on
Encodingare now relative to the original string, and not the normalized one anymore. - The added token given to
add_special_tokensoradd_tokenson aTokenizer, or while usingtrain(special_tokens=...)can now be instances ofAddedTokento provide more control over these tokens. - [#136]: Updated Pyo3 version
- [#136]: Static methods
Model.from_filesandModel.emptyare removed in favor of using constructors. - [#239]:
CharBPETokenizernow corresponds to OpenAI GPT BPE implementation by default.
Added
- [#188]:
ByteLevelis also aPostProcessornow and handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token. It has been added toByteLevelBPETokenizerbut it is off by default (trim_offsets=False). - [#236]:
RobertaProcessingalso handles trimming the offsets. - [#234]: New alignment mappings on the
Encoding. Provide methods to easily convert betweencharorword(input space) andtoken(output space). post_processcan be called on theTokenizer- [#208]: Ability to retrieve the vocabulary from the
Tokenizerwithget_vocab(with_added_tokens: bool) - [#136] Models can now be instantiated through object constructors.
Fixed
- [#193]: Fix some issues with the offsets being wrong with the
ByteLevelBPE:- when
add_prefix_space=True - [#156]: when a Unicode character gets split-up in multiple byte-level characters
- when
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not advised, but that's not the question).
- [#205]: Trim the decoded string in
BPEDecoderused byCharBPETokenizer
How to migrate
- Add the
ByteLevelPostProcessorto your byte-level BPE tokenizers if relevant. If you are usingByteLevelBPETokenizer, this option is disabled by default (trim_offsets=False). BertWordPieceTokenizeroption toadd_special_tokensmust now be given toencodeorencode_batch- Access to the
original_stron theEncodinghas been removed. The original string is the input ofencodeso it didn't make sense to keep it here. - No need to call
original_str.offsets(offsets[N])to convert offsets to the original string. They are now relative to the original string by default. - Access to the
normalized_stron theEncodinghas been removed. Can be retrieved by callingnormalize(sequence)on theTokenizer - Change
Model.from_filesandModel.emptyto use constructor. The model constructor should take the same arguments as the old methods. (ieBPE(vocab, merges)orBPE()) - If you were using the
CharBPETokenizerand want to keep the same behavior as before, setbert_normalizer=Falseandsplit_on_whitespace_only=True.
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Rust v0.10.0
Changed
- [#222]: All Tokenizer's subparts must now be
Send + Sync
Added
- [#208]: Ability to retrieve the vocabulary from the
Tokenizer&Model
Fixed
- [#205]: Trim the decoded string in
BPEDecoder - [b770f36]: Fix a bug with added tokens generated IDs
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Rust v0.9.0
Changed
- Only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bars on screen. Also avoids reading the size of each file before starting to actually read these files, as this process could take really long.
- [#190]: Improved BPE and WordPiece builders
- [#193]:
encodeandencode_batchnow take a new argument, specifying whether we should add the special tokens - [#197]: The
NormalizedStringhas been removed from theEncoding. It is now possible to retrieve it by callingnormalizeon theTokenizer. This brings a reduction of 70% of the memory footprint - [#197]: The
NormalizedStringAPI has been improved. It is now possible to retrieve parts of both strings using both "normalized" or "original" offsets - [#197]: The offsets provided on
Encodingare now relative to the original string, and not the normalized one anymore AddedTokenare now used for bothadd_special_tokensandadd_tokens. Also, these AddedToken have more options to allow various behaviors.
Added
- [#188]:
impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token - More alignment mappings on the
Encoding. post_processcan be called on theTokenizer
Fixed
- [#193]: Fix some issues with the offsets being wrong with the
ByteLevelBPE:- when
add_prefix_spaceis activated - [#156]: when a Unicode character gets split-up in multiple byte-level characters
- when
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not advised, but that's not the question)
How to migrate
- Add the
ByteLevelPostProcessorto your byte-level BPE tokenizers if relevant.
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Python v0.6.0
Changes:
- Big improvements in speed for BPE (Both training and tokenization) (#165)
Fixes:
- Some default tokens were missing from
BertWordPieceTokenizer(cf #160) - There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up in multiple bytes. (cf #156)
- The
longest_firsttruncation strategy had a bug (#174)
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Rust v0.8.0
Changes:
- Big improvements in speed for BPE (Both training and tokenization) (#165)
Fixes:
- Do not open all files directly while training (#163)
- There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up in multiple bytes. (cf #156)
- The
LongestFirsttruncation strategy had a bug (#174)
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Python v0.5.2
Fixes:
- We introduced a bug related to the saving of the WordPiece model in 0.5.2: The
vocab.txtfile was namedvocab.json. This is now fixed. - The
WordLevelmodel was also saving its vocabulary in the wrong format.
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Python v0.5.1
Changes:
nameargument is now optional when saving aModel's vocabulary. When the name is not specified, the files get a more generic naming, likevocab.jsonormerges.txt.
- Rust
Published by n1t0 almost 6 years ago
tokenizers - Python v0.5.0
Changes:
BertWordPieceTokenizernow cleans up some tokenization artifacts while decoding (cf #145)ByteLevelBPETokenizernow hasdropout(thanks @colinclement with #149)- Added a new
Stripnormalizer do_lowercasehas been changed tolowercasefor consistency between the different tokenizers. (EspeciallyByteLevelBPETokenizerandCharBPETokenizer)- Expose
__len__onEncoding(cf #139) - Improved padding performances.
Fixes:
- #145: Decoding was buggy on
BertWordPieceTokenizer. - #152: Some documentation and examples were still using the old
BPETokenizer
- Rust
Published by n1t0 about 6 years ago
tokenizers - Python v0.4.2
Fixes:
- Fix a bug in the class
WordPieceTrainerthat preventedBertWordPieceTokenizerfrom being trained. (cf #137)
- Rust
Published by n1t0 about 6 years ago
tokenizers - Python v0.4.1
Fixes:
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
- Rust
Published by n1t0 about 6 years ago
tokenizers - Python v0.4.0
Changes:
- Replaced all
.new()class methods by a proper__new__implementation. (Huge thanks to @ljos with #131) - Improved typings
- Rust
Published by n1t0 about 6 years ago
tokenizers - Python v0.3.0
Changes:
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Added
CharDelimiterSplit: a newPreTokenizerthat allows splitting sequences on the given delimiter (Works like.split(delimiter)) - Added
WordLevel: a new model that simply mapstokensto theirids. - Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing
Encodingthat are ready to be processed by a language model, just as the mainEncoding. - Provide mapping to the original string offsets using:
output = tokenizer.encode(...) print(output.original_str.offsets(output.offsets[3])) - Exposed the vocabulary size on all tokenizers: https://github.com/huggingface/tokenizers/pull/99 by @kdexd
Bug fixes:
- Fix a bug with IndexableString
- Fix a bug with truncation
- Rust
Published by n1t0 about 6 years ago
tokenizers - Python v0.2.1
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5
- Rust
Published by n1t0 about 6 years ago
tokenizers - Python v0.2.0
In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).
- Rust
Published by n1t0 about 6 years ago
tokenizers - Python v0.1.1
- Fix a bug where special tokens get split while encoding
- Rust
Published by n1t0 about 6 years ago