What's Changed

Bump on-headers and compression in /tokenizers/examples/unstable_wasm/www by @dependabot[bot] in https://github.com/huggingface/tokenizers/pull/1827
Implement from_bytes and read_bytes Methods in WordPiece Tokenizer for WebAssembly Compatibility by @sondalex in https://github.com/huggingface/tokenizers/pull/1758
fix: use AHashMap to fix compile error by @b00f in https://github.com/huggingface/tokenizers/pull/1840
New stream by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1856
[docs] Add more decoders by @pcuenca in https://github.com/huggingface/tokenizers/pull/1849
Fix missing parenthesis in EncodingVisualizer.calculate_label_colors by @Liam-DeVoe in https://github.com/huggingface/tokenizers/pull/1853
Update quicktour.mdx re: Issue #1625 by @WilliamPLaCroix in https://github.com/huggingface/tokenizers/pull/1846
remove stray comment by @sanderland in https://github.com/huggingface/tokenizers/pull/1831
Fix typo in README by @aisk in https://github.com/huggingface/tokenizers/pull/1808
RUSTSEC-2024-0436 - replace paste with pastey by @nystromjd in https://github.com/huggingface/tokenizers/pull/1834
Tokenizer: Add native async bindings, via py03-async-runtimes. by @michaelfeil in https://github.com/huggingface/tokenizers/pull/1843

New Contributors

@b00f made their first contribution in https://github.com/huggingface/tokenizers/pull/1840
@pcuenca made their first contribution in https://github.com/huggingface/tokenizers/pull/1849
@Liam-DeVoe made their first contribution in https://github.com/huggingface/tokenizers/pull/1853
@WilliamPLaCroix made their first contribution in https://github.com/huggingface/tokenizers/pull/1846
@sanderland made their first contribution in https://github.com/huggingface/tokenizers/pull/1831
@aisk made their first contribution in https://github.com/huggingface/tokenizers/pull/1808
@nystromjd made their first contribution in https://github.com/huggingface/tokenizers/pull/1834
@michaelfeil made their first contribution in https://github.com/huggingface/tokenizers/pull/1843

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.3...v0.22.0rc0

- Rust
Published by ArthurZucker 6 months ago

tokenizers - v0.21.4

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.3...v0.21.4

No change, the 0.21.3 release failed, this is just a re-release.

https://github.com/huggingface/tokenizers/releases/tag/v0.21.3

- Rust
Published by Narsil 7 months ago

What's Changed

Clippy fixes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1818
Fixed an introduced backward breaking change in our Rust APIs.

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.2...v0.21.3

- Rust
Published by Narsil 8 months ago

What's Changed

This release if focused around some performance optimization, enabling broader python no gil support, and fixing some onig issues!

Update the release builds following 0.21.1. by @Narsil in https://github.com/huggingface/tokenizers/pull/1746
replace lazy_static with stabilized std::sync::LazyLock in 1.80 by @sftse in https://github.com/huggingface/tokenizers/pull/1739
Fix no-onig no-wasm builds by @414owen in https://github.com/huggingface/tokenizers/pull/1772
Fix typos in strings and comments by @co63oc in https://github.com/huggingface/tokenizers/pull/1770
Fix type notation of merges in BPE Python binding by @Coqueue in https://github.com/huggingface/tokenizers/pull/1766
Bump http-proxy-middleware from 2.0.6 to 2.0.9 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1762
Fix data path in testcontinuingprefixtrainermismatch by @GaetanLepage in https://github.com/huggingface/tokenizers/pull/1747
clippy by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1781
Update pyo3 and rust-numpy depends for no-gil/free-threading compat by @Qubitium in https://github.com/huggingface/tokenizers/pull/1774
Use ApiBuilder::fromenv() in frompretrained function by @BenLocal in https://github.com/huggingface/tokenizers/pull/1737
Upgrade onig, to get it compiling with GCC 15 by @414owen in https://github.com/huggingface/tokenizers/pull/1771
Itertools upgrade by @sftse in https://github.com/huggingface/tokenizers/pull/1756
Bump webpack-dev-server from 4.10.0 to 5.2.1 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1792
Bump brace-expansion from 1.1.11 to 1.1.12 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1796
Fix features blending into a paragraph by @bionicles in https://github.com/huggingface/tokenizers/pull/1798
Adding throughput to benches to have a more consistent measure across by @Narsil in https://github.com/huggingface/tokenizers/pull/1800
Upgrading dependencies. by @Narsil in https://github.com/huggingface/tokenizers/pull/1801
[docs] Whitespace by @stevhliu in https://github.com/huggingface/tokenizers/pull/1785
Hotfixing the stub. by @Narsil in https://github.com/huggingface/tokenizers/pull/1802
Bpe clones by @sftse in https://github.com/huggingface/tokenizers/pull/1707
Fixed Length Pre-Tokenizer by @jonvet in https://github.com/huggingface/tokenizers/pull/1713
Consolidated optimization ahash dary compact str by @Narsil in https://github.com/huggingface/tokenizers/pull/1799
🚨 breaking: Fix training with special tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1617

New Contributors

@414owen made their first contribution in https://github.com/huggingface/tokenizers/pull/1772
@co63oc made their first contribution in https://github.com/huggingface/tokenizers/pull/1770
@Coqueue made their first contribution in https://github.com/huggingface/tokenizers/pull/1766
@GaetanLepage made their first contribution in https://github.com/huggingface/tokenizers/pull/1747
@Qubitium made their first contribution in https://github.com/huggingface/tokenizers/pull/1774
@BenLocal made their first contribution in https://github.com/huggingface/tokenizers/pull/1737
@bionicles made their first contribution in https://github.com/huggingface/tokenizers/pull/1798
@stevhliu made their first contribution in https://github.com/huggingface/tokenizers/pull/1785
@jonvet made their first contribution in https://github.com/huggingface/tokenizers/pull/1713

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.1...v0.21.2rc0

- Rust
Published by ArthurZucker 8 months ago

What's Changed

Update dev version and pyproject.toml by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1693
Add feature flag hint to README.md, fixes #1633 by @sftse in https://github.com/huggingface/tokenizers/pull/1709
Upgrade to PyO3 0.23 by @Narsil in https://github.com/huggingface/tokenizers/pull/1708
Fixing the README. by @Narsil in https://github.com/huggingface/tokenizers/pull/1714
Fix typo in Split docstrings by @Dylan-Harden3 in https://github.com/huggingface/tokenizers/pull/1701
Fix typos by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1715
Update documentation of Rust feature by @sondalex in https://github.com/huggingface/tokenizers/pull/1711
Fix panic in DecodeStream::step due to incorrect index usage by @n0gu-furiosa in https://github.com/huggingface/tokenizers/pull/1699
Fixing the stream by removing the read_index altogether. by @Narsil in https://github.com/huggingface/tokenizers/pull/1716
Fixing NormalizedString append when normalized is empty. by @Narsil in https://github.com/huggingface/tokenizers/pull/1717
🚨 Support updating template processors by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1652. Removed in this release to keep backware compatibility temporarily.
Update metadata as Python3.7 and Python3.8 support was dropped by @earlytobed in https://github.com/huggingface/tokenizers/pull/1724
Add rustls-tls feature by @torymur in https://github.com/huggingface/tokenizers/pull/1732

New Contributors

@Dylan-Harden3 made their first contribution in https://github.com/huggingface/tokenizers/pull/1701
@sondalex made their first contribution in https://github.com/huggingface/tokenizers/pull/1711
@n0gu-furiosa made their first contribution in https://github.com/huggingface/tokenizers/pull/1699
@earlytobed made their first contribution in https://github.com/huggingface/tokenizers/pull/1724
@torymur made their first contribution in https://github.com/huggingface/tokenizers/pull/1732

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1

- Rust
Published by Narsil 12 months ago

What's Changed

Update dev version and pyproject.toml by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1693
Add feature flag hint to README.md, fixes #1633 by @sftse in https://github.com/huggingface/tokenizers/pull/1709
Upgrade to PyO3 0.23 by @Narsil in https://github.com/huggingface/tokenizers/pull/1708
Fixing the README. by @Narsil in https://github.com/huggingface/tokenizers/pull/1714
Fix typo in Split docstrings by @Dylan-Harden3 in https://github.com/huggingface/tokenizers/pull/1701
Fix typos by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1715
Update documentation of Rust feature by @sondalex in https://github.com/huggingface/tokenizers/pull/1711
Fix panic in DecodeStream::step due to incorrect index usage by @n0gu-furiosa in https://github.com/huggingface/tokenizers/pull/1699
Fixing the stream by removing the read_index altogether. by @Narsil in https://github.com/huggingface/tokenizers/pull/1716
Fixing NormalizedString append when normalized is empty. by @Narsil in https://github.com/huggingface/tokenizers/pull/1717
🚨 Support updating template processors by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1652
Update metadata as Python3.7 and Python3.8 support was dropped by @earlytobed in https://github.com/huggingface/tokenizers/pull/1724
Add rustls-tls feature by @torymur in https://github.com/huggingface/tokenizers/pull/1732

New Contributors

@Dylan-Harden3 made their first contribution in https://github.com/huggingface/tokenizers/pull/1701
@sondalex made their first contribution in https://github.com/huggingface/tokenizers/pull/1711
@n0gu-furiosa made their first contribution in https://github.com/huggingface/tokenizers/pull/1699
@earlytobed made their first contribution in https://github.com/huggingface/tokenizers/pull/1724
@torymur made their first contribution in https://github.com/huggingface/tokenizers/pull/1732

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1rc0

- Rust
Published by Narsil 12 months ago

What's Changed

More cache options. by @Narsil in https://github.com/huggingface/tokenizers/pull/1675
Disable caching for long strings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1676
Testing ABI3 wheels to reduce number of wheels by @Narsil in https://github.com/huggingface/tokenizers/pull/1674
Adding an API for decode streaming. by @Narsil in https://github.com/huggingface/tokenizers/pull/1677
Decode stream python by @Narsil in https://github.com/huggingface/tokenizers/pull/1678

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.3...v0.20.4-rc0

- Rust
Published by Narsil over 1 year ago

What's Changed

There was a breaking change in 0.20.3 for tuple inputs of encode_batch! * fix pylist by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1673 * [MINOR:TYPO] Fix docstrings by @cakiki in https://github.com/huggingface/tokenizers/pull/1653

New Contributors

@cakiki made their first contribution in https://github.com/huggingface/tokenizers/pull/1653

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.2...v0.20.3

- Rust
Published by ArthurZucker over 1 year ago

tokenizers - v0.20.2

Release v0.20.2

Thanks a MILE to @diliop we now have support for python 3.13! 🥳

What's Changed

Bump cookie and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1648
Fix off-by-one error in tokenizer::normalizer::Range::len by @rlanday in https://github.com/huggingface/tokenizers/pull/1638
Arg name correction: auth_token -> token by @rravenel in https://github.com/huggingface/tokenizers/pull/1621
Unsound call of set_var by @sftse in https://github.com/huggingface/tokenizers/pull/1664
Add safety comments by @Manishearth in https://github.com/huggingface/tokenizers/pull/1651
Bump actions/checkout to v4 by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1667
PyO3 0.22 by @diliop in https://github.com/huggingface/tokenizers/pull/1665
Bump actions versions by @tinyboxvk in https://github.com/huggingface/tokenizers/pull/1669

New Contributors

@rlanday made their first contribution in https://github.com/huggingface/tokenizers/pull/1638
@rravenel made their first contribution in https://github.com/huggingface/tokenizers/pull/1621
@sftse made their first contribution in https://github.com/huggingface/tokenizers/pull/1664
@Manishearth made their first contribution in https://github.com/huggingface/tokenizers/pull/1651
@tinyboxvk made their first contribution in https://github.com/huggingface/tokenizers/pull/1667
@diliop made their first contribution in https://github.com/huggingface/tokenizers/pull/1665

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.1...v0.20.2

- Rust
Published by ArthurZucker over 1 year ago

tokenizers - Release v0.20.1

What's Changed

The most awaited offset issue with Llama is fixed 🥳

Update README.md by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1608
fix benchmark file link by @152334H in https://github.com/huggingface/tokenizers/pull/1610
Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in https://github.com/huggingface/tokenizers/pull/1626
[ignore_merges] Fix offsets by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1640
Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1629
Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1630
Bump send and express in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1631
Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1641
Fix documentation build by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1642
style: simplify string formatting for readability by @hamirmahal in https://github.com/huggingface/tokenizers/pull/1632

New Contributors

@152334H made their first contribution in https://github.com/huggingface/tokenizers/pull/1610
@hamirmahal made their first contribution in https://github.com/huggingface/tokenizers/pull/1632

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.0...v0.20.1

- Rust
Published by ArthurZucker over 1 year ago

tokenizers - Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this: ```python3

from tokenizers import Tokenizer; tokenizer = Tokenizer.frompretrained("bert-base-uncased"); print(tokenizer) Tokenizer(version="1.0", truncation=None, padding=None, addedtokens=[{"id":0, "content":"[PAD]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "singleword":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "singleword":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(cleantext=True, handlechinesechars=True, stripaccents=None, lowercase=True), pretokenizer=BertPreTokenizer(), postprocessor=TemplateProcessing(single=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0)], pair=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0), Sequence(id=B, typeid=1), SpecialToken(id="[SEP]", typeid=1)], specialtokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unktoken="[UNK]", continuingsubwordprefix="##", maxinputcharsper_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))

tokenizer Tokenizer(version="1.0", truncation=None, padding=None, addedtokens=[{"id":0, "content":"[PAD]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "singleword":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(cleantext=True, handlechinesechars=True, stripaccents=None, lowercase=True), pretokenizer=BertPreTokenizer(), postprocessor=TemplateProcessing(single=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0)], pair=[SpecialToken(id="[CLS]", typeid=0), Sequence(id=A, typeid=0), SpecialToken(id="[SEP]", typeid=0), Sequence(id=B, typeid=1), SpecialToken(id="[SEP]", typeid=1)], specialtokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unktoken="[UNK]", continuingsubwordprefix="##", maxinputcharsperword=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...})) ```

The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now: python from tokenizers import normalizers norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()]) norm[0] norm[1].lowercase=False

What's Changed

remove enforcement of non special when adding tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1521
[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in https://github.com/huggingface/tokenizers/pull/1513
Make USED_PARALLELISM atomic by @nathaniel-daniel in https://github.com/huggingface/tokenizers/pull/1532
Fixing for clippy 1.78 by @Narsil in https://github.com/huggingface/tokenizers/pull/1548
feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/tokenizers/pull/1551
Switch from cached_download to hf_hub_download in tests by @Wauplin in https://github.com/huggingface/tokenizers/pull/1547
Fix "dictionnary" typo by @nprisbrey in https://github.com/huggingface/tokenizers/pull/1511
make sure we don't warn on empty tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1554
Enable dropout = 0.0 as an equivalent to none in BPE by @mcognetta in https://github.com/huggingface/tokenizers/pull/1550
Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1569
Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1555
Fix clippy + feature test management. by @Narsil in https://github.com/huggingface/tokenizers/pull/1580
Bump spm_precompiled to 0.1.3 by @MikeIvanichev in https://github.com/huggingface/tokenizers/pull/1571
Add benchmark vs tiktoken by @Narsil in https://github.com/huggingface/tokenizers/pull/1582
Fixing the benchmark. by @Narsil in https://github.com/huggingface/tokenizers/pull/1583
Tiny improvement by @Narsil in https://github.com/huggingface/tokenizers/pull/1585
Enable fancy regex by @Narsil in https://github.com/huggingface/tokenizers/pull/1586
Fixing release CI strict (taken from safetensors). by @Narsil in https://github.com/huggingface/tokenizers/pull/1593
Adding some serialization testing around the wrapper. by @Narsil in https://github.com/huggingface/tokenizers/pull/1594
Add-legacy-tests by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1597
Adding a few tests for decoder deserialization. by @Narsil in https://github.com/huggingface/tokenizers/pull/1598
Better serialization error by @Narsil in https://github.com/huggingface/tokenizers/pull/1595
Add test normalizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1600
Improve decoder deserialization by @Narsil in https://github.com/huggingface/tokenizers/pull/1599
Using serde (serdepyo3) to get _str__ and repr easily. by @Narsil in https://github.com/huggingface/tokenizers/pull/1588
Merges cannot handle tokens containing spaces. by @Narsil in https://github.com/huggingface/tokenizers/pull/909
Fix doc about split by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1591
Support None to reset pre_tokenizers and normalizers, and index sequences by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1590
Fix strip python type by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1602
Tests + Deserialization improvement for normalizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1604
add deserialize for pre tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1603
Perf improvement 16% by removing offsets. by @Narsil in https://github.com/huggingface/tokenizers/pull/1587

New Contributors

@nathaniel-daniel made their first contribution in https://github.com/huggingface/tokenizers/pull/1532
@nprisbrey made their first contribution in https://github.com/huggingface/tokenizers/pull/1511
@mcognetta made their first contribution in https://github.com/huggingface/tokenizers/pull/1550
@MikeIvanichev made their first contribution in https://github.com/huggingface/tokenizers/pull/1571

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.1...v0.20.0rc1

- Rust
Published by ArthurZucker over 1 year ago

tokenizers - v0.19.1

What's Changed

add serialization for ignore_merges by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1504

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.0...v0.19.1

- Rust
Published by ArthurZucker almost 2 years ago

tokenizers - v0.19.0

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1442
[remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443
🚨🚨 BREAKING CHANGE 🚨🚨: (addprefixspace dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1476
Add more support for tiktoken based tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1493
PyO3 0.21. by @Narsil in https://github.com/huggingface/tokenizers/pull/1494
Remove 3.13 (potential undefined behavior.) by @Narsil in https://github.com/huggingface/tokenizers/pull/1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in https://github.com/huggingface/tokenizers/pull/1498
Fixing doc. by @Narsil in https://github.com/huggingface/tokenizers/pull/1499

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0

- Rust
Published by Narsil almost 2 years ago

tokenizers - v0.19.0rc0

Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1442
[remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443
🚨🚨 BREAKING CHANGE 🚨🚨: (addprefixspace dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1476
Add more support for tiktoken based tokenizers by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1493
PyO3 0.21. by @Narsil in https://github.com/huggingface/tokenizers/pull/1494
Remove 3.13 (potential undefined behavior.) by @Narsil in https://github.com/huggingface/tokenizers/pull/1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in https://github.com/huggingface/tokenizers/pull/1498

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0rc0

- Rust
Published by Narsil almost 2 years ago

tokenizers - v0.15.2

What's Changed

Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:

chore: Update dependencies to latest supported versions by @bryantbiggs in https://github.com/huggingface/tokenizers/pull/1441
Convert word counts to u64 by @stephenroller in https://github.com/huggingface/tokenizers/pull/1433
Efficient Replace normalizer by @rlrs in https://github.com/huggingface/tokenizers/pull/1413

New Contributors

@bryantbiggs made their first contribution in https://github.com/huggingface/tokenizers/pull/1441
@stephenroller made their first contribution in https://github.com/huggingface/tokenizers/pull/1433
@rlrs made their first contribution in https://github.com/huggingface/tokenizers/pull/1413

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.1...v0.15.2rc1

- Rust
Published by ArthurZucker about 2 years ago

tokenizers - v0.15.1

What's Changed

udpate to version = "0.15.1-dev0" by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1390
Derive Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381
Stale bot. by @Narsil in https://github.com/huggingface/tokenizers/pull/1404
Fix doc links in readme by @Pierrci in https://github.com/huggingface/tokenizers/pull/1367
Faster HF dataset iteration in docs by @mariosasko in https://github.com/huggingface/tokenizers/pull/1414
Add quick doc to byte_level.rs by @steventrouble in https://github.com/huggingface/tokenizers/pull/1420
Fix make bench. by @Narsil in https://github.com/huggingface/tokenizers/pull/1428
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1430
pyo3: update to 0.20 by @mikelui in https://github.com/huggingface/tokenizers/pull/1386
Encode special tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1437
Update release for python3.12 windows by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1438

New Contributors

@steventrouble made their first contribution in https://github.com/huggingface/tokenizers/pull/1420

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.0...v0.15.1

- Rust
Published by ArthurZucker about 2 years ago

tokenizers - v0.15.1.rc0

What's Changed

pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353
Preparing release. by @Narsil in https://github.com/huggingface/tokenizers/pull/1355
fix a clerical error in the comment by @tiandiweizun in https://github.com/huggingface/tokenizers/pull/1356
fix: remove useless token by @rtrompier in https://github.com/huggingface/tokenizers/pull/1371
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1370
Allow hf_hub 0.18 by @mariosasko in https://github.com/huggingface/tokenizers/pull/1383
Allow huggingface_hub<1.0 by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385
[pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357
udpate to version = "0.15.1-dev0" by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1390
Derive Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381
Stale bot. by @Narsil in https://github.com/huggingface/tokenizers/pull/1404
Fix doc links in readme by @Pierrci in https://github.com/huggingface/tokenizers/pull/1367
Faster HF dataset iteration in docs by @mariosasko in https://github.com/huggingface/tokenizers/pull/1414
Add quick doc to byte_level.rs by @steventrouble in https://github.com/huggingface/tokenizers/pull/1420
Fix make bench. by @Narsil in https://github.com/huggingface/tokenizers/pull/1428
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1430
pyo3: update to 0.20 by @mikelui in https://github.com/huggingface/tokenizers/pull/1386

New Contributors

@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
@eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341
@tiandiweizun made their first contribution in https://github.com/huggingface/tokenizers/pull/1356
@rtrompier made their first contribution in https://github.com/huggingface/tokenizers/pull/1371
@mariosasko made their first contribution in https://github.com/huggingface/tokenizers/pull/1383
@Wauplin made their first contribution in https://github.com/huggingface/tokenizers/pull/1385
@steventrouble made their first contribution in https://github.com/huggingface/tokenizers/pull/1420

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.15.1.rc0

- Rust
Published by Narsil about 2 years ago

tokenizers -

What's Changed

fix a clerical error in the comment by @tiandiweizun in https://github.com/huggingface/tokenizers/pull/1356
fix: remove useless token by @rtrompier in https://github.com/huggingface/tokenizers/pull/1371
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1370
Allow hf_hub 0.18 by @mariosasko in https://github.com/huggingface/tokenizers/pull/1383
Allow huggingface_hub<1.0 by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385
[pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357

New Contributors

@tiandiweizun made their first contribution in https://github.com/huggingface/tokenizers/pull/1356
@rtrompier made their first contribution in https://github.com/huggingface/tokenizers/pull/1371
@mariosasko made their first contribution in https://github.com/huggingface/tokenizers/pull/1383
@Wauplin made their first contribution in https://github.com/huggingface/tokenizers/pull/1385

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.14.1...v0.15.0

- Rust
Published by ArthurZucker over 2 years ago

tokenizers - v0.14.1

What's Changed

Fix conda release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1211
Fix node release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1212
Printing warning to stderr. by @Narsil in https://github.com/huggingface/tokenizers/pull/1222
Fixing paddingleft sequenceids. by @Narsil in https://github.com/huggingface/tokenizers/pull/1233
Use LTO for release and benchmark builds by @csko in https://github.com/huggingface/tokenizers/pull/1157
fix unigram.rs test_sample() by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1244
implement a simple maxsentencepiecelength into BPE by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1228
Makes decode and decode_batch work on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251
Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319
Release all at once for simplicity. by @Narsil in https://github.com/huggingface/tokenizers/pull/1320
Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321
pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353
Preparing release. by @Narsil in https://github.com/huggingface/tokenizers/pull/1355

New Contributors

@csko made their first contribution in https://github.com/huggingface/tokenizers/pull/1157
@chris-ha458 made their first contribution in https://github.com/huggingface/tokenizers/pull/1244
@sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
@boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
@hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
@bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
@kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
@SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
@jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317
@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
@eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.1

- Rust
Published by Narsil over 2 years ago

tokenizers - v0.14.1rc1

What's Changed

pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335
update package version for dev by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1339
Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1341
Let's allow hf_hub < 1.0 by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1344
Fixing the progressbar. by @Narsil in https://github.com/huggingface/tokenizers/pull/1353

New Contributors

@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322
@eaplatanios made their first contribution in https://github.com/huggingface/tokenizers/pull/1341

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.14.1rc1

- Rust
Published by Narsil over 2 years ago

tokenizers - v0.14.0

⚠️ Reworks the release pipeline. Other breaking changes ⚠️ : - #1335, AddedToken is reworked, is_special_token rename to special for consistency - feature http is now OFF by default, and depends on hf-hub instead of cachedpath (updated cache directory, better sync implementation) - Removed SSL link on the python package, calling huggingfacehub directly instead. - New dependency : huggingfacehub (while we deprecate Tokenizer.frompretrained(...) to Tokenizer.fromfile(hugginngfacehub.hfhubdownload(MODEL_ID, "tokenizer.json")

What's Changed

Fix conda release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1211
Fix node release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1212
Printing warning to stderr. by @Narsil in https://github.com/huggingface/tokenizers/pull/1222
Fixing paddingleft sequenceids. by @Narsil in https://github.com/huggingface/tokenizers/pull/1233
Use LTO for release and benchmark builds by @csko in https://github.com/huggingface/tokenizers/pull/1157
fix unigram.rs test_sample() by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1244
implement a simple maxsentencepiecelength into BPE by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1228
Makes decode and decode_batch work on borrowed content. by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1251
Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319
Release all at once for simplicity. by @Narsil in https://github.com/huggingface/tokenizers/pull/1320
Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321
pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335

New Contributors

@csko made their first contribution in https://github.com/huggingface/tokenizers/pull/1157
@chris-ha458 made their first contribution in https://github.com/huggingface/tokenizers/pull/1244
@sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
@boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
@hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
@bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
@kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
@SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
@jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317
@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.3...v0.14.0

- Rust
Published by ArthurZucker over 2 years ago

tokenizers - v0.14.0.rc1

Reworks the release pipeline. Other breaking changes are mostly related to https://github.com/huggingface/tokenizers/pull/1335, where AddedToken is reworked

What's Changed

pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328
Reduce number of different revisions by 1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1329
Python 38 arm by @Narsil in https://github.com/huggingface/tokenizers/pull/1330
Move to maturing mimicking move for safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331
Updating the docs with the new command. by @Narsil in https://github.com/huggingface/tokenizers/pull/1333
Update added tokens by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1335

New Contributors

@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.14.0.rc1

- Rust
Published by ArthurZucker over 2 years ago

tokenizers - v0.13.4.rc3

Mostly checking the new release scripts actually work.

What's Changed

pyo3: update to 0.19 by @mikelui in https://github.com/huggingface/tokenizers/pull/1322
Add expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316
Re-using scritpts from safetensors. by @Narsil in https://github.com/huggingface/tokenizers/pull/1328

New Contributors

@mikelui made their first contribution in https://github.com/huggingface/tokenizers/pull/1322

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.13.4.rc3

- Rust
Published by Narsil over 2 years ago

tokenizers - v0.13.4.rc2

What's Changed

Fix stride condition. by @Narsil in https://github.com/huggingface/tokenizers/pull/1321

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc1...v0.13.4.rc2

- Rust
Published by Narsil over 2 years ago

tokenizers - Python v0.13.4.rc1

What's Changed

Update all GH Actions with dependency on actions/checkout by @mfuntowicz in https://github.com/huggingface/tokenizers/pull/1256
Parallelize unigram trainer by @mishig25 in https://github.com/huggingface/tokenizers/pull/976
Update unigram/trainer.rs by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1257
Fixing broken link. by @Narsil in https://github.com/huggingface/tokenizers/pull/1268
fix documentation regarding regex by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1264
Update Cargo.toml by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1266
Update README.md - Broken link by @sbhavani in https://github.com/huggingface/tokenizers/pull/1272
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/tokenizers/pull/1273
Improve error for truncation with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1275
Add unigram bytefallback by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1217
revise type specification by @hiroshi-matsuda-rit in https://github.com/huggingface/tokenizers/pull/1289
Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1291
Update path name: master -> main by @bact in https://github.com/huggingface/tokenizers/pull/1292
import Tuple from typing by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1295
Fixing clippy warnings on 1.71. by @Narsil in https://github.com/huggingface/tokenizers/pull/1296
Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1299
feat: Added CITATION.cff. by @SamuelLarkin in https://github.com/huggingface/tokenizers/pull/1302
Single warning for holes. by @Narsil in https://github.com/huggingface/tokenizers/pull/1303
Give error when initializing tokenizer with too high stride by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1306
Handle when precompiled charsmap is empty by @kellymarchisio in https://github.com/huggingface/tokenizers/pull/1308
Derive clone for TrainerWrapper by @jonatanklosko in https://github.com/huggingface/tokenizers/pull/1317
CD backports by @chris-ha458 in https://github.com/huggingface/tokenizers/pull/1318
0.13.4.rc1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1319

New Contributors

@sbhavani made their first contribution in https://github.com/huggingface/tokenizers/pull/1272
@boyleconnor made their first contribution in https://github.com/huggingface/tokenizers/pull/1275
@hiroshi-matsuda-rit made their first contribution in https://github.com/huggingface/tokenizers/pull/1289
@bact made their first contribution in https://github.com/huggingface/tokenizers/pull/1292
@kellymarchisio made their first contribution in https://github.com/huggingface/tokenizers/pull/1295
@SamuelLarkin made their first contribution in https://github.com/huggingface/tokenizers/pull/1302
@jonatanklosko made their first contribution in https://github.com/huggingface/tokenizers/pull/1317

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4-rc2...v0.13.4.rc1

- Rust
Published by Narsil over 2 years ago

tokenizers - https://github.com/huggingface/tokenizers/releases/tag/v0.13.4-rc2

- Rust
Published by github-actions[bot] almost 3 years ago

tokenizers - https://github.com/huggingface/tokenizers/releases/tag/v0.13.4-rc1

- Rust
Published by github-actions[bot] almost 3 years ago

tokenizers - Node v0.13.3

What's Changed

Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
Making Tokenizer clone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152
Prevent using from_pretrained on invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153
Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
Adding ByteFallback support for tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183
Faster datasets train example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192
Adding Replace to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195
Creating normalizers.Prepend (To be used instead of Metaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194
Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
Add content to Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199

New Contributors

@ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
@SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
@hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
@fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
@mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
@lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192

Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1

What's Changed

Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
Making Tokenizer clone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152
Prevent using from_pretrained on invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153
Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
Adding ByteFallback support for tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183
Faster datasets train example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192
Adding Replace to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195
Creating normalizers.Prepend (To be used instead of Metaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194
Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
Add content to Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199
New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205

New Contributors

@ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
@SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
@hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
@fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
@mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
@lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192

Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1

- Rust
Published by ArthurZucker almost 3 years ago

tokenizers - Rust v0.13.3

What's Changed

Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
Making Tokenizer clone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152
Prevent using from_pretrained on invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153
Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
Adding ByteFallback support for tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183
Faster datasets train example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192
Adding Replace to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195
Creating normalizers.Prepend (To be used instead of Metaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194
Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
Add content to Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199
New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205
New release by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1207

New Contributors

@ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
@SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
@hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
@fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
@mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
@lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.2...v0.13.3

- Rust
Published by ArthurZucker almost 3 years ago

tokenizers - Python v0.13.3

What's Changed

Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
Making Tokenizer clone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152
Prevent using from_pretrained on invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153
Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
Adding ByteFallback support for tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183
Faster datasets train example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192
Adding Replace to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195
Creating normalizers.Prepend (To be used instead of Metaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194
Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
Add content to Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199

New Contributors

@ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
@SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
@hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
@fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
@mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
@lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192

Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1

What's Changed

Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
Making Tokenizer clone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152
Prevent using from_pretrained on invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153
Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
Adding ByteFallback support for tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183
Faster datasets train example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192
Adding Replace to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195
Creating normalizers.Prepend (To be used instead of Metaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194
Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
Add content to Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199
New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205

New Contributors

@ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
@SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
@hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
@fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
@mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
@lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192

Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1

- Rust
Published by ArthurZucker almost 3 years ago

tokenizers - Python v0.13.3rc1

What's Changed

Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
Making Tokenizer clone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152
Prevent using from_pretrained on invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153
Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
Adding ByteFallback support for tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183
Faster datasets train example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192
Adding Replace to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195
Creating normalizers.Prepend (To be used instead of Metaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194
Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
Add content to Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199

New Contributors

@ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
@SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
@hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
@fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
@mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
@lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192

Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1

What's Changed

Update pr docs actions by @mishig25 in https://github.com/huggingface/tokenizers/pull/1101
Adding rust audit. by @Narsil in https://github.com/huggingface/tokenizers/pull/1099
Revert "Update pr docs actions" by @mishig25 in https://github.com/huggingface/tokenizers/pull/1107
Bump loader-utils from 1.4.0 to 1.4.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1108
Include license file in Rust crate by @ankane in https://github.com/huggingface/tokenizers/pull/1115
Bump decode-uri-component from 0.2.0 to 0.2.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1116
[FIX] In SentencePieceBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1120
Fixing conda ssl location by @Narsil in https://github.com/huggingface/tokenizers/pull/1124
Adding stale bot ? by @Narsil in https://github.com/huggingface/tokenizers/pull/1123
Bump minimatch from 3.0.4 to 3.1.2 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1126
Bump decode-uri-component from 0.2.0 to 0.2.2 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1125
Bump cached-path from 0.5 to 0.6 by @hvaara in https://github.com/huggingface/tokenizers/pull/1127
Wrap rustdoc html entity in code block by @hvaara in https://github.com/huggingface/tokenizers/pull/1130
Fix broken links in docs by @hvaara in https://github.com/huggingface/tokenizers/pull/1133
Bump derive_builder from 0.9 to 0.12 by @hvaara in https://github.com/huggingface/tokenizers/pull/1129
Ignore Cargo.lock for subfolders by @hvaara in https://github.com/huggingface/tokenizers/pull/1131
Fix one char super tiny typo by @fzyzcjy in https://github.com/huggingface/tokenizers/pull/1137
[FIX] In CharBPETokenizer, when Vocab or merges is None, unk_token cannot be used. by @SeongBeomLEE in https://github.com/huggingface/tokenizers/pull/1136
Bump json5, copy-webpack-plugin, webpack and webpack-cli in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1139
Bump json5 from 2.2.0 to 2.2.3 in /bindings/node by @dependabot in https://github.com/huggingface/tokenizers/pull/1140
Add missing build targets by @Narsil in https://github.com/huggingface/tokenizers/pull/1145
Adding python 3.8 for M1 by @Narsil in https://github.com/huggingface/tokenizers/pull/1147
Made dirs optional by @ankane in https://github.com/huggingface/tokenizers/pull/1148
Update info on environment variable for threading by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1150
Making Tokenizer clone. by @Narsil in https://github.com/huggingface/tokenizers/pull/1152
Prevent using from_pretrained on invalid ids (better error message). by @Narsil in https://github.com/huggingface/tokenizers/pull/1153
Improved version. by @Narsil in https://github.com/huggingface/tokenizers/pull/1154
Update model.rs by @thomasw21 in https://github.com/huggingface/tokenizers/pull/1166
Using clippy 1.67 by @Narsil in https://github.com/huggingface/tokenizers/pull/1167
pyo3 v0.18 migration by @mert-kurttutan in https://github.com/huggingface/tokenizers/pull/1173
Bump webpack from 5.75.0 to 5.76.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in https://github.com/huggingface/tokenizers/pull/1181
Fixing infinite loop in UnigramTrainer. by @Narsil in https://github.com/huggingface/tokenizers/pull/1182
Bump dirs from 3.0 to 4.0 by @hvaara in https://github.com/huggingface/tokenizers/pull/1142
Adding ByteFallback support for tokenizers. by @Narsil in https://github.com/huggingface/tokenizers/pull/1183
Faster datasets train example by @lhoestq in https://github.com/huggingface/tokenizers/pull/1192
Adding Replace to decoder (to undo the Replace Normalizer for Metaspace split). by @Narsil in https://github.com/huggingface/tokenizers/pull/1195
Creating normalizers.Prepend (To be used instead of Metaspace). by @Narsil in https://github.com/huggingface/tokenizers/pull/1194
Adding 2 new decoders: by @Narsil in https://github.com/huggingface/tokenizers/pull/1196
Fixing decoder strip because of char boundaries. by @Narsil in https://github.com/huggingface/tokenizers/pull/1197
Add content to Strip decoder to allow decoding mid tokens. by @Narsil in https://github.com/huggingface/tokenizers/pull/1199
New version 0.13.3 by @Narsil in https://github.com/huggingface/tokenizers/pull/1205

New Contributors

@ankane made their first contribution in https://github.com/huggingface/tokenizers/pull/1115
@SeongBeomLEE made their first contribution in https://github.com/huggingface/tokenizers/pull/1120
@hvaara made their first contribution in https://github.com/huggingface/tokenizers/pull/1127
@fzyzcjy made their first contribution in https://github.com/huggingface/tokenizers/pull/1137
@mert-kurttutan made their first contribution in https://github.com/huggingface/tokenizers/pull/1150
@lhoestq made their first contribution in https://github.com/huggingface/tokenizers/pull/1192

Full Changelog: https://github.com/huggingface/tokenizers/compare/node-v0.13.2...python-v0.13.3rc1

- Rust
Published by Narsil almost 3 years ago

tokenizers - Node 0.13.2

Python 3.11 support (Python only modification)

- Rust
Published by Narsil over 3 years ago

tokenizers - Rust 0.13.2

Python 3.11 support (Python only modification)

- Rust
Published by Narsil over 3 years ago

tokenizers - Python 0.13.2

[0.13.2]

[#1096] Python 3.11 support

- Rust
Published by Narsil over 3 years ago

tokenizers - Node 0.13.1

[0.13.1]

[#1072] Fixing Roberta type ids.

- Rust
Published by Narsil over 3 years ago

tokenizers - Rust 0.13.1

[0.13.1]

[#1072] Fixing Roberta type ids.

- Rust
Published by Narsil over 3 years ago

tokenizers - Python v0.13.1

[0.13.1]

[#1072] Fixing Roberta type ids.

- Rust
Published by Narsil over 3 years ago

tokenizers - Python v0.13.0

[0.13.0]

[#956] PyO3 version upgrade
[#1055] M1 automated builds
[#1008] Decoder is now a composable trait, but without being backward incompatible
[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.

- Rust
Published by Narsil over 3 years ago

tokenizers - Node v0.13.0

[0.13.0]

[#1008] Decoder is now a composable trait, but without being backward incompatible
[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

- Rust
Published by Narsil over 3 years ago

tokenizers - Rust v0.13.0

[0.13.0]

[#1009] unstable_wasm feature to support building on Wasm (it's unstable !)
[#1008] Decoder is now a composable trait, but without being backward incompatible
[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.

- Rust
Published by Narsil over 3 years ago

tokenizers - Python v0.12.1

[0.12.1]

[#938] Reverted breaking change. https://github.com/huggingface/transformers/issues/16520

- Rust
Published by Narsil almost 4 years ago

tokenizers - [YANKED] Node v0.12.0

[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change. Using 0.12 to match other bindings.

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of tokenizers

- Rust
Published by Narsil almost 4 years ago

tokenizers - [YANKED] Python v0.12.0

[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change.

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#962] Fix tests for python 3.10
[#961] Added link for Ruby port of tokenizers

- Rust
Published by Narsil almost 4 years ago

tokenizers - [YANKED] Rust v0.12.0

[0.12.0]

Bump minor version because of a breaking change.

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of tokenizers
[#960] Feature gate for cli and its clap dependency

- Rust
Published by Narsil almost 4 years ago

tokenizers - Rust v0.11.2

[#919] Fixing single_word AddedToken. (regression from 0.11.2)
[#916] Deserializing faster added_tokens by loading them in batch.

- Rust
Published by Narsil almost 4 years ago

tokenizers - Node v0.8.3

- Rust
Published by Narsil almost 4 years ago

tokenizers - Python v0.11.6

[#919] Fixing single_word AddedToken. (regression from 0.11.2)
[#916] Deserializing faster added_tokens by loading them in batch.

- Rust
Published by Narsil almost 4 years ago

tokenizers - Python v0.11.5

[#895] Add wheel support for Python 3.10

- Rust
Published by Narsil about 4 years ago

tokenizers - Node v0.8.2

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

- Rust
Published by Narsil about 4 years ago

tokenizers - Python v0.11.4

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

- Rust
Published by Narsil about 4 years ago

tokenizers - Python v0.11.3

[#882] Fixing Punctuation deserialize without argument.
[#868] Fixing missing direction in TruncationParams
[#860] Adding TruncationSide to TruncationParams

- Rust
Published by Narsil about 4 years ago

tokenizers - Rust v0.11.1

[#882] Fixing Punctuation deserialize without argument.
[#868] Fixing missing direction in TruncationParams
[#860] Adding TruncationSide to TruncationParams

- Rust
Published by Narsil about 4 years ago

tokenizers - Node v0.8.1

Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.

- Rust
Published by Narsil about 4 years ago

tokenizers - Python v0.11.2

Fixes https://github.com/huggingface/tokenizers/pull/868

- Rust
Published by Narsil about 4 years ago

tokenizers - Python v0.11.1

[#860] Adding TruncationSide to TruncationParams.

- Rust
Published by Narsil about 4 years ago

tokenizers - Python v0.11.0

Fixed

[#585] Conda version should now work on old CentOS
[#844] Fixing interaction between is_pretokenized and trim_offsets.
[#851] Doc links

Added

[#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
[#845]: Documentation for Decoders.

Changed

[#850]: Added a feature gate to enable disabling http features
[#718]: Fix WordLevel tokenizer determinism during training
[#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer
[#770]: Improved documentation for UnigramTrainer
[#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub
[#793]: Saving a pretty JSON file by default when saving a tokenizer

- Rust
Published by n1t0 about 4 years ago

tokenizers - Node v0.8.0

BREACKING CHANGES

Many improvements on the Trainer (#519). The files must now be provided first when calling tokenizer.train(files, trainer).

Features

Adding the TemplateProcessing
Add WordLevel and Unigram models (#490)
Add nmtNormalizer and precompiledNormalizer normalizers (#490)
Add templateProcessing post-processor (#490)
Add digitsPreTokenizer pre-tokenizer (#490)
Add support for mapping to sequences (#506)
Add splitPreTokenizer pre-tokenizer (#542)
Add behavior option to the punctuationPreTokenizer (#657)
Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

- Rust
Published by n1t0 over 4 years ago

tokenizers - Python v0.10.3

Fixed

[#686]: Fix SPM conversion process for whitespace deduplication
[#707]: Fix stripping strings containing Unicode characters

Added

[#693]: Add a CTC Decoder for Wave2Vec models

Removed

[#714]: Removed support for Python 3.5

- Rust
Published by n1t0 over 4 years ago

tokenizers - Python v0.10.2

Fixed

[#652]: Fix offsets for Precompiled corner case
[#656]: Fix BPE continuing_subword_prefix
[#674]: Fix Metaspace serialization problems

- Rust
Published by n1t0 almost 5 years ago

tokenizers - Python v0.10.1

Fixed

[#616]: Fix SentencePiece tokenizers conversion
[#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
[#618]: Fix Normalizer.normalize with PyNormalizedStringRefMut
[#620]: Fix serialization/deserialization for overlapping models
[#621]: Fix ByteLevel instantiation from a previously saved state (using __getstate__())

- Rust
Published by n1t0 about 5 years ago

tokenizers - Python v0.10.0

Added

[#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
[#519]: Add a WordLevelTrainer used to train a WordLevel model
[#533]: Add support for conda builds
[#542]: Add Split pre-tokenizer to easily split using a pattern
[#544]: Ability to train from memory. This also improves the integration with datasets
[#590]: Add getters/setters for components on BaseTokenizer
[#574]: Add fust_unk option to SentencePieceBPETokenizer

Changed

[#509]: Automatically stubbing the .pyi files
[#519]: Each Model can return its associated Trainer with get_trainer()
[#530]: The various attributes on each component can be get/set (ie. tokenizer.model.dropout = 0.1)
[#538]: The API Reference has been improved and is now up-to-date.

Fixed

[#519]: During training, the Model is now trained in-place. This fixes several bugs that were forcing to reload the Model after a training.
[#539]: Fix BaseTokenizer enable_truncation docstring

- Rust
Published by n1t0 about 5 years ago

tokenizers - Python v0.10.0rc1

Added

[#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
[#519]: Add a WordLevelTrainer used to train a WordLevel model
[#533]: Add support for conda builds
[#542]: Add Split pre-tokenizer to easily split using a pattern
[#544]: Ability to train from memory. This also improves the integration with datasets

Changed

[#509]: Automatically stubbing the .pyi files
[#519]: Each Model can return its associated Trainer with get_trainer()
[#530]: The various attributes on each component can be get/set (ie. tokenizer.model.dropout = 0.1)
[#538]: The API Reference has been improved and is now up-to-date.

Fixed

[#519]: During training, the Model is now trained in-place. This fixes several bugs that were forcing to reload the Model after a training.
[#539]: Fix BaseTokenizer enable_truncation docstring

- Rust
Published by n1t0 about 5 years ago

tokenizers - Python v0.9.4

Fixed

[#492]: Fix from_file on BertWordPieceTokenizer
[#498]: Fix the link to download sentencepiece_model_pb2.py
[#500]: Fix a typo in the docs quicktour

Changed

[#506]: Improve Encoding mappings for pairs of sequence

- Rust
Published by n1t0 over 5 years ago

tokenizers - Python v0.9.3

Fixed

[#470]: Fix hanging error when training with custom component
[#476]: TemplateProcessing serialization is now deterministic
[#481]: Fix SentencePieceBPETokenizer.from_files

Added

[#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
[#480]: Unigram now accepts an initial_alphabet and handles special_tokens correctly

- Rust
Published by n1t0 over 5 years ago

tokenizers - Python v0.9.2

Fixed

[#464] Fix a problem with RobertaProcessing being deserialized as BertProcessing

- Rust
Published by n1t0 over 5 years ago

tokenizers - Python v0.9.1

Fixed

[#459] Fix a problem with deserialization

- Rust
Published by n1t0 over 5 years ago

tokenizers - Python v0.9.0

Fixed

[#362]: Fix training deadlock with Python components.
[#363]: Fix a crash when calling .train with some non-existent files
[#355]: Remove a lot of possible crashes
[#389]: Improve truncation (crash and consistency)

Added

[#379]: Add the ability to call encode/encode_batch with numpy arrays
[#292]: Support for the Unigram algorithm
[#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
[#403]: Add TemplateProcessing PostProcessor.
[#420]: Ability to fuse the "unk" token in BPE.

Changed

[#360]: Lots of improvements related to words/alignment tracking
[#426]: Improvements on error messages thanks to PyO3 0.12

- Rust
Published by n1t0 over 5 years ago

tokenizers - Python v0.9.0.rc1

Fixed

[#362]: Fix training deadlock with Python components.
[#363]: Fix a crash when calling .train with some non-existent files
[#355]: Remove a lot of possible crashes
[#389]: Improve truncation (crash and consistency)

Added

[#379]: Add the ability to call encode/encode_batch with numpy arrays
[#292]: Support for the Unigram algorithm
[#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
[#403]: Add TemplateProcessing PostProcessor.
[#420]: Ability to fuse the "unk" token in BPE.

Changed

[#360]: Lots of improvements related to words/alignment tracking
[#426]: Improvements on error messages thanks to PyO3 0.12

- Rust
Published by n1t0 over 5 years ago

tokenizers - Python 0.8.1

Fixed

[#333]: Fix deserialization of AddedToken, where the content was not restored properly

Changed

[#329]: Improved warning and behavior when we detect a fork
[#330]: BertNormalizer now keeps the same behavior than the original implementation when strip_accents is not specified.

- Rust
Published by n1t0 over 5 years ago

tokenizers - Python v0.8.0

Highlights of this release

We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps while applying labels to each word.
Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
With the serialization comes the compatibility with Pickle! The Tokenizer, all of its components, Encodings, everything can be pickled!
Training a tokenizer is now even faster (up to 5-10x) than before!
Compatibility with multiprocessing, even when using the fork start method. Since this library makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization, this led to problems (deadlocks) when used with multiprocessing. This version now allows to disable the parallelism, and will warn you if this is necessary.
And a lot of other improvements, and fixes.

Fixed

[#286]: Fix various crash when training a BPE model
[#309]: Fixed a few bugs related to additional vocabulary/tokens

Added

[#272]: Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, ...). This adds some methods to easily save/load an entire tokenizer (from_str, from_file).
[#273]: Tokenizer and its parts are now pickable
[#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with enable_padding(pad_to_multiple_of=8) for example.
[#298]: Ability to get the currently set truncation/padding params
[#311]: Ability to enable/disable the parallelism using the TOKENIZERS_PARALLELISM environment variable. This is especially usefull when using multiprocessing capabilities, with the fork start method, which happens to be the default on Linux systems. Without disabling the parallelism, the process dead-locks while encoding. (Cf [#187] for more information)

Changed

Improved errors generated during truncation: When the provided max length is too low are now handled properly.
[#249] encode and encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized, the argument is_pretokenized=True must be specified.
[#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the processing of each file
[#280]: Use onig for byte-level pre-tokenization to remove all the differences with the original implementation from GPT-2
[#309]: Improved the management of the additional vocabulary. This introduces an option normalized, controlling whether a token should be extracted from the normalized version of the input text.

- Rust
Published by n1t0 over 5 years ago

tokenizers - Rust v0.10.1

Fixed

[#226]: Fix the word indexes when there are special tokens

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Python v0.7.0

Changed

Only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bars on screen. Also avoids reading the size of each file before starting to actually read these files, as this process could take really long.
[#193]: encode and encode_batch now take a new optional argument, specifying whether we should add the special tokens. This is activated by default.
[#197]: original_str and normalized_str have been removed from the Encoding returned by encode and encode_batch. This brings a reduction of 70% of the memory footprint.
[#197]: The offsets provided on Encoding are now relative to the original string, and not the normalized one anymore.
The added token given to add_special_tokens or add_tokens on a Tokenizer, or while using train(special_tokens=...) can now be instances of AddedToken to provide more control over these tokens.
[#136]: Updated Pyo3 version
[#136]: Static methods Model.from_files and Model.empty are removed in favor of using constructors.
[#239]: CharBPETokenizer now corresponds to OpenAI GPT BPE implementation by default.

Added

[#188]: ByteLevel is also a PostProcessor now and handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token. It has been added to ByteLevelBPETokenizer but it is off by default (trim_offsets=False).
[#236]: RobertaProcessing also handles trimming the offsets.
[#234]: New alignment mappings on the Encoding. Provide methods to easily convert between char or word (input space) and token (output space).
post_process can be called on the Tokenizer
[#208]: Ability to retrieve the vocabulary from the Tokenizer with get_vocab(with_added_tokens: bool)
[#136] Models can now be instantiated through object constructors.

Fixed

[#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
- when add_prefix_space=True
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
[#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not advised, but that's not the question).
[#205]: Trim the decoded string in BPEDecoder used by CharBPETokenizer

How to migrate

Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant. If you are using ByteLevelBPETokenizer, this option is disabled by default (trim_offsets=False).
BertWordPieceTokenizer option to add_special_tokens must now be given to encode or encode_batch
Access to the original_str on the Encoding has been removed. The original string is the input of encode so it didn't make sense to keep it here.
No need to call original_str.offsets(offsets[N]) to convert offsets to the original string. They are now relative to the original string by default.
Access to the normalized_str on the Encoding has been removed. Can be retrieved by calling normalize(sequence) on the Tokenizer
Change Model.from_files and Model.empty to use constructor. The model constructor should take the same arguments as the old methods. (ie BPE(vocab, merges) or BPE())
If you were using the CharBPETokenizer and want to keep the same behavior as before, set bert_normalizer=False and split_on_whitespace_only=True.

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Rust v0.10.0

Changed

[#222]: All Tokenizer's subparts must now be Send + Sync

Added

[#208]: Ability to retrieve the vocabulary from the Tokenizer & Model

Fixed

[#205]: Trim the decoded string in BPEDecoder
[b770f36]: Fix a bug with added tokens generated IDs

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Rust v0.9.0

Changed

Only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bars on screen. Also avoids reading the size of each file before starting to actually read these files, as this process could take really long.
[#190]: Improved BPE and WordPiece builders
[#193]: encode and encode_batch now take a new argument, specifying whether we should add the special tokens
[#197]: The NormalizedString has been removed from the Encoding. It is now possible to retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory footprint
[#197]: The NormalizedString API has been improved. It is now possible to retrieve parts of both strings using both "normalized" or "original" offsets
[#197]: The offsets provided on Encoding are now relative to the original string, and not the normalized one anymore
AddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken have more options to allow various behaviors.

Added

[#188]: impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token
More alignment mappings on the Encoding.
post_process can be called on the Tokenizer

Fixed

[#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
- when add_prefix_space is activated
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
[#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not advised, but that's not the question)

How to migrate

Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Python v0.6.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Some default tokens were missing from BertWordPieceTokenizer (cf #160)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up in multiple bytes. (cf #156)
The longest_first truncation strategy had a bug (#174)

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Rust v0.8.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Do not open all files directly while training (#163)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up in multiple bytes. (cf #156)
The LongestFirst truncation strategy had a bug (#174)

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Python v0.5.2

Fixes:

We introduced a bug related to the saving of the WordPiece model in 0.5.2: The vocab.txt file was named vocab.json. This is now fixed.
The WordLevel model was also saving its vocabulary in the wrong format.

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Python v0.5.1

Changes:

name argument is now optional when saving a Model's vocabulary. When the name is not specified, the files get a more generic naming, like vocab.json or merges.txt.

- Rust
Published by n1t0 almost 6 years ago

tokenizers - Python v0.5.0