Recent Releases of datasets
datasets - 4.0.0
New Features
- Add
IterableDataset.push_to_hub()by @lhoestq in https://github.com/huggingface/datasets/pull/7595
```python # Build streaming data pipelines in a few lines of code ! from datasets import load_dataset
ds = loaddataset(..., streaming=True) ds = ds.map(...).filter(...) ds.pushto_hub(...) ```
- Add
num_proc=to.push_to_hub()(Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606
python
# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)
- New
Columnobject- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564
- Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614
```python # Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
# Iterate on a column: for text in ds["text"]: ...
# Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"] ``` * Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616 - Enables streaming only the ranges you need !
```python # Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset
ds = loaddataset(..., streaming=True) for example in ds: video = example["video"] frames = video.getframesinrange(start=0, stop=6, step=1) # only stream certain frames ```
- Requires
torch>=2.7.0and FFmpeg >= 4 - Not available for Windows yet but it is coming soon - in the meantime please use
datasets<4.0 - Load audio data with
AudioDecoder:
```python
audio = dataset[0]["audio"] #
# old syntax is still supported array, sr = audio["array"], audio["sampling_rate"] ```
- Load video data with
VideoDecoder:
python
# video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape # (3, 240, 320)
first_frame.pts_seconds # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape # torch.Size([5, 3, 240, 320])
Breaking changes
- Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_codeis no longer supported
- Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding
- Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
- Introduction of the
Listtype
- Introduction of the
```python from datasets import Features, List, Value
features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) }) ```
Sequencewas a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aListor adictdepending on the subfeature
```python from datasets import Sequence
Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))} ```
Other improvements and bug fixes
- Refactor
Dataset.mapto reuse cache files mapped with differentnum_procby @ringohoffman in https://github.com/huggingface/datasets/pull/7434 - fix stringtodict test by @lhoestq in https://github.com/huggingface/datasets/pull/7571
- Preserve formatting in concatenated IterableDataset by @francescorubbo in https://github.com/huggingface/datasets/pull/7522
- Fix typos in PDF and Video documentation by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7579
- fix: Add embed_storage in Pdf feature by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7582
- load_dataset splits typing by @lhoestq in https://github.com/huggingface/datasets/pull/7587
- Fixed typos by @TopCoder2K in https://github.com/huggingface/datasets/pull/7572
- Fix regex library warnings by @emmanuel-ferdman in https://github.com/huggingface/datasets/pull/7576
- [MINOR:TYPO] Update savetodisk docstring by @cakiki in https://github.com/huggingface/datasets/pull/7575
- Add missing property on
RepeatExamplesIterableby @SilvanCodes in https://github.com/huggingface/datasets/pull/7581 - Avoid multiple default config names by @albertvillanova in https://github.com/huggingface/datasets/pull/7585
- Fix broken link to albumentations by @ternaus in https://github.com/huggingface/datasets/pull/7593
- fix stringtodict usage for windows by @lhoestq in https://github.com/huggingface/datasets/pull/7598
- No TF in win tests by @lhoestq in https://github.com/huggingface/datasets/pull/7603
- Docs and more methods for IterableDataset: pushtohub, to_parquet... by @lhoestq in https://github.com/huggingface/datasets/pull/7604
- Tests typing and fixes for pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/7608
- fix parallel pushtohub in dataset_dict by @lhoestq in https://github.com/huggingface/datasets/pull/7613
- remove unused code by @lhoestq in https://github.com/huggingface/datasets/pull/7615
- Update
_dill.pyto useco_linetablefor Python 3.10+ in place ofco_lnotabby @qgallouedec in https://github.com/huggingface/datasets/pull/7609 - Fixes in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7620
- Add albumentations to use dataset by @ternaus in https://github.com/huggingface/datasets/pull/7596
- minor docs data aug by @lhoestq in https://github.com/huggingface/datasets/pull/7621
- fix: raise error in FolderBasedBuilder when datadir and datafiles are missing by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7623
- fix save_infos by @lhoestq in https://github.com/huggingface/datasets/pull/7639
- better features repr by @lhoestq in https://github.com/huggingface/datasets/pull/7640
- update docs and docstrings by @lhoestq in https://github.com/huggingface/datasets/pull/7641
- fix length for ci by @lhoestq in https://github.com/huggingface/datasets/pull/7642
- Backward compat sequence instance by @lhoestq in https://github.com/huggingface/datasets/pull/7643
- fix sequence ci by @lhoestq in https://github.com/huggingface/datasets/pull/7644
- Custom metadata filenames by @lhoestq in https://github.com/huggingface/datasets/pull/7663
- Update the beans dataset link in Preprocess by @HJassar in https://github.com/huggingface/datasets/pull/7659
- Backward compat list feature by @lhoestq in https://github.com/huggingface/datasets/pull/7666
- Fix infer list of images by @lhoestq in https://github.com/huggingface/datasets/pull/7667
- Fix audio bytes by @lhoestq in https://github.com/huggingface/datasets/pull/7670
- Fix double sequence by @lhoestq in https://github.com/huggingface/datasets/pull/7672
New Contributors
- @TopCoder2K made their first contribution in https://github.com/huggingface/datasets/pull/7564
- @francescorubbo made their first contribution in https://github.com/huggingface/datasets/pull/7522
- @emmanuel-ferdman made their first contribution in https://github.com/huggingface/datasets/pull/7576
- @SilvanCodes made their first contribution in https://github.com/huggingface/datasets/pull/7581
- @ternaus made their first contribution in https://github.com/huggingface/datasets/pull/7593
- @ArjunJagdale made their first contribution in https://github.com/huggingface/datasets/pull/7623
- @TyTodd made their first contribution in https://github.com/huggingface/datasets/pull/7616
- @HJassar made their first contribution in https://github.com/huggingface/datasets/pull/7659
Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0
- Python
Published by lhoestq 8 months ago
datasets - 3.6.0
Dataset Features
- Enable xet in push to hub by @lhoestq in https://github.com/huggingface/datasets/pull/7552
- Faster downloads/uploads with Xet storage
- more info: https://github.com/huggingface/datasets/issues/7526
Other improvements and bug fixes
- Add tryoriginaltype to DatasetDict.map by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7544
- Avoid global umask for setting file mode. by @ryan-clancy in https://github.com/huggingface/datasets/pull/7547
- Rebatch arrow iterables before formatted iterable by @lhoestq in https://github.com/huggingface/datasets/pull/7553
- Document the HFDATASETSCACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in https://github.com/huggingface/datasets/pull/7532
- fix regression by @lhoestq in https://github.com/huggingface/datasets/pull/7558
- fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in https://github.com/huggingface/datasets/pull/7521
- Remove
aiohttpfrom direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294
New Contributors
- @ryan-clancy made their first contribution in https://github.com/huggingface/datasets/pull/7547
- @Harry-Yang0518 made their first contribution in https://github.com/huggingface/datasets/pull/7532
- @giraffacarp made their first contribution in https://github.com/huggingface/datasets/pull/7521
- @akx made their first contribution in https://github.com/huggingface/datasets/pull/7294
Full Changelog: https://github.com/huggingface/datasets/compare/3.5.1...3.6.0
- Python
Published by lhoestq 10 months ago
datasets - 3.5.1
Bug fixes
- support pyarrow 20 by @lhoestq in https://github.com/huggingface/datasets/pull/7540
- Fix pyarrow error
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
- Fix pyarrow error
- Write pdf in map by @lhoestq in https://github.com/huggingface/datasets/pull/7487
Other improvements
- update fsspec 2025.3.0 by @peteski22 in https://github.com/huggingface/datasets/pull/7478
- Support underscore int read instruction by @lhoestq in https://github.com/huggingface/datasets/pull/7488
- Support skiptryingtype by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7483
- pdf docs fixes by @lhoestq in https://github.com/huggingface/datasets/pull/7519
- Remove conditions for Python < 3.9 by @cyyever in https://github.com/huggingface/datasets/pull/7474
- mention av in video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7523
- correct use with polars example by @SiQube in https://github.com/huggingface/datasets/pull/7524
- chore: fix typos by @afuetterer in https://github.com/huggingface/datasets/pull/7436
New Contributors
- @peteski22 made their first contribution in https://github.com/huggingface/datasets/pull/7478
- @yoshitomo-matsubara made their first contribution in https://github.com/huggingface/datasets/pull/7483
- @SiQube made their first contribution in https://github.com/huggingface/datasets/pull/7524
- @afuetterer made their first contribution in https://github.com/huggingface/datasets/pull/7436
Full Changelog: https://github.com/huggingface/datasets/compare/3.5.0...3.5.1
- Python
Published by lhoestq 10 months ago
datasets - 3.5.0
Datasets Features
- Introduce PDF support (#7318) by @yabramuvdi in https://github.com/huggingface/datasets/pull/7325
```python
from datasets import loaddataset, Pdf repo = "path/to/pdf/folder" # or username/datasetname on Hugging Face dataset = loaddataset(repo, split="train") dataset[0]["pdf"]
dataset[0]["pdf"].pages[0].extract text() ... ```
What's Changed
- Fix local pdf loading by @lhoestq in https://github.com/huggingface/datasets/pull/7466
- Minor fix for metadata files in extension counter by @lhoestq in https://github.com/huggingface/datasets/pull/7464
- Priotitize json by @lhoestq in https://github.com/huggingface/datasets/pull/7476
New Contributors
- @yabramuvdi made their first contribution in https://github.com/huggingface/datasets/pull/7325
Full Changelog: https://github.com/huggingface/datasets/compare/3.4.1...3.5.0
- Python
Published by lhoestq 11 months ago
datasets - 3.4.0
Dataset Features
- Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
- /!\ Breaking change: we replaced
decordwithtorchvisionto read videos, sincedecordis not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideotype is still marked as experimental is this version
- /!\ Breaking change: we replaced
```python from datasets import load_dataset, Video
dataset = loaddataset("path/to/video/folder", split="train") dataset[0]["video"] # <torchvision.io.videoreader.VideoReader at 0x1652284c0> ```
- faster streaming for image/audio/video folder from Hugging Face
- support for
metadata.parquetin addition tometadata.csvormetadata.jsonlfor the metadata of the image/audio/video files- Add IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
python
dataset = dataset.decode(num_threads=num_threads)
* Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368
General improvements and bug fixes
- fix: None default with bool type on load creates typing error by @stephantul in https://github.com/huggingface/datasets/pull/7426
- Use pyupgrade --py39-plus by @cyyever in https://github.com/huggingface/datasets/pull/7428
- Refactor
string_to_dictto returnNoneif there is no match instead of raisingValueErrorby @ringohoffman in https://github.com/huggingface/datasets/pull/7435 - Fix small bugs with async map by @lhoestq in https://github.com/huggingface/datasets/pull/7445
- Fix resuming after
ds.set_epoch(new_epoch)by @lhoestq in https://github.com/huggingface/datasets/pull/7451 - minor docs changes by @lhoestq in https://github.com/huggingface/datasets/pull/7452
New Contributors
- @stephantul made their first contribution in https://github.com/huggingface/datasets/pull/7426
- @cyyever made their first contribution in https://github.com/huggingface/datasets/pull/7428
- @jp1924 made their first contribution in https://github.com/huggingface/datasets/pull/7368
Full Changelog: https://github.com/huggingface/datasets/compare/3.3.2...3.4.0
- Python
Published by lhoestq 12 months ago
datasets - 3.3.2
Bug fixes
- Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in https://github.com/huggingface/datasets/pull/7411
- Gracefully cancel async tasks by @lhoestq in https://github.com/huggingface/datasets/pull/7414
Other general improvements
- Update usewithpandas.mdx: to_pandas() correction in last section by @ibarrien in https://github.com/huggingface/datasets/pull/7407
- Fix a typo in arrow_dataset.py by @jingedawang in https://github.com/huggingface/datasets/pull/7402
New Contributors
- @dakinggg made their first contribution in https://github.com/huggingface/datasets/pull/7411
- @ibarrien made their first contribution in https://github.com/huggingface/datasets/pull/7407
- @jingedawang made their first contribution in https://github.com/huggingface/datasets/pull/7402
Full Changelog: https://github.com/huggingface/datasets/compare/3.3.1...3.3.2
- Python
Published by lhoestq about 1 year ago
datasets - 3.3.0
Dataset Features
- Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384
- Especially useful to download content like images or call inference APIs
python
prompt = "Answer the following question: {question}. You should think step by step."
async def ask_llm(example):
return await query_model(prompt.format(question=example["question"]))
ds = ds.map(ask_llm)
* Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198
python
ds = ds.repeat(10)
* Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in https://github.com/huggingface/datasets/pull/7370
* Add support for "pandas" and "polars" formats in IterableDatasets
* This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
python
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
ds = ds.with_format("polars")
expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
ds = ds.map(lambda df: df.with_columns(expr), batched=True)
- Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
- IterableDatasets with "numpy" format are now much faster
What's Changed
- don't import soundfile in tests by @lhoestq in https://github.com/huggingface/datasets/pull/7340
- minor video docs on how to install by @lhoestq in https://github.com/huggingface/datasets/pull/7341
- Fix typo in arrow_dataset by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7328
- remove filecheck to enable symlinks by @fschlatt in https://github.com/huggingface/datasets/pull/7133
- Webdataset special columns in last position by @lhoestq in https://github.com/huggingface/datasets/pull/7349
- Bump hfh to 0.24 to fix ci by @lhoestq in https://github.com/huggingface/datasets/pull/7350
- fsspec 2024.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7352
- changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in https://github.com/huggingface/datasets/pull/7353
- Catch OSError for arrow by @lhoestq in https://github.com/huggingface/datasets/pull/7348
- Remove .h5 from imagefolder extensions by @lhoestq in https://github.com/huggingface/datasets/pull/7374
- Add Pandas, PyArrow and Polars docs by @lhoestq in https://github.com/huggingface/datasets/pull/7382
- Optimized sequence encoding for scalars by @lukasgd in https://github.com/huggingface/datasets/pull/7393
- Update docs by @lhoestq in https://github.com/huggingface/datasets/pull/7395
- Update README.md by @lhoestq in https://github.com/huggingface/datasets/pull/7396
- Release: 3.3.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7398
New Contributors
- @AndreaFrancis made their first contribution in https://github.com/huggingface/datasets/pull/7328
- @vttrifonov made their first contribution in https://github.com/huggingface/datasets/pull/7353
- @lukasgd made their first contribution in https://github.com/huggingface/datasets/pull/7393
Full Changelog: https://github.com/huggingface/datasets/compare/3.2.0...3.3.0
- Python
Published by lhoestq about 1 year ago
datasets - 3.2.0
Dataset Features
- Faster parquet streaming + filters with predicate pushdown by @lhoestq in https://github.com/huggingface/datasets/pull/7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
python from datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
Other improvements and bug fixes
- fix conda release worlflow by @lhoestq in https://github.com/huggingface/datasets/pull/7272
- Add link to video dataset by @NielsRogge in https://github.com/huggingface/datasets/pull/7277
- Raise error for incorrect JSON serialization by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7273
- support for custom feature encoding/decoding by @alex-hh in https://github.com/huggingface/datasets/pull/7284
- update load_dataset doctring by @lhoestq in https://github.com/huggingface/datasets/pull/7301
- Let server decide default repo visibility by @Wauplin in https://github.com/huggingface/datasets/pull/7302
- fix: update elasticsearch version by @ruidazeng in https://github.com/huggingface/datasets/pull/7300
- Fix typing in iterable_dataset.py by @lhoestq in https://github.com/huggingface/datasets/pull/7304
- Updated inconsistent output in documentation examples for
ClassLabelby @sergiopaniego in https://github.com/huggingface/datasets/pull/7293 - More docs to from_dict to mention that the result lives in RAM by @lhoestq in https://github.com/huggingface/datasets/pull/7316
- Release: 3.2.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7317
New Contributors
- @ruidazeng made their first contribution in https://github.com/huggingface/datasets/pull/7300
- @sergiopaniego made their first contribution in https://github.com/huggingface/datasets/pull/7293
Full Changelog: https://github.com/huggingface/datasets/compare/3.1.0...3.2.0
- Python
Published by lhoestq about 1 year ago
datasets - 3.1.0
Dataset Features
- Video support by @lhoestq in https://github.com/huggingface/datasets/pull/7230
python >>> from datasets import Dataset, Video, load_dataset >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) >>> # or from the hub >>> ds = load_dataset("username/dataset_name", split="train") >>> ds[0]["video"] <decord.video_reader.VideoReader at 0x105525c70> - Add IterableDataset.shard() by @lhoestq in https://github.com/huggingface/datasets/pull/7252
python >>> from datasets import load_dataset >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True) >>> full_ds.num_shards 2360 >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0) >>> ds.num_shards 1 >>> ds = full_ds.shard(num_shards=8, index=0) >>> ds.num_shards 295 - Basic XML support by @lhoestq in https://github.com/huggingface/datasets/pull/7250
What's Changed
- (Super tiny doc update) Mention to_polars by @fzyzcjy in https://github.com/huggingface/datasets/pull/7232
- [MINOR:TYPO] Update arrow_dataset.py by @cakiki in https://github.com/huggingface/datasets/pull/7236
- Missing video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7251
- fix decord import by @lhoestq in https://github.com/huggingface/datasets/pull/7255
- fix ci for pyarrow 18 by @lhoestq in https://github.com/huggingface/datasets/pull/7257
- Retry all requests timeouts by @lhoestq in https://github.com/huggingface/datasets/pull/7256
- Always set non-null writer batch size by @lhoestq in https://github.com/huggingface/datasets/pull/7258
- Don't embed videos by @lhoestq in https://github.com/huggingface/datasets/pull/7259
- Allow video with disabeld decoding without decord by @lhoestq in https://github.com/huggingface/datasets/pull/7262
- Small addition to video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7263
- fix docs relative links by @lhoestq in https://github.com/huggingface/datasets/pull/7264
- Disallow video pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/7265
New Contributors
- @fzyzcjy made their first contribution in https://github.com/huggingface/datasets/pull/7232
Full Changelog: https://github.com/huggingface/datasets/compare/3.0.2...3.1.0
- Python
Published by lhoestq over 1 year ago
datasets - 3.0.2
Main bug fixes
- fix unbatched arrow map for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7204
- Support features in metadata configs by @albertvillanova in https://github.com/huggingface/datasets/pull/7182
- Preserve features in iterable dataset.filter by @alex-hh in https://github.com/huggingface/datasets/pull/7209
- Pin dill<0.3.9 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7184
- this should also fix cache issues
What's Changed
- Fix release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/7177
- Pin multiprocess<0.70.1 to align with dill<0.3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/7188
- with_format docstring by @lhoestq in https://github.com/huggingface/datasets/pull/7203
- fix ci benchmark by @lhoestq in https://github.com/huggingface/datasets/pull/7205
- Fix the environment variable for huggingface cache by @torotoki in https://github.com/huggingface/datasets/pull/7200
- Support Python 3.11 by @albertvillanova in https://github.com/huggingface/datasets/pull/7179
- bump fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/7219
- Fix typo in image dataset docs by @albertvillanova in https://github.com/huggingface/datasets/pull/7231
- No need for dataset_info by @lhoestq in https://github.com/huggingface/datasets/pull/7234
- use huggingface_hub offline mode by @lhoestq in https://github.com/huggingface/datasets/pull/7244
New Contributors
- @alex-hh made their first contribution in https://github.com/huggingface/datasets/pull/7204
- @torotoki made their first contribution in https://github.com/huggingface/datasets/pull/7200
Full Changelog: https://github.com/huggingface/datasets/compare/3.0.1...3.0.2
- Python
Published by lhoestq over 1 year ago
datasets - 3.0.1
What's Changed
- Modify add_column() to optionally accept a FeatureType as param by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7143
- Align filename prefix splitting with WebDataset library by @albertvillanova in https://github.com/huggingface/datasets/pull/7151
- Support ndjson data files by @albertvillanova in https://github.com/huggingface/datasets/pull/7154
- Support JSON lines with missing struct fields by @albertvillanova in https://github.com/huggingface/datasets/pull/7160
- Support JSON lines with empty struct by @albertvillanova in https://github.com/huggingface/datasets/pull/7162
- fix increaseloadcount by @lhoestq in https://github.com/huggingface/datasets/pull/7165
- fix docstring code example for distributed shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/7166
- Support JSON lines with missing columns by @albertvillanova in https://github.com/huggingface/datasets/pull/7170
- Add torchdata as a regular test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/7172
New Contributors
- @varadhbhatnagar made their first contribution in https://github.com/huggingface/datasets/pull/7143
Full Changelog: https://github.com/huggingface/datasets/compare/3.0.0...3.0.1
- Python
Published by albertvillanova over 1 year ago
datasets - 3.0.0
Dataset Features
Use Polars functions in
.map()- Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762
- Example:
```python
from datasets import loaddataset ds = loaddataset("lhoestq/CudyPokemonAdventures", split="train").withformat("polars") cols = [pl.col("content").str.lenbytes().alias("length")] dswithlength = ds.map(lambda df: df.withcolumns(cols), batched=True) dswithlength[:5] shape: (5, 5) ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐ │ idx ┆ title ┆ content ┆ labels ┆ length │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str ┆ u32 │ ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡ │ 0 ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyfuladventure ┆ 180 │ │ 1 ┆ Pikachu's Quest for Peace ┆ Pikachu, with his cheeky persona… ┆ peacefulnarrative ┆ 138 │ │ 2 ┆ The Tender Tale of Squirtle ┆ Squirtle took everyone on a memo… ┆ gentleadventure ┆ 135 │ │ 3 ┆ Charizard's Heartwarming Tale ┆ Charizard found joy in helping o… ┆ heartwarmingstory ┆ 112 │ │ 4 ┆ Jolteon's Sparkling Journey ┆ Jolteon, with his zest for life,… ┆ celebratorynarrative ┆ 111 │ └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘ ```
Support NumPy 2
- Allow numpy-2.1 and test it without audio extra by @albertvillanova in https://github.com/huggingface/datasets/pull/7118
Cache Changes
- Use
huggingface_hubcache by @lhoestq in https://github.com/huggingface/datasets/pull/7105- use the
huggingface_hubcache for files downloaded from HF, by default at~/.cache/huggingface/hub - cached datasets (Arrow files) will still be reloaded from the
datasetscache, by default at~/.cache/huggingface/datasets
- use the
Breaking changes
- Remove deprecated code by @albertvillanova in https://github.com/huggingface/datasets/pull/6996
- removed deprecated arguments like
use_auth_token,fsorignore_verifications
- removed deprecated arguments like
- Remove beam by @albertvillanova in https://github.com/huggingface/datasets/pull/6987
- removed deprecated apache beam datasets support
- Remove metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/6983
- remove deprecated
load_metric, please use theevaluatelibrary instead
- remove deprecated
- Remove tasks by @albertvillanova in https://github.com/huggingface/datasets/pull/6999
- remove deprecated
taskargument inload_dataset().prepare_for_task()method,datasets.tasksmodule
- remove deprecated
General improvements and bug fixes
- Improved the tutorial by adding a link for loading datasets by @AmboThom in https://github.com/huggingface/datasets/pull/7042
- Automatically create
cache_dirfromcache_file_nameby @ringohoffman in https://github.com/huggingface/datasets/pull/7096 - remove more script docs by @lhoestq in https://github.com/huggingface/datasets/pull/7104
- Fix args of feature docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/7103
- Temporarily pin numpy<2.1 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7114
- Fix ConnectionError for gated datasets and unauthenticated users by @albertvillanova in https://github.com/huggingface/datasets/pull/7110
- Install transformers with numpy-2 CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7119
- don't mention the script if trustremotecode=False by @severo in https://github.com/huggingface/datasets/pull/7120
- Fix typed examples iterable state dict by @lhoestq in https://github.com/huggingface/datasets/pull/7121
- Rename LargeList.dtype to LargeList.feature by @albertvillanova in https://github.com/huggingface/datasets/pull/7106
- Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by @albertvillanova in https://github.com/huggingface/datasets/pull/7125
- Disable implicit token in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7126
- Test getdatasetconfig_info with non-existing/gated/private dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/7124
- fix streaming from arrow files by @fschlatt in https://github.com/huggingface/datasets/pull/7083
New Contributors
- @AmboThom made their first contribution in https://github.com/huggingface/datasets/pull/7042
- @fschlatt made their first contribution in https://github.com/huggingface/datasets/pull/7083
Full Changelog: https://github.com/huggingface/datasets/compare/2.21.0...3.0.0
- Python
Published by albertvillanova over 1 year ago
datasets - 2.21.0
Features
Support pyarrow large_list by @albertvillanova in https://github.com/huggingface/datasets/pull/7019
- Support Polars round trip: ```python import polars as pl from datasets import Dataset
df1 = pl.fromdict({"col1": [[1, 2], [3, 4]]} df2 = Dataset.frompolars(df).topolars() assert df1.equals(df2) ```
What's Changed
- Use
HF_HUB_OFFLINEinstead ofHF_DATASETS_OFFLINEby @Wauplin in https://github.com/huggingface/datasets/pull/6968 - packaging: Remove useless dependencies by @daskol in https://github.com/huggingface/datasets/pull/6971
- Fix resuming arrow format by @lhoestq in https://github.com/huggingface/datasets/pull/6964
- Fix webdataset pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6972
- Set temporary numpy upper version < 2.0.0 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6975
- Fix regression for pandas < 2.0.0 in JSON loader by @albertvillanova in https://github.com/huggingface/datasets/pull/6978
- Ensure compatibility with numpy 2.0.0 by @KennethEnevoldsen in https://github.com/huggingface/datasets/pull/6976
- Remove underlines between badges by @novialriptide in https://github.com/huggingface/datasets/pull/6966
- Update docs on trustremotecode defaults to False by @albertvillanova in https://github.com/huggingface/datasets/pull/6981
- Improve skip take shuffling and distributed by @lhoestq in https://github.com/huggingface/datasets/pull/6965
- Fix tests using hf-internal-testing/librispeechasrdummy by @albertvillanova in https://github.com/huggingface/datasets/pull/6998
- Fix dump of bfloat16 torch tensor by @lhoestq in https://github.com/huggingface/datasets/pull/7002
- minor fix for bfloat16 by @lhoestq in https://github.com/huggingface/datasets/pull/7003
- Fix incorrect rank value in data splitting by @yzhangcs in https://github.com/huggingface/datasets/pull/6994
- less script docs by @lhoestq in https://github.com/huggingface/datasets/pull/6993
- Fix CI by temporarily pinning ruff < 0.5.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/7007
- Support ruff 0.5.0 in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7009
- Fix WebDatasets KeyError for user-defined Features when a field is missing in an example by @ProGamerGov in https://github.com/huggingface/datasets/pull/7004
- [Streaming] retry on requests errors by @lhoestq in https://github.com/huggingface/datasets/pull/6963
- Re-enable raising error from huggingface-hub FutureWarning in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7011
- Skip faiss tests on Windows to avoid running CI for 360 minutes by @albertvillanova in https://github.com/huggingface/datasets/pull/7014
- Support fsspec 2024.6.1 by @albertvillanova in https://github.com/huggingface/datasets/pull/7017
- Persist IterableDataset epoch in workers by @lhoestq in https://github.com/huggingface/datasets/pull/6710
- Fix casting list array to fixed size list by @albertvillanova in https://github.com/huggingface/datasets/pull/7021
- Remove dead code for pyarrow < 15.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/7023
- Fix checklibraryimports by @lhoestq in https://github.com/huggingface/datasets/pull/7026
- Missing line from previous pr by @lhoestq in https://github.com/huggingface/datasets/pull/7027
- Fix ci by @lhoestq in https://github.com/huggingface/datasets/pull/7028
- Add decorator as explicit test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/7043
- Mark tests that require librosa by @albertvillanova in https://github.com/huggingface/datasets/pull/7044
- Unblock NumPy 2.0 by @NeilGirdhar in https://github.com/huggingface/datasets/pull/6991
- Fix tensorflow min version depending on Python version by @albertvillanova in https://github.com/huggingface/datasets/pull/7045
- Support librosa and numpy 2.0 for Python 3.10 by @albertvillanova in https://github.com/huggingface/datasets/pull/7046
- add checkpoint and resume title in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7050
- Update load_hub.mdx by @severo in https://github.com/huggingface/datasets/pull/7057
- Add batching to IterableDataset by @lappemic in https://github.com/huggingface/datasets/pull/7054
- Avoid calling http_head for non-HTTP URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/7062
- Fix loaddataset for datafiles with protocols other than HF by @matstrand in https://github.com/huggingface/datasets/pull/6862
- Add batch method to Dataset class by @lappemic in https://github.com/huggingface/datasets/pull/7064
- Fix doc generation when NamedSplit is used as parameter default value by @albertvillanova in https://github.com/huggingface/datasets/pull/7036
- Fix CI by temporarily marking testconvertto_parquet as expected to fail by @albertvillanova in https://github.com/huggingface/datasets/pull/7074
- add split argument to Generator by @piercus in https://github.com/huggingface/datasets/pull/7015
- Update required soxr version from pre-release to release by @albertvillanova in https://github.com/huggingface/datasets/pull/7075
- Fix CI testconvertto_parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/7078
- Fix preparesinglehoppathandstorageoptions by @albertvillanova in https://github.com/huggingface/datasets/pull/7068
- Set loadfromdisk path type as PathLike by @albertvillanova in https://github.com/huggingface/datasets/pull/7081
- Fix pushtohub by not calling create_branch if branch exists by @albertvillanova in https://github.com/huggingface/datasets/pull/7069
- feat: support non streamable arrow file binary format by @kmehant in https://github.com/huggingface/datasets/pull/7025
- Support HTTP authentication in non-streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/7082
- chore: fix typos in docs by @hattizai in https://github.com/huggingface/datasets/pull/7034
- Fix CI for metrics by @albertvillanova in https://github.com/huggingface/datasets/commit/83e5c05fd38a4a37b5e6d5d7c0cfa73d76f1b220
New Contributors
- @novialriptide made their first contribution in https://github.com/huggingface/datasets/pull/6966
- @yzhangcs made their first contribution in https://github.com/huggingface/datasets/pull/6994
- @ProGamerGov made their first contribution in https://github.com/huggingface/datasets/pull/7004
- @NeilGirdhar made their first contribution in https://github.com/huggingface/datasets/pull/6991
- @matstrand made their first contribution in https://github.com/huggingface/datasets/pull/6862
- @lappemic made their first contribution in https://github.com/huggingface/datasets/pull/7054
- @piercus made their first contribution in https://github.com/huggingface/datasets/pull/7015
- @kmehant made their first contribution in https://github.com/huggingface/datasets/pull/7025
- @hattizai made their first contribution in https://github.com/huggingface/datasets/pull/7034
Full Changelog: https://github.com/huggingface/datasets/compare/2.20.0...2.21.0
- Python
Published by albertvillanova over 1 year ago
datasets - 2.20.0
Important
- Remove default
trust_remote_code=Trueby @lhoestq in https://github.com/huggingface/datasets/pull/6954- datasets with a python loading script now require passing
trust_remote_code=Trueto be used
- datasets with a python loading script now require passing
Datasets features
[Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6658
- checkpoint and resume an iterable dataset (e.g. when streaming):
```python
iterabledataset = Dataset.fromdict({"a": range(6)}).toiterabledataset(numshards=3) for idx, example in enumerate(iterabledataset): ... print(example) ... if idx == 2: ... statedict = iterabledataset.statedict() ... print("checkpoint") ... break iterabledataset.loadstatedict(statedict) print(f"restart from checkpoint") for example in iterabledataset: ... print(example) ```
Returns:
{'a': 0} {'a': 1} {'a': 2} checkpoint restart from checkpoint {'a': 3} {'a': 4} {'a': 5}
General improvements and bug fixes
- Add docs about the CLI by @albertvillanova in https://github.com/huggingface/datasets/pull/6831
- Remove token arg from CLI examples by @albertvillanova in https://github.com/huggingface/datasets/pull/6839
- Allow deleting a subset/config from a no-script dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6820
- Fix line-endings in tests on Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/6857
- Fix CI by temporarily pinning huggingface-hub < 0.23.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6861
- Fix dataset name for community Hub script-datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6855
- Update tqdm >= 4.66.3 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6870
- Fix download for dict of dicts of URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/6871
- Set dev version by @albertvillanova in https://github.com/huggingface/datasets/pull/6873
- Shorten long logs by @lhoestq in https://github.com/huggingface/datasets/pull/6875
- Support jax 0.4.27 in CI tests by @albertvillanova in https://github.com/huggingface/datasets/pull/6885
- Close gzipped files properly by @lhoestq in https://github.com/huggingface/datasets/pull/6893
- Make CLI converttoparquet not raise error if no rights to create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6902
- Fix YAML error in README files appearing on GitHub by @albertvillanova in https://github.com/huggingface/datasets/pull/6898
- Document that to_json defaults to JSON Lines by @albertvillanova in https://github.com/huggingface/datasets/pull/6895
- Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6883
- Create function to convert to parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6878
- Update features.py to avoid bfloat16 unsupported error by @skaulintel in https://github.com/huggingface/datasets/pull/6607
- Fix decoding multi part extension by @lhoestq in https://github.com/huggingface/datasets/pull/6904
- Use pandas ujson in JSON loader to improve performance by @albertvillanova in https://github.com/huggingface/datasets/pull/6874
- Update requests >=2.32.1 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6909
- Fix wrong type hints in data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/6910
- Remove dead code for non-dict data_files from packaged modules by @albertvillanova in https://github.com/huggingface/datasets/pull/6911
- Support fsspec 2024.5.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6921
- Remove torchaudio remnants from code by @albertvillanova in https://github.com/huggingface/datasets/pull/6922
- [WebDataset] Add
.pthsupport for torch tensors by @lhoestq in https://github.com/huggingface/datasets/pull/6920 - Unpin hfh by @lhoestq in https://github.com/huggingface/datasets/pull/6876
- Preserve JSON column order and support list of strings field by @albertvillanova in https://github.com/huggingface/datasets/pull/6914
- [WebDataset] Support compressed files by @lhoestq in https://github.com/huggingface/datasets/pull/6931
- update ci user by @lhoestq in https://github.com/huggingface/datasets/pull/6933
- Revert ci user by @lhoestq in https://github.com/huggingface/datasets/pull/6934
- Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing datadir/datafiles in no-code Hub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6925
- Set dev version by @albertvillanova in https://github.com/huggingface/datasets/pull/6944
- Update yanked version of minimum requests requirement by @albertvillanova in https://github.com/huggingface/datasets/pull/6945
- Re-enable import sorting disabled by flake8:noqa directive when using ruff linter by @albertvillanova in https://github.com/huggingface/datasets/pull/6946
- Update dataset_dict.py by @Arunprakash-A in https://github.com/huggingface/datasets/pull/6932
- Update process.mdx: Code Listings Fixes by @FadyMorris in https://github.com/huggingface/datasets/pull/6928
- Fix small typo by @marcenacp in https://github.com/huggingface/datasets/pull/6955
- update docs on N-dim arrays by @lhoestq in https://github.com/huggingface/datasets/pull/6956
- Fix typos in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6957
- Validate config name and data_files in packaged modules by @albertvillanova in https://github.com/huggingface/datasets/pull/6915
- Add support for categorical/dictionary types by @EthanSteinberg in https://github.com/huggingface/datasets/pull/6892
- feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/datasets/pull/6960
- Better error handling in
dataset_module_factoryby @Wauplin in https://github.com/huggingface/datasets/pull/6959 - Move info_utils errors to exceptions module by @albertvillanova in https://github.com/huggingface/datasets/pull/6952
- fix(ci): remove unnecessary permissions by @McPatate in https://github.com/huggingface/datasets/pull/6962
New Contributors
- @skaulintel made their first contribution in https://github.com/huggingface/datasets/pull/6607
- @Arunprakash-A made their first contribution in https://github.com/huggingface/datasets/pull/6932
- @FadyMorris made their first contribution in https://github.com/huggingface/datasets/pull/6928
- @marcenacp made their first contribution in https://github.com/huggingface/datasets/pull/6955
- @EthanSteinberg made their first contribution in https://github.com/huggingface/datasets/pull/6892
- @McPatate made their first contribution in https://github.com/huggingface/datasets/pull/6960
Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.20.0
- Python
Published by albertvillanova over 1 year ago
datasets - 2.19.2
Bug fixes
- Make CLI converttoparquet not raise error if no rights to create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6902
- Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6883
- Update requests >=2.32.1 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6909
- Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing datadir/datafiles in no-code Hub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6925
Full Changelog: https://github.com/huggingface/datasets/compare/2.19.1...2.19.2
- Python
Published by albertvillanova over 1 year ago
datasets - 2.19.0
Dataset Features
- Add Polars compatibility by @psmyth94 in https://github.com/huggingface/datasets/pull/6531
- convert to a Polars dataframe using
.to_polars();python import polars as pl from datasets import load_dataset ds = load_dataset("DIBT/10k_prompts_ranked", split="train") ds.to_polars() \ .groupby("topic") \ .agg(pl.len(), pl.first()) \ .sort("len", descending=True) - Use Polars formatting to return Polars objects when accessing a dataset:
python ds = ds.with_format("polars") ds[:10].group_by("kind").len()
- convert to a Polars dataframe using
- Add
fsspecsupport forto_json,to_csv, andto_parquetby @alvarobartt in https://github.com/huggingface/datasets/pull/6096- Save on HF in any file format:
python ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl") ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv") ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
- Save on HF in any file format:
- Add
modeparameter toImagefeature by @mariosasko in https://github.com/huggingface/datasets/pull/6735- Set images to be read in a certain mode like "RGB"
python dataset = dataset.cast_column("image", Image(mode="RGB"))
- Set images to be read in a certain mode like "RGB"
- Add CLI function to convert script-dataset to Parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6795
- run command to open a PR in script-based dataset to convert it to Parquet:
datasets-cli convert_to_parquet <dataset_id>
- run command to open a PR in script-based dataset to convert it to Parquet:
- Add Dataset.take and Dataset.skip by @lhoestq in https://github.com/huggingface/datasets/pull/6813
- same as IterableDataset.take and IterableDataset.skip
python ds = ds.take(10) # take only the first 10 examples
- same as IterableDataset.take and IterableDataset.skip
General improvements and bug fixes
- Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in https://github.com/huggingface/datasets/pull/6713
- fix CastError pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6712
- Expand no-code dataset info with datasets-server info by @mariosasko in https://github.com/huggingface/datasets/pull/6714
- Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in https://github.com/huggingface/datasets/pull/6715
- Fix concurrent script loading with force_redownload by @lhoestq in https://github.com/huggingface/datasets/pull/6718
- getdatasetdefaultconfigname docstring by @lhoestq in https://github.com/huggingface/datasets/pull/6723
- Deprecate Beam API and download from HF GCS bucket by @mariosasko in https://github.com/huggingface/datasets/pull/6474
- Deprecate Pandas builder by @mariosasko in https://github.com/huggingface/datasets/pull/6730
- Using a registry instead of calling globals for fetching feature types by @psmyth94 in https://github.com/huggingface/datasets/pull/6727
- Update torch_formatter.py by @VarunNSrivastava in https://github.com/huggingface/datasets/pull/6402
- Improve default patterns resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6704
- Transpose images with EXIF Orientation tag by @mariosasko in https://github.com/huggingface/datasets/pull/6739
- Fix missing downloadconfig in getdata_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6742
- Allow null values in dict columns by @mariosasko in https://github.com/huggingface/datasets/pull/6743
- Fix fsspec tqdm callback by @lhoestq in https://github.com/huggingface/datasets/pull/6749
- chore(deps): bump fsspec by @shcheklein in https://github.com/huggingface/datasets/pull/6747
- Fix offline mode with single config by @lhoestq in https://github.com/huggingface/datasets/pull/6741
- Remove deprecated code by @Wauplin in https://github.com/huggingface/datasets/pull/6761
- fixing the issue 6755(small typo) by @JINO-ROHIT in https://github.com/huggingface/datasets/pull/6767
remove_columns/rename_columnsdoc fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6772- Fix CI by @mariosasko in https://github.com/huggingface/datasets/pull/6780
- rename datasets-server to dataset-viewer by @severo in https://github.com/huggingface/datasets/pull/6785
- Install dependencies with
uvin CI by @mariosasko in https://github.com/huggingface/datasets/pull/6779 - Fix cache conflict in
_check_legacy_cache2by @lhoestq in https://github.com/huggingface/datasets/pull/6792 - Fix typo in docs (upload CLI) by @Wauplin in https://github.com/huggingface/datasets/pull/6802
- fix
DatasetBuilder._split_generatorsincomplete type annotation by @JonasLoos in https://github.com/huggingface/datasets/pull/6799 - #6791 Improve type checking around FAISS by @Dref360 in https://github.com/huggingface/datasets/pull/6803
- Fix --repo-type order in cli upload docs by @lhoestq in https://github.com/huggingface/datasets/pull/6804
- Fix hf-internal-testing/datasetwithscript commit SHA in CI test by @albertvillanova in https://github.com/huggingface/datasets/pull/6806
- Fix cache path to snakecase for
CachedDatasetModuleFactoryandCacheby @izhx in https://github.com/huggingface/datasets/pull/6754 - Multithreaded downloads by @lhoestq in https://github.com/huggingface/datasets/pull/6794
- Remove
os.path.relpathinresolve_patternsby @mariosasko in https://github.com/huggingface/datasets/pull/6815 - Extract data on the fly in packaged builders by @mariosasko in https://github.com/huggingface/datasets/pull/6784
- add allowprimitivetostr and allowdecimaltostr instead of allownumberto_str by @Modexus in https://github.com/huggingface/datasets/pull/6811
- Support indexable objects in
Dataset.__getitem__by @mariosasko in https://github.com/huggingface/datasets/pull/6817 - Make converttoparquet CLI command create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6809
- Fix parquet export infos by @lhoestq in https://github.com/huggingface/datasets/pull/6822
New Contributors
- @VarunNSrivastava made their first contribution in https://github.com/huggingface/datasets/pull/6402
- @shcheklein made their first contribution in https://github.com/huggingface/datasets/pull/6747
- @JINO-ROHIT made their first contribution in https://github.com/huggingface/datasets/pull/6767
- @JonasLoos made their first contribution in https://github.com/huggingface/datasets/pull/6799
- @izhx made their first contribution in https://github.com/huggingface/datasets/pull/6754
- @Modexus made their first contribution in https://github.com/huggingface/datasets/pull/6811
Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0
- Python
Published by albertvillanova almost 2 years ago
datasets - 2.18.0
Dataset features
- Make JSON builder support an array of strings by @albertvillanova in https://github.com/huggingface/datasets/pull/6696
- Base parquet batch_size on parquet row group size by @lhoestq in https://github.com/huggingface/datasets/pull/6701
- Faster cold start for streaming
- Change default compression argument for JsonDatasetWriter by @Rexhaif in https://github.com/huggingface/datasets/pull/6659
- Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in https://github.com/huggingface/datasets/pull/6660
- fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in https://github.com/huggingface/datasets/pull/6687
- Support latest fsspec up to 2024.2.0
General improvements and bug fixes
- Fix for Incorrect exiterable used with multi numworker by @kq-chen in https://github.com/huggingface/datasets/pull/6582
- Previously using PyTorch DDP and
num_workerscould lead to incorrect shards assignments to workers and cause errors
- Previously using PyTorch DDP and
- Fix imagefolder dataset url by @mariosasko in https://github.com/huggingface/datasets/pull/6683
- Improve error message for gated datasets on load by @lewtun in https://github.com/huggingface/datasets/pull/6684
- Updated Quickstart Notebook link by @Codeblockz in https://github.com/huggingface/datasets/pull/6685
- Update the print message for chunked_dataset in process.mdx by @gzbfgjf2 in https://github.com/huggingface/datasets/pull/6693
- Faster
xlistdirby @mariosasko in https://github.com/huggingface/datasets/pull/6698 - Update GitHub Actions to Node 20 by @albertvillanova in https://github.com/huggingface/datasets/pull/6682
- Update release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/6681
- Pass through information about location of cache directory. by @stridge-cruxml in https://github.com/huggingface/datasets/pull/6677
- Allow SplitDict setitem to replace existing SplitInfo by @lhoestq in https://github.com/huggingface/datasets/pull/6665
- Update ruff by @lhoestq in https://github.com/huggingface/datasets/pull/6706
- Silence ruff deprecation messages by @mariosasko in https://github.com/huggingface/datasets/pull/6707
- fix: show correct package name to install biopython by @BioGeek in https://github.com/huggingface/datasets/pull/6662
- Fix datafiles when passing datadir by @lhoestq in https://github.com/huggingface/datasets/pull/6705
- Release: 2.18.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6708
New Contributors
- @Codeblockz made their first contribution in https://github.com/huggingface/datasets/pull/6685
- @gzbfgjf2 made their first contribution in https://github.com/huggingface/datasets/pull/6693
- @stridge-cruxml made their first contribution in https://github.com/huggingface/datasets/pull/6677
- @pmrowla made their first contribution in https://github.com/huggingface/datasets/pull/6687
- @BioGeek made their first contribution in https://github.com/huggingface/datasets/pull/6662
- @Rexhaif made their first contribution in https://github.com/huggingface/datasets/pull/6659
- @mohalisad made their first contribution in https://github.com/huggingface/datasets/pull/6660
- @kq-chen made their first contribution in https://github.com/huggingface/datasets/pull/6582
Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0
- Python
Published by lhoestq almost 2 years ago
datasets - 2.17.1
Bug Fixes
- Revert the changes in
arrow_writer.pyfrom #6636 by @bryant1410 in https://github.com/huggingface/datasets/pull/6664 - Remove deprecated verbose parameter from CSV builder by @albertvillanova in https://github.com/huggingface/datasets/pull/6672
Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1
- Python
Published by albertvillanova about 2 years ago
datasets - 2.17.0
What's Changed
- Fix parallel downloads for datasets without scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6551
- Fix imagefolder with one image by @lhoestq in https://github.com/huggingface/datasets/pull/6556
- Fix tests based on datasets that used to have scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6574
- remove eli5 test by @lhoestq in https://github.com/huggingface/datasets/pull/6583
- [IterableDataset] Fix
drop_last_batchin map after shuffling or sharding by @lhoestq in https://github.com/huggingface/datasets/pull/6575 - [WebDataset] Audio support and bug fixes by @lhoestq in https://github.com/huggingface/datasets/pull/6573
- Support standalone yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6557
- Drop redundant None guard. by @xkszltl in https://github.com/huggingface/datasets/pull/6596
- fix os.listdir return name is empty string by @d710055071 in https://github.com/huggingface/datasets/pull/6581
- Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in https://github.com/huggingface/datasets/pull/6617
- Dedicated RNG object for fingerprinting by @mariosasko in https://github.com/huggingface/datasets/pull/6606
- Add concurrent loading of shards to datasets.loadfromdisk by @kkoutini in https://github.com/huggingface/datasets/pull/6464
- Migrate from
setup.cfgtopyproject.tomlby @mariosasko in https://github.com/huggingface/datasets/pull/6619 - keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in https://github.com/huggingface/datasets/pull/6586
- Read GeoParquet files using parquet reader by @weiji14 in https://github.com/huggingface/datasets/pull/6508
- Use schema metadata only if it matches features by @lhoestq in https://github.com/huggingface/datasets/pull/6616
- Raise error on bad split name by @lhoestq in https://github.com/huggingface/datasets/pull/6626
- Disable
tqdmbars in non-interactive environments by @mariosasko in https://github.com/huggingface/datasets/pull/6627 - Add
with_rankparam toDataset.filterby @mariosasko in https://github.com/huggingface/datasets/pull/6608 - Bump max range of dill to 0.3.8 by @ringohoffman in https://github.com/huggingface/datasets/pull/6630
- Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/6631
- Faster webdataset streaming by @lhoestq in https://github.com/huggingface/datasets/pull/6578
- Multi gpu docs by @lhoestq in https://github.com/huggingface/datasets/pull/6550
- dataset viewer requires no-script by @severo in https://github.com/huggingface/datasets/pull/6633
- Make split slicing consistent with list slicing by @mariosasko in https://github.com/huggingface/datasets/pull/5891
- Do not use Parquet exports if revision is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/6555
- Make CLI test support multi-processing by @albertvillanova in https://github.com/huggingface/datasets/pull/6628
- Support datadir parameter in pushto_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6634
- Support pushtohub without org/user to default to logged-in user by @albertvillanova in https://github.com/huggingface/datasets/pull/6629
- Fix reload cache with data dir by @lhoestq in https://github.com/huggingface/datasets/pull/6632
- Fix array cast/embed with null values by @mariosasko in https://github.com/huggingface/datasets/pull/6283
- Faster column validation and reordering by @psmyth94 in https://github.com/huggingface/datasets/pull/6636
- Better multi-gpu example by @lhoestq in https://github.com/huggingface/datasets/pull/6646
- Fix missing info when loading some datasets from Parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6635
- Minor multi gpu doc improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6649
- Document usage of hfh cli instead of git by @lhoestq in https://github.com/huggingface/datasets/pull/6648
- Allow concatenation of datasets with mixed structs by @Dref360 in https://github.com/huggingface/datasets/pull/6587
New Contributors
- @xkszltl made their first contribution in https://github.com/huggingface/datasets/pull/6596
- @kkoutini made their first contribution in https://github.com/huggingface/datasets/pull/6464
- @JochenSiegWork made their first contribution in https://github.com/huggingface/datasets/pull/6586
- @weiji14 made their first contribution in https://github.com/huggingface/datasets/pull/6508
- @ringohoffman made their first contribution in https://github.com/huggingface/datasets/pull/6630
- @psmyth94 made their first contribution in https://github.com/huggingface/datasets/pull/6636
Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0
- Python
Published by albertvillanova about 2 years ago
datasets - 2.16.1
Bug fixes
- Fix dl_manager.extract returning FileNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6543
- Fix bug causing FileNotFoundError when passing a relative directory as
cache_dirtoload_dataset
- Fix bug causing FileNotFoundError when passing a relative directory as
- Fix custom configs from script by @lhoestq in https://github.com/huggingface/datasets/pull/6544
- Fix bug when loading a dataset with a loading script using custom arguments would fail
- e.g.
load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")
Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1
- Python
Published by lhoestq about 2 years ago
datasets - 2.16.0
Security features
- Add trustremotecode argument by @lhoestq in https://github.com/huggingface/datasets/pull/6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argumenttrust_remote_code=True. - Passing
trust_remote_code=Truewill be mandatory to load these datasets from the next major release ofdatasets. - Using the environment variable
HF_DATASETS_TRUST_REMOTE_CODE=0you can already disable custom code by default without waiting for the next release ofdatasets
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
- Use parquet export if possible by @lhoestq in https://github.com/huggingface/datasets/pull/6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at
https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet
Features
- Webdataset dataset builder by @lhoestq in https://github.com/huggingface/datasets/pull/6391
- Implement get dataset default config name by @albertvillanova in https://github.com/huggingface/datasets/pull/6511
- Lazy data files resolution and offline cache reload by @lhoestq in https://github.com/huggingface/datasets/pull/6493
- This speeds up the
load_datasetstep that lists the data files of big repositories (up to x100) but requireshuggingface_hub0.20 or newer - Fix
load_datasetthat used to reload data from cache even if the dataset was updated on Hugging Face - Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets:
~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha - Backward comaptibility: cached datasets from
datasets2.15 (using the old scheme) are still reloaded from cache
- This speeds up the
General improvements and bug fixes
- Remove unused argument in
_get_data_files_patternsby @lhoestq in https://github.com/huggingface/datasets/pull/6343 - Set
usedforsecurity=Falsein hashlib methods (FIPS compliance) by @Wauplin in https://github.com/huggingface/datasets/pull/6414 - Use
rufffor formatting by @mariosasko in https://github.com/huggingface/datasets/pull/6434 - Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in https://github.com/huggingface/datasets/pull/6431
- Fix multi gpu map example by @lhoestq in https://github.com/huggingface/datasets/pull/6415
- Better
tqdmwrapper by @mariosasko in https://github.com/huggingface/datasets/pull/6433 - Remove
Table.__getstate__andTable.__setstate__by @LZHgrla in https://github.com/huggingface/datasets/pull/6444 - Use
filelockpackage for file locking by @mariosasko in https://github.com/huggingface/datasets/pull/6445 - Fix metadata file resolution when inferred pattern is
**by @mariosasko in https://github.com/huggingface/datasets/pull/6449 - Update hub-docs reference by @mishig25 in https://github.com/huggingface/datasets/pull/6453
- Refactor
dilllogic by @mariosasko in https://github.com/huggingface/datasets/pull/6454 - Don't require trustremotecode in inspect_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6456
- [docs] troubleshooting guide by @MKhalusova in https://github.com/huggingface/datasets/pull/6424
- Missing DatasetNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6462
- Disable benchmarks in PRs by @lhoestq in https://github.com/huggingface/datasets/pull/6463
- More robust temporary directory deletion by @mariosasko in https://github.com/huggingface/datasets/pull/6426
- Fix shard retry mechanism in
push_to_hubby @mariosasko in https://github.com/huggingface/datasets/pull/6461 - Use auth to get parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6468
- Remove delete doc CI by @lhoestq in https://github.com/huggingface/datasets/pull/6471
- Fix CI quality by @albertvillanova in https://github.com/huggingface/datasets/pull/6473
- Fix PermissionError on Windows CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6477
- More robust preupload retry mechanism by @mariosasko in https://github.com/huggingface/datasets/pull/6479
- Add IterableDataset
__repr__by @lhoestq in https://github.com/huggingface/datasets/pull/6480 - Fix max lock length on unix by @lhoestq in https://github.com/huggingface/datasets/pull/6482
- Fix ArrayXD YAML conversion by @mariosasko in https://github.com/huggingface/datasets/pull/6168
- Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6486
- Fix deprecation warning when building conda package by @albertvillanova in https://github.com/huggingface/datasets/pull/6425
- Make pushtohub return CommitInfo by @albertvillanova in https://github.com/huggingface/datasets/pull/6492
- docs: add reference Git over SSH by @severo in https://github.com/huggingface/datasets/pull/6499
- Fallback on dataset script if user wants to load default config by @lhoestq in https://github.com/huggingface/datasets/pull/6498
- Don't expand_info in HF glob by @lhoestq in https://github.com/huggingface/datasets/pull/6469
- Fix streaming xnli by @lhoestq in https://github.com/huggingface/datasets/pull/6503
- Pickle support for
torch.Generatorobjects by @mariosasko in https://github.com/huggingface/datasets/pull/6502 - Enable setting config as default when pushtohub by @albertvillanova in https://github.com/huggingface/datasets/pull/6500
- Better cast error when generating dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6509
- Replace
list_files_infowithlist_repo_treeinpush_to_hubby @mariosasko in https://github.com/huggingface/datasets/pull/6510 - Remove deprecated HfFolder by @lhoestq in https://github.com/huggingface/datasets/pull/6512
- Support huggingface-hub pre-releases by @albertvillanova in https://github.com/huggingface/datasets/pull/6516
- Support pushtohub canonical datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6519
- Support commitdescription parameter in pushto_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6520
- fix getmetadatapatterns function args error by @d710055071 in https://github.com/huggingface/datasets/pull/6518
- Fix metrics dead link by @qgallouedec in https://github.com/huggingface/datasets/pull/6491
- fix tests by @lhoestq in https://github.com/huggingface/datasets/pull/6523
- Cache backward compatibility with 2.15.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6514
- Preserve order of configs and splits when using Parquet exports by @albertvillanova in https://github.com/huggingface/datasets/pull/6526
New Contributors
- @LZHgrla made their first contribution in https://github.com/huggingface/datasets/pull/6444
- @d710055071 made their first contribution in https://github.com/huggingface/datasets/pull/6518
Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0
- Python
Published by lhoestq about 2 years ago
datasets - 2.15.0
What's Changed
- Fix typo in Audio dataset documentation by @prassanna-ravishankar in https://github.com/huggingface/datasets/pull/6222
- Add pushtohub with multiple configs docs by @lhoestq in https://github.com/huggingface/datasets/pull/6226
- Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in https://github.com/huggingface/datasets/pull/6228
- Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6233
- Don't skip hidden files in
dl_manager.iter_fileswhen they are given as input by @mariosasko in https://github.com/huggingface/datasets/pull/6230 - Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6223
- Remove unused global variables in
audio.pyby @mariosasko in https://github.com/huggingface/datasets/pull/6241 - Improve error message for missing function parameters by @suavemint in https://github.com/huggingface/datasets/pull/6232
- Fix cast from fixed size list to variable size list by @mariosasko in https://github.com/huggingface/datasets/pull/6243
- Update create_dataset.mdx by @EswarDivi in https://github.com/huggingface/datasets/pull/6247
- [DOCS] Fix typo: Elasticsearch by @leemthompo in https://github.com/huggingface/datasets/pull/6258
- Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in https://github.com/huggingface/datasets/pull/6251
- Temporarily pin tensorflow < 2.14.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6264
- Fix CI 404 errors by @albertvillanova in https://github.com/huggingface/datasets/pull/6262
- Remove
apache_beamimport inBeamBasedBuilder._save_infoby @mariosasko in https://github.com/huggingface/datasets/pull/6265 - Improve documentation of dataset.from_generator by @hartmans in https://github.com/huggingface/datasets/pull/6281
- Fix parquet columns argument in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/6295
- Doc readme improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6298
- Unpin
tensorflowmaximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6301 - Unpin
jaxmaximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6300 - Fix ArrayXD cast by @mariosasko in https://github.com/huggingface/datasets/pull/6297
- Reduce the number of commits in
push_to_hubby @mariosasko in https://github.com/huggingface/datasets/pull/6269 - Fix typo in code example in docs by @bryant1410 in https://github.com/huggingface/datasets/pull/6307
- Update README.md by @smty2018 in https://github.com/huggingface/datasets/pull/6304
- Deterministic set hash by @lhoestq in https://github.com/huggingface/datasets/pull/6318
- docs: resolving namespace conflict, refactored variable by @smty2018 in https://github.com/huggingface/datasets/pull/6312
- Fix typos by @python273 in https://github.com/huggingface/datasets/pull/6321
- Fix commit message formatting in multi-commit uploads by @qgallouedec in https://github.com/huggingface/datasets/pull/6313
- Temporarily pin fsspec < 2023.10.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6331
- Unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/6336
- Fix use_dataset.mdx by @angel-luis in https://github.com/huggingface/datasets/pull/6351
- Add
fsspecversion to thedatasets-cli envcommand output by @mariosasko in https://github.com/huggingface/datasets/pull/6356 - Expanduser in savetodisk() by @Unknown3141592 in https://github.com/huggingface/datasets/pull/6098
- Fix time measuring snippet in docs by @mariosasko in https://github.com/huggingface/datasets/pull/6367
- Temporarily pin pyarrow < 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6375
- Fix typo in
Dataset.mapdocstring by @bryant1410 in https://github.com/huggingface/datasets/pull/6373 - Avoid redundant warning when encoding NumPy array as
Imageby @mariosasko in https://github.com/huggingface/datasets/pull/6379 - Replace deprecated license_file in setup.cfg by @albertvillanova in https://github.com/huggingface/datasets/pull/6332
- Minor release step improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6339
- Fix dependency conflict within CI build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/6411
- Remove redundant condition in builders by @albertvillanova in https://github.com/huggingface/datasets/pull/6398
- Handle future deprecation argument by @winglian in https://github.com/huggingface/datasets/pull/6390
- Remove token value from warnings by @mariosasko in https://github.com/huggingface/datasets/pull/6418
- Rename audioclassificiation.py to audioclassification.py by @carlthome in https://github.com/huggingface/datasets/pull/6416
- Add pyarrow-hotfix to release docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6421
- Simplify filesystem logic by @mariosasko in https://github.com/huggingface/datasets/pull/6362
- Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/6423
New Contributors
- @prassanna-ravishankar made their first contribution in https://github.com/huggingface/datasets/pull/6222
- @NinoRisteski made their first contribution in https://github.com/huggingface/datasets/pull/6233
- @suavemint made their first contribution in https://github.com/huggingface/datasets/pull/6232
- @EswarDivi made their first contribution in https://github.com/huggingface/datasets/pull/6247
- @leemthompo made their first contribution in https://github.com/huggingface/datasets/pull/6258
- @hartmans made their first contribution in https://github.com/huggingface/datasets/pull/6281
- @smty2018 made their first contribution in https://github.com/huggingface/datasets/pull/6304
- @python273 made their first contribution in https://github.com/huggingface/datasets/pull/6321
- @angel-luis made their first contribution in https://github.com/huggingface/datasets/pull/6351
- @Unknown3141592 made their first contribution in https://github.com/huggingface/datasets/pull/6098
- @winglian made their first contribution in https://github.com/huggingface/datasets/pull/6390
- @carlthome made their first contribution in https://github.com/huggingface/datasets/pull/6416
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0
- Python
Published by albertvillanova over 2 years ago
datasets - 2.14.7
Bug Fixes
- Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in https://github.com/huggingface/datasets/pull/6346
- Fix python formatting for complex types in format_table by @mariosasko in https://github.com/huggingface/datasets/pull/6368
- Support pyarrow 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6378
- Do not try to download from HF GCS for generator by @yundai424 in https://github.com/huggingface/datasets/pull/6372
- Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in https://github.com/huggingface/datasets/pull/6404
New Contributors
- @cwallenwein made their first contribution in https://github.com/huggingface/datasets/pull/6346
- @yundai424 made their first contribution in https://github.com/huggingface/datasets/pull/6372
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7
- Python
Published by albertvillanova over 2 years ago
datasets - 2.14.6
What's Changed
- Ignore dataset_info.json in data files resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6224
- Check builder cls default config name in inspect by @lhoestq in https://github.com/huggingface/datasets/pull/6253
- Add support for fsspec>=2023.9.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6244
- Create DefunctDatasetError by @albertvillanova in https://github.com/huggingface/datasets/pull/6286
- Fix getdatapatterns for directories with the word data twice by @albertvillanova in https://github.com/huggingface/datasets/pull/6309
- Fix loading Hub datasets with CSV metadata file by @albertvillanova in https://github.com/huggingface/datasets/pull/6316
- datasets.filesystems: fix isremotefilesystems by @ap-- in https://github.com/huggingface/datasets/pull/6334
- Pin upper version of fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/6337
- Fix regex getdatafiles formatting for base paths by @ZachNagengast in https://github.com/huggingface/datasets/pull/6322
New Contributors
- @ap-- made their first contribution in https://github.com/huggingface/datasets/pull/6334
- @ZachNagengast made their first contribution in https://github.com/huggingface/datasets/pull/6322
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6
- Python
Published by lhoestq over 2 years ago
datasets - 2.14.5
Bug fixes
- Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6091
- Minor fix in
iter_filesfor hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092 - Use yaml instead of get data patterns when possible by @lhoestq in https://github.com/huggingface/datasets/pull/6154
- Fix Parquet loading with
columnsby @mariosasko in https://github.com/huggingface/datasets/pull/6160 - Fix: Missing a MetadataConfigs init when the repo has a
datasets_info.jsonbut no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164 - PyArrow 13 CI fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6175
- Don't alter input in Features.from_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6189
- Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/6165
- Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6192
- Temporarily pin pandas < 2.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6200
- Preserve split order in DataFilesDict by @albertvillanova in https://github.com/huggingface/datasets/pull/6198
- Add missing
revisionargument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191 - Temporarily pin fsspec < 2023.9.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6210
- Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208
- Fix empty splitinfo json by @lhoestq in https://github.com/huggingface/datasets/pull/6211
- Fix to_json ValueError and remove pandas pin by @albertvillanova in https://github.com/huggingface/datasets/pull/6201
- Fix checking patterns to infer packaged builder by @polinaeterna in https://github.com/huggingface/datasets/pull/6215
- Rename old pushtohub configs to "default" in dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/6218
Other improvements
- Deprecate
Dataset.exportby @mariosasko in https://github.com/huggingface/datasets/pull/6081 - Deprecate
download_customby @mariosasko in https://github.com/huggingface/datasets/pull/6093 - Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in https://github.com/huggingface/datasets/pull/6138
- Remove unused allowed_extensions param by @albertvillanova in https://github.com/huggingface/datasets/pull/6135
- Export toiterabledataset to document by @npuichigo in https://github.com/huggingface/datasets/pull/6145
- [Docs] Add description of
select_columnsto guide by @unifyh in https://github.com/huggingface/datasets/pull/6119 - Ignore parallel warning in map_nested by @lhoestq in https://github.com/huggingface/datasets/pull/6148
- [docs] Complete
to_iterable_datasetby @stevhliu in https://github.com/huggingface/datasets/pull/6158 - Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in https://github.com/huggingface/datasets/pull/6155
- Fix typo in aboutmapstylevs_iterable.mdx by @lhoestq in https://github.com/huggingface/datasets/pull/6171
- Document BUILDERCONFIGCLASS by @lhoestq in https://github.com/huggingface/datasets/pull/6166
- Fix import in
image_loaddoc by @mariosasko in https://github.com/huggingface/datasets/pull/6181 - Use object detection images from
huggingface/documentation-imagesby @mariosasko in https://github.com/huggingface/datasets/pull/6177 - Use
hf-internal-testingrepos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180
New Contributors
- @npuichigo made their first contribution in https://github.com/huggingface/datasets/pull/6145
- @unifyh made their first contribution in https://github.com/huggingface/datasets/pull/6119
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5
- Python
Published by albertvillanova over 2 years ago
datasets - 2.14.3
Bug fixes
- Fix error when loading from GCP bucket by @albertvillanova in https://github.com/huggingface/datasets/pull/6105
- Fix deprecation of useauthtoken in file_utils by @albertvillanova in https://github.com/huggingface/datasets/pull/6107
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3
- Python
Published by albertvillanova over 2 years ago
datasets - 2.14.2
Bug fixes
- Fix deprecation of useauthtoken in DownloadConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6094
- Fix deprecation of errors in TextConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6095
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2
- Python
Published by albertvillanova over 2 years ago
datasets - 2.14.1
Bug fixes
- fix tqdm lock by @lhoestq in https://github.com/huggingface/datasets/pull/6067
- fix tqdm lock deletion by @lhoestq in https://github.com/huggingface/datasets/pull/6068
- Fix fsspec storageoptions from loaddataset by @lhoestq in https://github.com/huggingface/datasets/pull/6072
- No gzip encoding from github by @lhoestq in https://github.com/huggingface/datasets/pull/6076
Other improvements
- Fix
Overview.ipynb& detach Jupyter Notebooks fromdatasetsrepository by @alvarobartt in https://github.com/huggingface/datasets/pull/5902 - Fix Quickstart notebook link by @mariosasko in https://github.com/huggingface/datasets/pull/6070
- Remove README link to deprecated Colab notebook by @mariosasko in https://github.com/huggingface/datasets/pull/6080
- Misc doc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6074
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1
- Python
Published by lhoestq over 2 years ago
datasets - 2.14.0
Important: caching
- Datasets downloaded and cached using
datasets>=2.14.0may not be reloaded from cache using older version ofdatasets(and therefore re-downloaded). - Datasets that were already cached are still supported.
- This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
- This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in https://github.com/huggingface/datasets/pull/5331.
Dataset Configuration
- Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
- Configure your dataset using YAML at the top of your dataset card (docs here)
- Choose which file goes into which split
yaml
---
configs:
- config_name: default
data_files:
- split: train
path: data.csv
- split: test
path: holdout.csv
---
* Define multiple dataset configurations
yaml
---
configs:
- config_name: main_data
data_files: main_data.csv
- config_name: additional_data
data_files: additional_data.csv
---
Dataset Features
- Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
push_to_hub()additional dataset configurations
python
ds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")
- Support returning dataframe in map transform by @mariosasko in https://github.com/huggingface/datasets/pull/5995
What's Changed
- Deprecate
errorsparam in favor ofencoding_errorsin text builder by @mariosasko in https://github.com/huggingface/datasets/pull/5974 - Fix select_columns columns order by @lhoestq in https://github.com/huggingface/datasets/pull/5994
- Replace metadata utils with
huggingface_hub's RepoCard API by @mariosasko in https://github.com/huggingface/datasets/pull/5949 - Pin
joblibto avoidjoblibsparktest failures by @mariosasko in https://github.com/huggingface/datasets/pull/6000 - Align
column_namestype check with type hint insortby @mariosasko in https://github.com/huggingface/datasets/pull/6001 - Deprecate
use_auth_tokenin favor oftokenby @mariosasko in https://github.com/huggingface/datasets/pull/5996 - Drop Python 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6005
- Misc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6004
- Make IterableDataset.from_spark more efficient by @mathewjacob1002 in https://github.com/huggingface/datasets/pull/5986
- Fix cast for dictionaries with no keys by @mariosasko in https://github.com/huggingface/datasets/pull/6009
- Avoid stuck map operation when subprocesses crashes by @pappacena in https://github.com/huggingface/datasets/pull/5976
- Deprecate task api by @mariosasko in https://github.com/huggingface/datasets/pull/5865
- Add metadata ui screenshot in docs by @lhoestq in https://github.com/huggingface/datasets/pull/6015
- Fix
ClassLabelmin max check forNonevalues by @mariosasko in https://github.com/huggingface/datasets/pull/6023 - [docs] Update return statement of index search by @stevhliu in https://github.com/huggingface/datasets/pull/6021
- Improve logging by @mariosasko in https://github.com/huggingface/datasets/pull/6019
- Fix style with ruff 0.0.278 by @lhoestq in https://github.com/huggingface/datasets/pull/6026
- Don't reference self in Spark.validatecache_dir by @maddiedawson in https://github.com/huggingface/datasets/pull/6024
- Delete
task_templatesinIterableDatasetwhen they are no longer valid by @mariosasko in https://github.com/huggingface/datasets/pull/6027 - [docs] Fix link by @stevhliu in https://github.com/huggingface/datasets/pull/6029
- fixed typo in comment by @NightMachinery in https://github.com/huggingface/datasets/pull/6030
- Fix legacydatasetinfos by @lhoestq in https://github.com/huggingface/datasets/pull/6040
- Flatten repository_structure docs on yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6041
- Use new hffs by @lhoestq in https://github.com/huggingface/datasets/pull/6028
- Bump dev version by @lhoestq in https://github.com/huggingface/datasets/pull/6047
- Fix unused DatasetInfosDict code in pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/6042
- Rename "pattern" to "path" in YAML data_files configs by @lhoestq in https://github.com/huggingface/datasets/pull/6044
- Remove
HfFileSystemand deprecateS3FileSystemby @mariosasko in https://github.com/huggingface/datasets/pull/6052 - Dill 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6061
- Improve
Dataset.from_listdocstring by @mariosasko in https://github.com/huggingface/datasets/pull/6062 - Check if column names match in Parquet loader only when config
featuresare specified by @mariosasko in https://github.com/huggingface/datasets/pull/6045 - Release: 2.14.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6063
New Contributors
- @mathewjacob1002 made their first contribution in https://github.com/huggingface/datasets/pull/5986
- @pappacena made their first contribution in https://github.com/huggingface/datasets/pull/5976
Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0
- Python
Published by lhoestq over 2 years ago
datasets - 2.13.1
General improvements and bug fixes
- Fix JSON generation in benchmarks CI by @mariosasko in https://github.com/huggingface/datasets/pull/5966
- Always return list in
list_datasetsby @mariosasko in https://github.com/huggingface/datasets/pull/5964 - Add
encodinganderrorsparams to JSON loader by @mariosasko in https://github.com/huggingface/datasets/pull/5969 - Filter unsupported extensions by @lhoestq in https://github.com/huggingface/datasets/pull/5972
Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1
- Python
Published by lhoestq over 2 years ago
datasets - 2.13.0
Dataset Features
- Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770
- Stream the data from your Spark DataFrame directly to your training pipeline
```python from datasets import IterableDataset from torch.utils.data import DataLoader
ids = IterableDataset.fromspark(df) ids = ids.map(...).filter(...).withformat("torch") for batch in DataLoader(ids, batchsize=16, numworkers=4): ... ``` * IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow: * IterableDataset Arrow formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5821 * Iterable torch formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5852
```python from datasets import load_dataset
ids = loaddataset("c4", "en", split="train", streaming=True) ids = ids.map(...).withformat("torch") # to get PyTorch tensors - also works with tf, np, jax etc. ```
- Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893
```python from datasets import IterableDataset
ids = IterableDataset.from_file("path/to/data.arrow") ```
- Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944
```python from datasets import load_dataset
ds = loaddataset("arrow", datafiles={"train": "train.arrow", "test": "test.arrow"}) ```
Experimental
- Add parallel module using joblib for Spark by @es94129 in https://github.com/huggingface/datasets/pull/5924
General improvements and bug fixes
- Preserve
stopping_strategyof shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816 - Fix incomplete docstring for
BuilderConfigby @Laurent2916 in https://github.com/huggingface/datasets/pull/5824 - [docs] Custom decoding transforms by @stevhliu in https://github.com/huggingface/datasets/pull/5836
- Add
accelerateas metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848 - Add
date_formatparam to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845 - [docs] Redirects, migrated from nginx by @julien-c in https://github.com/huggingface/datasets/pull/5853
- Fix infer module for uppercase extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5872
- Minor tqdm optim by @lhoestq in https://github.com/huggingface/datasets/pull/5860
- Always set nullable fields in the writer by @lhoestq in https://github.com/huggingface/datasets/pull/5835
- Add
fn_kwargstomapandfilterofIterableDatasetandIterableDatasetDictby @yuukicammy in https://github.com/huggingface/datasets/pull/5810 - Better error message when combining dataset dicts instead of datasets by @lhoestq in https://github.com/huggingface/datasets/pull/5861
- Force overwrite existing filesystem protocol by @baskrahmer in https://github.com/huggingface/datasets/pull/5894
- Support workingdir in fromspark by @maddiedawson in https://github.com/huggingface/datasets/pull/5826
- Raise TypeError when indexing a dataset with bool by @albertvillanova in https://github.com/huggingface/datasets/pull/5859
- Fix minor typo in docs loading.mdx by @albertvillanova in https://github.com/huggingface/datasets/pull/5900
- Fix
FixedSizeListArraycasting by @mariosasko in https://github.com/huggingface/datasets/pull/5897 - Unpin responses by @mariosasko in https://github.com/huggingface/datasets/pull/5916
- Validate name parameter in makefileinstructions by @albertvillanova in https://github.com/huggingface/datasets/pull/5904
- Raise error in
DatasetBuilder.as_datasetwhenfile_formatis not"arrow"by @mariosasko in https://github.com/huggingface/datasets/pull/5915 - Refactor extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5917
- Use more efficient and idiomatic way to construct list. by @ttsugriy in https://github.com/huggingface/datasets/pull/5909
- Add
flatten_indicestoDatasetDictby @maximxlss in https://github.com/huggingface/datasets/pull/5907 - Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in https://github.com/huggingface/datasets/pull/5920
- Make preparesplit more robust if errors in metadata datasetinfo splits by @albertvillanova in https://github.com/huggingface/datasets/pull/5901
- Fix streaming parquet with image feature in schema by @lhoestq in https://github.com/huggingface/datasets/pull/5921
- canonicalize data dir in config ID hash by @kylrth in https://github.com/huggingface/datasets/pull/5899
- Fix link to quickstart docs in README.md by @mariosasko in https://github.com/huggingface/datasets/pull/5928
- Fix string-encoding, make
batch_sizeoptional, and minor improvements inDataset.to_tf_datasetby @alvarobartt in https://github.com/huggingface/datasets/pull/5883 - Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5863
- [doc build] Use secrets by @mishig25 in https://github.com/huggingface/datasets/pull/5932
- Fix
to_numpywhen None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933 - Better row group size in pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/5935
- Avoid parallel redownload in cache by @albertvillanova in https://github.com/huggingface/datasets/pull/5937
- Better filenotfound for gated by @lhoestq in https://github.com/huggingface/datasets/pull/5954
- Make getfromcache use custom temp filename that is locked by @albertvillanova in https://github.com/huggingface/datasets/pull/5938
- Fix ArrowExamplesIterable.sharddatasources by @lhoestq in https://github.com/huggingface/datasets/pull/5956
- Add Arrow builder docs by @lhoestq in https://github.com/huggingface/datasets/pull/5952
- Fix sequence of array support for most dtype by @qgallouedec in https://github.com/huggingface/datasets/pull/5948
New Contributors
- @Laurent2916 made their first contribution in https://github.com/huggingface/datasets/pull/5824
- @yuukicammy made their first contribution in https://github.com/huggingface/datasets/pull/5810
- @baskrahmer made their first contribution in https://github.com/huggingface/datasets/pull/5894
- @ttsugriy made their first contribution in https://github.com/huggingface/datasets/pull/5909
- @maximxlss made their first contribution in https://github.com/huggingface/datasets/pull/5907
- @mariusz-jachimowicz-83 made their first contribution in https://github.com/huggingface/datasets/pull/5893
- @kylrth made their first contribution in https://github.com/huggingface/datasets/pull/5899
- @qgallouedec made their first contribution in https://github.com/huggingface/datasets/pull/5933
- @es94129 made their first contribution in https://github.com/huggingface/datasets/pull/5924
Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef
- Python
Published by lhoestq over 2 years ago
datasets - 2.12.0
Datasets Features
- Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701
- Get a Dataset from a Spark DataFrame (docs):
```python
from datasets import Dataset ds = Dataset.from_spark(df) ``` * Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689 * Stream data from Wikipedia:
```python
from datasets import loaddataset ds = loaddataset("wikipedia", "20220301.de", streaming=True) next(iter(ds["train"])) {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...} ``` * Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735 * Use interleaved datasets in a distributed setup or with a DataLoader
```python
from datasets import loaddataset, interleavedatasets from torch.utils.data import DataLoader wiki = loaddataset("wikipedia", "20220301.en", split="train", streaming=True) c4 = loaddataset("c4", "en", split="train", streaming=True) merged = interleavedatasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stoppingstrategy="allexhausted") dataloader = DataLoader(merged, numworkers=4) ``` * Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751 * Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python * Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't * Allow converting the variable-shaped ArrayND to Pandas
General improvements and bug fixes
- Fix a description error for interleave_datasets. by @QizhiPei in https://github.com/huggingface/datasets/pull/5680
- [docs] Split pattern search order by @stevhliu in https://github.com/huggingface/datasets/pull/5693
- Raise an error on missing distributed seed by @lhoestq in https://github.com/huggingface/datasets/pull/5697
- Fix xnumpy_load for .npz files by @albertvillanova in https://github.com/huggingface/datasets/pull/5714
- Temporarily pin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5731
- Unpin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5733
- Fix CI warnings by @albertvillanova in https://github.com/huggingface/datasets/pull/5741
- Fix CI mock filesystem fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/5740
- Fix link in docs by @bbbxyz in https://github.com/huggingface/datasets/pull/5746
- fix typo: "mow" -> "now" by @csris in https://github.com/huggingface/datasets/pull/5763
- [docs] Compress data files by @stevhliu in https://github.com/huggingface/datasets/pull/5691
- Fix style by @lhoestq in https://github.com/huggingface/datasets/pull/5774
- Minor tqdm fixes by @mariosasko in https://github.com/huggingface/datasets/pull/5754
- Fixes #5757 by @eli-osherovich in https://github.com/huggingface/datasets/pull/5758
- Fix JSON builder when missing keys in first row by @albertvillanova in https://github.com/huggingface/datasets/pull/5772
- Warning specifying future change in totfdataset behaviour by @amyeroberts in https://github.com/huggingface/datasets/pull/5742
- Prepare tests for hfh 0.14 by @Wauplin in https://github.com/huggingface/datasets/pull/5788
- Call fs.makedirs in savetodisk by @lhoestq in https://github.com/huggingface/datasets/pull/5779
- Allow to run CI on push to ci-branch by @albertvillanova in https://github.com/huggingface/datasets/pull/5790
- Fix nondeterministic sharded data split order by @albertvillanova in https://github.com/huggingface/datasets/pull/5729
- Raise subprocesses traceback when interrupting by @lhoestq in https://github.com/huggingface/datasets/pull/5784
- Fix spark imports by @lhoestq in https://github.com/huggingface/datasets/pull/5795
- Change downloaded file permission based on umask by @albertvillanova in https://github.com/huggingface/datasets/pull/5800
- Fix inferring module for unsupported data files by @albertvillanova in https://github.com/huggingface/datasets/pull/5787
- Reorder default data splits to have validation before test by @albertvillanova in https://github.com/huggingface/datasets/pull/5718
- Validate non-empty data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/5802
- Spark docs by @lhoestq in https://github.com/huggingface/datasets/pull/5796
- Release: 2.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5803
New Contributors
- @QizhiPei made their first contribution in https://github.com/huggingface/datasets/pull/5680
- @bbbxyz made their first contribution in https://github.com/huggingface/datasets/pull/5746
- @csris made their first contribution in https://github.com/huggingface/datasets/pull/5763
- @eli-osherovich made their first contribution in https://github.com/huggingface/datasets/pull/5758
- @maddiedawson made their first contribution in https://github.com/huggingface/datasets/pull/5701
Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0
- Python
Published by lhoestq almost 3 years ago
datasets - 2.11.0
Important
- Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in https://github.com/huggingface/datasets/pull/5573
- this allows to not have dependencies on pytorch to decode audio files
- this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
- Deprecated
batch_sizeonDataset.to_dict()
Datasets Features
- Add writerbatchsize for ArrowBasedBuilder by @lhoestq in https://github.com/huggingface/datasets/pull/5565
- allow to specofy the row group / record batch size when you
download_and_prepare()a dataset
- allow to specofy the row group / record batch size when you
- Experimental support of cloud storage in
load_dataset():- Support cloud storage in load_dataset via fsspec by @dwyatte in https://github.com/huggingface/datasets/pull/5580
- Pass down storage options by @dwyatte in https://github.com/huggingface/datasets/pull/5673
- Support PyArrow arrays as column values in
from_dictby @mariosasko in https://github.com/huggingface/datasets/pull/5643 - Allow direct cast from binary to Audio/Image by @mariosasko in https://github.com/huggingface/datasets/pull/5644
- Add column_names to IterableDataset by @patrickloeber in https://github.com/huggingface/datasets/pull/5582
- pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5569
- add Dataset.to_list by @kyoto7250 in https://github.com/huggingface/datasets/pull/5611
General imrovements and bug fixes
- Update csv.py by @XDoubleU in https://github.com/huggingface/datasets/pull/5562
- Remove instructions for
ffmpegsystem package installation on Colab by @polinaeterna in https://github.com/huggingface/datasets/pull/5558 - Apply ruff flake8-comprehension checks by @Skylion007 in https://github.com/huggingface/datasets/pull/5549
- Fix
datasets.load_from_disk,DatasetDict.load_from_diskandDataset.load_from_diskby @alvarobartt in https://github.com/huggingface/datasets/pull/5529 - Add pre-commit config yaml file to enable automatic code formatting by @polinaeterna in https://github.com/huggingface/datasets/pull/5561
- Add
huggingface_hubversion to env cli command by @mariosasko in https://github.com/huggingface/datasets/pull/5578 - Do no write index by default when exporting a dataset by @mariosasko in https://github.com/huggingface/datasets/pull/5583
- Flatten dataset on the fly in
save_to_diskby @mariosasko in https://github.com/huggingface/datasets/pull/5588 - Fix
sortwith indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5587 - Fix docstring example by @stevhliu in https://github.com/huggingface/datasets/pull/5592
- Fix pushtohub with no dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/5598
- Don't compute checksums if not necessary in
datasets-cli testby @lhoestq in https://github.com/huggingface/datasets/pull/5603 - Update README logo by @gary149 in https://github.com/huggingface/datasets/pull/5605
- Fix CI by temporarily pinning fsspec < 2023.3.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5617
- Fix archive fs test by @lhoestq in https://github.com/huggingface/datasets/pull/5614
- unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/5619
- Bump pyarrow to 8.0.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5620
- Remove setaccesstoken usage + fail tests if FutureWarning by @Wauplin in https://github.com/huggingface/datasets/pull/5623
- Fix outdated
verification_modevalues by @polinaeterna in https://github.com/huggingface/datasets/pull/5607 - Adding Oracle Cloud to docs by @ahosler in https://github.com/huggingface/datasets/pull/5621
- Fix CI: ignore C901 ("some_func" is to complex) in
ruffby @polinaeterna in https://github.com/huggingface/datasets/pull/5636 - add kwargs to index search by @SaulLu in https://github.com/huggingface/datasets/pull/5628
- Less zip false positives by @lhoestq in https://github.com/huggingface/datasets/pull/5640
- Allow self as key in
Featuresby @mariosasko in https://github.com/huggingface/datasets/pull/5646 - Bump hfh to 0.11.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5642
- Support streaming datasets with numpy.load by @albertvillanova in https://github.com/huggingface/datasets/pull/5626
- Fix unnecessary dict comprehension by @albertvillanova in https://github.com/huggingface/datasets/pull/5662
- Fix CI by temporarily pinning tensorflow < 2.12.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5664
- Copy features by @lhoestq in https://github.com/huggingface/datasets/pull/5652
- Improve features decoding in toiterabledataset by @lhoestq in https://github.com/huggingface/datasets/pull/5655
- Fix
fsspec.openwhen using an HTTP proxy by @bryant1410 in https://github.com/huggingface/datasets/pull/5656 - Jax requires jaxlib by @lhoestq in https://github.com/huggingface/datasets/pull/5667
- docs: Update numshards docs to mention numproc on Dataset and DatasetDict by @connor-henderson in https://github.com/huggingface/datasets/pull/5658
- Allow loading/saving of FAISS index using fsspec by @Dref360 in https://github.com/huggingface/datasets/pull/5526
- Fix verificationmode when ignoreverifications is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/5683
- Release: 2.11.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5684
New Contributors
- @XDoubleU made their first contribution in https://github.com/huggingface/datasets/pull/5562
- @Skylion007 made their first contribution in https://github.com/huggingface/datasets/pull/5549
- @Hubert-Bonisseur made their first contribution in https://github.com/huggingface/datasets/pull/5569
- @ahosler made their first contribution in https://github.com/huggingface/datasets/pull/5621
- @patrickloeber made their first contribution in https://github.com/huggingface/datasets/pull/5582
- @SaulLu made their first contribution in https://github.com/huggingface/datasets/pull/5628
- @connor-henderson made their first contribution in https://github.com/huggingface/datasets/pull/5658
- @kyoto7250 made their first contribution in https://github.com/huggingface/datasets/pull/5611
Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.11.0
- Python
Published by lhoestq almost 3 years ago
datasets - 2.10.1
What's Changed
- Fix sort with indices mapping by @mariosasko https://github.com/huggingface/datasets/pull/5587
- Fix
IndexErrorwhen doingds.filter(...).sort(...)ords.select(...).sort(...)
- Fix
Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.10.1
- Python
Published by lhoestq almost 3 years ago
datasets - 2.10.0
Important
- Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in https://github.com/huggingface/datasets/pull/5542
- Big improvements on the speed of
.flatten_indices()(x2) +save/load_from_disk(x100) on selected/shuffled datasets
- Big improvements on the speed of
- Skip dataset verifications by default by @mariosasko in https://github.com/huggingface/datasets/pull/5303
- introduces multiple
verification_modeyou can pass to `load_dataset()): - the new default verification steps are much faster (no need to compute expensive checksums)
- introduces multiple
Datasets features
- Single TQDM bar in multi-proc map by @mariosasko in https://github.com/huggingface/datasets/pull/5455
- No more stacked TQDM bars when calling
.map()in multiprocessing
- No more stacked TQDM bars when calling
- Map-style Dataset to IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5410
- introduces
.to_iterable_dataset()to get aIterableDatasetfrom aDataset - see all the advantages of
IterableDatasetin the documentation about the differences between Dataset and IterableDataset
- introduces
- Select columns of Dataset or DatasetDict by @daskol in https://github.com/huggingface/datasets/pull/5480
- introduces
.select_column()to return a dataset only containing the requested columns
- introduces
- Added functionality: sort datasets by multiple keys by @MichlF in https://github.com/huggingface/datasets/pull/5502
- introduces
ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
- introduces
- Add JAX device selection when formatting by @alvarobartt in https://github.com/huggingface/datasets/pull/5547
- introduces
ds = ds.with_format("jax", device=device)
- introduces
- Reload features from Parquet metadata by @MFreidank in https://github.com/huggingface/datasets/pull/5516
- Speed up batched PyTorch DataLoader by @lhoestq in https://github.com/huggingface/datasets/pull/5512
Documentation
- Add section in tutorial for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/5485
- https://huggingface.co/docs/datasets/main/en/access#iterabledataset
- Tutorial for creating a dataset by @stevhliu in https://github.com/huggingface/datasets/pull/5540
- https://huggingface.co/docs/datasets/main/en/create_dataset
- Add JAX-formatting documentation by @alvarobartt in https://github.com/huggingface/datasets/pull/5535
- https://huggingface.co/docs/datasets/main/en/usewithjax
General improvements and bug fixes
- Pin sqlalchemy by @lhoestq in https://github.com/huggingface/datasets/pull/5476
- Update dataset card creation by @stevhliu in https://github.com/huggingface/datasets/pull/5470
- Add numtestbatches option by @amyeroberts in https://github.com/huggingface/datasets/pull/5471
- Tip for recomputing metadata by @stevhliu in https://github.com/huggingface/datasets/pull/5478
- Disable aiohttp requoting of redirection URL by @albertvillanova in https://github.com/huggingface/datasets/pull/5459
- [MINOR] Typo by @cakiki in https://github.com/huggingface/datasets/pull/5491
- Pin dill lower version by @albertvillanova in https://github.com/huggingface/datasets/pull/5489
- Improved error message for gated/private repos by @osanseviero in https://github.com/huggingface/datasets/pull/5497
- Update docs for
nyu_depth_v2dataset by @awsaf49 in https://github.com/huggingface/datasets/pull/5484 - don't zero copy timestamps by @dwyatte in https://github.com/huggingface/datasets/pull/5504
- Remove unused
load_from_cache_filearg fromDataset.shard()docstring by @polinaeterna in https://github.com/huggingface/datasets/pull/5493 - Do not add index column by default when exporting to CSV by @albertvillanova in https://github.com/huggingface/datasets/pull/5490
- Fix bug when casting empty array to class labels by @marioga in https://github.com/huggingface/datasets/pull/5521
- Fix benchmarks CI - pin protobuf by @lhoestq in https://github.com/huggingface/datasets/pull/5527
- Remove py.typed by @mariosasko in https://github.com/huggingface/datasets/pull/5518
- Add missing license in
NumpyFormatterby @alvarobartt in https://github.com/huggingface/datasets/pull/5530 - Unify
load_from_cache_filetype and logic by @HallerPatrick in https://github.com/huggingface/datasets/pull/5515 - Format code with
ruffby @mariosasko in https://github.com/huggingface/datasets/pull/5519 - Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in https://github.com/huggingface/datasets/pull/5522
- Resolve four broken refs in the docs by @tomaarsen in https://github.com/huggingface/datasets/pull/5550
- Use default audio resampling type by @lhoestq in https://github.com/huggingface/datasets/pull/5556
- resampy is no longer needed to resample audio data
- improved message error row formatting by @Plutone11011 in https://github.com/huggingface/datasets/pull/5553
- Make tiktoken tokenizers hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5552
- Suggest scikit-learn instead of sklearn by @osbm in https://github.com/huggingface/datasets/pull/5551
- Add filter desc by @lhoestq in https://github.com/huggingface/datasets/pull/5557
- Fix map suffix_template by @lhoestq in https://github.com/huggingface/datasets/pull/5559
- Ensure last tqdm update in map by @mariosasko in https://github.com/huggingface/datasets/pull/5560
New Contributors
- @amyeroberts made their first contribution in https://github.com/huggingface/datasets/pull/5471
- @awsaf49 made their first contribution in https://github.com/huggingface/datasets/pull/5484
- @dwyatte made their first contribution in https://github.com/huggingface/datasets/pull/5504
- @marioga made their first contribution in https://github.com/huggingface/datasets/pull/5521
- @MFreidank made their first contribution in https://github.com/huggingface/datasets/pull/5516
- @daskol made their first contribution in https://github.com/huggingface/datasets/pull/5480
- @Plutone11011 made their first contribution in https://github.com/huggingface/datasets/pull/5553
- @osbm made their first contribution in https://github.com/huggingface/datasets/pull/5551
- @MichlF made their first contribution in https://github.com/huggingface/datasets/pull/5502
Full Changelog: https://github.com/huggingface/datasets/compare/2.9.0...ef
- Python
Published by lhoestq about 3 years ago
datasets - 2.9.0
Datasets Features
- Parallel implementation of totfdataset() by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5377
- Pass
num_workers=to.to_tf_dataset()to make your dataset faster with multiprocessing
- Pass
- Distributed support by @lhoestq in https://github.com/huggingface/datasets/pull/5369
- Split your dataset for each node for distributed training
- It supports both
DatasetandIterableDataset(e.g. in streaming mode) - See the documentation for more details
```python import os from datasets.distributed import splitdatasetby_node
rank = int(os.environ["RANK"]) worldsize = int(os.environ["WORLDSIZE"]) ds = splitdatasetbynode(ds, rank=rank, worldsize=world_size) ```
- Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in https://github.com/huggingface/datasets/pull/5400
- Tqdm progress bar for
to_parquetby @zanussbaum in https://github.com/huggingface/datasets/pull/5456 - ZIP files support in iter_archive with better compression type check by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3379
- Support other formats than uint8 for image arrays by @vigsterkr in https://github.com/huggingface/datasets/pull/5365
Documentation
- Depth estimation dataset guide by @sayakpaul in https://github.com/huggingface/datasets/pull/5379
- see https://huggingface.co/docs/datasets/main/en/depth_estimation
- Imagefolder docs: mention support of CSV and ZIP by @lhoestq in https://github.com/huggingface/datasets/pull/5463
- see https://huggingface.co/docs/datasets/main/en/image_load#imagefolder
- Update docs of S3 filesystem with async aiobotocore by @maheshpec in https://github.com/huggingface/datasets/pull/5411
- see https://huggingface.co/docs/datasets/main/en/filesystems#amazon-s3
General improvements and bug fixes
- Raise error if ClassLabel names is not python list by @freddyheppell in https://github.com/huggingface/datasets/pull/5359
- Temporarily pin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5395
- Unpin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5397
- Replace one letter import in docs by @MKhalusova in https://github.com/huggingface/datasets/pull/5403
- Fix Colab notebook link by @albertvillanova in https://github.com/huggingface/datasets/pull/5392
- Fix
fs.openresource leaks by @tkukurin in https://github.com/huggingface/datasets/pull/5358 - Fix deprecation warning when useauthtoken passed to downloadandprepare by @albertvillanova in https://github.com/huggingface/datasets/pull/5409
- Fix streaming pandas.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/5372
- ci: 🎡 remove two obsolete issue templates by @severo in https://github.com/huggingface/datasets/pull/5420
- Handle 0-dim tensors in
cast_to_python_objectsby @mariosasko in https://github.com/huggingface/datasets/pull/5384 - Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5429
- Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in https://github.com/huggingface/datasets/pull/5432
- Revert container image pin in CI benchmarks by @0x2b3bfa0 in https://github.com/huggingface/datasets/pull/5436
- Finish deprecating the fs argument by @dconathan in https://github.com/huggingface/datasets/pull/5393
- Update actions/checkout in CD Conda release by @albertvillanova in https://github.com/huggingface/datasets/pull/5438
- Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/5416
- Fix documentation about batch samplers by @thomasw21 in https://github.com/huggingface/datasets/pull/5440
- Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5447
- Support fsspec 2023.1.0 in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/5449
- Update share tutorial by @stevhliu in https://github.com/huggingface/datasets/pull/5443
- Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in https://github.com/huggingface/datasets/pull/5452
- Fix base directory while extracting insecure TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/5453
- Fix link in
load_datasetdocstring by @mariosasko in https://github.com/huggingface/datasets/pull/5389 - Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in https://github.com/huggingface/datasets/pull/5460
- Concatenate on axis=1 with misaligned blocks by @lhoestq in https://github.com/huggingface/datasets/pull/5462
- Raise from disconnect error in xopen by @lhoestq in https://github.com/huggingface/datasets/pull/5382
- remove pathlib.Path with URIs by @jonny-cyberhaven in https://github.com/huggingface/datasets/pull/5466
- Remove deprecated
shard_sizearg from.push_to_hub()by @polinaeterna in https://github.com/huggingface/datasets/pull/5469
New Contributors
- @freddyheppell made their first contribution in https://github.com/huggingface/datasets/pull/5359
- @MKhalusova made their first contribution in https://github.com/huggingface/datasets/pull/5403
- @tkukurin made their first contribution in https://github.com/huggingface/datasets/pull/5358
- @0x2b3bfa0 made their first contribution in https://github.com/huggingface/datasets/pull/5436
- @maheshpec made their first contribution in https://github.com/huggingface/datasets/pull/5411
- @dconathan made their first contribution in https://github.com/huggingface/datasets/pull/5393
- @zanussbaum made their first contribution in https://github.com/huggingface/datasets/pull/5456
- @jonny-cyberhaven made their first contribution in https://github.com/huggingface/datasets/pull/5466
Full Changelog: https://github.com/huggingface/datasets/compare/2.8.0...2.9.0
- Python
Published by lhoestq about 3 years ago
datasets - 2.8.0
Important
- Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of
datasetsare not able to reload datasets pushed with this new model, so we encourage everyone to update.
Datasets Features
- Fix methods using
IterableDataset.mapthat lead tofeatures=Noneby @alvarobartt in https://github.com/huggingface/datasets/pull/5287- Datasets in streaming mode now update their
featuresafter column renaming or removal
- Datasets in streaming mode now update their
- Add numproc to fromcsv/generator/json/parquet/text by @lhoestq in https://github.com/huggingface/datasets/pull/5239
- Use multiprocessing to load multiple files in parallel
- Add
featuresparam toIterableDataset.mapby @alvarobartt in https://github.com/huggingface/datasets/pull/5311 - Sharded savetodisk + multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5268
- Pass
num_shardsormax_shard_sizetods.save_to_disk()ords.push_to_hub() - Pass
num_procto use multiprocessing.
- Pass
- Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in https://github.com/huggingface/datasets/pull/5252
- Support torch dataloader without torch formatting for IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
python from datasets import load_dataset ds = load_dataset("c4", "en", streaming=True, split="train") dataloader = DataLoader(ds, batch_size=32, num_workers=4)
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
Docs
- Complete doc migration by @mishig25 in https://github.com/huggingface/datasets/pull/5248
General improvements and bug fixes
- typo by @WrRan in https://github.com/huggingface/datasets/pull/5253
- typo by @WrRan in https://github.com/huggingface/datasets/pull/5254
- remove an unused statement by @WrRan in https://github.com/huggingface/datasets/pull/5257
- fix wrong print by @WrRan in https://github.com/huggingface/datasets/pull/5256
- Fix
max_shard_sizedocs by @lhoestq in https://github.com/huggingface/datasets/pull/5267 - Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in https://github.com/huggingface/datasets/pull/5266
- Change release procedure to use only pull requests by @albertvillanova in https://github.com/huggingface/datasets/pull/5250
- Warn about checksums by @lhoestq in https://github.com/huggingface/datasets/pull/5279
- Tweak readme by @lhoestq in https://github.com/huggingface/datasets/pull/5210
- Save file name in embed_storage by @lhoestq in https://github.com/huggingface/datasets/pull/5285
- Use correct dataset type in
from_generatordocs by @mariosasko in https://github.com/huggingface/datasets/pull/5307 - Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in https://github.com/huggingface/datasets/pull/5294
- Fix xjoin for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5297
- Fix xopen for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5299
- Ci py3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/5065
- Update Overview.ipynb google colab by @lhoestq in https://github.com/huggingface/datasets/pull/5211
- Support xPath for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5310
- Fix description of streaming in the docs by @polinaeterna in https://github.com/huggingface/datasets/pull/5313
- Fix Text sample_by paragraph by @albertvillanova in https://github.com/huggingface/datasets/pull/5319
- [Extract] Place the lock file next to the destination directory by @lhoestq in https://github.com/huggingface/datasets/pull/5320
- Fix loading from HF GCP cache by @lhoestq in https://github.com/huggingface/datasets/pull/5321
- This was affecting datasets like
wikipediaornatural_questions
- This was affecting datasets like
- Fix docs building for main by @albertvillanova in https://github.com/huggingface/datasets/pull/5328
- Origin/fix missing features error by @eunseojo in https://github.com/huggingface/datasets/pull/5318
- fix: 🐛 pass the token to get the list of config names by @severo in https://github.com/huggingface/datasets/pull/5333
- Clarify imagefolder is for small datasets by @stevhliu in https://github.com/huggingface/datasets/pull/5329
- Close stream in
ArrowWriter.finalizebefore inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309 - Use same
num_procfor dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300 - Set
IterableDataset.mapparambatch_sizetyping as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336 - fix: dataset path should be absolute by @vigsterkr in https://github.com/huggingface/datasets/pull/5234
- Clean up DatasetInfo and Dataset docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5340
- Clean up docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5334
- Remove tasks.json by @lhoestq in https://github.com/huggingface/datasets/pull/5341
- Support
topdownparameter inxwalkby @mariosasko in https://github.com/huggingface/datasets/pull/5308 - Improve
use_auth_tokendocstring and deprecateuse_auth_tokenindownload_and_prepareby @mariosasko in https://github.com/huggingface/datasets/pull/5302 - Clean up Loading methods docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5350
- Clean up remaining Main Classes docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5349
- Clean up Dataset and DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/5344
- Clean up Table class docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5355
- Raise error for
.tararchives in the same way as for.tar.gzand.tgzin_get_extraction_protocolby @polinaeterna in https://github.com/huggingface/datasets/pull/5322 - Clean filesystem and logging docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5356
- ExamplesIterable fixes by @lhoestq in https://github.com/huggingface/datasets/pull/5366
- Simplify skipping by @Muennighoff in https://github.com/huggingface/datasets/pull/5373
- Release: 2.8.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5375
New Contributors
- @WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
- @eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
- @vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234
- @Muennighoff made their first contribution in https://github.com/huggingface/datasets/pull/5373
Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0
- Python
Published by lhoestq about 3 years ago
datasets - 2.7.0
Dataset Features
- Multiprocessed dataset builder by @TevenLeScao in https://github.com/huggingface/datasets/pull/5107
- Load big datasets faster than before using multiprocessing:
python from datasets import load_dataset ds = load_dataset("imagenet-1k", num_proc=4)
- Load big datasets faster than before using multiprocessing:
- Make torch.Tensor and spacy models cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/5191
- Function passed to
maporfilterthat uses tensors or pipelines can now be cached
- Function passed to
- Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in https://github.com/huggingface/datasets/pull/5192
- TextConfig: added "errors" by @NightMachinery in https://github.com/huggingface/datasets/pull/5155
Audio setup
- Add ffmpeg4 installation instructions in warnings by @polinaeterna in https://github.com/huggingface/datasets/pull/5167
Docs
- Update create image dataset docs by @stevhliu in https://github.com/huggingface/datasets/pull/5177
- add: segmentation guide. by @sayakpaul in https://github.com/huggingface/datasets/pull/5188
- Reword E2E training and inference tips in the vision guides by @sayakpaul in https://github.com/huggingface/datasets/pull/5217
- Add SQL guide by @stevhliu in https://github.com/huggingface/datasets/pull/5223
General improvements and bug fixes
- Add
pyproject.tomlforblackby @mariosasko in https://github.com/huggingface/datasets/pull/5125 - Fix
tqdmzip bug by @david1542 in https://github.com/huggingface/datasets/pull/5120 - Install tensorflow-macos dependency conditionally by @albertvillanova in https://github.com/huggingface/datasets/pull/5124
- [TYPO] Update newdatasetscript.py by @cakiki in https://github.com/huggingface/datasets/pull/5119
- Avoid extra cast in
class_encode_columnby @mariosasko in https://github.com/huggingface/datasets/pull/5130 - Use yaml for issue templates + revamp by @mariosasko in https://github.com/huggingface/datasets/pull/5116
- Update docs once dataset scripts transferred to the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5136
- Delete duplicate issue template file by @albertvillanova in https://github.com/huggingface/datasets/pull/5146
- Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in https://github.com/huggingface/datasets/pull/5142
- Raise ImportError instead of OSError by @ayushthe1 in https://github.com/huggingface/datasets/pull/5141
- Fix CI require beam by @albertvillanova in https://github.com/huggingface/datasets/pull/5168
- Make iter_files deterministic by @albertvillanova in https://github.com/huggingface/datasets/pull/5149
- Add PB and TB in convertfilesizetoint by @lhoestq in https://github.com/huggingface/datasets/pull/5171
- Reduce default max
writer_batch_sizeby @mariosasko in https://github.com/huggingface/datasets/pull/5163 - Support dill 0.3.6 by @albertvillanova in https://github.com/huggingface/datasets/pull/5166
- Make filename matching more robust by @riccardobucco in https://github.com/huggingface/datasets/pull/5128
- Preserve None in list type cast in PyArrow 10 by @mariosasko in https://github.com/huggingface/datasets/pull/5174
- Raise ffmpeg warnings only once by @polinaeterna in https://github.com/huggingface/datasets/pull/5173
- Add "ipykernel" to list of
co_filenames to remove by @gpucce in https://github.com/huggingface/datasets/pull/5169 - chore: add notebook links to img cls and obj det. by @sayakpaul in https://github.com/huggingface/datasets/pull/5187
- Fix docs about dataset_info in YAML by @albertvillanova in https://github.com/huggingface/datasets/pull/5194
- fsspec lock reset in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5159
- Add note about the name of a dataset script by @polinaeterna in https://github.com/huggingface/datasets/pull/5198
- Deprecate dummy data generation command by @mariosasko in https://github.com/huggingface/datasets/pull/5199
- Do not sort splits in dataset info by @polinaeterna in https://github.com/huggingface/datasets/pull/5201
- Add missing
DownloadConfig.use_auth_tokenvalue by @alvarobartt in https://github.com/huggingface/datasets/pull/5205 - Update canonical links to Hub links by @stevhliu in https://github.com/huggingface/datasets/pull/5203
- Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in https://github.com/huggingface/datasets/pull/5208
- Update github pr docs actions by @mishig25 in https://github.com/huggingface/datasets/pull/5214
- Use hfh hfhuburl function by @albertvillanova in https://github.com/huggingface/datasets/pull/5196
- Pin
typerversion in tests to <0.5 to fix Windows CI by @polinaeterna in https://github.com/huggingface/datasets/pull/5235 - Fix shards in IterableDataset.from_generator by @lhoestq in https://github.com/huggingface/datasets/pull/5233
- Fix class name of symbolic link by @riccardobucco in https://github.com/huggingface/datasets/pull/5126
- Make
Versionhashable by @mariosasko in https://github.com/huggingface/datasets/pull/5238 - Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in https://github.com/huggingface/datasets/pull/5236
- Encode path only for old versions of hfh by @lhoestq in https://github.com/huggingface/datasets/pull/5237
- Fix CI require_beam maximum compatible dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/5212
- Support hfh rc version by @lhoestq in https://github.com/huggingface/datasets/pull/5241
- Cleaner error tracebacks for dataset script errors by @mariosasko in https://github.com/huggingface/datasets/pull/5240
New Contributors
- @david1542 made their first contribution in https://github.com/huggingface/datasets/pull/5120
- @ayushthe1 made their first contribution in https://github.com/huggingface/datasets/pull/5142
- @gpucce made their first contribution in https://github.com/huggingface/datasets/pull/5169
- @sayakpaul made their first contribution in https://github.com/huggingface/datasets/pull/5187
- @NightMachinery made their first contribution in https://github.com/huggingface/datasets/pull/5155
Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.7.0
- Python
Published by albertvillanova over 3 years ago
datasets - 2.6.1
Bug fixes
- Fix filter indices when batched by @albertvillanova in https://github.com/huggingface/datasets/pull/5113
- fixed a bug where
filtercould return examples with the wrong indices
- fixed a bug where
- Fix iter_batches by @lhoestq in https://github.com/huggingface/datasets/pull/5115
- fixed a bug where
mapwithbatch=Truecould return a dataset with less examples
- fixed a bug where
- Fix a typo in arrow_dataset.py by @yangky11 in https://github.com/huggingface/datasets/pull/5108
New Contributors
- @yangky11 made their first contribution in https://github.com/huggingface/datasets/pull/5108
Full Changelog: https://github.com/huggingface/datasets/compare/2.6.0...2.6.1
- Python
Published by lhoestq over 3 years ago
datasets - 2.6.0
Important
- [GH->HF] Remove all dataset scripts from github by @lhoestq in https://github.com/huggingface/datasets/pull/4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on
Datasets features
- Add ability to read-write to SQL databases. by @Dref360 in https://github.com/huggingface/datasets/pull/4928
- Read from sqlite file:
python from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db") - Allow connection objects in
from_sql+ small doc improvement by @mariosasko in https://github.com/huggingface/datasets/pull/5091python from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
- Read from sqlite file:
- Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in https://github.com/huggingface/datasets/pull/5072
- return numpy/torch/tf/jax tensors with
python from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]
- return numpy/torch/tf/jax tensors with
- Added
IterableDataset.from_generatorby @hamid-vakilzadeh in https://github.com/huggingface/datasets/pull/5052 - Fast dataset iter by @mariosasko in https://github.com/huggingface/datasets/pull/5030
- speed up by a factor of 2 using the Arrow Table reader
- Dataset infos in yaml by @lhoestq in https://github.com/huggingface/datasets/pull/4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
- Add
kwargstoDataset.from_generatorby @mariosasko in https://github.com/huggingface/datasets/pull/5049 - Support
convertersinCsvBuilderby @mariosasko in https://github.com/huggingface/datasets/pull/5057 - Restore saved format state in
load_from_diskby @asofiaoliveira in https://github.com/huggingface/datasets/pull/5073
Dataset changes
- Update: hendrycks_test - support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/5041
- Update: swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5019
- Update swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5042
- Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in https://github.com/huggingface/datasets/pull/5022
- Fix: sbu_captions - fix URLs by @donglixp in https://github.com/huggingface/datasets/pull/5020
- Fix: xcsr - fix string features by @albertvillanova in https://github.com/huggingface/datasets/pull/5024
- Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/5040
- Fix: catsvsdogs - fix number of samples by @lhoestq in https://github.com/huggingface/datasets/pull/5047
- Fix: lexglue - fix bug with labels of eurlex config of lexglue dataset by @iliaschalkidis in https://github.com/huggingface/datasets/pull/5048
- Fix: msr_sqa - fix dataset generation by @Timothyxxx in https://github.com/huggingface/datasets/pull/3715
Dataset cards
- Add description to hellaswag dataset by @julien-c in https://github.com/huggingface/datasets/pull/4810
- Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5010
- Update languages in aeslc dataset card by @apergo-ai in https://github.com/huggingface/datasets/pull/3357
- Update license to bookcorpus dataset card by @meg-huggingface in https://github.com/huggingface/datasets/pull/3526
- Update paper link in medmcqa dataset card by @monk1337 in https://github.com/huggingface/datasets/pull/4290
- Add oversampling strategy iterable datasets interleave by @ylacombe in https://github.com/huggingface/datasets/pull/5036
- Fix license/citation information of squadshifts dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5054
General improvements and bug fixes
- Fix missing useauthtoken in streaming docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/5003
- Add some note about running the transformers ci before a release by @lhoestq in https://github.com/huggingface/datasets/pull/5007
- Remove license tag file and validation by @albertvillanova in https://github.com/huggingface/datasets/pull/5004
- Re-apply input columns change by @mariosasko in https://github.com/huggingface/datasets/pull/5008
- patch CIHUBTOKEN_PATH with Path instead of str by @Wauplin in https://github.com/huggingface/datasets/pull/5026
- Fix typo in error message by @severo in https://github.com/huggingface/datasets/pull/5027
- Fix import in
ClassLabeldocstring example by @alvarobartt in https://github.com/huggingface/datasets/pull/5029 - Remove redundant code from some dataset module factories by @albertvillanova in https://github.com/huggingface/datasets/pull/5033
- Fix typos in load docstrings and comments by @albertvillanova in https://github.com/huggingface/datasets/pull/5035
- Prefer split patterns from directories over split patterns from filenames by @polinaeterna in https://github.com/huggingface/datasets/pull/4985
- Fix tar extraction vuln by @lhoestq in https://github.com/huggingface/datasets/pull/5016
- Support hfh 0.10 implicit auth by @lhoestq in https://github.com/huggingface/datasets/pull/5031
- Fix
flatten_indiceswith empty indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5043 - Improve CI performance speed of PackagedDatasetTest by @albertvillanova in https://github.com/huggingface/datasets/pull/5037
- Revert task removal in folder-based builders by @mariosasko in https://github.com/huggingface/datasets/pull/5051
- Fix backward compatibility for dataset_infos.json by @lhoestq in https://github.com/huggingface/datasets/pull/5055
- Fix typo by @stevhliu in https://github.com/huggingface/datasets/pull/5059
- Fix CI hfh token warning by @albertvillanova in https://github.com/huggingface/datasets/pull/5062
- Mark CI tests as xfail when 502 error by @albertvillanova in https://github.com/huggingface/datasets/pull/5058
- Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in https://github.com/huggingface/datasets/pull/5077
- Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5067
- Fix header level in Audio docs by @stevhliu in https://github.com/huggingface/datasets/pull/5078
- Support DEFAULTCONFIGNAME when no BUILDER_CONFIGS by @albertvillanova in https://github.com/huggingface/datasets/pull/5071
- Support streaming gzip.open by @albertvillanova in https://github.com/huggingface/datasets/pull/5066
- adding keep in memory by @Mustapha-AJEGHRIR in https://github.com/huggingface/datasets/pull/5082
- refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in https://github.com/huggingface/datasets/pull/5079
- fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in https://github.com/huggingface/datasets/pull/5076
- Align signature of listrepofiles with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5063
- Align signature of create/delete_repo with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5064
- Fix filter with empty indices by @Mouhanedg56 in https://github.com/huggingface/datasets/pull/5087
- Fix tutorial (#5093) by @riccardobucco in https://github.com/huggingface/datasets/pull/5095
- Use HTML relative paths for tiles in the docs by @lewtun in https://github.com/huggingface/datasets/pull/5092
- Fix loading how to guide (#5102) by @riccardobucco in https://github.com/huggingface/datasets/pull/5104
- url encode hub url (#5099) by @riccardobucco in https://github.com/huggingface/datasets/pull/5103
- Free the "hf" filesystem protocol for
hffsby @lhoestq in https://github.com/huggingface/datasets/pull/5101 - Fix task template reload from dict by @lhoestq in https://github.com/huggingface/datasets/pull/5106
New Contributors
- @Wauplin made their first contribution in https://github.com/huggingface/datasets/pull/5026
- @donglixp made their first contribution in https://github.com/huggingface/datasets/pull/5020
- @Timothyxxx made their first contribution in https://github.com/huggingface/datasets/pull/3715
- @hamid-vakilzadeh made their first contribution in https://github.com/huggingface/datasets/pull/5052
- @Mustapha-AJEGHRIR made their first contribution in https://github.com/huggingface/datasets/pull/5082
- @galbwe made their first contribution in https://github.com/huggingface/datasets/pull/5079
- @rahulXs made their first contribution in https://github.com/huggingface/datasets/pull/5076
- @Mouhanedg56 made their first contribution in https://github.com/huggingface/datasets/pull/5087
- @riccardobucco made their first contribution in https://github.com/huggingface/datasets/pull/5095
- @asofiaoliveira made their first contribution in https://github.com/huggingface/datasets/pull/5073
Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.6.0
- Python
Published by lhoestq over 3 years ago
datasets - 2.5.0
Important
- Drop Python 3.6 support by @mariosasko in https://github.com/huggingface/datasets/pull/4460
- Deprecate metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4739
- Metrics are now deprecated and have been moved to evaluate:
python !pip install evaluate import evaluate metric = evaluate.load("accuracy")
- Metrics are now deprecated and have been moved to evaluate:
- Load GitHub datasets from Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
- Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
- Use HTTP requests to access data and metadata through the Datasets REST API (docs here)
Datasets features
No-code loaders
- Add AudioFolder packaged loader by @polinaeterna in https://github.com/huggingface/datasets/pull/4530
- Add support for CSV metadata files to ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/4837
- Add support for parsing JSON files in array form by @mariosasko in https://github.com/huggingface/datasets/pull/4997
Dataset methods
- add
Dataset.from_listby @sanderland in https://github.com/huggingface/datasets/pull/4890 - Add
Dataset.from_generatorby @mariosasko in https://github.com/huggingface/datasets/pull/4957 - Add oversampling strategies to interleave datasets by @ylacombe in https://github.com/huggingface/datasets/pull/4831
- Preserve non-
input_columsinDataset.mapifinput_columnsare specified by @mariosasko in https://github.com/huggingface/datasets/pull/4971 - Add
fn_kwargsparam toIterableDataset.mapby @mariosasko in https://github.com/huggingface/datasets/pull/4975 - More rigorous shape inference in totfdataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4763
Parquet support
- Download and prepare as Parquet for cloud storage by @lhoestq in https://github.com/huggingface/datasets/pull/4724
- Shard parquet in downloadandprepare by @lhoestq in https://github.com/huggingface/datasets/pull/4747
- Embed image/audio data in dlandprepare parquet by @lhoestq in https://github.com/huggingface/datasets/pull/4987
Datasets changes
- Update: natural questions - Add long answer candidates by @seirasto in https://github.com/huggingface/datasets/pull/4368
- Update: opus_paracrawl - update version by @albertvillanova in https://github.com/huggingface/datasets/pull/4816
- Update: ReCoRD - Include entity positions as feature by @richarddwang in https://github.com/huggingface/datasets/pull/4479
- Update: swda - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4914
- Update: Enwik8 - update broken link and information by @mtanghu in https://github.com/huggingface/datasets/pull/4
- Update: compguesswhat - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4968
- Update: nli_tr - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4970
- Update: IndicGLUE - update download links by @sumanthd17 in https://github.com/huggingface/datasets/pull/4978
- Update: iwslt2017 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4992
- Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4788
- Fix: mkqa - Update data URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4823
- Fix: exams - fix bug and checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/4853
- Fix: trec - use fine classes by @albertvillanova in https://github.com/huggingface/datasets/pull/4801
- Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in https://github.com/huggingface/datasets/pull/4871
- Fix: LibriSpeech - Fix dev split localextractedarchive for 'all' config by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4904
- Fix: compguesswhat - fix data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/4959
- Fix: vivos - fix data URL and metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4969
- Fix: MBPP - Add splits by @cwarny in https://github.com/huggingface/datasets/pull/4943
Dataset cards
- Add
language_bcp47tag by @lhoestq in https://github.com/huggingface/datasets/pull/4753 - Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in https://github.com/huggingface/datasets/pull/4701
- Remove "unkown" language tags by @lhoestq in https://github.com/huggingface/datasets/pull/4754
- Highlight non-commercial license in amazonreviewsmulti dataset card by @sbroadhurst-hf in https://github.com/huggingface/datasets/pull/4712
- Added dataset information in clinic oos dataset card by @Arnav-Ladkat in https://github.com/huggingface/datasets/pull/4751
- Fix opus_gnome dataset card by @gojiteji in https://github.com/huggingface/datasets/pull/4806
- Complete the mlqa dataset card by @eldhoittangeorge in https://github.com/huggingface/datasets/pull/4809
- Fix loading example in opus dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4813
- Add missing language tags to resources by @albertvillanova in https://github.com/huggingface/datasets/pull/4819
- Fix titles in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4824
- Fix language tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4826
- Add license metadata to pg19 by @julien-c in https://github.com/huggingface/datasets/pull/4827
- Fix task tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4830
- Fix tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4832
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4833
- Fix documentation card of recipe_nlg dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4834
- Fix documentation card of ethos dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4835
- Update documentation card of miam dataset by @PierreColombo in https://github.com/huggingface/datasets/pull/4846
- Update stackexchange license by @cakiki in https://github.com/huggingface/datasets/pull/4842
- Update tedtalksiwslt license to include ND by @cakiki in https://github.com/huggingface/datasets/pull/4841
- Fix documentation card of adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4838
- Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
- Fix license tag and Source Data section in billsum dataset card by @kashif in https://github.com/huggingface/datasets/pull/4851
- Fix documentation card of covidqacastorini dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4877
- Fix Citation Information section in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4879
- Fix documentation card of math_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4884
- Added names of less-studied languages by @BenjaminGalliot in https://github.com/huggingface/datasets/pull/4880
- Fix language tags resource file by @albertvillanova in https://github.com/huggingface/datasets/pull/4882
- Add citation to rosts and rosts_parallel datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4892
- Add citation information to makhzan dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4894
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4891
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4896
- Re-add code and und language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/4899
- Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
- Update GLUE evaluation metadata by @lewtun in https://github.com/huggingface/datasets/pull/4909
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4908
- Add license and citation information to cosmos_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4913
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4921
- Add cc-by-nc-2.0 to list of licenses by @albertvillanova in https://github.com/huggingface/datasets/pull/4930
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4931
- Add Papers with Code ID to scifact dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4941
- Fix license information in qasc dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4951
- Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4940
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4979
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4991
Documentation
- Update map docs by @stevhliu in https://github.com/huggingface/datasets/pull/4743
- Add image classification processing guide by @stevhliu in https://github.com/huggingface/datasets/pull/4748
- Fix traintestsplit docs by @NielsRogge in https://github.com/huggingface/datasets/pull/4821
- Update local loading script docs by @stevhliu in https://github.com/huggingface/datasets/pull/4778
- Docs for creating a loading script for image datasets by @stevhliu in https://github.com/huggingface/datasets/pull/4783
- Docs for creating an audio dataset by @stevhliu in https://github.com/huggingface/datasets/pull/4872
General improvements and bug fixes
- Use CI unit/integration tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4738
- Fix multiprocessing in map_nested by @albertvillanova in https://github.com/huggingface/datasets/pull/4740
- Add 2.4.0 version added to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4767
- Update CI badge by @mariosasko in https://github.com/huggingface/datasets/pull/4764
- Fix version in map_nested docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4765
- fix typo by @xwwwwww in https://github.com/huggingface/datasets/pull/4770
- Unpin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4768
- Remove apachebeam import from module level in naturalquestions dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4780
- Require torchaudio<0.12.0 to avoid RuntimeError by @albertvillanova in https://github.com/huggingface/datasets/pull/4777
- Remove dummy data generation docs by @stevhliu in https://github.com/huggingface/datasets/pull/4771
- Require torchaudio<0.12.0 in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/4785
- Fix bug in function validate_type for Python >= 3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/4812
- Fix typo in streaming docs by @flozi00 in https://github.com/huggingface/datasets/pull/4843
- Fix test of getextraction_protocol for TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/4850
- Fix typos in documentation by @fl-lo in https://github.com/huggingface/datasets/pull/
- Mark CI tests as xfail if Hub HTTP error by @albertvillanova in https://github.com/huggingface/datasets/pull/4845
- [Windows] Fix Access Denied when using os.rename() by @DougTrajano in https://github.com/huggingface/datasets/pull/4825
- [docs] Some tiny doc tweaks by @julien-c in https://github.com/huggingface/datasets/pull/4874
- Document loading from relative path by @stevhliu in https://github.com/huggingface/datasets/pull/4773
- Fix CI reporting by @albertvillanova in https://github.com/huggingface/datasets/pull/4903
- Add 'val' to VALIDATION_KEYWORDS. by @akt42 in https://github.com/huggingface/datasets/pull/4844
- Raise ManualDownloadError from getdatasetconfig_info by @albertvillanova in https://github.com/huggingface/datasets/pull/4901
- feat: improve error message on Keys mismatch. closes #4917 by @PaulLerner in https://github.com/huggingface/datasets/pull/4919
- Fixes a typo in loading documentation by @sighingnow in https://github.com/huggingface/datasets/pull/4929
- Remove main branch rename notice by @lhoestq in https://github.com/huggingface/datasets/pull/4938
- Fix NonMatchingChecksumError in adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4939
- Remove deprecated identical_ok by @lhoestq in https://github.com/huggingface/datasets/pull/4937
- Pin TensorFlow temporarily by @albertvillanova in https://github.com/huggingface/datasets/pull/4954
- Fix minor typo in error message for missing imports by @mariosasko in https://github.com/huggingface/datasets/pull/4948
- Fix TF tests for 2.10 by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4956
- fix BLEU metric card by @antoniolanza1996 in https://github.com/huggingface/datasets/pull/4927
- Update doc upload_dataset.mdx by @mishig25 in https://github.com/huggingface/datasets/pull/4789
- Improve features resolution in streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4762
- Fix label renaming and add a battery of tests by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4781
- Strip "/" in local dataset path to avoid empty dataset name error by @apohllo in https://github.com/huggingface/datasets/pull/4967
- Introduce regex check when pushing as well by @LysandreJik in https://github.com/huggingface/datasets/pull/4946
- [doc] Fix broken snippet that had too many quotes by @tomaarsen in https://github.com/huggingface/datasets/pull/4986
- Fix map batched with torch output by @lhoestq in https://github.com/huggingface/datasets/pull/4972
- fix: avoid casting tuples after Dataset.map by @szmoro in https://github.com/huggingface/datasets/pull/4993
- decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
- Don't add a tag on the Hub on release by @lhoestq in https://github.com/huggingface/datasets/pull/4998
- Add EmptyDatasetError by @lhoestq in https://github.com/huggingface/datasets/pull/4999
New Contributors
- @seirasto made their first contribution in https://github.com/huggingface/datasets/pull/4368
- @sbroadhurst-hf made their first contribution in https://github.com/huggingface/datasets/pull/4712
- @nawarhalabi made their first contribution in https://github.com/huggingface/datasets/pull/4701
- @Arnav-Ladkat made their first contribution in https://github.com/huggingface/datasets/pull/4751
- @xwwwwww made their first contribution in https://github.com/huggingface/datasets/pull/4770
- @gojiteji made their first contribution in https://github.com/huggingface/datasets/pull/4806
- @eldhoittangeorge made their first contribution in https://github.com/huggingface/datasets/pull/4809
- @flozi00 made their first contribution in https://github.com/huggingface/datasets/pull/4843
- @fl-lo made their first contribution in https://github.com/huggingface/datasets/pull/4869
- @BenjaminGalliot made their first contribution in https://github.com/huggingface/datasets/pull/4880
- @DougTrajano made their first contribution in https://github.com/huggingface/datasets/pull/4825
- @ylacombe made their first contribution in https://github.com/huggingface/datasets/pull/4831
- @osanseviero made their first contribution in https://github.com/huggingface/datasets/pull/4887
- @akt42 made their first contribution in https://github.com/huggingface/datasets/pull/4844
- @sanderland made their first contribution in https://github.com/huggingface/datasets/pull/4890
- @sighingnow made their first contribution in https://github.com/huggingface/datasets/pull/4929
- @mtanghu made their first contribution in https://github.com/huggingface/datasets/pull/4950
- @antoniolanza1996 made their first contribution in https://github.com/huggingface/datasets/pull/4927
- @apohllo made their first contribution in https://github.com/huggingface/datasets/pull/4967
- @cwarny made their first contribution in https://github.com/huggingface/datasets/pull/4943
- @tomaarsen made their first contribution in https://github.com/huggingface/datasets/pull/4986
- @szmoro made their first contribution in https://github.com/huggingface/datasets/pull/4993
Full Changelog: https://github.com/huggingface/datasets/compare/2.4.0...2.5.0
- Python
Published by lhoestq over 3 years ago
datasets - 2.4.0
Dataset Features
- Add
concatenate_datasetsfor iterable datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4500 - Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in https://github.com/huggingface/datasets/pull/4625
- Support using PCM audio files (#4323) by @YooSungHyun in https://github.com/huggingface/datasets/pull/4409
- [data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in https://github.com/huggingface/datasets/pull/4633
- Support extract 7-zip compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4672
- Support extract lz4 compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4700
- Support
metadata.jsonlfrom parent directories inimagefolder@mariosasko in https://github.com/huggingface/datasets/pull/4576
Dataset changes
- Update: allocine - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4563
- Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4585
- Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4586
- Update: financial_phrasebank - Host data on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4598
- Update: cfq - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4579
- Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4588
- Update: bookcorpus - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4564
- Update: fever - Refactor and add metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4503
- Update: mlsum - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4574
- Fix: catsvsdogs - Update download url and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/4523
- Fix: conll2003 - fix empty example by @lhoestq in https://github.com/huggingface/datasets/pull/4662
- Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in https://github.com/huggingface/datasets/pull/4554
- Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in https://github.com/huggingface/datasets/pull/4706
- Fix: crd3 - fix splits that were containing the same data by @lhoestq in https://github.com/huggingface/datasets/pull/4705
Dataset Cards
- Add action names in schemaguideddstc8 dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/4559
- Add evaluation data to acronym_identification by @lewtun in https://github.com/huggingface/datasets/pull/4561
- Update WinoBias README by @sashavor in https://github.com/huggingface/datasets/pull/4631
- Support "tags" yaml tag by @lhoestq in https://github.com/huggingface/datasets/pull/4716
- Fix POS tags by @lhoestq in https://github.com/huggingface/datasets/pull/4715
- AESLC dataset: Add summarization tags by @hobson in https://github.com/huggingface/datasets/pull/4517
Documentation
- Update docs around audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4440
- Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in https://github.com/huggingface/datasets/pull/4513
- Remove multiple config section by @stevhliu in https://github.com/huggingface/datasets/pull/4600
- Create new sections for audio and vision in guides by @stevhliu in https://github.com/huggingface/datasets/pull/4519
- Document installation of sox OS dependency for audio by @albertvillanova in https://github.com/huggingface/datasets/pull/4713
General improvements and bug fixes
- Add regression test for
ArrowWriter.write_batchwhen batch is empty by @alvarobartt in https://github.com/huggingface/datasets/pull/4510 - Support all negative values in ClassLabel by @lhoestq in https://github.com/huggingface/datasets/pull/4511
- Add uppercased versions of image file extensions for automatic module inference by @mariosasko in https://github.com/huggingface/datasets/pull/4515
- Patch tests for hfh v0.8.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/4518
- Replace deprecated logging.warn with logging.warning by @hugovk in https://github.com/huggingface/datasets/pull/4539
- [CI] Fix upstream hub test url by @lhoestq in https://github.com/huggingface/datasets/pull/4543
- Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4541
- [CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in https://github.com/huggingface/datasets/pull/4546
- Tell users to upload on the hub directly by @lhoestq in https://github.com/huggingface/datasets/pull/4552
- Add
batch_sizeparameter when callingadd_faiss_indexandadd_faiss_index_from_external_arraysby @alvarobartt in https://github.com/huggingface/datasets/pull/4535 - Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4545
- Properly raise FileNotFound even if the dataset is private by @lhoestq in https://github.com/huggingface/datasets/pull/4536
- Fix hashing for python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/4516
- [CI] Fix some warnings by @lhoestq in https://github.com/huggingface/datasets/pull/4547
- Validate new_fingerprint passed by user by @lhoestq in https://github.com/huggingface/datasets/pull/4587
- Update CI Windows orb by @albertvillanova in https://github.com/huggingface/datasets/pull/4604
- Perform hidden file check on relative data file path by @mariosasko in https://github.com/huggingface/datasets/pull/4551
- Align more metadata with other repo types (models,spaces) by @julien-c in https://github.com/huggingface/datasets/pull/4607
- Align/fix license metadata info by @julien-c in https://github.com/huggingface/datasets/pull/4613
- Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in https://github.com/huggingface/datasets/pull/4611
- Add authentication tip to
load_datasetby @mariosasko in https://github.com/huggingface/datasets/pull/4577 - Stop dropping columns in totfdataset() before we load batches by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4553
- fix(datasetwrappers): Fixes access to fsspec.asyn in torchiterable_dataset.py. by @gugarosa in https://github.com/huggingface/datasets/pull/4630
- Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in https://github.com/huggingface/datasets/pull/4608
- Rename master to main by @lhoestq in https://github.com/huggingface/datasets/pull/4643
- Set HFSCRIPTSVERSION to main by @lhoestq in https://github.com/huggingface/datasets/pull/4645
- [Minor fix] Typo correction by @cakiki in https://github.com/huggingface/datasets/pull/4644
- fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in https://github.com/huggingface/datasets/pull/4627
- Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4590
- Fix time type
_arrow_to_datasets_dtypeconversion by @mariosasko in https://github.com/huggingface/datasets/pull/4628 - Fix resolvesinglepatternlocally on Windows with multiple drives by @albertvillanova in https://github.com/huggingface/datasets/pull/4660
- Replace
assertEqualwithassertTupleEqualin unit tests for verbosity by @alvarobartt in https://github.com/huggingface/datasets/pull/4496 - Fix
embed_storageon features inside lists/sequences by @mariosasko in https://github.com/huggingface/datasets/pull/4615 - Add links to vision tasks scripts in ADDNEWDATASET template by @mariosasko in https://github.com/huggingface/datasets/pull/4512
- Transfer CI to GitHub Actions by @albertvillanova in https://github.com/huggingface/datasets/pull/4659
- Fix mock fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/4685
- Trigger CI also on push to main by @albertvillanova in https://github.com/huggingface/datasets/pull/4687
- Fix ImageFolder with parameters dropmetadata=True and droplabels=False (when metadata.jsonl is present) by @polinaeterna in https://github.com/huggingface/datasets/pull/4622
- Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4688
- Test extractors for all compression formats by @albertvillanova in https://github.com/huggingface/datasets/pull/4689
- Refactor base extractors by @albertvillanova in https://github.com/huggingface/datasets/pull/4690
- Update create dataset card docs by @stevhliu in https://github.com/huggingface/datasets/pull/4683
- Add text decorators by @stevhliu in https://github.com/huggingface/datasets/pull/4663
- Skip tests only for lz4/zstd params if not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4704
- Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in https://github.com/huggingface/datasets/pull/4614
- Docs: Fix same-page haslinks by @mishig25 in https://github.com/huggingface/datasets/pull/4722
- Fix broken link to the Hub by @stevhliu in https://github.com/huggingface/datasets/pull/4726
- Refactor conftest fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/4723
- Add object detection processing tutorial by @nateraw in https://github.com/huggingface/datasets/pull/4710
- Fix require torchaudio and refactor test requirements by @albertvillanova in https://github.com/huggingface/datasets/pull/4708
- docs: ✏️ fix TranslationVariableLanguages example by @severo in https://github.com/huggingface/datasets/pull/4731
- Pin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4735
- Fix named split sorting and remove unnecessary casting by @albertvillanova in https://github.com/huggingface/datasets/pull/4714
- Make cast in
from_pandasmore robust by @mariosasko in https://github.com/huggingface/datasets/pull/4703 - Make Extractor accept Path as input by @albertvillanova in https://github.com/huggingface/datasets/pull/4718
- Refactor Hub tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4729
- Fix to dict conversion of
DatasetInfo/Featuresby @mariosasko in https://github.com/huggingface/datasets/pull/4741
New Contributors
- @hugovk made their first contribution in https://github.com/huggingface/datasets/pull/4539
- @VijayKalmath made their first contribution in https://github.com/huggingface/datasets/pull/4545
- @gugarosa made their first contribution in https://github.com/huggingface/datasets/pull/4630
- @benlipkin made their first contribution in https://github.com/huggingface/datasets/pull/4627
- @YooSungHyun made their first contribution in https://github.com/huggingface/datasets/pull/4409
- @hobson made their first contribution in https://github.com/huggingface/datasets/pull/4517
- @khushmeeet made their first contribution in https://github.com/huggingface/datasets/pull/4554
- @dtuit made their first contribution in https://github.com/huggingface/datasets/pull/4614
Full Changelog: https://github.com/huggingface/datasets/compare/2.3.2...2.4.0
- Python
Published by lhoestq over 3 years ago
datasets - 2.3.2
Bug fixes
- Fix double dots in data files by @lhoestq in https://github.com/huggingface/datasets/pull/4505
- fix a bug when
/../is passed todata_filescausing FileNotFoundError
- fix a bug when
- fix ETT m1/m2 test/val dataset by @kashif in https://github.com/huggingface/datasets/pull/4499
- Corrected broken links in doc by @clefourrier in https://github.com/huggingface/datasets/pull/4501
New Contributors
- @clefourrier made their first contribution in https://github.com/huggingface/datasets/pull/4501
Full Changelog: https://github.com/huggingface/datasets/compare/2.3.1...2.3.2
- Python
Published by lhoestq over 3 years ago
datasets - 2.3.1
Bug fixes
- Fix patching module that doesn't exist by @lhoestq in https://github.com/huggingface/datasets/pull/4495
- fix bug when importing the lib when scipy is not installed
- Re-add download_manager module in utils by @lhoestq in https://github.com/huggingface/datasets/pull/4497
- fix moved imports of
DownloadConfig,DownloadMode,DownloadManager
- fix moved imports of
- Support streaming UDHR dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4487
Full Changelog: https://github.com/huggingface/datasets/compare/2.3.0...2.3.1
- Python
Published by lhoestq over 3 years ago
datasets - 2.3.0
Datasets Changes
- New: ImageNet-Sketch by @nateraw in https://github.com/huggingface/datasets/pull/4301
- New: Biwi Kinect Head Pose by @dnaveenr in https://github.com/huggingface/datasets/pull/3903
- New: enwik8 by @HallerPatrick in https://github.com/huggingface/datasets/pull/4321
- New: LCCC dataset by @silverriver in https://github.com/huggingface/datasets/pull/4416
- New: TruthfulQA by @jon-tow in https://github.com/huggingface/datasets/pull/4159
- New: BIG-bench by @andersjohanandreassen in https://github.com/huggingface/datasets/pull/4125
- New: QuickDraw by @mariosasko in https://github.com/huggingface/datasets/pull/3592
- New: SST-2 by @albertvillanova in https://github.com/huggingface/datasets/pull/4473
- Update: imagenet-1k - remove manual download by @mariosasko in https://github.com/huggingface/datasets/pull/4299
- ImageNet can now be loaded in python with
load_datasetwithout requiring a manual download ! - It also supports streaming mode with
load_dataset("imagenet-1k", streaming=True)
- ImageNet can now be loaded in python with
- Update: spider - Remove Google Drive URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4410
- Update: blendedskilltalk - add missing columns to by @mariosasko in https://github.com/huggingface/datasets/pull/4437
- Update: multi-news - Use newer version with fixes by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4451
- Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
- Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
- Update: udhr - update metadata by @leondz in https://github.com/huggingface/datasets/pull/4362
- Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4469
- Update: PASS - update dataset version by @mariosasko in https://github.com/huggingface/datasets/pull/4488
- Fix: GEM - fix bug in wikiautoasset_turk config by @albertvillanova in https://github.com/huggingface/datasets/pull/4389
- Fix: GEM - fix URL for totto config by @albertvillanova in https://github.com/huggingface/datasets/pull/4396
- Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in https://github.com/huggingface/datasets/pull/4424
- Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in https://github.com/huggingface/datasets/pull/4425
- Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in https://github.com/huggingface/datasets/pull/4436
- Fix: iwslt2017 by @lhoestq in https://github.com/huggingface/datasets/pull/4481
Dataset Features
- totfdataset rewrite by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4170
- see more in the documentation
- Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4375
- see more in the documentation
- Added stratify option to
train_test_splitby @nandwalritik in https://github.com/huggingface/datasets/pull/4322 - Re-add support for Apache Beam functionality by @albertvillanova in https://github.com/huggingface/datasets/pull/4328
- Resume
push_to_hub: skip identical files inpush_to_hubinstead of overwriting by @mariosasko in https://github.com/huggingface/datasets/pull/4402 - Support nested/complex feature types as
featuresin packaged loaders by @mariosasko in https://github.com/huggingface/datasets/pull/4364 - Optimize contiguous shard and select by @lhoestq in https://github.com/huggingface/datasets/pull/4466
Dataset Cards
- Minor fixes/improvements in
scene_parse_150card by @mariosasko in https://github.com/huggingface/datasets/pull/4447 - Tidy up license metadata for googlewellformedquery, newspop, sick by @leondz in https://github.com/huggingface/datasets/pull/4378
- Fix example in opus_ubuntu, Add license info by @leondz in https://github.com/huggingface/datasets/pull/4360
- Update README.md of fquad by @lhoestq in https://github.com/huggingface/datasets/pull/4450
Documentation
- Add API code examples for loading methods by @stevhliu in https://github.com/huggingface/datasets/pull/4300
- Add API code examples for remaining main classes by @stevhliu in https://github.com/huggingface/datasets/pull/4292
- Generalize tutorials for audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4468
- [Docs] How to use with PyTorch page by @lhoestq in https://github.com/huggingface/datasets/pull/4474
- First draft of the docs for TF + Datasets by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4457
Other improvements and bug fixes
- Update CI deprecated legacy image by @albertvillanova in https://github.com/huggingface/datasets/pull/4393
- remove int documentation from logging docs by @lvwerra in https://github.com/huggingface/datasets/pull/4392
- Fix docstring in DatasetDict::shuffle by @felixdivo in https://github.com/huggingface/datasets/pull/4344
- Fix Version equality by @albertvillanova in https://github.com/huggingface/datasets/pull/4359
- Set builder name from module instead of class by @albertvillanova in https://github.com/huggingface/datasets/pull/4388
- Test dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4385
- Refactor download by @albertvillanova in https://github.com/huggingface/datasets/pull/4384
- Fix dependency on dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/4397
- Support remote cache_dir by @albertvillanova in https://github.com/huggingface/datasets/pull/4347
- Update imagenet gate by @lhoestq in https://github.com/huggingface/datasets/pull/4408
- Fix dataset builder default version by @albertvillanova in https://github.com/huggingface/datasets/pull/4356
- Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in https://github.com/huggingface/datasets/pull/4403
- Rename DatasetBuilder config_name by @albertvillanova in https://github.com/huggingface/datasets/pull/4414
- Fix metadata validation by @albertvillanova in https://github.com/huggingface/datasets/pull/4390
- Add HF.co for PRs/Issues for specific datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4427
- Fix type hint and documentation for
new_fingerprintby @fxmarty in https://github.com/huggingface/datasets/pull/4326 - Skip hidden files/directories in data files resolution and
iter_filesby @mariosasko in https://github.com/huggingface/datasets/pull/4412 - Fix docstring of inspect_dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4438
- Fix builder docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4432
- Fix kwargs in docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4444
- Fix missing args in docstring of loaddatasetbuilder by @albertvillanova in https://github.com/huggingface/datasets/pull/4445
- Add missing kwargs to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4446
- Add extractor for bzip2-compressed files by @asivokon in https://github.com/huggingface/datasets/pull/4421
- Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in https://github.com/huggingface/datasets/pull/4434
- Update
dataset_infos.jsonwith new split info inDataset.push_to_hubto avoid verification error by @mariosasko in https://github.com/huggingface/datasets/pull/4415 - Update builder docstring for deprecated/added arguments by @albertvillanova in https://github.com/huggingface/datasets/pull/4429
- Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/4464
- Fix script fetching and local path handling in
inspect_datasetandinspect_metricby @mariosasko in https://github.com/huggingface/datasets/pull/4433 - Fix bigbench config names by @lhoestq in https://github.com/huggingface/datasets/pull/4465
- Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in https://github.com/huggingface/datasets/pull/4472
- Reorder returned validation/test splits in script template by @albertvillanova in https://github.com/huggingface/datasets/pull/4470
- Better ImportError message when a dataset script dependency is missing by @lhoestq in https://github.com/huggingface/datasets/pull/4484
- Fix cast to null by @lhoestq in https://github.com/huggingface/datasets/pull/4485
- Update
_format_columnsinremove_columnsby @alvarobartt in https://github.com/huggingface/datasets/pull/4411 - Fix wrong map parameter name in cache docs by @h4iku in https://github.com/huggingface/datasets/pull/4293
- Pin the revision in imagenet download links by @lhoestq in https://github.com/huggingface/datasets/pull/4492
- Refactor column mappings for question answering datasets by @lewtun in https://github.com/huggingface/datasets/pull/4391
New Contributors
- @leondz made their first contribution in https://github.com/huggingface/datasets/pull/4378
- @felixdivo made their first contribution in https://github.com/huggingface/datasets/pull/4344
- @nandwalritik made their first contribution in https://github.com/huggingface/datasets/pull/4322
- @fxmarty made their first contribution in https://github.com/huggingface/datasets/pull/4326
- @HallerPatrick made their first contribution in https://github.com/huggingface/datasets/pull/4321
- @silverriver made their first contribution in https://github.com/huggingface/datasets/pull/4416
- @asivokon made their first contribution in https://github.com/huggingface/datasets/pull/4421
- @andersjohanandreassen made their first contribution in https://github.com/huggingface/datasets/pull/4125
Full Changelog: https://github.com/huggingface/datasets/compare/2.2.2...lol
- Python
Published by lhoestq over 3 years ago
datasets - 2.2.2
Datasets fixes
- Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4377
- Fix: CC-Aligned - fix invalid url by @juntang-zhuang in https://github.com/huggingface/datasets/pull/4231
- Fix: multi_news - don't strip proceeding hyphen by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4353
Bug fixes
- Support lists of multi-dimensional numpy arrays by @albertvillanova in https://github.com/huggingface/datasets/pull/4194
- Check if dataset features match before push in
DatasetDict.push_to_hubby @mariosasko in https://github.com/huggingface/datasets/pull/4372 - Pin dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4380
- dill 0.3.5 has some issues in
transformers- pinning the version to<0.3.5for now
- dill 0.3.5 has some issues in
Dataset Cards
- Adding eval metadata for ade v2 by @sashavor in https://github.com/huggingface/datasets/pull/4319
- Adding eval metadata for AG News by @sashavor in https://github.com/huggingface/datasets/pull/4329
- Adding eval metadata to Allociné dataset by @sashavor in https://github.com/huggingface/datasets/pull/4330
- Adding eval metadata to Amazon Polarity by @sashavor in https://github.com/huggingface/datasets/pull/4331
- Adding eval metadata for arabic speech corpus by @sashavor in https://github.com/huggingface/datasets/pull/4332
- Adding eval metadata for Banking 77 by @sashavor in https://github.com/huggingface/datasets/pull/4333
- Eval metadata Batch 4: Tweet Eval, Tweets Hate Speech Detection, VCTK, Weibo NER, Wisesight Sentiment, XSum, Yahoo Answers Topics, Yelp Polarity, Yelp Review Full by @sashavor in https://github.com/huggingface/datasets/pull/4338
- Eval metadata batch 3: Reddit, Rotten Tomatoes, SemEval 2010, Sentiment 140, SMS Spam, Snips, SQuAD, SQuAD v2, Timit ASR by @sashavor in https://github.com/huggingface/datasets/pull/4337
- Eval metadata batch 1: BillSum, CoNLL2003, CoNLLPP, CUAD, Emotion, GigaWord, GLUE, Hate Speech 18, Hate Speech by @sashavor in https://github.com/huggingface/datasets/pull/4335
- Eval metadata batch 2 : Health Fact, Jigsaw Toxicity, LIAR, LJ Speech, MSRA NER, Multi News, NCBI Disease, Poem Sentiment by @sashavor in https://github.com/huggingface/datasets/pull/4336
Docs
- Add API code examples for Builder classes by @stevhliu in https://github.com/huggingface/datasets/pull/4313
- Add redirect to dataset script in the repo structure page by @lhoestq in https://github.com/huggingface/datasets/pull/4369
Other improvements and bug fixes
- Fix failing CI on Windows for sari and wiki_split metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4342
- Fix never ending GH Action to build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/4345
- Fix warning in upload_file by @albertvillanova in https://github.com/huggingface/datasets/pull/4355
- Fix warning in pushtohub by @albertvillanova in https://github.com/huggingface/datasets/pull/4357
- Remove config names as yaml keys by @lhoestq in https://github.com/huggingface/datasets/pull/4367
- Add missing language tags for udhr dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4371
- Remove links in docs to old dataset viewer by @mariosasko in https://github.com/huggingface/datasets/pull/4373
New Contributors
- @JohnGiorgi made their first contribution in https://github.com/huggingface/datasets/pull/4353
- @juntang-zhuang made their first contribution in https://github.com/huggingface/datasets/pull/4231
Full Changelog: https://github.com/huggingface/datasets/compare/2.2.1...2.2.2
- Python
Published by lhoestq almost 4 years ago
datasets - 2.2.1
Datasets bug fixes
- Fix cnn_dailymail (dm stories were ignored) by @lhoestq in https://github.com/huggingface/datasets/pull/4317
datasets2.2.0 introduced a bug in cnn_dailymail and some examples were missing in the dataset
General improvements and bug fixes
- Fix: Add missing comma by @mrm8488 in https://github.com/huggingface/datasets/pull/4303
- Catch pull error when mirroring by @lhoestq in https://github.com/huggingface/datasets/pull/4314
- Remove unused multiprocessing args from test CLI by @albertvillanova in https://github.com/huggingface/datasets/pull/4308
- Fix CLI run_beam namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/4315
- Support passing configkwargs to CLI runbeam by @albertvillanova in https://github.com/huggingface/datasets/pull/4316
- Don't check f.loc in getextractionprotocolwithmagicnumber by @lhoestq in https://github.com/huggingface/datasets/pull/4318
New Contributors
- @mrm8488 made their first contribution in https://github.com/huggingface/datasets/pull/4303
Full Changelog: https://github.com/huggingface/datasets/compare/2.2.0...2.2.1
- Python
Published by lhoestq almost 4 years ago
datasets - 2.2.0
Dataset Changes
- New: ImageNet by @apsdehal in https://github.com/huggingface/datasets/pull/4178
- Manual download only for now
- New: Google Conceptual Captions by @abhishekkrthakur in https://github.com/huggingface/datasets/pull/1459
- New: Conceptual 12M by @thomasw21 in https://github.com/huggingface/datasets/pull/4162
- New: Visual Genome by @thomasw21 in https://github.com/huggingface/datasets/pull/4161
- New: RVL-CDIP by @dnaveenr in https://github.com/huggingface/datasets/pull/4050
- New: Text-based NP Enrichment (TNE) by @yanaiela in https://github.com/huggingface/datasets/pull/4153
- New: TextVQA by @apsdehal in https://github.com/huggingface/datasets/pull/3967
- New: ETT time series dataset by @kashif in https://github.com/huggingface/datasets/pull/4213
- Update: assin2 - update metadata by @lhoestq in https://github.com/huggingface/datasets/pull/4172
- Update: Librispeech - Add 'all' config by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4184
- Update: XGLUE - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4249
- Update: crd3 - group all the turns in one example by @shanyas10 in https://github.com/huggingface/datasets/pull/4240
- Update: pubmed_qa - Remove google drive URL by @lhoestq in https://github.com/huggingface/datasets/pull/4255
- Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4254
- Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in https://github.com/huggingface/datasets/pull/4267
- Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4236
- Update: openbookqa - Add missing features for additional config by @albertvillanova in https://github.com/huggingface/datasets/pull/4278
- Update: commonsense_qa - Add missing features by @albertvillanova in https://github.com/huggingface/datasets/pull/4280
- Fix: Common Voice - Make sure bytes are correctly deleted if
pathexists by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4212 - Fix: openbookqa - fix bug in choices labels by @manandey in https://github.com/huggingface/datasets/pull/4259
- Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4270
Dataset Features
- Add support for metadata files to
imagefolderby @mariosasko in https://github.com/huggingface/datasets/pull/4069- load a folder of images and metadata stored in
metadata.jsonl, more info in the documentation on how to load an image dataset
- load a folder of images and metadata stored in
- Infer splits from the
data_dirparameter when loading datasets without script by @polinaeterna in https://github.com/huggingface/datasets/pull/4144- splits are inferred from the directory and file names, see more info in the documentation on how to structure your repository
- Enable label alignment for token classification datasets by @lewtun in https://github.com/huggingface/datasets/pull/4277
- Add
drop_last_batchtoIterableDataset.mapby @mariosasko in https://github.com/huggingface/datasets/pull/4215 - Load dataset with TSV files by @albertvillanova in https://github.com/huggingface/datasets/pull/4246
Dataset Cards
- Autoeval config by @nrajani in https://github.com/huggingface/datasets/pull/4234
- Add
train-deval-indexmetadata to automate evaluation on your datasets based on their tasks
- Add
- Adding license information for Openbookcorpus by @meg-huggingface in https://github.com/huggingface/datasets/pull/3525
- Make code for image downloading from image urls cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/4218
- Fix description links in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4222
- Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in https://github.com/huggingface/datasets/pull/4262
- Remove a copy-paste sentence in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4281
- Update LexGLUE README.md by @iliaschalkidis in https://github.com/huggingface/datasets/pull/4285
- leadboard info added for TNE by @yanaiela in https://github.com/huggingface/datasets/pull/4273
- Add Lahnda language tag by @mariosasko in https://github.com/huggingface/datasets/pull/4286
- Add license and point of contact to big_patent dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4269
- Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4266
Metrics Changes
- Perplexity Speedup by @emibaylor in https://github.com/huggingface/datasets/pull/4108
- Add AUC ROC Metric by @emibaylor in https://github.com/huggingface/datasets/pull/4158
- Small fixes in ROC AUC docs by @wschella in https://github.com/huggingface/datasets/pull/4239
- Fix/start token mask issue and update documentation by @TristanThrush in https://github.com/huggingface/datasets/pull/4258
- Add pearsonr mc, update functionality to match the original docs by @emibaylor in https://github.com/huggingface/datasets/pull/4226
Metric Cards
- Metric card for the XTREME-S dataset by @sashavor in https://github.com/huggingface/datasets/pull/4251
- Creating metric card for MAE by @sashavor in https://github.com/huggingface/datasets/pull/4252
- Create metric cards for mean IOU by @sashavor in https://github.com/huggingface/datasets/pull/4253
- Create metric card for Mahalanobis Distance by @sashavor in https://github.com/huggingface/datasets/pull/4257
- Create metric card for MSE by @sashavor in https://github.com/huggingface/datasets/pull/4256
- Fix exact match by @emibaylor in https://github.com/huggingface/datasets/pull/4166
- Fix google bleu typos, examples by @emibaylor in https://github.com/huggingface/datasets/pull/4165
- Add f1 metric card, update docstring in py file by @emibaylor in https://github.com/huggingface/datasets/pull/4227
- Add Recall Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4204
- Matthews Correlation Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4110
- Add Precision Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4203
- Add Accuracy Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4223
- Add Spearmanr Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4109
- Metric card template by @emibaylor in https://github.com/huggingface/datasets/pull/3915
Documentation
- Document savetodisk and pushtohub on images and audio files by @lhoestq in https://github.com/huggingface/datasets/pull/4193
- Add to docs how to load from local script by @albertvillanova in https://github.com/huggingface/datasets/pull/4200
- Add code examples to API docs by @stevhliu in https://github.com/huggingface/datasets/pull/4168
- Add code examples for DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/4245
- Add API code examples for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/4274
- Add packaged builder configs to the documentation by @lhoestq in https://github.com/huggingface/datasets/pull/4307
- [Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in https://github.com/huggingface/datasets/pull/4311
General improvements and bug fixes
- Generate tasks.json taxonomy from
huggingface_hubby @julien-c in https://github.com/huggingface/datasets/pull/4154 - Fix when map function modifies input in-place by @thomasw21 in https://github.com/huggingface/datasets/pull/4174
- Support streaming cnn_dailymail dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4188
- Don't duplicate data when encoding audio or image by @lhoestq in https://github.com/huggingface/datasets/pull/4187
- Fix outdated docstring about default dataset config by @lhoestq in https://github.com/huggingface/datasets/pull/4186
- Deprecate
shard_sizeinpush_to_hubin favor ofmax_shard_sizeby @mariosasko in https://github.com/huggingface/datasets/pull/4190 - Fix some type annotation in doc by @thomasw21 in https://github.com/huggingface/datasets/pull/4202
- Update GH template for dataset viewer issues by @albertvillanova in https://github.com/huggingface/datasets/pull/4201
- Update auth when mirroring datasets on the hub by @lhoestq in https://github.com/huggingface/datasets/pull/4242
- Rename imagenet2012 -> imagenet-1k by @lhoestq in https://github.com/huggingface/datasets/pull/4263
- Skip checksum computation in Imagefolder by default by @mariosasko in https://github.com/huggingface/datasets/pull/4214
- Fix
convert_file_size_to_intfor kilobits and megabits by @mariosasko in https://github.com/huggingface/datasets/pull/4205 - Fix typo in logging docs by @stevhliu in https://github.com/huggingface/datasets/pull/4272
- Bump PyArrow Version to 6 by @dnaveenr in https://github.com/huggingface/datasets/pull/4250
- task id update by @nrajani in https://github.com/huggingface/datasets/pull/4244
- Avoid recursion error in map if example is returned as dict value by @mariosasko in https://github.com/huggingface/datasets/pull/4216
- Update minimal PyArrow version warning by @mariosasko in https://github.com/huggingface/datasets/pull/4279
- [Minor edit] Fix typo in class name by @cakiki in https://github.com/huggingface/datasets/pull/4207
- Stream private zipped images by @lhoestq in https://github.com/huggingface/datasets/pull/4173
- Fix filesystem docstring by @stevhliu in https://github.com/huggingface/datasets/pull/4283
- Document how to use FAISS index for special operations by @albertvillanova in https://github.com/huggingface/datasets/pull/4189
- Contributing MedMCQA dataset by @monk1337 in https://github.com/huggingface/datasets/pull/4064
- Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in https://github.com/huggingface/datasets/pull/4282
- Fix missing lz4 dependency for tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4295
- Altered faiss installation comment by @vishalsrao in https://github.com/huggingface/datasets/pull/4220
- Fix CLI runbeam saveinfos by @albertvillanova in https://github.com/huggingface/datasets/pull/4294
- Add missing
faissimport to fix https://github.com/huggingface/datasets/issues/4287 by @alvarobartt in https://github.com/huggingface/datasets/pull/4288
New Contributors
- @shanyas10 made their first contribution in https://github.com/huggingface/datasets/pull/4240
- @apsdehal made their first contribution in https://github.com/huggingface/datasets/pull/4178
- @wschella made their first contribution in https://github.com/huggingface/datasets/pull/4239
- @TristanThrush made their first contribution in https://github.com/huggingface/datasets/pull/4258
- @yanaiela made their first contribution in https://github.com/huggingface/datasets/pull/4153
- @mo6zes made their first contribution in https://github.com/huggingface/datasets/pull/4262
- @nrajani made their first contribution in https://github.com/huggingface/datasets/pull/4244
- @sanchit-gandhi made their first contribution in https://github.com/huggingface/datasets/pull/4266
- @cakiki made their first contribution in https://github.com/huggingface/datasets/pull/4207
- @monk1337 made their first contribution in https://github.com/huggingface/datasets/pull/4064
- @alvarobartt made their first contribution in https://github.com/huggingface/datasets/pull/4288
Full Changelog: https://github.com/huggingface/datasets/compare/2.1.0...2.2.0
- Python
Published by lhoestq almost 4 years ago
datasets - 2.1.0
Datasets Changes
- New: initial monash time series forecasting by @kashif in https://github.com/huggingface/datasets/pull/3743
- New: Roman Urdu Hate Speech dataset by @bp-high in https://github.com/huggingface/datasets/pull/3972
- New: Adversarial GLUE by @jxmorris12 in https://github.com/huggingface/datasets/pull/3849
- New: MetaShift by @dnaveenr in https://github.com/huggingface/datasets/pull/3900
- New: GSM8K by @jon-tow in https://github.com/huggingface/datasets/pull/4103
- New: SBU Captions Photo by @thomasw21 in https://github.com/huggingface/datasets/pull/4130
- Deprecated: Multilingual Librispeech - deprecate dataset in favor of
facebook/multilingual_librispeechby @polinaeterna in https://github.com/huggingface/datasets/pull/4060 - Update (BREAKING): TIMIT - Redirect users to download data manually from LDC by @lhoestq in https://github.com/huggingface/datasets/pull/4145
- Update: Wikipedia by @albertvillanova in https://github.com/huggingface/datasets/pull/3821 and https://github.com/huggingface/datasets/pull/3989
- Update: conll2012_ontonotesv5 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4002
- Update: daily_dialog - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4008
- Update: id_clickbait - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4014
- Update: blimp - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4016
- Update: scan - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4017
- Update: yelpreviewfull - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4018
- Update: yelp_polarity - Support streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4019
- Update: amazon_polarity - Replace data URL by @lhoestq in https://github.com/huggingface/datasets/pull/4020
- Update: dbpedia_14 - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4022
- Update: xtreme - Support streaming dataset for bucc18 config by @albertvillanova in https://github.com/huggingface/datasets/pull/4026
- Update: yahooanswerstopics - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4023* Update: ASSIN 2 dataset - replace broken Google Drive URLS by links on github by @ruanchaves in https://github.com/huggingface/datasets/pull/4004
- Update: xcopa - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4039
- Update: medical_dialog - Add configs with processed data by @albertvillanova in https://github.com/huggingface/datasets/pull/4127
- Update: xtreme - Support streaming for udpos config by @albertvillanova in https://github.com/huggingface/datasets/pull/4131
- Update: xtreme - Support streaming for PAWS-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4132
- Update: xtreme - Support streaming for PAN-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4135
- Update: SQuAD v2 - Use a constant for the articles regex by @bryant1410 in https://github.com/huggingface/datasets/pull/4030
- Update: HANS - Support streaming by @mariosasko in https://github.com/huggingface/datasets/pull/4155
- Fix: catsvsdogs - fix checksum error dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4033
- Fix: xcopa - fix null checksum by @albertvillanova in https://github.com/huggingface/datasets/pull/4034
- Fix: amazonusreviews - fix metadata - 4/4/2022 by @trentonstrong in https://github.com/huggingface/datasets/pull/4092
Dataset Cards
- Updated annotations for nli_tr dataset by @e-budur in https://github.com/huggingface/datasets/pull/4058
- Add missing label for emotion description by @lijiazheng99 in https://github.com/huggingface/datasets/pull/4151
- Remove unncessary 'pylint disable' message in ReadMe by @Datta0 in https://github.com/huggingface/datasets/pull/3955
- Improve RedCaps dataset card by @mariosasko in https://github.com/huggingface/datasets/pull/4100
- Fix duplicate key in multi_news by @lhoestq in https://github.com/huggingface/datasets/pull/4164
Datasets Tags and Search on the Hugging Face Hub
- Tasks alignment with models by @lhoestq in https://github.com/huggingface/datasets/pull/4066
- Update datasets task tags to align tags with models by @lhoestq in https://github.com/huggingface/datasets/pull/4067
Metrics Changes
- Xtreme-S Metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3799
- Fix xtreme s metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3957
- Avoid info log messages from transformers in FrugalScore metric by @albertvillanova in https://github.com/huggingface/datasets/pull/3938
- Add exact match metric by @emibaylor in https://github.com/huggingface/datasets/pull/3899
- Fix comet metric by @lhoestq in https://github.com/huggingface/datasets/pull/3945
- Add zero_division argument to precision and recall metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4035
- Support float data types in pearsonr/spearmanr metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4054
- Remove GLEU metric by @emibaylor in https://github.com/huggingface/datasets/pull/3949
Metric Cards
- Perplexity Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3905
- Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3917
- Create README.md for CER metric by @sashavor in https://github.com/huggingface/datasets/pull/3911
- Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3944
- Update README.md by @sashavor in https://github.com/huggingface/datasets/pull/3933
- Create SARI metric card by @sashavor in https://github.com/huggingface/datasets/pull/3932
- Create MAUVE metric card by @sashavor in https://github.com/huggingface/datasets/pull/3934
- Create CoVAL metric card by @sashavor in https://github.com/huggingface/datasets/pull/3940
- Google BLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3948
- Create metric card for BERTScore by @sashavor in https://github.com/huggingface/datasets/pull/3966
- Rename wer to cer by @pmgautam in https://github.com/huggingface/datasets/pull/4012
- Create metric card for XNLI by @sashavor in https://github.com/huggingface/datasets/pull/4046
- Create metric card for the Code Eval metric by @sashavor in https://github.com/huggingface/datasets/pull/4049
- Add TER metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3981
- BLEU metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3947
- Create metric card for CUAD by @sashavor in https://github.com/huggingface/datasets/pull/4043
- Create metric card for METEOR by @sashavor in https://github.com/huggingface/datasets/pull/4065
- Create a metric card for Competition MATH by @sashavor in https://github.com/huggingface/datasets/pull/4073
- Create metric card for seqeval by @sashavor in https://github.com/huggingface/datasets/pull/4070
- Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3930
- Create metric card for Frugal Score by @sashavor in https://github.com/huggingface/datasets/pull/4089
- Updating FrugalScore metric card by @sashavor in https://github.com/huggingface/datasets/pull/4097
- Proposing WikiSplit metric card by @sashavor in https://github.com/huggingface/datasets/pull/4098
- Fix formatting in BLEU metric card by @mariosasko in https://github.com/huggingface/datasets/pull/4157
Documentation
- Doc maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3926
- [Doc] Don't use v for version tags on GitHub by @sgugger in https://github.com/huggingface/datasets/pull/3943
- Use templates for doc-builidng jobs by @sgugger in https://github.com/huggingface/datasets/pull/3914
- Add alignlabelswith_mapping docs by @stevhliu in https://github.com/huggingface/datasets/pull/3931
- Add tip on how to speed up loading with ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/3980
- Fix main_classes docs index by @lhoestq in https://github.com/huggingface/datasets/pull/3925
- More consistent references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3988
- Docs maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3999
- Add ROUGE Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4076
- Add chrF(++) Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4082
- Add SacreBLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4083
General improvements and bug fixes
- Fix flatten of complex feature types by @mariosasko in https://github.com/huggingface/datasets/pull/3723
- Fix flatten of Sequence feature type by @lhoestq in https://github.com/huggingface/datasets/pull/3962
- Exclude Google Drive tests of the CI by @lhoestq in https://github.com/huggingface/datasets/pull/3982
- Close
PIL.Imagefile handler inImage.decode_exampleby @mariosasko in https://github.com/huggingface/datasets/pull/3995 - Fix Faiss custom_index device by @albertvillanova in https://github.com/huggingface/datasets/pull/3987
- Fix None issue with Sequence of dict by @lhoestq in https://github.com/huggingface/datasets/pull/4010
- Update main readme by @lhoestq in https://github.com/huggingface/datasets/pull/3927
- Fix
mapremove_columns on empty dataset by @lhoestq in https://github.com/huggingface/datasets/pull/4021 - Fix Audio.encode_example() when writing an array by @polinaeterna in https://github.com/huggingface/datasets/pull/3998
- Use audio feature in ASR task template by @lhoestq in https://github.com/huggingface/datasets/pull/4006
- Improve out of bounds error message by @lhoestq in https://github.com/huggingface/datasets/pull/4068
- Increase max retries for GitHub metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4063
- Fix CLI dummy data generation by @albertvillanova in https://github.com/huggingface/datasets/pull/4045
- Fix docs on audio feature installation by @albertvillanova in https://github.com/huggingface/datasets/pull/4028
- Add installation instructions to image_process doc by @mariosasko in https://github.com/huggingface/datasets/pull/4072
- Fix GithubMetricModuleFactory instantiation with None download_config by @albertvillanova in https://github.com/huggingface/datasets/pull/4078
- Increase max retries for GitHub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4079
- Close parquet writer properly in
push_to_hubby @lhoestq in https://github.com/huggingface/datasets/pull/4081 - fix typo in rename_column error message by @hunterlang in https://github.com/huggingface/datasets/pull/4095
- Fix BeamWriter output Parquet file by @albertvillanova in https://github.com/huggingface/datasets/pull/4087
- Remove unused legacy Beam utils by @albertvillanova in https://github.com/huggingface/datasets/pull/4088
- Hotfix failing CI tests on Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/4119
- Update security policy by @albertvillanova in https://github.com/huggingface/datasets/pull/4111
- Avoid writing empty license files by @albertvillanova in https://github.com/huggingface/datasets/pull/4090
- Support huggingface_hub 0.5 by @lhoestq in https://github.com/huggingface/datasets/pull/4106
- Pretty print dataset info files by @mariosasko in https://github.com/huggingface/datasets/pull/4116
- Add single dataset citations for TweetEval by @gchhablani in https://github.com/huggingface/datasets/pull/4137
- Adjust path to datasets tutorial in How-To by @NimaBoscarino in https://github.com/huggingface/datasets/pull/4147
- Applied index-filters on scores in search.py. by @vishalsrao in https://github.com/huggingface/datasets/pull/3971
- More robust
cast_to_python_objectsinTypedSequenceby @mariosasko in https://github.com/huggingface/datasets/pull/4128 - Sync Features dictionaries by @mariosasko in https://github.com/huggingface/datasets/pull/3997
- Avoid rate limit in update hub repositories by @lhoestq in https://github.com/huggingface/datasets/pull/4167
New Contributors
- @bp-high made their first contribution in https://github.com/huggingface/datasets/pull/3972
- @ruanchaves made their first contribution in https://github.com/huggingface/datasets/pull/4004
- @pmgautam made their first contribution in https://github.com/huggingface/datasets/pull/4012
- @hunterlang made their first contribution in https://github.com/huggingface/datasets/pull/4095
- @trentonstrong made their first contribution in https://github.com/huggingface/datasets/pull/4092
- @NimaBoscarino made their first contribution in https://github.com/huggingface/datasets/pull/4147
- @jon-tow made their first contribution in https://github.com/huggingface/datasets/pull/4103
- @lijiazheng99 made their first contribution in https://github.com/huggingface/datasets/pull/4151
- @Datta0 made their first contribution in https://github.com/huggingface/datasets/pull/3955
- @vishalsrao made their first contribution in https://github.com/huggingface/datasets/pull/3971
Full Changelog: https://github.com/huggingface/datasets/compare/2.0.0...2.1.0
- Python
Published by lhoestq almost 4 years ago
datasets - 2.0.0
🤗 Datasets 2.0.0
We're happy to announce that our new documentation is available at hf.co/docs/datasets !
Dataset Features
- Load a folder of images using the
imagefolderdataset loader:- Add imagefolder dataset by @nateraw in https://github.com/huggingface/datasets/pull/2830
- Faster ImageFolder + add option to drop labels by @mariosasko in https://github.com/huggingface/datasets/pull/3887
- Push your image and audio datasets on the Hugging Face Hub with
push_to_hub:- Add support for
AudioandImagefeature inpush_to_hubby @mariosasko in https://github.com/huggingface/datasets/pull/3685
- Add support for
- New processing methods for streaming datasets:
- Add
IterableDataset.filterby @lhoestq in https://github.com/huggingface/datasets/pull/3826 - Manipulate columns on
IterableDataset(rename columns, cast, etc.) by @lhoestq in https://github.com/huggingface/datasets/pull/3862 - Add the new methods to IterableDatasetDict by @lhoestq in https://github.com/huggingface/datasets/pull/3923
- Add
- And more:
- Add more compression types for
to_jsonby @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3551 - Multi-GPU support for
FaissIndexby @rentruewang in https://github.com/huggingface/datasets/pull/3721
- Add more compression types for
Breaking changes
- API changes for
mapandshufflefor datasets loaded in streaming mode:- Align
mapwhen streaming: update instead of overwrite + add missing parameters by @lhoestq in https://github.com/huggingface/datasets/pull/3801 - Align
IterableDataset.shufflewithDataset.shuffleby @lhoestq in https://github.com/huggingface/datasets/pull/3842
- Align
- Rename GenerateMode to DownloadMode by @albertvillanova in https://github.com/huggingface/datasets/pull/3759
- Remove deprecated methods/params (preparation for v2.0) by @mariosasko in https://github.com/huggingface/datasets/pull/3803
- Remove deprecated
remove_columnsparam infilterby @mariosasko in https://github.com/huggingface/datasets/pull/3827 - Module namespace cleanup for v2.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3875
Dataset Changes
- New: CFPB Consumer Complaints by @kayvane1 in https://github.com/huggingface/datasets/pull/3617
- New: told-br (brazilian hate speech) by @JAugusto97 in https://github.com/huggingface/datasets/pull/3683
- New: electricity load diagram by @kashif in https://github.com/huggingface/datasets/pull/3722
- New: MIT Scene Parsing Benchmark by @mariosasko in https://github.com/huggingface/datasets/pull/3607
- New: ElkarHizketak v1.0 by @antxa in https://github.com/huggingface/datasets/pull/3780
- New: wikitablequestions by @SivilTaram in https://github.com/huggingface/datasets/pull/3870
- New: ontonotes_conll by @richarddwang in https://github.com/huggingface/datasets/pull/3853
- Update: BnL Historical Newspapers - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3616
- Update: Common voice - add validated partition by @shalymin-amzn in https://github.com/huggingface/datasets/pull/3669
- Update: Common Voice - add local paths to audio files by @lhoestq in https://github.com/huggingface/datasets/pull/3736
- Update: Common Voice - simplify code by @lhoestq in https://github.com/huggingface/datasets/pull/3817
- Update: Natural Questions - add dev-only configuration by @albertvillanova in https://github.com/huggingface/datasets/pull/3699
- Update: pubmed - update data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3692
- Update: pubmed - make the dataset streamable by @abhi-mosaic in https://github.com/huggingface/datasets/pull/3740
- Update: RedCaps - make the dataset streamable by @mariosasko in https://github.com/huggingface/datasets/pull/3737
- Update: catsvsdogs - update metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3752
- Update: newsroom - update manual download url by @albertvillanova in https://github.com/huggingface/datasets/pull/3779
- Update: xcopa - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3810
- Update: catsvsdogs size by @mariosasko in https://github.com/huggingface/datasets/pull/3878
- Fix: semeval2018task1 - fix download location by @maxpel in https://github.com/huggingface/datasets/pull/3643
- Fix: newsqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3696
- Fix: The Pile datasets - fix host urls by @albertvillanova in https://github.com/huggingface/datasets/pull/3627
- Fix: Evidence Infer Treatment - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3718
- Fix: NewsQA - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3734
- Fix: head_qa - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3766
- Fix: msr_sqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3771
- Fix: reddit_tifu - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3774
- Fix: wiki_lingua - fix spanish data file url by @albertvillanova in https://github.com/huggingface/datasets/pull/3806
- Fix: beans - fix data urls by @mariosasko in https://github.com/huggingface/datasets/pull/3890
- Fix: CRD3 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3921
- Fix: MultiWOZ 2.2 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3922
Dataset cards
- Add code example in wikipedia card by @lhoestq in https://github.com/huggingface/datasets/pull/3678
- Fix Multi-News dataset metadata and card by @albertvillanova in https://github.com/huggingface/datasets/pull/3731
- Reddit dataset card additions by @anna-kay in https://github.com/huggingface/datasets/pull/3781
- Update gigaword card and info by @mariosasko in https://github.com/huggingface/datasets/pull/3775
- Reddit dataset card contribution by @anna-kay in https://github.com/huggingface/datasets/pull/3797
Metric Changes
- New: FrugalScore by @moussaKam in https://github.com/huggingface/datasets/pull/3674
- New: Mahalanobis distance by @JoaoLages in https://github.com/huggingface/datasets/pull/3794
- New: mIoU by @NielsRogge in https://github.com/huggingface/datasets/pull/3745
- New: MSE and MAE - V2 by @dnaveenr in https://github.com/huggingface/datasets/pull/3874
- Fix: METEOR - fix bug due to nltk version by @albertvillanova in https://github.com/huggingface/datasets/pull/3884
Metric cards
- Add perplexity to metrics by @emibaylor in https://github.com/huggingface/datasets/pull/3757
- Create SQuAD metric README.md by @sashavor in https://github.com/huggingface/datasets/pull/3873
- SQuAD v2 metric: create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3879
- Update README.md for SQuAD v2 metric by @sashavor in https://github.com/huggingface/datasets/pull/3908
- Update README.md for SQuAD metric by @sashavor in https://github.com/huggingface/datasets/pull/3907
- Create README.md for WER metric by @sashavor in https://github.com/huggingface/datasets/pull/3898
- Create README.md for GLUE by @sashavor in https://github.com/huggingface/datasets/pull/3916
New documentation
- Update docs to new frontend/UI by @mishig25 in https://github.com/huggingface/datasets/pull/3690
- Image process doc by @stevhliu in https://github.com/huggingface/datasets/pull/3882
General improvements and bug fixes
- Better TQDM output by @mariosasko in https://github.com/huggingface/datasets/pull/3654
- Prioritize
module.builder_kwargsover defaults inTestCommandby @lvwerra in https://github.com/huggingface/datasets/pull/3672 - Extend support for streaming datasets that use os.path.relpath by @albertvillanova in https://github.com/huggingface/datasets/pull/3623
- Add Fon language tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3620
- Remove unnecessary 'r' arg in by @bryant1410 in https://github.com/huggingface/datasets/pull/3661
- Fix TestCommand to copy dataset_infos to local dir with only data files by @albertvillanova in https://github.com/huggingface/datasets/pull/3680
- Upgrade black to version ~=22.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/3691
- Fix streaming for servers not supporting HTTP range requests by @albertvillanova in https://github.com/huggingface/datasets/pull/3689
- Pin ElasticSearch by @lhoestq in https://github.com/huggingface/datasets/pull/3701
- Raise informative error when loading a savetodisk dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3705
- Fix ClassLabel to/from dict when passed names_file by @albertvillanova in https://github.com/huggingface/datasets/pull/3695
- Fix CI code quality issue by @albertvillanova in https://github.com/huggingface/datasets/pull/3710
- Check if indices values in
Dataset.selectare within bounds by @mariosasko in https://github.com/huggingface/datasets/pull/3719 - Pin pandas to avoid bug in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3725
- Use config pandas version in CSV dataset builder by @albertvillanova in https://github.com/huggingface/datasets/pull/3726
- Set base path to hub url for canonical datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3709
- Fix ValueError message formatting in int2str by @akulchik in https://github.com/huggingface/datasets/pull/3742
- Patch all module attributes in its namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/3727
- Fix typo in train split name by @albertvillanova in https://github.com/huggingface/datasets/pull/3751
- feat: 🎸 generate info if dataset_infos.json does not exist by @severo in https://github.com/huggingface/datasets/pull/3670
- Support streaming in size estimation function in
push_to_hubby @mariosasko in https://github.com/huggingface/datasets/pull/3732 - Expose method and fix param by @severo in https://github.com/huggingface/datasets/pull/3767
- Fix HfFileSystem docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3768
- process .opus files (for Multilingual Spoken Words) by @polinaeterna in https://github.com/huggingface/datasets/pull/3666
- Fix: dataset name is stored in keys by @thomasw21 in https://github.com/huggingface/datasets/pull/3772
- Use the same seed to shuffle shards and metadata in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/3746
- Start removing canonical datasets logic by @lhoestq in https://github.com/huggingface/datasets/pull/3777
- Support passing str to iter_files by @albertvillanova in https://github.com/huggingface/datasets/pull/3783
- Fix Google Drive URL to avoid Virus scan warning by @albertvillanova in https://github.com/huggingface/datasets/pull/3787
- Skip checksum computation if
ignore_verificationsisTrueby @mariosasko in https://github.com/huggingface/datasets/pull/3796 - Fix error message in CSV loader for newer Pandas versions by @mariosasko in https://github.com/huggingface/datasets/pull/3798
- Add
data_dirtodata_filesresolution and misc improvements to HfFileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/3791 - Error of writing with different schema, due to nonpreservation of nullability by @richarddwang in https://github.com/huggingface/datasets/pull/3782
- Handle Nones in PyArrow struct by @mariosasko in https://github.com/huggingface/datasets/pull/3814
- Fix iter_archive getting reset by @lhoestq in https://github.com/huggingface/datasets/pull/3815
- Added computer vision tasks by @merveenoyan in https://github.com/huggingface/datasets/pull/3800
- Fix typo in doc build yml by @mishig25 in https://github.com/huggingface/datasets/pull/3819
- Allow not specifying feature cols other than
predictions/referencesinMetric.computeby @mariosasko in https://github.com/huggingface/datasets/pull/3824 - Logo float left by @mishig25 in https://github.com/huggingface/datasets/pull/3836
- Pin responses to fix CI for Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/3840
- Fix dead dataset scripts creation link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3834
- Remove decode: true for image feature in head_qa by @craffel in https://github.com/huggingface/datasets/pull/3805
- Update faiss device docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3846
- Udpate index.mdx margins by @gary149 in https://github.com/huggingface/datasets/pull/3858
- Fix pushtohub with null images by @lhoestq in https://github.com/huggingface/datasets/pull/3856
- Redundant add dataset information and dead link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3852
- Update image dataset tags by @mariosasko in https://github.com/huggingface/datasets/pull/3864
- Bring back imgs so that forsk dont get broken by @mishig25 in https://github.com/huggingface/datasets/pull/3866
- Small typos in How-to-train tutorial. by @lkhphuc in https://github.com/huggingface/datasets/pull/3833
- Small doc fixes by @mishig25 in https://github.com/huggingface/datasets/pull/3860
- add pandas to env command by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3871
- Ignore duplicate keys if
ignore_verifications=Trueby @mariosasko in https://github.com/huggingface/datasets/pull/3868 - Update code blocks by @lhoestq in https://github.com/huggingface/datasets/pull/3863
- Fix downloadmode in datasetmodule_factory by @albertvillanova in https://github.com/huggingface/datasets/pull/3876
- Fix some shuffle docs by @lhoestq in https://github.com/huggingface/datasets/pull/3885
- Fix race condition in doc build by @lhoestq in https://github.com/huggingface/datasets/pull/3891
- Add default branch for doc building by @sgugger in https://github.com/huggingface/datasets/pull/3893
- [docs] make dummy data creation optional by @lhoestq in https://github.com/huggingface/datasets/pull/3894
- Fix code examples indentation by @lhoestq in https://github.com/huggingface/datasets/pull/3895
- Align tqdm control/cache control with Transformers by @mariosasko in https://github.com/huggingface/datasets/pull/3897
- Fix CLI test checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/3892
- Fix Google Drive URL to avoid Virus scan warning in streaming mode by @mariosasko in https://github.com/huggingface/datasets/pull/3843
- Change the framework switches to the new syntax by @sgugger in https://github.com/huggingface/datasets/pull/3880
New Contributors
- @kayvane1 made their first contribution in https://github.com/huggingface/datasets/pull/3617
- @JAugusto97 made their first contribution in https://github.com/huggingface/datasets/pull/3683
- @shalymin-amzn made their first contribution in https://github.com/huggingface/datasets/pull/3669
- @kashif made their first contribution in https://github.com/huggingface/datasets/pull/3722
- @akulchik made their first contribution in https://github.com/huggingface/datasets/pull/3742
- @abhi-mosaic made their first contribution in https://github.com/huggingface/datasets/pull/3740
- @emibaylor made their first contribution in https://github.com/huggingface/datasets/pull/3757
- @anna-kay made their first contribution in https://github.com/huggingface/datasets/pull/3781
- @JoaoLages made their first contribution in https://github.com/huggingface/datasets/pull/3794
- @mishig25 made their first contribution in https://github.com/huggingface/datasets/pull/3690
- @antxa made their first contribution in https://github.com/huggingface/datasets/pull/3780
- @dnaveenr made their first contribution in https://github.com/huggingface/datasets/pull/3834
- @lkhphuc made their first contribution in https://github.com/huggingface/datasets/pull/3833
- @rentruewang made their first contribution in https://github.com/huggingface/datasets/pull/3721
- @gary149 made their first contribution in https://github.com/huggingface/datasets/pull/3858
- @NielsRogge made their first contribution in https://github.com/huggingface/datasets/pull/3745
- @sashavor made their first contribution in https://github.com/huggingface/datasets/pull/3873
- @SivilTaram made their first contribution in https://github.com/huggingface/datasets/pull/3870
- Document cases for github datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3924
- Fix text loader to split only on universal newlines by @albertvillanova in https://github.com/huggingface/datasets/pull/3910
- Retry HfApi call inside pushtohub when 504 error by @albertvillanova in https://github.com/huggingface/datasets/pull/3886
Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...0.0.0
- Python
Published by lhoestq almost 4 years ago
datasets - 1.18.4
Bug fixes
- Prioritize
module.builder_kwargsover defaults inTestCommand#3672 (@lvwerra) - Fix TestCommand to copy dataset_infos to local dir with only data files #3680 (@albertvillanova)
- Upgrade black to version ~=22.0 #3691 (@LysandreJik)
- Fix streaming for servers not supporting HTTP range requests #3689 (@albertvillanova)
- Pin ElasticSearch #3701 (@lhoestq)
- Fix ClassLabel to/from dict when passed names_file #3695 (@albertvillanova)
- Fix CI code quality issue #3710 (@albertvillanova)
- Check if indices values in
Dataset.selectare within bounds #3719 (@mariosasko) - Pin pandas to avoid bug in streaming mode #3725 (@albertvillanova)
- Use config pandas version in CSV dataset builder #3726 (@albertvillanova)
- Fix dataset mirroring (@lhoestq)
- Fix ValueError message formatting in int2str #3742 (@akulchik)
- Patch all module attributes in its namespace #3727 (@albertvillanova)
- Fix HfFileSystem docstring #3768 (@lhoestq)
- Fix: dataset name is stored in keys #3772 (@thomasw21)
- Fix Google Drive URL to avoid Virus scan warning #3787 (@albertvillanova)
- Fix error message in CSV loader for newer Pandas versions #3798 (@mariosasko)
- Pin responses to fix CI for Windows #3840 (@albertvillanova)
Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...1.18.4
- Python
Published by albertvillanova almost 4 years ago
datasets - 1.18.3
Bug fixes
- Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq in https://github.com/huggingface/datasets/pull/3665
- Extend dataset builder for streaming in
get_dataset_split_namesby @mariosasko in https://github.com/huggingface/datasets/pull/3657
Dataset changes
- New: Turkic X-WMT evaluation set for machine translation by @mirzakhalov in https://github.com/huggingface/datasets/pull/3605
- New: British Library books dataset by @davanstrien in https://github.com/huggingface/datasets/pull/3603
- Fix: wiki_bio - Update link by @jxmorris12 in https://github.com/huggingface/datasets/pull/3651
Other improvements
- sp. Columbia => Colombia by @serapio in https://github.com/huggingface/datasets/pull/3652
- Run pyupgrade for Python 3.6+ by @bryant1410 in https://github.com/huggingface/datasets/pull/3560
New Contributors
- @serapio made their first contribution in https://github.com/huggingface/datasets/pull/3652
- @mirzakhalov made their first contribution in https://github.com/huggingface/datasets/pull/3605
Full Changelog: https://github.com/huggingface/datasets/compare/1.18.2...1.18.3
- Python
Published by lhoestq about 4 years ago
datasets - 1.18.2
Bug fixes
- Fix streaming datasets that are not reset correctly by @lhoestq in https://github.com/huggingface/datasets/pull/3646
- Fix numpy rngs when shuffling with seed=None by @mariosasko in https://github.com/huggingface/datasets/pull/3641
- Fix dataset slicing with negative bounds when indices mapping is not
Noneby @mariosasko in https://github.com/huggingface/datasets/pull/3642 - Fix
add_columnon datasets with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/3647
Other improvements
- Update index.rst by @VioletteLepercq in https://github.com/huggingface/datasets/pull/3636
- Fix Windows CI: bump python to 3.7 by @lhoestq in https://github.com/huggingface/datasets/pull/3648
New Contributors
- @VioletteLepercq made their first contribution in https://github.com/huggingface/datasets/pull/3636
Full Changelog: https://github.com/huggingface/datasets/compare/1.18.1...1.18.2
- Python
Published by lhoestq about 4 years ago
datasets - 1.18.1
Improvements
- Make decoding of Audio and Image feature optional by @mariosasko in https://github.com/huggingface/datasets/pull/3430
Bug fixes
- Fix
prepare_for_task()by @mariosasko in https://github.com/huggingface/datasets/pull/3614 - Fix: Multilingual Librispeech - fix bad url formatting by @polinaeterna in https://github.com/huggingface/datasets/pull/3619
Full Changelog: https://github.com/huggingface/datasets/compare/1.18.0...1.18.1
- Python
Published by lhoestq about 4 years ago
datasets - 1.18.0
Datasets Changes
- New: VCTK
- Add VCTK dataset by @jaketae in https://github.com/huggingface/datasets/pull/3351
- Fix VCTK encoding by @lhoestq in https://github.com/huggingface/datasets/pull/3493
- Docs: Add VCTK dataset description by @jaketae in https://github.com/huggingface/datasets/pull/3500
- New: CPPE-5 dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3517
- New: RedCaps dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3424
- New: WIDER FACE dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3413
- New: SVHN dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3535
- New: BNL newspapers by @davanstrien in https://github.com/huggingface/datasets/pull/3397
- New: PASS dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3576
- New: Text2log Dataset by @apergo-ai in https://github.com/huggingface/datasets/pull/3579
- Update: beans, catsvsdogs - Use
iter_filesinstead ofstr(Path(...)in image dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3477 - Update : PIB - update version and make it streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3496
- Update: codexgluetttexttotext, compguesswhat - Remove print statements in datasets by @mariosasko in https://github.com/huggingface/datasets/pull/3546
- Update: MuchoCine - add missing tasks by @mariosasko in https://github.com/huggingface/datasets/pull/3571
- Fix: Tashkeela - fix to yield stripped text by @albertvillanova in https://github.com/huggingface/datasets/pull/3471
- Fix: asset - change to raw.githubusercontent.com URLs by @VictorSanh in https://github.com/huggingface/datasets/pull/3516
- Fix: CC100 - use HTTPS for the data source URL by @aajanki in https://github.com/huggingface/datasets/pull/3519
- Fix: vision datsets - Fix bug in
ImageClassifcationtask template by @mariosasko in https://github.com/huggingface/datasets/pull/3557 - Fix: tweet_qa - fix
DuplicatedKeysErrorand improve card by @mariosasko in https://github.com/huggingface/datasets/pull/3559 - Fix: mC4 - fix multiple language downloading by @polinaeterna in https://github.com/huggingface/datasets/pull/3594
- Fix: CoNLL2003:
- Use old url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3600
- Update url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3602
- Add conll2003 licensing by @lhoestq in https://github.com/huggingface/datasets/pull/3601
Datasets Features
- [Time series] Add support for time, date, duration, and decimal dtypes by @mariosasko in https://github.com/huggingface/datasets/pull/3591
- [Image][Audio] Add flexible casting for Image and Audio + Support nested casting by @lhoestq in https://github.com/huggingface/datasets/pull/3575
- Allows DatasetDict.filter to have batching option by @thomasw21 in https://github.com/huggingface/datasets/pull/3506
- Add desc parameter to filter by @mariosasko in https://github.com/huggingface/datasets/pull/3513
- Add
gzipforto_jsonby @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3492 - Allow multiple task templates of the same type by @mariosasko in https://github.com/huggingface/datasets/pull/3562
- Add parameter
preserve_indextofrom_pandasby @Sorrow321 in https://github.com/huggingface/datasets/pull/3565 - Dataset Streaming:
- Fix
str(Path(...))conversion in streaming on Linux by @mariosasko in https://github.com/huggingface/datasets/pull/3472 - Extend support for streaming datasets that use ET.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/3476
- Extend support for streaming datasets that use os.walk by @albertvillanova in https://github.com/huggingface/datasets/pull/3478
- Fix
Metrics Changes
- Add Mauve metric by @jthickstun in https://github.com/huggingface/datasets/pull/3573
Dataset cards
- update
pretty_namefor first 200 datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3498 - update
pretty_namefor all the other datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3536 - pib: Update pib dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/3501
- arabicspeechcorpus: Adding link to license. by @meg-huggingface in https://github.com/huggingface/datasets/pull/3524
- Covost2: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3528
- librispeech_asr: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3529
- vivos: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3530
- audio datasets: Audio datacard update - first pass by @meg-huggingface in https://github.com/huggingface/datasets/pull/3520
- common_language: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3527
- wikidpr: Update wikidpr README.md by @lhoestq in https://github.com/huggingface/datasets/pull/3534
- qa4mre: Fix qa4mre tags by @lhoestq in https://github.com/huggingface/datasets/pull/3574
- HellaSwag: Update HellaSwag README.md by @borgr in https://github.com/huggingface/datasets/pull/3588
- ANLI: Update ANLI README.md by @borgr in https://github.com/huggingface/datasets/pull/3590
- tweet_eval: Update README.md by @borgr in https://github.com/huggingface/datasets/pull/3593
Documentation
- Fix rendering of docs by @albertvillanova in https://github.com/huggingface/datasets/pull/3470
- Fix totfdataset references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3514
- added PII statements and license links to data cards by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3537
- Readme usage update by @meg-huggingface in https://github.com/huggingface/datasets/pull/3538
- Update the CC-100 dataset card by @aajanki in https://github.com/huggingface/datasets/pull/3542
- Research wording for nc licenses by @meg-huggingface in https://github.com/huggingface/datasets/pull/3539
- Added links to licensing and PII message in vctk dataset by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3523
- Give clearer instructions to add the YAML tags by @albertvillanova in https://github.com/huggingface/datasets/pull/3532
General improvements and bug fixes
- Fix overriding of filesystem info by @albertvillanova in https://github.com/huggingface/datasets/pull/3481
- Update ADDNEWDATASET.md by @apergo-ai in https://github.com/huggingface/datasets/pull/3487
- Fix weird spacing in ManualDownloadError message by @bryant1410 in https://github.com/huggingface/datasets/pull/3486
- Clone full repo to detect new tags when mirroring datasets on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3494
- Remove unused phony rule from Makefile by @bryant1410 in https://github.com/huggingface/datasets/pull/3483
- fix: 🐛 pass token when retrieving the split names by @severo in https://github.com/huggingface/datasets/pull/3545
- Pin torchmetrics to fix the COMET test by @lhoestq in https://github.com/huggingface/datasets/pull/3589
- Preserve encoding/decoding with features in
Iterable.mapcall by @mariosasko in https://github.com/huggingface/datasets/pull/3556
New Contributors
- @apergo-ai made their first contribution in https://github.com/huggingface/datasets/pull/3487
- @bryant1410 made their first contribution in https://github.com/huggingface/datasets/pull/3486
- @meg-huggingface made their first contribution in https://github.com/huggingface/datasets/pull/3527
- @aajanki made their first contribution in https://github.com/huggingface/datasets/pull/3519
- @Sorrow321 made their first contribution in https://github.com/huggingface/datasets/pull/3565
- @jthickstun made their first contribution in https://github.com/huggingface/datasets/pull/3573
- @borgr made their first contribution in https://github.com/huggingface/datasets/pull/3588
Full Changelog: https://github.com/huggingface/datasets/compare/1.17.0...1.18.0
- Python
Published by lhoestq about 4 years ago
datasets - 1.17.0
Dataset Changes
- New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3287
- Add The Pile Free Law subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3359
- Add The Pile USPTO subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3360
- Add The Pile subsets by @albertvillanova in https://github.com/huggingface/datasets/pull/3378
- Add The Pile Enron Emails subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3427
- New: British Library Books Genre by @davanstrien in https://github.com/huggingface/datasets/pull/3312
- New: Americas NLI by @fdschmidt93 in https://github.com/huggingface/datasets/pull/3371
- New: Speech commands by @polinaeterna in https://github.com/huggingface/datasets/pull/3335
- New: eli5_category by @jingshenSN2 in https://github.com/huggingface/datasets/pull/3420
- New: OneStopQa by @scaperex in https://github.com/huggingface/datasets/pull/3436
- Update: LABR - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3352
- Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in https://github.com/huggingface/datasets/pull/3376
- Update: beans, castvsdogs, cifar10, cifar100, fashionmnist, mnist, headqa: use the new Image feature type + streaming support by @mariosasko in https://github.com/huggingface/datasets/pull/3362
- Update: CC100- add Georgian data by @AnzorGozalishvili in https://github.com/huggingface/datasets/pull/3383
- Update: disasterresponsemessages - update download urls (+ add validation split) by @mariosasko in https://github.com/huggingface/datasets/pull/3426
- Update: swahili_news - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3463
- Fix: WikiAuto, Jeopardy, definitepronounresolution - fix URLs by @LashaO in https://github.com/huggingface/datasets/pull/3266
- Fix: QED - fix type of bridge field by @mariosasko in https://github.com/huggingface/datasets/pull/3417
- Fix: ASSET - fix dataset data URLs by @tianjianjiang in https://github.com/huggingface/datasets/pull/3342
Dataset Features
- Add Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/3163
- totfdataset() refactor by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3356
- More robust
Nonehandling by @mariosasko in https://github.com/huggingface/datasets/pull/3195 - Add
cast_columntoIterableDatasetby @mariosasko in https://github.com/huggingface/datasets/pull/3439 - Support streaming zipped dataset repo by passing only repo name by @albertvillanova in https://github.com/huggingface/datasets/pull/3375
- Extend support for streaming datasets that use pd.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/3355
- Extend iter_archive to support file object input by @albertvillanova in https://github.com/huggingface/datasets/pull/3443
- Extend text to support yielding lines, paragraphs or documents by @albertvillanova in https://github.com/huggingface/datasets/pull/3442
- Push dataset_infos.json to Hub to preserve feature types by @lhoestq in https://github.com/huggingface/datasets/pull/3467
Dataset cards
- Change TriviaQA license (#3313) by @avinashsai in https://github.com/huggingface/datasets/pull/3330
- Add missing tags to XTREME by @mariosasko in https://github.com/huggingface/datasets/pull/3322
- Remove duplicate name from dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3354
- Fix typos in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3386
- Fix duplicated tag in wikicorpus dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/3458
Dataset Tasks
- Create Language Modeling task by @albertvillanova in https://github.com/huggingface/datasets/pull/3387
Metric Changes
- BLEURT: Match key names to correspond with filename by @jaehlee in https://github.com/huggingface/datasets/pull/3348
- Fix links in metrics description by @albertvillanova in https://github.com/huggingface/datasets/pull/3461
- Fix METEOR missing NLTK's omw-1.4 by @lhoestq in https://github.com/huggingface/datasets/pull/3469
Docs
- Add ArrayXD docs by @stevhliu in https://github.com/huggingface/datasets/pull/3344
- Document a training loop for streaming dataset by @lhoestq in https://github.com/huggingface/datasets/pull/3370
- Fix formatting in IterableDataset.map docs by @mariosasko in https://github.com/huggingface/datasets/pull/3395
- Correctly indent builder config in dataset script docs by @mariosasko in https://github.com/huggingface/datasets/pull/3432
- Update BLEURT hyperlink by @lewtun in https://github.com/huggingface/datasets/pull/3437
Additional improvements and bug fixes
- Quick fix error formatting by @NouamaneTazi in https://github.com/huggingface/datasets/pull/3328
- Fix error message and add extension fallback by @mariosasko in https://github.com/huggingface/datasets/pull/3332
- Avoid content-encoding issue while streaming datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/3350
- Fix JSON ClassLabel casting for integers by @lhoestq in https://github.com/huggingface/datasets/pull/3340
- Better error message when download fails by @lhoestq in https://github.com/huggingface/datasets/pull/3343
- Fix dict source_datasets tagset validator by @albertvillanova in https://github.com/huggingface/datasets/pull/3368
- Fix typo in other-structured-to-text task tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3367
- Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in https://github.com/huggingface/datasets/pull/3296
- Fix flaky test of the temporary directory used by loadfromdisk by @lhoestq in https://github.com/huggingface/datasets/pull/3388
- More robust first elem check in encode/cast example by @mariosasko in https://github.com/huggingface/datasets/pull/3402
- Fix module inference for archive with a directory by @albertvillanova in https://github.com/huggingface/datasets/pull/3406
- Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in https://github.com/huggingface/datasets/pull/3410
- Pass new_fingerprint in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/3409
- Fix flaky test again for s3 serialization by @lhoestq in https://github.com/huggingface/datasets/pull/3412
- Skip None encoding (line deleted by accident in #3195) by @mariosasko in https://github.com/huggingface/datasets/pull/3414
- Clean squad dummy data by @lhoestq in https://github.com/huggingface/datasets/pull/3428
- #3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in https://github.com/huggingface/datasets/pull/3382
- Make cast cacheable (again) on Windows by @mariosasko in https://github.com/huggingface/datasets/pull/3429
- Use max number of data files to infer module by @albertvillanova in https://github.com/huggingface/datasets/pull/3407
- Fix iter_archive generator by @albertvillanova in https://github.com/huggingface/datasets/pull/3454
- [Staging] Update dataset repos automatically on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3451
- Update supported versions of Python in setup.py by @mariosasko in https://github.com/huggingface/datasets/pull/3438
- raise exception instead of using assertions. by @manisnesan in https://github.com/huggingface/datasets/pull/3349
New Contributors
- @avinashsai made their first contribution in https://github.com/huggingface/datasets/pull/3330
- @NouamaneTazi made their first contribution in https://github.com/huggingface/datasets/pull/3328
- @davanstrien made their first contribution in https://github.com/huggingface/datasets/pull/3312
- @francisco-perez-sorrosal made their first contribution in https://github.com/huggingface/datasets/pull/3296
- @LashaO made their first contribution in https://github.com/huggingface/datasets/pull/3266
- @fdschmidt93 made their first contribution in https://github.com/huggingface/datasets/pull/3371
- @polinaeterna made their first contribution in https://github.com/huggingface/datasets/pull/3335
- @AnzorGozalishvili made their first contribution in https://github.com/huggingface/datasets/pull/3383
- @tianjianjiang made their first contribution in https://github.com/huggingface/datasets/pull/3342
- @jingshenSN2 made their first contribution in https://github.com/huggingface/datasets/pull/3420
- @scaperex made their first contribution in https://github.com/huggingface/datasets/pull/3436
Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0
- Python
Published by lhoestq about 4 years ago
datasets - 1.16.0
Datasets Changes
- New: riddle_sense by @ziyiwu9494 in https://github.com/huggingface/datasets/pull/3161
- New: Multi-Lingual LibriSpeech by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3198
- New: XCSR by @yangxqiao in https://github.com/huggingface/datasets/pull/3074
- New: CMU Hinglish DoG by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3149
- New: Multidoc2dial by @sivasankalpp in https://github.com/huggingface/datasets/pull/3205
- New: IndoNLI by @afaji in https://github.com/huggingface/datasets/pull/3307
- Update: DaNE - updated URL for download by @MalteHB in https://github.com/huggingface/datasets/pull/3203
- Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in https://github.com/huggingface/datasets/pull/3254
- Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in https://github.com/huggingface/datasets/pull/3225
- Update: KILT - update metadata JSON by @albertvillanova in https://github.com/huggingface/datasets/pull/3276
- Update: Covost 2 - update download instructions by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3281
- Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in https://github.com/huggingface/datasets/pull/3290
- Fix: tuple_ie - fix download url by @mariosasko in https://github.com/huggingface/datasets/pull/3213
- Fix: idnewspapers2018 - fix streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3249
- Fix: bookcorpusopen - fix RAM usage by @lhoestq in https://github.com/huggingface/datasets/pull/3280
- Fix: Scielo - fix ConnectionError by @mariosasko in https://github.com/huggingface/datasets/pull/3260
- Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in https://github.com/huggingface/datasets/pull/3321
Datasets Features
- Push to hub capabilities for
DatasetandDatasetDictby @LysandreJik in https://github.com/huggingface/datasets/pull/3098:- upload your dataset to the Hugging face Hub with the
push_to_hub()method ! - See documentation here
- upload your dataset to the Hugging face Hub with the
- 200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in https://github.com/huggingface/datasets/pull/3110
- Stream from Google Drive and other hosts by @lhoestq in https://github.com/huggingface/datasets/pull/3248
- Support Audio feature in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in https://github.com/huggingface/datasets/pull/3129
- Resolve data_files by split name automatically by @lhoestq in https://github.com/huggingface/datasets/pull/3221
- It takes into account the file names to know which file goes into which split
- See documentation here
- Filter method for batched=True by @thomasw21 in https://github.com/huggingface/datasets/pull/3244
- Adding
with_rankarg to pass process rank tomapby @TevenLeScao in https://github.com/huggingface/datasets/pull/3314
Dataset Cards
- Add full tagset to conll2003 README by @BramVanroy in https://github.com/huggingface/datasets/pull/3230
- Fix some contact information formats by @lhoestq in https://github.com/huggingface/datasets/pull/3274
- Add wikipedia tags by @lhoestq in https://github.com/huggingface/datasets/pull/3301
- Updating details of IRC disentanglement data by @jkkummerfeld in https://github.com/huggingface/datasets/pull/3259
Metrics Changes
- New: OpenAI's pass@k code evaluation metric by @lvwerra in https://github.com/huggingface/datasets/pull/2916
- Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in https://github.com/huggingface/datasets/pull/3235
- Update: CER - update to support latest release by @mariosasko in https://github.com/huggingface/datasets/pull/3252
- Update: WER - update to the documentation by @wooters in https://github.com/huggingface/datasets/pull/3278
Documentation
- Add docs for
to_tf_datasetby @stevhliu in https://github.com/huggingface/datasets/pull/3175 - Small updates to totfdataset documentation by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3215
- Update link to Datasets Tagging app in Spaces by @albertvillanova in https://github.com/huggingface/datasets/pull/3194
- Improve repository structure docs by @lhoestq in https://github.com/huggingface/datasets/pull/3233
- Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3241
- Add docs for audio processing by @stevhliu in https://github.com/huggingface/datasets/pull/3222
- Add pushtohub docs by @lhoestq in https://github.com/huggingface/datasets/pull/3319
Additional improvements and bug fixes
- Catch token invalid error in CI by @lhoestq in https://github.com/huggingface/datasets/pull/3200
- Pin keras version until TF fixes its release by @albertvillanova in https://github.com/huggingface/datasets/pull/3208
- Fix disable_nullable default value to False by @lhoestq in https://github.com/huggingface/datasets/pull/3211
- Fix code quality in riddle_sense dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3218
- Better error msg if
len(predictions)doesn't matchlen(references)in metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3160 - Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3121
- Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in https://github.com/huggingface/datasets/pull/3216
- Group tests in multiprocessing workers by test file by @albertvillanova in https://github.com/huggingface/datasets/pull/3231
- Fix loadfromdisk temporary directory by @lhoestq in https://github.com/huggingface/datasets/pull/3245
- [tiny] fix typo in stream docs by @nollied in https://github.com/huggingface/datasets/pull/3246
- Avoid PyArrow type optimization if it fails by @mariosasko in https://github.com/huggingface/datasets/pull/3234
- Remove redundant isort module placement by @mariosasko in https://github.com/huggingface/datasets/pull/3243
- asserts replaced by exception for text classification task with test. by @manisnesan in https://github.com/huggingface/datasets/pull/3256
- Add os.listdir for streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3270
- asserts replaced with exception for image classification task, csv, json by @manisnesan in https://github.com/huggingface/datasets/pull/3262
- Force data files extraction if downloadmode='forceredownload' by @mariosasko in https://github.com/huggingface/datasets/pull/3275
- Minor Typo Fix - Precision to Recall by @SebastinSanty in https://github.com/huggingface/datasets/pull/3279
- Decode audio from remote by @lhoestq in https://github.com/huggingface/datasets/pull/3271
- Fix build_docs CI by @lhoestq in https://github.com/huggingface/datasets/pull/3286
- Allow datasets with indices table when concatenating along axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3288
- f-string formatting by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3277
- Unpin markdown for build_docs now that it's fixed by @lhoestq in https://github.com/huggingface/datasets/pull/3289
- Pin version exclusion for Markdown by @albertvillanova in https://github.com/huggingface/datasets/pull/3293
- Use f-strings in the dataset scripts by @Carlosbogo in https://github.com/huggingface/datasets/pull/3291
- fix old_val typo in f-string by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3302
- asserts replaced with exception for
fingerprint.py,search.py,arrow_writer.pyandmetric.pyby @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3305 - fix: files counted twice in inferred structure by @borisdayma in https://github.com/huggingface/datasets/pull/3309
- Finish transition to PyArrow 3.0.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3318
- Removing query params for dynamic URL caching by @anton-l in https://github.com/huggingface/datasets/pull/3315
Citation
- Update BibTeX entry by @albertvillanova in https://github.com/huggingface/datasets/pull/3223
- Fix paper BibTeX citation with proceedings reference by @albertvillanova in https://github.com/huggingface/datasets/pull/3226
- Add CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3228
- Fix URL in CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3229
Deprecations
- Deprecate prepare_module by @albertvillanova in https://github.com/huggingface/datasets/pull/3166
Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0
- Python
Published by lhoestq about 4 years ago
datasets - 1.15.0
Dataset Changes
- Update: JNLBA - add tags names by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3092
- Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in https://github.com/huggingface/datasets/pull/3125 and https://github.com/huggingface/datasets/pull/3176
- Update: RONEC - update to v2 by @dumitrescustefan in https://github.com/huggingface/datasets/pull/3184
- Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in https://github.com/huggingface/datasets/pull/3136
- Fix: HLGD - fix label mapping by @VictorSanh in https://github.com/huggingface/datasets/pull/3180
Dataset Features
- Allow dynamic first dimension for ArrayXD by @rpowalski in https://github.com/huggingface/datasets/pull/2891
- add multi-proc in
to_csvby @bhavitvyamalik in https://github.com/huggingface/datasets/pull/2896 - QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in https://github.com/huggingface/datasets/pull/3196
Dataset Cards
- Fill in dataset card for NCBI disease dataset by @edugp in https://github.com/huggingface/datasets/pull/3115
Metrics Changes
- New: metric for the MATH dataset (competition_math). by @hacobe in https://github.com/huggingface/datasets/pull/3020
- New: Google BLEU (aka GLEU) metric by @slowwavesleep in https://github.com/huggingface/datasets/pull/3108
- New: TER by @BramVanroy in https://github.com/huggingface/datasets/pull/3153
- New: ChrF(++) by @BramVanroy in https://github.com/huggingface/datasets/pull/3187
General improvements and bug fixes
- Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3120
- Fixes to
to_tf_datasetby @Rocketknight1 in https://github.com/huggingface/datasets/pull/3085 - Add security policy to the project by @albertvillanova in https://github.com/huggingface/datasets/pull/2958
- Update doc links to point to new docs by @mariosasko in https://github.com/huggingface/datasets/pull/3116
- Fix caching bugs by @mariosasko in https://github.com/huggingface/datasets/pull/3141
- Fix numpy deprecation warning for ragged tensors by @lhoestq in https://github.com/huggingface/datasets/pull/3137
- Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in https://github.com/huggingface/datasets/pull/3157
- Fix some typos in the documentation by @h4iku in https://github.com/huggingface/datasets/pull/3152
- Fix string encoding for Value type by @lhoestq in https://github.com/huggingface/datasets/pull/3158
- Fix CLI test to ignore verfications when saving infos by @albertvillanova in https://github.com/huggingface/datasets/pull/3147
- Make inspect.getdatasetconfig_names always return a non-empty list by @albertvillanova in https://github.com/huggingface/datasets/pull/3159
- Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in https://github.com/huggingface/datasets/pull/3173
- Asserts replaced by exceptions (huggingface#3171) by @joseporiolayats in https://github.com/huggingface/datasets/pull/3174
- Preserve ordering in
zip_dictby @mariosasko in https://github.com/huggingface/datasets/pull/3170 - Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in https://github.com/huggingface/datasets/pull/3182
- Re-add faiss to windows testing suite by @BramVanroy in https://github.com/huggingface/datasets/pull/3151
- Add missing docstring to DownloadConfig by @mariosasko in https://github.com/huggingface/datasets/pull/3183
- More efficient nested features encoding by @eladsegal in https://github.com/huggingface/datasets/pull/3124
- Fix optimized encoding for arrays by @lhoestq in https://github.com/huggingface/datasets/pull/3197
- Python
Published by lhoestq over 4 years ago
datasets - 1.14.0
Dataset changes
- Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
- Update: SUPERB - use Audio features #3101 (@anton-l)
- Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)
Dataset features
- Add iter_archive #3066 (@lhoestq)
General improvements and bug fixes
- Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
- Fix project description in PyPI #3103 (@albertvillanova)
- Align tqdm control with cache control #3031 (@mariosasko)
- Add paper BibTeX citation #3107 (@albertvillanova)
- Python
Published by albertvillanova over 4 years ago
datasets - 1.13.3
Dataset changes
- Update: Adapt all audio datasets #3081 (@patrickvonplaten)
Bug fixes
- Update BibTeX entry #3090 (@albertvillanova)
- Use template columnmapping to transmitformat instead of template features #3088 (@mariosasko)
- Fix Audio feature mp3 resampling #3096 (@albertvillanova)
- Python
Published by albertvillanova over 4 years ago
datasets - 1.13.0
Dataset changes
- New: CaSiNo #2867 (@kushalchawla)
- New: Mostly Basic Python Problems #2893 (@lvwerra)
- New: OpenAI's HumanEval #2897 (@lvwerra)
- New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
- New: SEDE #2942 (@Hazoom)
- New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
- New: AMI #2853 (@cahya-wirawan)
- New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
- New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
- New: KanHope #2985 (@adeepH)
- New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
- New: SwedMedNER #2940 (@bwang482)
- New: SberQuAD #3039 (@Alenush)
- New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
- New: Greek Legal Code #2966 (@christospi)
- New: Story Cloze Test #3067 (@zaidalyafeai)
- Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
- Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
- Update: TriviaQA - add web and wiki config #2949 (@shirte)
- Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
- Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
- Update: Biosses - fix column names #3054 (@bwang482)
- Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
- Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
- Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
- Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
- Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
- Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)
Metric changes
- Update: meteor - update from nltk update #2946 (@lhoestq)
- Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
- Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
- Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)
Dataset features
- Use with TensorFlow:
- Adding
to_tf_datasetmethod #2731 #2931 #2951 #2974 (@Rocketknight1)
- Adding
- Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
- Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add
remove_columnstoIterableDataset#3030 (@cccntu) - All the above ZIP features also work in streaming mode
- New utilities:
- Add
get_dataset_split_names()to get a dataset config's split names #2906 (@severo)
- Add
- Replace script_version with revision #2933 (@albertvillanova)
- The
script_versionparameter inload_datasetis now deprecated, in favor ofrevision
- The
- Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed
Dataset cards
- Add arxiv paper inswissjudgmentprediction dataset card #3026 (@JoelNiklaus)
Documentation
- Add tutorial for no-code dataset upload #2925 (@stevhliu)
General improvements and bug fixes
- Fix filter leaking #3019 (@lhoestq)
- calling
filterseveral times in a row was not returning the right results in 1.12.0 and 1.12.1
- calling
- Update BibTeX entry #2928 (@albertvillanova)
- Fix exception chaining #2911 (@albertvillanova)
- Add regression test for null Sequence #2929 (@albertvillanova)
- Don't use old, incompatible cache for the new
filter#2947 (@lhoestq) - Fix fn kwargs in filter #2950 (@lhoestq)
- Use pyarrow.Table.replaceschemametadata instead of pyarrow.Table.cast #2895 (@arsarabi)
- Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
- Fix missing conda deps #2952 (@lhoestq)
- Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
- Support pandas 1.3 new
read_csvparameters #2960 (@SBrandeis) - Fix CI doc build #2961 (@albertvillanova)
- Run tests in parallel #2954 (@albertvillanova)
- Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
- Take namespace into account in caching #2938 (@lhoestq)
- Make Dataset.map accept list of np.array #2990 (@albertvillanova)
- Fix loading compressed CSV without streaming #2994 (@albertvillanova)
- Fix json loader when conversion not implemented #3000 (@lhoestq)
- Remove all query parameters when extracting protocol #2996 (@albertvillanova)
- Correct a typo #3007 (@Yann21)
- Fix Windows test suite #3025 (@albertvillanova)
- Remove unused parameter in xdirname #3017 (@albertvillanova)
- Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
- Fix typo #3023 (@qqaatw)
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
- Use cache folder for lockfile #2887 (@Dref360)
- Fix streaming: catch Timeout error #3050 (@borisdayma)
- Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
- Fix task reloading from cache #3059 (@lhoestq)
- Fix test command after refac #3065 (@lhoestq)
- Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
- Update summary on PyPi beyond NLP #3062 (@thomwolf)
- Remove a reference to the open Arrow file when deleting a TF dataset created with totfdataset #3002 (@mariosasko)
- feat: increase streaming retry config #3068 (@borisdayma)
- Fix pathlib patches for streaming #3072 (@lhoestq)
Breaking changes:
- Due to the big refactoring at #2986, the
prepare_modulefunction doesn't support thereturn_resolved_file_pathandreturn_associated_base_pathparameters. As an alternative, you may use thedataset_module_factoryinstead.
- Python
Published by lhoestq over 4 years ago
datasets - 1.12.1
Bug fixes
- Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
- Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
- Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
- this fixes the
ArrowInvalid: Can only convert 1-dimensional array valueserrors
- this fixes the
- Python
Published by lhoestq over 4 years ago
datasets - 1.12.0
New documentation
- New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference
See the new documentation here !
Datasets changes
- New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
- New: The Pile books3 #2801 (@richarddwang)
- New: The Pile stack exchange #2803 (@richarddwang)
- New: The Pile openwebtext2 #2802 (@richarddwang)
- New: Food-101 #2804 (@nateraw)
- New: Beans #2809 (@nateraw)
- New: cedr #2796 (@naumov-al)
- New: catsvsdogs #2807 (@nateraw)
- New: MultiEURLEX #2865 (@iliaschalkidis)
- New: BIOSSES #2881 (@bwang482)
- Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
- Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
- Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
- Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
- Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
- Update: SUPERB - Add SD task #2661 (@albertvillanova)
- Update: SUPERB - Add KS task #2783 (@anton-l)
- Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
- Update: Openwebtext - update size #2857 (@lhoestq)
- Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
- Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
- Fix: turkishmoviesentiment - fix metadata JSON #2755 (@albertvillanova)
- Fix: ubuntudialogscorpus - fix metadata JSON #2756 (@albertvillanova)
- Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
- Fix: linnaeus - fix url #2852 (@lhoestq)
- Fix ToTTo - fix data URL #2864 (@albertvillanova)
- Fix: wikicorpus - fix keys #2844 (@lhoestq)
- Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
- Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)
Datasets features
- Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
- Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
- add multi-proc in
to_json#2747 (@bhavitvyamalik) - Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)
Dataset streaming - better support for compression:
- Fix streaming zip files #2798 (@albertvillanova)
- Support streaming tar files #2800 (@albertvillanova)
- Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
- Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
- Add url prefix convention for many compression formats #2822 (@lhoestq)
- Support streaming datasets that use pathlib #2874 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)
Metrics changes
- Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
- Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)
Dataset cards
- Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
- Update ELI5 README.md #2848 (@odellus)
General improvements and bug fixes
- Update release instructions #2740 (@albertvillanova)
- Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
- Allow PyArrow from source #2769 (@patrickvonplaten)
- fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
- Fix typo in testdatasetcommon #2790 (@nateraw)
- Fix type hint for data_files #2793 (@albertvillanova)
- Bump tqdm version #2814 (@mariosasko)
- Use packaging to handle versions #2777 (@albertvillanova)
- Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
- Rename The Pile subsets #2817 (@lhoestq)
- Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
- Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
- Fix extraction protocol inference from urls with params #2843 (@lhoestq)
- Fix caching when moving script #2854 (@lhoestq)
- Fix windows CI CondaError #2855 (@lhoestq)
- fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
- Update
column_namesshowed as:func:in exploring.st #2851 (@ClementRomac) - Fix s3fs version in CI #2858 (@lhoestq)
- Fix three typos in two files for documentation #2870 (@leny-mi)
- Move checks from mapsingle to map #2660 (@mariosasko)
- fix regex to accept negative timezone #2847 (@jadermcs)
- Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
- Fix null sequence encoding #2900 (@lhoestq)
- Python
Published by lhoestq over 4 years ago
datasets - 1.11.0
Datasets Changes
- New: Add Russian SuperGLUE #2668 (@slowwavesleep)
- New: Add Disfl-QA #2473 (@bhavitvyamalik)
- New: Add TimeDial #2476 (@bhavitvyamalik)
- Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
- Fix: Update WikiANN data URL #2710 (@albertvillanova)
- Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
- Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)
General improvements and bug fixes
- fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
- Update BibTeX entry #2706 (@albertvillanova)
- Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
- Add support for disableprogressbar on Windows #2696 (@mariosasko)
- Ignore empty batch when writing #2698 (@pcuenca)
- Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
- fix: 🐛 fix two typos #2720 (@severo)
- Docs details #2690 (@severo)
- Deal with the bad check in test_load.py #2721 (@mariosasko)
- Pass useauthtoken to request_etags #2725 (@albertvillanova)
- Typo fix
tokenize_exemple#2726 (@shabie) - Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
- Add missing parquet known extension #2733 (@lhoestq)
- Python
Published by albertvillanova over 4 years ago
datasets - 1.10.0
Datasets Features
- Support remote data files #2616 (@albertvillanova)
This allows to pass URLs of remote data files to any dataset loader:
python load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})This works for all these dataset loaders:- text
- csv
- json
- parquet
- pandas
- Streaming from remote text/json/csv/parquet/pandas files:
When you pass URLs to a dataset loader, you can enable streaming mode with
streaming=True. Main contributions:- Streaming for the Pandas loader #2636 (@lhoestq)
- Streaming for the CSV loader #2635 (@lhoestq)
- Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq)
- Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
- Delete extracted files when loading dataset #2631 (@albertvillanova)
Datasets Changes
- Fix: C4 - fix expected files list #2682 (@lhoestq)
- Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
- Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
- Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
- Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
- Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
- Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)
Dataset Tasks
- Add speech processing tasks #2620 (@lewtun)
- Update ASR tags #2633 (@lewtun)
- Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
- Add ASR task for SUPERB #2619 (@lewtun)
- add image-classification task template #2632 (@nateraw)
Metrics Changes
- New: wiki_split #2623 (@bhadreshpsavani)
- Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
- Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)
General improvements and bug fixes
- Fix BibTeX entry #2594 (@albertvillanova)
- Fix testissmall_dataset #2588 (@albertvillanova)
- Remove import of transformers #2602 (@albertvillanova)
- Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
- Fix
filterwith multiprocessing in case all samples are discarded #2601 (@mxschmdt) - Remove redundant prepare_module #2597 (@albertvillanova)
- Create ExtractManager #2295 (@albertvillanova)
- Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
- Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
- Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
- Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
- Use correct logger in metrics.py #2626 (@mariosasko)
- Minor fix tests with Windows paths #2627 (@albertvillanova)
- Use ETag of remote data files #2628 (@albertvillanova)
- More consistent naming #2611 (@mariosasko)
- Refactor patching to specific submodule #2639 (@albertvillanova)
- Fix docstrings #2640 (@albertvillanova)
- Fix anchor in README #2647 (@mariosasko)
- Fix logging docstring #2652 (@mariosasko)
- Allow dataset config kwargs to be None #2659 (@lhoestq)
- Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
- Use tqdm from tqdm_utils #2667 (@mariosasko)
- Increase json reader block_size automatically #2676 (@lhoestq)
- Parallelize ETag requests #2675 (@lhoestq)
- Fix bad config ids that name cache directories #2686 (@lhoestq)
- Minor documentation fix #2687 (@slowwavesleep)
Dataset Cards
- Add missing WikiANN language tags #2610 (@albertvillanova)
- feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)
Docs
- Update processing.rst with other export formats #2599 (@TevenLeScao)
- Python
Published by lhoestq over 4 years ago
datasets -
Datasets Changes
- New: C4 #2575 #2592 (@lhoestq)
- New: mC4 #2576 (@lhoestq)
- New: MasakhaNER #2465 (@dadelani)
- New: Eduge #2492 (@enod)
- Update: xortydiqa - update version #2455 (@cccntu)
- Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
- Update: udpos - change features structure #2466 (@jerryIsHere)
- Update: WebNLG - update checksums #2558 (@lhoestq)
- Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
- Fix: proto_qa - fix download link #2463 (@mariosasko)
- Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
- Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
- Fix: codesearchnet - fix keys #2555 (@lhoestq)
- Fix: discofuse - fix link cc #2541 (@VictorSanh)
- Fix: fever - fix keys #2557 (@lhoestq)
Datasets Features
- Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
- JAX integration #2502 (@lhoestq)
- Add Parquet loader + fromparquet and toparquet #2537 (@lhoestq)
- Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
- Set configurable downloaded datasets path #2488 (@albertvillanova)
- Set configurable extracted datasets path #2487 (@albertvillanova)
- Add alignlabelswith_mapping function #2457 (@lewtun) #2510 (@lhoestq)
- Add interleave_datasets for map-style datasets #2568 (@lhoestq)
- Add loaddatasetbuilder #2500 (@mariosasko)
- Support Zstandard compressed files #2578 (@albertvillanova)
Task templates
- Add task templates for tydiqa and xquad #2518 (@lewtun)
- Insert text classification template for Emotion dataset #2521 (@lewtun)
- Add summarization template #2529 (@lewtun)
- Add task template for automatic speech recognition #2533 (@lewtun)
- Remove task templates if required features are removed during
Dataset.map#2540 (@lewtun) - Inject templates for ASR datasets #2565 (@lewtun)
General improvements and bug fixes
- Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
- Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
- Allow latest pyarrow version #2490 (@albertvillanova)
- Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
- Add Zenodo metadata file with license #2501 (@albertvillanova)
- add tensorflow-macos support #2493 (@slayerjain)
- Keep original features order #2453 (@albertvillanova)
- Add course banner #2506 (@sgugger)
- Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
- Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
- Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
- Improve performance of pandas arrow extractor #2519 (@albertvillanova)
- Fix fingerprint when moving cache dir #2509 (@lhoestq)
- Replace bad
n>1Msize tag #2527 (@lhoestq) - Fix dev version #2531 (@lhoestq)
- Sync with transformers disabling NOTSET #2534 (@albertvillanova)
- Fix logging levels #2544 (@albertvillanova)
- Add support for Split.ALL #2259 (@mariosasko)
- Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
- Make numpy arrow extractor faster #2505 (@lhoestq)
- fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
- Add ASR task and new languages to resources #2567 (@lewtun)
- Filter expected warning log from transformers #2571 (@albertvillanova)
- Fix BibTeX entry #2579 (@albertvillanova)
- Fix Counter import #2580 (@albertvillanova)
- Add aiohttp to tests extras require #2587 (@albertvillanova)
- Add language tags #2590 (@lewtun)
- Support pandas 1.3.0 read_csv #2593 (@lhoestq)
Dataset cards
- Updated Dataset Description #2420 (@binny-mathew)
- Update DatasetMetadata and ReadMe #2436 (@gchhablani)
- CRD3 dataset card #2515 (@wilsonyhlee)
- Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
- wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)
Docs
- no s at load_datasets #2479 (@julien-c)
- Fix docs custom stable version #2477 (@albertvillanova)
- Improve Features docs #2535 (@albertvillanova)
- Update README.md #2414 (@cryoff)
- Fix FileSystems documentation #2551 (@connor-mccarthy)
- Minor fix in loading metrics docs #2562 (@albertvillanova)
- Minor fix docs format for bertscore #2570 (@albertvillanova)
- Add streaming in load a dataset docs #2574 (@lhoestq)
- Python
Published by lhoestq over 4 years ago
datasets - 1.8.0
Datasets Changes
- New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
- New: KLUE benchmark #2416 (@jungwhank)
- New: HendrycksTest #2370 (@andyzoujm)
- Update: xortydiqa - update url to v1.1 #2449 (@cccntu)
- Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
- Fix: bnhatespeech and covidtweetsjapanese - fix broken URLs for #2445 (@lewtun)
- Fix: flores - fix download link #2448 (@mariosasko)
Datasets Features
- Add
descparameter inmapforDatasetDictobject #2423 (@bhavitvyamalik) - Support sliced list arrays in cast #2461 (@lhoestq)
Dataset.castcan now change the feature types of Sequence fields
- Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets INMEMORYMAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set
keep_in_memory=Truewhen loading a dataset to load it in memory
Datasets Cards
- adds license information for DailyDialog. #2419 (@aditya2211)
- add english language tags for ~100 datasets #2442 (@VictorSanh)
- Add copyright info to MLSUM dataset #2427 (@PhilipMay)
- Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
- Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)
General improvements and bug fixes
- Add DOI badge to README #2411 (@albertvillanova)
- Make datasets PEP-561 compliant #2417 (@SBrandeis)
- Fix savetodisk nested features order in dataset_info.json #2422 (@lhoestq)
- Fix CI six installation on linux #2432 (@lhoestq)
- Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
- Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
- doc: fix typo HFMAXINMEMORYDATASETSIZEIN_BYTES #2421 (@borisdayma)
- add utf-8 while reading README #2418 (@bhavitvyamalik)
- Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
- Rename config and environment variable for in memory max size #2454 (@albertvillanova)
- Add version-specific BibTeX #2430 (@albertvillanova)
- Fix cross-reference typos in documentation #2456 (@albertvillanova)
- Better error message when using the wrong loadfromdisk #2437 (@lhoestq)
Experimental and work in progress: Format a dataset for specific tasks
- Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
- Insert task templates for text classification #2389 (@lewtun)
- Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
- Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)
- Python
Published by lhoestq over 4 years ago
datasets - 1.7.0
Dataset Changes
- New: NLU evaluation data #2238 (@dkajtoch)
- New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
- New: Bbaw egyptian #2290 (@phiwi)
- New: GooAQ #2260 (@bhavitvyamalik)
- New: SubjQA #2302 (@lewtun)
- New: Ascent KB #2341, #2349 (@phongnt570)
- New: HLGD #2325 (@tingofurro)
- New: Qasper #2346 (@cceyda)
- New: ConvQuestions benchmark #2372 (@PhilippChr)
- Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
- Update multiwozv22 - update checksum #2281 (@lhoestq)
- Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
- Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
- Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
- Update: web_science - fixed download link #2338 (@bhavitvyamalik)
- Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
- Update: conll2003 - correct labels #2369 (@philschmid)
- Update: offenseval_dravidian - update citations #2385 (@adeepH)
- Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
- Fix: newsphnli - test data added, datasetinfos updated #2263 (@bhavitvyamalik)
- Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
- Fix: indicglue - Fix number of classes in indicglue sna.bn dataset #2397 (@albertvillanova)
- Fix: head_qa - Fix keys #2408 (@lhoestq)
Dataset Features
- Implement Dataset add_item #1870 (@albertvillanova)
- Implement Dataset add_column #2145 (@albertvillanova)
- Implement Dataset to JSON #2248, #2352 (@albertvillanova)
- Add rename_columnS method #2312 (@SBrandeis)
- add
desctotqdminDataset.map()#2374 (@bhavitvyamalik) - Add env variable HFMAXINMEMORYDATASETSIZEIN_BYTES #2399, #2409 (@albertvillanova)
Metric Changes
- New: CUAD metrics #2273 (@bhavitvyamalik)
- New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
- Update: CER - Docs, CER above 1 #2342 (@borisdayma)
General improvements and bug fixes
- Update black #2265 (@lhoestq)
- Fix incorrect updatemetadatawith_features calls in ArrowDataset #2258 (@mariosasko)
- Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
- Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
- Fix query table with iterable #2269 (@lhoestq)
- Perform minor refactoring: use config #2253 (@albertvillanova)
- Update format, fingerprint and indices after add_item #2254 (@lhoestq)
- Always update metadata in arrow schema #2274 (@lhoestq)
- Make tests run faster #2266 (@lhoestq)
- Fix metadata validation with config names #2286 (@lhoestq)
- Fixed typo seperate->separate #2292 (@laksh9950)
- Allow collaborators to self-assign issues #2289 (@albertvillanova)
- Mapping in the distributed setting #2298 (@TevenLeScao)
- Fix conda release #2309 (@lhoestq)
- Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
- Set default name in initdynamicmodules #2320 (@albertvillanova)
- Fix duplicate keys #2333 (@lhoestq)
- Add note about indices mapping in savetodisk docstring #2332 (@lhoestq)
- Metadata validation #2107 (@theo-m)
- Add Validation For README #2121 (@gchhablani)
- Fix overflow issue in interpolation search #2336 (@mariosasko)
- Datasets cli improvements #2315 (@mariosasko)
- Add
keytype and duplicates verification with hashing #2245 (@NikhilBartwal) - More consistent copy logic #2340 (@mariosasko)
- Update README vallidation rules #2353 (@gchhablani)
- normalized TOCs and titles in data cards #2355 (@yjernite)
- simpllify faiss index save #2351 (@Guitaricet)
- Allow "other-X" in licenses #2368 (@gchhablani)
- Improve ReadInstruction logic and update docs #2261 (@mariosasko)
- Disallow duplicate keys in yaml tags #2379 (@lhoestq)
- maintain YAML structure reading from README #2380 (@bhavitvyamalik)
- add dataset card title #2381 (@bhavitvyamalik)
- Add tests for dataset cards #2348 (@gchhablani)
- Improve example in rounding docs #2383 (@mariosasko)
- Paperswithcode dataset mapping #2404 (@julien-c)
- Free datasets with cache file in temp dir on exit #2403 (@mariosasko)
Experimental and work in progress: Format a dataset for specific tasks
- Task formatting for text classification & question answering #2255 (@SBrandeis)
- Add check for task templates on dataset load #2390 (@lewtun)
- Add args description to DatasetInfo #2384 (@lewtun)
- Improve task api code quality #2376 (@mariosasko)
- Python
Published by lhoestq over 4 years ago
datasets - 1.6.2
Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.
Breaking change:
- when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.
- Python
Published by lhoestq almost 5 years ago
datasets - 1.6.0
Dataset changes
- New: MOROCO #2002 (@MihaelaGaman)
- New: CBT dataset #2044 (@gchhablani)
- New: MDD Dataset #2051 (@gchhablani)
- New: Multilingual dIalogAct benchMark (miam) #2047 (@eusip)
- New: bAbI QA tasks #2053 (@gchhablani)
- New: machine translated multilingual STS benchmark dataset #2090 (@PhilipMay)
- New: EURLEX legal NLP dataset #2114 (@iliaschalkidis)
- New: ECtHR legal NLP dataset #2114 (@iliaschalkidis)
- New: EU-REG-IR legal NLP dataset #2114 (@iliaschalkidis)
- New: NorNE dataset for Norwegian POS and NER #2154 (@versae)
- New: banking77 #2140 (@dkajtoch)
- New: OpenSLR #2173 #2215 #2221 (@cahya-wirawan)
- New: CUAD dataset #2219 (@bhavitvyamalik)
- Update: Gem V1.1 + new challenge sets#2142 #2186 (@yjernite)
- Update: Wikiann - added spans field #2141 (@rabeehk)
- Update: XTREME - Add tel to xtreme tatoeba #2180 (@lhoestq)
- Update: GLUE MRPC - added real label to test set #2216 (@philschmid)
- Fix: MultiWoz22 - fix dialogue action slot name and value #2136 (@adamlin120)
- Fix: wikiauto - fix link #2171 (@mounicam)
- Fix: wino_bias - use right splits #1930 (@JieyuZhao)
- Fix: lc_quad - update download checksum #2213 (@mariosasko)
- Fix newsgroup -fix one instance of 'train' to 'test' #2225 (@alexwdong)
- Fix: xnli - fix tuple key #2233 (@NikhilBartwal)
Dataset features
- Allow stateful function in dataset.map #1960 (@mariosasko)
- MIAM dataset - new citation details #2101 (@eusip)
- [Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset #2025 (@lhoestq)
- Allow pickling of big in-memory tables #2150 (@lhoestq)
- updated user permissions based on umask #2086 #2157 (@bhavitvyamalik)
- Fast table queries with interpolation search #2122 (@lhoestq)
- Concat only unique fields in DatasetInfo.from_merge #2163 (@mariosasko)
- Implementation of classencodecolumn #2184 #2227 (@SBrandeis)
- Add support for axis in concatenate datasets #2151 (@albertvillanova)
- Set default in-memory value depending on the dataset size #2182 (@albertvillanova)
Metrics changes
- New: CER metric #2138 (@chutaklee)
- Update: WER - Compute metric iteratively #2111 (@albertvillanova)
- Update: seqeval - configurable options to
seqevalmetric #2204 (@marrodion)
Dataset cards
- REFreSD: Updated card using information from data statement and datasheet #2082 (@mcmillanmajora)
- Winobiais: fix split infos #2152 (@JieyuZhao)
- all: Fix size categories in YAML Tags #2074 (@gchhablani)
- LinCE: Updating citation information on LinCE readme #2205 (@gaguilar)
- Swda: Update README.md #2235 (@PierreColombo)
General improvements and bug fixes
- Refactorize Metric.compute signature to force keyword arguments only #2079 (@albertvillanova)
- Fix maxwaittime in requests #2085 (@lhoestq)
- Fix copy snippet in docs #2091 (@mariosasko)
- Fix deprecated warning message and docstring #2100 (@albertvillanova)
- Move Dataset.to_csv to csv module #2102 (@albertvillanova)
- Fix: Allows a feature to be named "_type" #2093 (@dcfidalgo)
- copy.deepcopy os.environ instead of copy #2119 (@NihalHarish)
- Replace legacy torch.Tensor constructor with torch.tensor #2126 (@mariosasko)
- Implement Dataset as context manager #2113 (@albertvillanova)
- Fix missing infos from concurrent dataset loading #2137 (@lhoestq)
- Pin fsspec lower than 0.9.0 #2172 (@lhoestq)
- Replace assertTrue(isinstance with assertIsInstance in tests #2164 (@mariosasko)
- add social thumbnial #2177 (@philschmid)
- Fix s3fs tests for py36 and py37+ #2183 (@lhoestq)
- Fix typo in huggingface hub #2192 (@LysandreJik)
- Update metadata if dataset features are modified #2087 (@mariosasko)
- fix missing indicesfiles in loadform_disk #2197 (@lhoestq)
- Fix backward compatibility in Dataset.loadfromdisk #2199 (@albertvillanova)
- Fix ArrowWriter overwriting features in ArrowBasedBuilder #2201 (@lhoestq)
- Fix incorrect assertion in builder.py #2110 (@dreamgonfly)
- Remove Python2 leftovers #2208 (@mariosasko)
- Revert breaking change in cache_files property #2217 (@lhoestq)
- Set test cache config #2223 (@albertvillanova)
- Fix map when removing columns on a formatted dataset #2231 (@lhoestq)
- Refactorize tests to use Dataset as context manager #2191 (@albertvillanova)
- Preserve split type when reloading dataset #2168 (@mariosasko)
Docs
- make documentation more clear to use different cloud storage #2127 (@philschmid)
- Render docstring return type as inline #2147 (@albertvillanova)
- Add table classes to the documentation #2155 (@lhoestq)
- Pin docutils for better doc #2174 (@sgugger)
- Fix docstrings issues #2081 (@albertvillanova)
- Add code of conduct to the project #2209 (@albertvillanova)
- Add classes GenerateMode, DownloadConfig and Version to the documentation #2202 (@albertvillanova)
- Fix bash snippet formatting in ADDNEWDATASET.md #2234 (@mariosasko)
- Python
Published by lhoestq almost 5 years ago
datasets - 1.5.0
Datasets changes
- New: Europarl Bilingual #1874 (@lucadiliello)
- New: Stanford Sentiment Treebank #1961 (@patpizio)
- New: RO-STS #1978 (@lorinczb)
- New: newspop #1871 (@frankier)
- New: FashionMNIST #1999 (@gchhablani)
- New: Common voice #1886 (@BirgerMoell), #2063 (@patrickvonplaten)
- New: Cryptonite #2013 (@theo-m)
- New: RoSent #2011 (@gchhablani)
- New: PersiNLU reading-comprehension #2028 (@danyaljj)
- New: conllpp #1991 (@ZihanWangKi)
- New: LaRoSeDa #2004 (@MihaelaGaman)
- Update: unnecessary docstart check in conll-like datasets #2020 (@mariosasko)
- Update: semeval 2020 task 11 - add article_id and process test set template #1979 (@hemildesai)
- Update: Md gender - card update #2018 (@mcmillanmajora)
- Update: XQuAD - add Romanian #2023 (@M-Salti)
- Update: DROP - all answers #1980 (@KaijuML)
- Fix: TIMIT ASR - Make sure not only the first sample is used #1995 (@patrickvonplaten)
- Fix: Wikipedia - save memory by replacing root.clear with elem.clear #2037 (@miyamonz)
- Fix: Doc2dial update datainfos and dataloaders #2041 (@songfeng)
- Fix: ZEST - update download link #2057 (@matt-peters)
- Fix: tedtalksiwslt - fix version error #2064 (@mariosasko)
Datasets Features
- Implement Dataset from CSV #1946 (@albertvillanova)
- Implement Dataset from JSON and JSON Lines #1943 (@albertvillanova)
- Implement Dataset from text #2030 (@albertvillanova)
- Optimize int precision for tokenization #1985 (@albertvillanova)
- This allows to save 75%+ of space when tokenizing a dataset
General Bug fixes and improvements
- Fix ArrowWriter closes stream at exit #1971 (@albertvillanova)
- feat(docs): navigate with left/right arrow keys #1974 (@ydcjeff)
- Fix various typos/grammer in the docs #2008 (@mariosasko)
- Update format columns in Dataset.rename_columns #2027 (@mariosasko)
- Replace print with logging in dataset scripts #2019 (@mariosasko)
- Raise an error for outdated sacrebleu versions #2033 (@lhoestq)
- Not all languages have 2 digit codes. #2016 (@asiddhant)
- Fix arrow memory checks issue in tests #2042 (@lhoestq)
- Support pickle protocol for dataset splits defined as ReadInstruction #2043 (@mariosasko)
- Preserve column ordering in Dataset.rename_column #2045 (@mariosasko)
- Fix text-classification tags #2049 (@gchhablani)
- Fix docstring rendering of Dataset/DatasetDict.from_csv args #2066 (@albertvillanova)
- Fixes check of TFAVAILABLE and TORCHAVAILABLE #2073 (@philschmid)
- Add and fix docstring for NamedSplit #2069 (@albertvillanova)
- Bump huggingface_hub version #2077 (@SBrandeis)
- Fix docstring issues #2072 (@albertvillanova)
- Python
Published by lhoestq almost 5 years ago
datasets -
Fix an issue #1981 with WMT downloads #1982 (@albertvillanova)
- Python
Published by lhoestq almost 5 years ago
datasets - 1.4.0
Datasets Changes
- New: iappwikiqa_squad #1873 (@cstorm125)
- New: Financial PhraseBank #1866 (@frankier)
- New: CoVoST2 #1935 (@patil-suraj)
- New: TIMIT #1903 (@vrindaprabhu)
- New: Mlama (multilingual lama) #1931 (@pdufter)
- New: FewRel #1823 (@gchhablani)
- New: CCAligned Multilingual Dataset #1815 (@gchhablani)
- New: Turkish News Category Lite #1967 (@yavuzKomecoglu)
- Update: WMT - use mirror links #1912 for better download speed (@lhoestq)
- Update: multi_nli - add missing fields #1950 (@bhavitvyamalik)
- Fix: ALT - fix duplicated examples in alt-parallel #1899 (@lhoestq)
- Fix: WMT datasets - fix download errors #1901 (@YangWang92), #1902 (@lhoestq)
- Fix: QA4MRE - fix download URLs #1918 (@M-Salti)
- Fix: Wikidpr - fix when withembeddings is False or indexname is "noindex" #1925 (@lhoestq)
- Fix: Wiki_dpr - add missing scalar quantizer #1926 (@lhoestq)
- Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM #1970 (@yjernite)
Datasets Features
- Add todict and topandas for Dataset #1889 (@SBrandeis)
- Add to_csv for Dataset #1887 (@SBrandeis)
- Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
- Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
- This introduces new methods for Dataset objects: renamecolumn, removecolumns, flatten and cast.
- The old in-place methods renamecolumn, removecolumns, flatten_ and cast_ are now deprecated.
- Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
- Add cross-platform support for datasets-cli #1951 (@mariosasko)
Metrics Changes
- New: sari metric #1875 (@ddhruvkr)
Offline loading
- Handle timeouts #1952 (@lhoestq)
- Add datasets full offline mode with HFDATASETSOFFLINE #1976 (@lhoestq)
General improvements and bugfixes
- Replace flatten_nested #1879 (@albertvillanova)
- add missing info on how to add large files #1885 (@stas00)
- Docs for adding new column on formatted dataset #1888 (@lhoestq)
- Fix PandasArrayExtensionArray conversion to native type #1897 (@lhoestq)
- Bugfix for stringtoarrow timestamp[ns] support #1900 (@justin-yan)
- Fix to_pandas for boolean ArrayXD #1904 (@lhoestq)
- Fix logging imports and make all datasets use library logger #1914 (@albertvillanova)
- Standardizing datasets dtypes #1921 (@justin-yan)
- Remove unused py_utils objects #1916 (@albertvillanova)
- Fix savetodisk with relative path #1923 (@lhoestq)
- Updating old cards #1928 (@mcmillanmajora)
- Improve typing and style and fix some inconsistencies #1929 (@mariosasko)
- Fix builder config creation with data_dir #1932 (@lhoestq)
- Disallow ClassLabel with no names #1938 (@lhoestq)
- Update documentation with not in place transforms and update DatasetDict #1947 (@lhoestq)
- Documentation for tocsv, topandas and to_dict #1953 (@lhoestq)
- typos + grammar #1955 (@stas00)
- Fix unused arguments #1962 (@mariosasko)
- Fix metrics collision in separate multiprocessed experiments #1966 (@lhoestq)
- Python
Published by lhoestq almost 5 years ago
datasets -
Dataset Features
- On-the-fly data transforms (#1795)
- ADD S3 support for downloading and uploading processed datasets (#1723)
- Allow loading dataset in-memory (#1792)
- Support future datasets (#1813)
- Enable/disable caching (#1703)
- Offline dataset loading (#1726)
Datasets Hub Features
- Loading from the Datasets Hub (#1860) This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library. Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation
Dataset Changes
- New: LJ Speech (#1878)
- New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
- New: cord 19 (#1850)
- New: Tweet Eval Dataset (#1829)
- New: CIFAR-100 Dataset (#1812)
- New: SICK (#1804)
- New: BBC Hindi NLI Dataset (#1158)
- New: Freebase QA Dataset (#1814)
- New: Arabic sarcasm (#1798)
- New: Semantic Scholar Open Research Corpus (#1606)
- New: DuoRC Dataset (#1800)
- New: Aggregated dataset for the GEM benchmark (#1807)
- New: CC-News dataset of English language articles (#1323)
- New: irc disentangle (#1586)
- New: Narrative QA Manual (#1778)
- New: Universal Morphologies (#1174)
- New: SILICONE (#1761)
- New: Librispeech ASR (#1767)
- New: OSCAR (#1694, #1868, #1833)
- New: CANER Corpus (#1684)
- New: Arabic Speech Corpus (#1852)
- New: id_liputan6 (#1740)
- New: Stuctured Argument Extraction for Korean dataset (#1748)
- New: TurkCorpus (#1732)
- New: Hatexplain Dataset (#1716)
- New: adversarialQA (#1714)
- Update: Doc2dial - reading comprehension update to latest version (#1816)
- Update: OPUS Open Subtitles - add with metadata information (#1865)
- Update: SWDA - use all metadata features(#1799)
- Update: SWDA - add metadata and correct splits (#1749)
- Update: CommonGen - update citation information (#1787)
- Update: SciFact - update URL (#1780)
- Update: BrWaC - update features name (#1736)
- Update: TLC - update urls to be github links (#1737)
- Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
- Fix: multiwozv22 - fix checksums (#1880)
- Fix: limit - fix url (#1861)
- Fix: WebNLG - fix test test + more field (#1739)
- Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
- Fix: reuters - add missing "brief" entries (#1744)
- Fix: thainer: empty token bug (#1734)
- Fix: lst20: empty token bug (#1734)
Metrics Changes
- New: Word Error Metric (#1847)
- New: COMET (#1577, #1753)
- Fix: bert_score - set version dependency (#1851)
Metric Docs
- Add metrics usage examples and tests (#1820)
CLI Changes
- [BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli uploaddataset" and "datasets-cli uploadmetric"
- instead, use the huggingface-hub CLI
Bug fixes
- fix writing GPU Faiss index (#1862)
- update pyarrow import warning (#1782)
- Ignore definition line number of functions for caching (#1779)
- update saving and loading methods for faiss index so to accept path like objects (#1663)
- Print error message with filename when malformed CSV (#1826)
- Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)
Refactoring
- Refactoring: Create config module (#1848)
- Use a config id in the cache directory names for custom configs (#1754)
Logging
- Enable logging propagation and remove logging handler (#1845)
- Python
Published by lhoestq about 5 years ago
datasets - 1.2.1
New Features
- Fast start up (#1690): Importing
datasetsis now significantly faster.
Datasets Changes
- New: MNIST (#1730)
- New: Korean intonation-aided intention identification dataset (#1715)
- New: Switchboard Dialog Act Corpus (#1678)
- Update: Wiki-Auto - Added unfiltered versions of the training data for the GEM simplification task. (#1722)
- Update: Scientific papers - Mirror datasets zip (#1721)
- Update: Update DBRD dataset card and download URL (#1699)
- Fix: Thainer - fix ner_tag bugs (#1695)
- Fix: reuters21578 - metadata parsing errors (#1693)
- Fix: adecorpusv2 - fix config names (#1689)
- Fix: DaNE - fix last example (#1688)
Datasets tagging
- rename "part-of-speech-tagging" tag in some dataset cards (#1645)
Bug Fixes
- Fix column list comparison in transmit format (#1719)
- Fix windows path scheme in cached path (#1711)
Docs
- Add information about caching and verifications in "Load a Dataset" docs (#1705)
Moreover many dataset cards of datasets added during the sprint were updated ! Thanks to all the contributors :)
- Python
Published by lhoestq about 5 years ago
datasets -
Intermediate release before v2.0.0
Includes all the datasets added during the datasets sprint of December 2020 (currently over 610 datasets).
- Python
Published by lhoestq about 5 years ago
datasets - 1.1.3
Datasets changes
- New: NLI-Tr (#787)
- New: Amazon Reviews (#791)(#844)(#845)(#799)
- New: ASNQ - answer sentence selection (#780)
- New: OpenBookCorpus (#856)
- New: ASLG-PC12 - sign language translation (#731)
- New: Quail - question answering dataset (#747)
- Update: SNLI: Created dataset card snli.md (#663)
- Update: csv - Use pandas reader in csv (#857)
- Better memory management
- Breaking: the previous
read_options,parse_optionsand convert_optionsare replaced with plain parameters like pandas.read_csv
- Update: conll2000, conll2003, germeval14, wnut17, XTREME PAN-X - Create ClassLabel for labelling tasks datasets (#850)
- Breaking: use of ClassLabel features instead of string features + naming of columns updated for consistency
- Update: XNLI - Add XNLI train set (#781)
- Update: XSUM - Use full released xsum dataset (#754)
- Update: CompGuessWhat - New version of CompGuessWhat?! with refined annotations (#748)
- Update: CLUE - add OCNLI, a new CLUE dataset (#742)
- Fix: KOR-NLI - Fix csv reader (#855)
- Fix: Discofuse - fix discofuse urls (#793)
- Fix: Emotion - fix description (#745)
- Fix: TREC - update urls (#740)
Metrics changes
- New: accuracy, precision, recall and F1 metrics (#825)
- Fix: squad_v2 (#840)
- Fix: seqeval (#810)(#738)
- Fix: Rouge - fix description (#774)
- Fix: GLUE - fix description (#734)
- Fix: BertScore - fix custom baseline (#763)
Command line tools
- add clear_cache parameter in the test command (#863)
Dependencies
- Integrate file_lock inside the lib for better logging control (#859)
Dataset features
- Add writerbatchsize attribute to GeneratorBasedBuilder (#828)
- pretty print dataset objects (#725)
- allow custom split names in text dataset (#776)
Tests
- All configs is a slow test now
Bug fixes
- Make save function use deterministic global vars order (#819)
- fix type hints pickling in python 3.6 (#818)
- fix metric deletion when attributes are missing (#782)
- Fix custom builder caching (#770)
- Fix metric with cache dir (#772)
- Fix traintestsplit output format (#719)
- Python
Published by lhoestq over 5 years ago
datasets -
Dataset changes
- Fix: text - use python read instead of pandas reader (#715):
- fix delimiter/overflow issues
- better memory handling
Bug fixes
- Fix dataset configuration creation using
data_filesper splits using NamedSplit (#706) - Fix permission issue on windows - don't use tqdm 4.50.0 (#718)
- Python
Published by lhoestq over 5 years ago
datasets - 1.1.0: Windows support, Better Multiprocessing, New Datasets
Windows support
- Add Windows support (#644):
- add tests and CI for Windows
- fix numerous windows specific issues
- The library now fully supports Windows
Dataset changes
- New: HotpotQA (#703)
- New: OpenWebText (#660)
- New: Winogrande - add debiased subset (#655)
- Update: XNLI - update download link (#695)
- Update: text - switch to pandas reader, better memory usage, fix delimiter issues (#689)
- Update: csv - add features parameter to CSV (#685)
- Fix: GAP - fix wrong computation of boolean features (#680)
- Fix: C4 - fix manual instruction function (#681)
Metric changes
- Update: ROUGE - Add rouge 2 and rouge Lsum to rouge metric outputs by default (#701, #702)
- Fix: SQuAD - fix kwargs description (#670)
Dataset Features
- Use multiprocess from pathos for multiprocessing (#656):
- allow lambda functions in multiprocessed map
- allow local functions in multiprocessed map
- and more ! As long as functions are compatible with
dill
Bug fixes
- Datasets: fix possible program hanging with tokenizers - Disable tokenizers parallelism in multiprocessed map (#688)
- Datasets: fix cast with unordered features - fix column order issue in cast (#684)
- Datasets: fix first time creation of cache directory - move cache dir root creation in builder's init (#677)
- Datasets: fix OverflowError when using negative ids - fix negative ids in slicing with an array (#679)
- Datasets: fix empty dictionaries afetr multiprocessing - keep new columns in transmit format (#659)
- Datasets: fix type inference for nested types - handle data alteration when trying type (#653)
- Metrics: fix compute metric with empty input - pass metric features to the reader (#654)
Documentation
- Elasticsearch integration documentation (#696)
Tests
- Use GitHub instead of AWS in remote dataset tests (#694)
- Python
Published by lhoestq over 5 years ago
datasets -
Dataset changes:
- New: CoNLL-2003 (#613)
- New: ConLL-2000 (#634)
- New: MATINF (ACL 2020) (#637)
- New: Polyglot-NER (#641)
- Update: GLUE - update GLUE urls (now hosted on FB) (#626)
- Update: GLUE/qqp - update download checksum (#639)
- Update: MLQA - feature names update (#627)
- Update: LinCE - update feature names - Consistent ner features (#636)
- Update: WNUT 17: update feature names - Consistent ner features (#642)
- Update: XTREME/PAN-X - update feature names - Consistent ner features (#636)
- Update: RACE - update dataset checksum + add new configurations (#540)
- Fix: text - fix delimiter (#631)
- Fix: Wiki DPR - fix download error in wiki_dpr (f38a871)
Logging:
- Set level to warning (previously info) (#635)
Bug fixes:
- make shuffle compatible with temp_seed (#640)
- don't use take on dataset table (offset overflow error) (#645)
- handle connection error in when downloading from HF google storage (#652)
- Python
Published by lhoestq over 5 years ago
datasets -
Fix: - add multiprocessing to dataset dict (#612)
- Python
Published by lhoestq over 5 years ago
datasets - 1.0.0 Release: New name, Speed-ups, Multimodal, Serialization
1.0.0 Release: New name, Speed-ups, Multimodal, Serialization
Package Changes
- Rename: nlp -> datasets
Update now with
pip install datasets
Dataset Features
- Keep the dataset format after dataset transforms (#607)
- Pickle support (#536)
- Save and load datasets to/from disk (#571)
- Multiprocessing in
mapandfilter(#552) - Multi-dimensional arrays support for multi-modal datasets (#533, #363)
- Speed up Tokenization by optimizing casting to python objects (#523)
- Speed up shuffle/shard/select methods - use indices mappings (#513)
- Add
input_columnparameter inmapandfilter(#475) - Speed up download and processing (#563)
- Indexed datasets for hybrid models (REALM/RAG/MARGE) (#500)
Dataset Changes
- New: IWSLT 2017 (#470)
- New: CommonGen Dataset (#578)
- New: CLUE Benchmark (11 datasets) (#572)
- New: the KILT knowledge source and tasks (#559)
- New: DailyDialog (#556)
- New: DoQA dataset (ACL 2020) (#473)
- New: reuters21578 (#570)
- New: HANS (#551)
- New: MLSUM (#529)
- New: Guardian authorship (#452)
- New: web_questions (#401)
- New: MS MARCO (#364)
- Update: Germeval14 - update download url (#594)
- Update: LinCE - update download url (#550)
- Update: Hyperpartisan news detection - update download url, manual download no longer required (#504)
- Update: Rotten Tomatoes - update download url (#484)
- Update: Wiki DPR - Use HNSW faiss index (#500)
- Update: Text - Speed up using multi-threaded PyArrow loading (#548)
- Fix: GLUE, PAWS-X - skip header (#497)
[Breaking] Update Dataset and DatasetDict API (#459)
- Rename the flatten, drop and dictionaryencodecolumn methods in flatten, drop and dictionaryencodecolumn_ to indicate that these methods have in-place effects
- Remove the dataset.columns property and dataset.nbytes
- Add a few more properties and methods to DatasetDict
Metric Features
- Disallow the use of positional arguments to avoid predictions vs references mistakes (#466)
- Allow to directly feed numpy/pytorch/tensorflow/pandas objects in metrics (#466)
Metric Changes
- New: METEOR metric (#479)
- Fix: Sacrebleu - fix inputs format (#520)
Loading script Features
- Pin the version of the scripts (reproducibility) (#603, #584)
- Specify default
script_versionwith the env variableHF_SCRIPTS_VERSION(#584) - Save scripts in a modules cache directory that can be controlled with
HF_MODULES_CACHE(#574)
Caching
- Better support for tokenizers when caching
mapresults (#601) - Faster caching for text dataset (#573, #502)
- Use dataset fingerprints, updated after each transform (#536)
- Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics (#518)
Documentation
- Metrics documentation (#579)
Miscellaneous
- Add centralized logging - Bump-up cache loads to warnings (#538)
Bug fixes
- Datasets: [Breaking] fixed typo in "formated_as" method: rename formated to formatted (#516)
- Datasets: fixed the error message when loading text/csv/json without providing data files (#586)
- Datasets: fixed
selectmethod for pyarrow < 1.0.0 (#585) - Datasets: fixed elasticsearch result ids returning as strings (#487)
- Datasets: fixed config used for slow test on real dataset (#527)
- Datasets: fixed tensorflow-formatted datasets outputs by using ragged tensor by default (#530)
- Datasets: fixed batched map for formatted dataset (#515)
- Datasets: fixed encodings issues on Windows - apply utf-8 encoding to all datasets (#481)
- Datasets: fixed dataset.map for function without outputs (#506)
- Datasets: fixed bad type in overflow check (#496)
- Datasets: fixed dataset info save - dont use beam fs to save info for local cache dir (#498)
- Datasets: fixed arrays outputs - stack vectors in numpy, pytorch and tensorflow (#495, #494)
- Metrics: fixed locking in distributed settings if one process finished before the other started writing (#564, #547)
- Python
Published by lhoestq over 5 years ago
datasets - 0.4.0
Datasets Features
- add frompandas and fromdict
- add shard method
- add rename/remove/cast columns methods
- faster select method
- add concatenate datasets
- add support for taking samples using numpy arrays
- add export to TFRecords
- add features parameter when loading from text/json/pandas/csv or when using the map transform
- add support for nested features for json
- add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
- add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
- add indexing using FAISS or ElasticSearch:
- add addfaissindex and addelasticsearchindex methods
- add getnearestexamples and getnearestexamples_batch to query the index and return examples
- add search and search_batch to query the index and return examples ids
- add savefaissindex/loadfaissindex to save/load a serialized faiss index
Datasets changes
- new: PG19
- new: ANLI
- new: WikiSQL
- new: qa_zre
- new: MWSC
- new: AG news
- new: SQuADShifts
- new: doc red
- new: Wiki DPR
- new: fever
- new: hyperpartisan news detection
- new: pandas
- new: text
- new: emotion
- new: quora
- new: BioMRC
- new: web questions
- new: search QA
- new: LinCE
- new: TREC
- new: Style Change Detection
- new: 20newsgroup
- new: social biais frames
- new: Emo
- new: web of science
- new: sogou news
- new: crd3
- update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
- update: xtreme - add PAWS-X.es
- update: xsum - manual download is no longer required.
- new processed: Natural Questions
Metrics Features
- add seed parameter for metrics that does sampling like rouge
- better installation messages
Metrics changes
- new: bleurt
- update seqeval: fix entities extraction (more info here)
Bug fixes
- fix bug in map and select that was causing memory issues
- fix pyarrow version check
- fix text/json/pandas/csv caching when loading different files in a row
- fix metrics caching when they have with different config names
- fix cache that was nto discarded when there's a KeybordInterrupt during .map
- fix sacrebleu tokenizer's parameter
- fix docstrings of metrics when multiple instances are created
More Tests
- add tests for features handling in dataset transforms
- add tests for dataset builders
- add tests for metrics loading
Backward compatibility
- because there are changes in the datasetinfo.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in datasetinfo.json
- Python
Published by lhoestq over 5 years ago
datasets -
New methods to transform a dataset:
- dataset.shuffle: create a shuffled dataset
- dataset.train_test_split: create a train and a test split (similar to sklearn)
- dataset.sort: create a dataset sorted according to a certain column
- dataset.select: create a dataset with rows selected following the given list of indices
Other features:
- Better instructions for datasets that require manual download
> Important: if you load datasets that require manual downloads with an older version of nlp, instructions won't be shown and an error will be raised
- Better access to dataset information (for instance dataset.feature['label'] or dataset.dataset_size)
Datasets: - New: cose v1.0 - New: rottentomatoes - New: german and italian wikipedia
New docs: - documentation about splitting a dataset
Bug fixes: - fix metric.compute that couldn't write on file - fix squad_v2 imports
- Python
Published by lhoestq over 5 years ago