Recent Releases of datasets

datasets - 4.0.0

New Features

Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595

```python # Build streaming data pipelines in a few lines of code ! from datasets import load_dataset

ds = loaddataset(..., streaming=True) ds = ds.map(...).filter(...) ds.pushto_hub(...) ```

Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606

python # Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)

New Column object
- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564
- Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614

```python # Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

# Iterate on a column: for text in ds["text"]: ...

# Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"] ``` * Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616 - Enables streaming only the ranges you need !

```python # Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset

ds = loaddataset(..., streaming=True) for example in ds: video = example["video"] frames = video.getframesinrange(start=0, stop=6, step=1) # only stream certain frames ```

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

```python audio = dataset[0]["audio"] # samples = audio.getallsamples() # or use getsamplesplayedinrange(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000

# old syntax is still supported array, sr = audio["array"], audio["sampling_rate"] ```

Load video data with VideoDecoder:

python # video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
- trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
- Introduction of the List type

```python from datasets import Features, List, Value

features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) }) ```

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

```python from datasets import Sequence

Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))} ```

Other improvements and bug fixes

Refactor Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in https://github.com/huggingface/datasets/pull/7434
fix stringtodict test by @lhoestq in https://github.com/huggingface/datasets/pull/7571
Preserve formatting in concatenated IterableDataset by @francescorubbo in https://github.com/huggingface/datasets/pull/7522
Fix typos in PDF and Video documentation by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7579
fix: Add embed_storage in Pdf feature by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7582
load_dataset splits typing by @lhoestq in https://github.com/huggingface/datasets/pull/7587
Fixed typos by @TopCoder2K in https://github.com/huggingface/datasets/pull/7572
Fix regex library warnings by @emmanuel-ferdman in https://github.com/huggingface/datasets/pull/7576
[MINOR:TYPO] Update savetodisk docstring by @cakiki in https://github.com/huggingface/datasets/pull/7575
Add missing property on RepeatExamplesIterable by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581
Avoid multiple default config names by @albertvillanova in https://github.com/huggingface/datasets/pull/7585
Fix broken link to albumentations by @ternaus in https://github.com/huggingface/datasets/pull/7593
fix stringtodict usage for windows by @lhoestq in https://github.com/huggingface/datasets/pull/7598
No TF in win tests by @lhoestq in https://github.com/huggingface/datasets/pull/7603
Docs and more methods for IterableDataset: pushtohub, to_parquet... by @lhoestq in https://github.com/huggingface/datasets/pull/7604
Tests typing and fixes for pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/7608
fix parallel pushtohub in dataset_dict by @lhoestq in https://github.com/huggingface/datasets/pull/7613
remove unused code by @lhoestq in https://github.com/huggingface/datasets/pull/7615
Update _dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in https://github.com/huggingface/datasets/pull/7609
Fixes in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7620
Add albumentations to use dataset by @ternaus in https://github.com/huggingface/datasets/pull/7596
minor docs data aug by @lhoestq in https://github.com/huggingface/datasets/pull/7621
fix: raise error in FolderBasedBuilder when datadir and datafiles are missing by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7623
fix save_infos by @lhoestq in https://github.com/huggingface/datasets/pull/7639
better features repr by @lhoestq in https://github.com/huggingface/datasets/pull/7640
update docs and docstrings by @lhoestq in https://github.com/huggingface/datasets/pull/7641
fix length for ci by @lhoestq in https://github.com/huggingface/datasets/pull/7642
Backward compat sequence instance by @lhoestq in https://github.com/huggingface/datasets/pull/7643
fix sequence ci by @lhoestq in https://github.com/huggingface/datasets/pull/7644
Custom metadata filenames by @lhoestq in https://github.com/huggingface/datasets/pull/7663
Update the beans dataset link in Preprocess by @HJassar in https://github.com/huggingface/datasets/pull/7659
Backward compat list feature by @lhoestq in https://github.com/huggingface/datasets/pull/7666
Fix infer list of images by @lhoestq in https://github.com/huggingface/datasets/pull/7667
Fix audio bytes by @lhoestq in https://github.com/huggingface/datasets/pull/7670
Fix double sequence by @lhoestq in https://github.com/huggingface/datasets/pull/7672

New Contributors

@TopCoder2K made their first contribution in https://github.com/huggingface/datasets/pull/7564
@francescorubbo made their first contribution in https://github.com/huggingface/datasets/pull/7522
@emmanuel-ferdman made their first contribution in https://github.com/huggingface/datasets/pull/7576
@SilvanCodes made their first contribution in https://github.com/huggingface/datasets/pull/7581
@ternaus made their first contribution in https://github.com/huggingface/datasets/pull/7593
@ArjunJagdale made their first contribution in https://github.com/huggingface/datasets/pull/7623
@TyTodd made their first contribution in https://github.com/huggingface/datasets/pull/7616
@HJassar made their first contribution in https://github.com/huggingface/datasets/pull/7659

Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0

- Python
Published by lhoestq about 1 year ago

datasets - 3.6.0

Dataset Features

Enable xet in push to hub by @lhoestq in https://github.com/huggingface/datasets/pull/7552
- Faster downloads/uploads with Xet storage
- more info: https://github.com/huggingface/datasets/issues/7526

Other improvements and bug fixes

Add tryoriginaltype to DatasetDict.map by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7544
Avoid global umask for setting file mode. by @ryan-clancy in https://github.com/huggingface/datasets/pull/7547
Rebatch arrow iterables before formatted iterable by @lhoestq in https://github.com/huggingface/datasets/pull/7553
Document the HFDATASETSCACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in https://github.com/huggingface/datasets/pull/7532
fix regression by @lhoestq in https://github.com/huggingface/datasets/pull/7558
fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in https://github.com/huggingface/datasets/pull/7521
Remove aiohttp from direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294

New Contributors

@ryan-clancy made their first contribution in https://github.com/huggingface/datasets/pull/7547
@Harry-Yang0518 made their first contribution in https://github.com/huggingface/datasets/pull/7532
@giraffacarp made their first contribution in https://github.com/huggingface/datasets/pull/7521
@akx made their first contribution in https://github.com/huggingface/datasets/pull/7294

Full Changelog: https://github.com/huggingface/datasets/compare/3.5.1...3.6.0

- Python
Published by lhoestq about 1 year ago

datasets - 3.5.1

Bug fixes

support pyarrow 20 by @lhoestq in https://github.com/huggingface/datasets/pull/7540
- Fix pyarrow error TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
Write pdf in map by @lhoestq in https://github.com/huggingface/datasets/pull/7487

Other improvements

update fsspec 2025.3.0 by @peteski22 in https://github.com/huggingface/datasets/pull/7478
Support underscore int read instruction by @lhoestq in https://github.com/huggingface/datasets/pull/7488
Support skiptryingtype by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7483
pdf docs fixes by @lhoestq in https://github.com/huggingface/datasets/pull/7519
Remove conditions for Python < 3.9 by @cyyever in https://github.com/huggingface/datasets/pull/7474
mention av in video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7523
correct use with polars example by @SiQube in https://github.com/huggingface/datasets/pull/7524
chore: fix typos by @afuetterer in https://github.com/huggingface/datasets/pull/7436

New Contributors

@peteski22 made their first contribution in https://github.com/huggingface/datasets/pull/7478
@yoshitomo-matsubara made their first contribution in https://github.com/huggingface/datasets/pull/7483
@SiQube made their first contribution in https://github.com/huggingface/datasets/pull/7524
@afuetterer made their first contribution in https://github.com/huggingface/datasets/pull/7436

Full Changelog: https://github.com/huggingface/datasets/compare/3.5.0...3.5.1

- Python
Published by lhoestq about 1 year ago

datasets - 3.5.0

Datasets Features

Introduce PDF support (#7318) by @yabramuvdi in https://github.com/huggingface/datasets/pull/7325

```python

from datasets import loaddataset, Pdf repo = "path/to/pdf/folder" # or username/datasetname on Hugging Face dataset = loaddataset(repo, split="train") dataset[0]["pdf"] dataset[0]["pdf"].pages[0].extracttext() ... ```

What's Changed

Fix local pdf loading by @lhoestq in https://github.com/huggingface/datasets/pull/7466
Minor fix for metadata files in extension counter by @lhoestq in https://github.com/huggingface/datasets/pull/7464
Priotitize json by @lhoestq in https://github.com/huggingface/datasets/pull/7476

New Contributors

@yabramuvdi made their first contribution in https://github.com/huggingface/datasets/pull/7325

Full Changelog: https://github.com/huggingface/datasets/compare/3.4.1...3.5.0

- Python
Published by lhoestq over 1 year ago

datasets - 3.4.1

Bug Fixes

Fix data_files filtering by @lhoestq in https://github.com/huggingface/datasets/pull/7459

Full Changelog: https://github.com/huggingface/datasets/compare/3.4.0...3.4.1

- Python
Published by lhoestq over 1 year ago

datasets - 3.4.0

Dataset Features

Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
- /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version

```python from datasets import load_dataset, Video

dataset = loaddataset("path/to/video/folder", split="train") dataset[0]["video"] # <torchvision.io.videoreader.VideoReader at 0x1652284c0> ```

faster streaming for image/audio/video folder from Hugging Face
support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
- Add IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:

python dataset = dataset.decode(num_threads=num_threads) * Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368

General improvements and bug fixes

fix: None default with bool type on load creates typing error by @stephantul in https://github.com/huggingface/datasets/pull/7426
Use pyupgrade --py39-plus by @cyyever in https://github.com/huggingface/datasets/pull/7428
Refactor string_to_dict to return None if there is no match instead of raising ValueError by @ringohoffman in https://github.com/huggingface/datasets/pull/7435
Fix small bugs with async map by @lhoestq in https://github.com/huggingface/datasets/pull/7445
Fix resuming after ds.set_epoch(new_epoch) by @lhoestq in https://github.com/huggingface/datasets/pull/7451
minor docs changes by @lhoestq in https://github.com/huggingface/datasets/pull/7452

New Contributors

@stephantul made their first contribution in https://github.com/huggingface/datasets/pull/7426
@cyyever made their first contribution in https://github.com/huggingface/datasets/pull/7428
@jp1924 made their first contribution in https://github.com/huggingface/datasets/pull/7368

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.2...3.4.0

- Python
Published by lhoestq over 1 year ago

datasets - 3.3.2

Bug fixes

Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in https://github.com/huggingface/datasets/pull/7411
Gracefully cancel async tasks by @lhoestq in https://github.com/huggingface/datasets/pull/7414

Other general improvements

Update usewithpandas.mdx: to_pandas() correction in last section by @ibarrien in https://github.com/huggingface/datasets/pull/7407
Fix a typo in arrow_dataset.py by @jingedawang in https://github.com/huggingface/datasets/pull/7402

New Contributors

@dakinggg made their first contribution in https://github.com/huggingface/datasets/pull/7411
@ibarrien made their first contribution in https://github.com/huggingface/datasets/pull/7407
@jingedawang made their first contribution in https://github.com/huggingface/datasets/pull/7402

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.1...3.3.2

- Python
Published by lhoestq over 1 year ago

datasets - 3.3.1

Bug fixes

Fix filter speed regression by @lhoestq in https://github.com/huggingface/datasets/pull/7408

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.0...3.3.1

- Python
Published by lhoestq over 1 year ago

datasets - 3.3.0

Dataset Features

Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384
- Especially useful to download content like images or call inference APIs

python prompt = "Answer the following question: {question}. You should think step by step." async def ask_llm(example): return await query_model(prompt.format(question=example["question"])) ds = ds.map(ask_llm) * Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198 python ds = ds.repeat(10) * Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in https://github.com/huggingface/datasets/pull/7370 * Add support for "pandas" and "polars" formats in IterableDatasets * This enables optimized data processing using pandas or polars functions with zero-copy, e.g.

python ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True) ds = ds.with_format("polars") expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution") ds = ds.map(lambda df: df.with_columns(expr), batched=True)

Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
- IterableDatasets with "numpy" format are now much faster

What's Changed

don't import soundfile in tests by @lhoestq in https://github.com/huggingface/datasets/pull/7340
minor video docs on how to install by @lhoestq in https://github.com/huggingface/datasets/pull/7341
Fix typo in arrow_dataset by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7328
remove filecheck to enable symlinks by @fschlatt in https://github.com/huggingface/datasets/pull/7133
Webdataset special columns in last position by @lhoestq in https://github.com/huggingface/datasets/pull/7349
Bump hfh to 0.24 to fix ci by @lhoestq in https://github.com/huggingface/datasets/pull/7350
fsspec 2024.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7352
changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in https://github.com/huggingface/datasets/pull/7353
Catch OSError for arrow by @lhoestq in https://github.com/huggingface/datasets/pull/7348
Remove .h5 from imagefolder extensions by @lhoestq in https://github.com/huggingface/datasets/pull/7374
Add Pandas, PyArrow and Polars docs by @lhoestq in https://github.com/huggingface/datasets/pull/7382
Optimized sequence encoding for scalars by @lukasgd in https://github.com/huggingface/datasets/pull/7393
Update docs by @lhoestq in https://github.com/huggingface/datasets/pull/7395
Update README.md by @lhoestq in https://github.com/huggingface/datasets/pull/7396
Release: 3.3.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7398

New Contributors

@AndreaFrancis made their first contribution in https://github.com/huggingface/datasets/pull/7328
@vttrifonov made their first contribution in https://github.com/huggingface/datasets/pull/7353
@lukasgd made their first contribution in https://github.com/huggingface/datasets/pull/7393

Full Changelog: https://github.com/huggingface/datasets/compare/3.2.0...3.3.0

- Python
Published by lhoestq over 1 year ago

datasets - 3.2.0

Dataset Features

Faster parquet streaming + filters with predicate pushdown by @lhoestq in https://github.com/huggingface/datasets/pull/7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g. python from datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Other improvements and bug fixes

fix conda release worlflow by @lhoestq in https://github.com/huggingface/datasets/pull/7272
Add link to video dataset by @NielsRogge in https://github.com/huggingface/datasets/pull/7277
Raise error for incorrect JSON serialization by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7273
support for custom feature encoding/decoding by @alex-hh in https://github.com/huggingface/datasets/pull/7284
update load_dataset doctring by @lhoestq in https://github.com/huggingface/datasets/pull/7301
Let server decide default repo visibility by @Wauplin in https://github.com/huggingface/datasets/pull/7302
fix: update elasticsearch version by @ruidazeng in https://github.com/huggingface/datasets/pull/7300
Fix typing in iterable_dataset.py by @lhoestq in https://github.com/huggingface/datasets/pull/7304
Updated inconsistent output in documentation examples for ClassLabel by @sergiopaniego in https://github.com/huggingface/datasets/pull/7293
More docs to from_dict to mention that the result lives in RAM by @lhoestq in https://github.com/huggingface/datasets/pull/7316
Release: 3.2.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7317

New Contributors

@ruidazeng made their first contribution in https://github.com/huggingface/datasets/pull/7300
@sergiopaniego made their first contribution in https://github.com/huggingface/datasets/pull/7293

Full Changelog: https://github.com/huggingface/datasets/compare/3.1.0...3.2.0

- Python
Published by lhoestq over 1 year ago

datasets - 3.1.0

Dataset Features

Video support by @lhoestq in https://github.com/huggingface/datasets/pull/7230 python >>> from datasets import Dataset, Video, load_dataset >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video()) >>> # or from the hub >>> ds = load_dataset("username/dataset_name", split="train") >>> ds[0]["video"] <decord.video_reader.VideoReader at 0x105525c70>
Add IterableDataset.shard() by @lhoestq in https://github.com/huggingface/datasets/pull/7252 python >>> from datasets import load_dataset >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True) >>> full_ds.num_shards 2360 >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0) >>> ds.num_shards 1 >>> ds = full_ds.shard(num_shards=8, index=0) >>> ds.num_shards 295
Basic XML support by @lhoestq in https://github.com/huggingface/datasets/pull/7250

What's Changed

(Super tiny doc update) Mention to_polars by @fzyzcjy in https://github.com/huggingface/datasets/pull/7232
[MINOR:TYPO] Update arrow_dataset.py by @cakiki in https://github.com/huggingface/datasets/pull/7236
Missing video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7251
fix decord import by @lhoestq in https://github.com/huggingface/datasets/pull/7255
fix ci for pyarrow 18 by @lhoestq in https://github.com/huggingface/datasets/pull/7257
Retry all requests timeouts by @lhoestq in https://github.com/huggingface/datasets/pull/7256
Always set non-null writer batch size by @lhoestq in https://github.com/huggingface/datasets/pull/7258
Don't embed videos by @lhoestq in https://github.com/huggingface/datasets/pull/7259
Allow video with disabeld decoding without decord by @lhoestq in https://github.com/huggingface/datasets/pull/7262
Small addition to video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7263
fix docs relative links by @lhoestq in https://github.com/huggingface/datasets/pull/7264
Disallow video pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/7265

New Contributors

@fzyzcjy made their first contribution in https://github.com/huggingface/datasets/pull/7232

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.2...3.1.0

- Python
Published by lhoestq over 1 year ago

datasets - 3.0.2

Main bug fixes

fix unbatched arrow map for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7204
Support features in metadata configs by @albertvillanova in https://github.com/huggingface/datasets/pull/7182
Preserve features in iterable dataset.filter by @alex-hh in https://github.com/huggingface/datasets/pull/7209
Pin dill<0.3.9 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7184
- this should also fix cache issues

What's Changed

Fix release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/7177
Pin multiprocess<0.70.1 to align with dill<0.3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/7188
with_format docstring by @lhoestq in https://github.com/huggingface/datasets/pull/7203
fix ci benchmark by @lhoestq in https://github.com/huggingface/datasets/pull/7205
Fix the environment variable for huggingface cache by @torotoki in https://github.com/huggingface/datasets/pull/7200
Support Python 3.11 by @albertvillanova in https://github.com/huggingface/datasets/pull/7179
bump fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/7219
Fix typo in image dataset docs by @albertvillanova in https://github.com/huggingface/datasets/pull/7231
No need for dataset_info by @lhoestq in https://github.com/huggingface/datasets/pull/7234
use huggingface_hub offline mode by @lhoestq in https://github.com/huggingface/datasets/pull/7244

New Contributors

@alex-hh made their first contribution in https://github.com/huggingface/datasets/pull/7204
@torotoki made their first contribution in https://github.com/huggingface/datasets/pull/7200

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.1...3.0.2

- Python
Published by lhoestq over 1 year ago

datasets - 3.0.1

What's Changed

Modify add_column() to optionally accept a FeatureType as param by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7143
Align filename prefix splitting with WebDataset library by @albertvillanova in https://github.com/huggingface/datasets/pull/7151
Support ndjson data files by @albertvillanova in https://github.com/huggingface/datasets/pull/7154
Support JSON lines with missing struct fields by @albertvillanova in https://github.com/huggingface/datasets/pull/7160
Support JSON lines with empty struct by @albertvillanova in https://github.com/huggingface/datasets/pull/7162
fix increaseloadcount by @lhoestq in https://github.com/huggingface/datasets/pull/7165
fix docstring code example for distributed shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/7166
Support JSON lines with missing columns by @albertvillanova in https://github.com/huggingface/datasets/pull/7170
Add torchdata as a regular test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/7172

New Contributors

@varadhbhatnagar made their first contribution in https://github.com/huggingface/datasets/pull/7143

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.0...3.0.1

- Python
Published by albertvillanova almost 2 years ago

datasets - 3.0.0

Dataset Features

Use Polars functions in .map()
- Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762
- Example:
```python

from datasets import loaddataset ds = loaddataset("lhoestq/CudyPokemonAdventures", split="train").withformat("polars") cols = [pl.col("content").str.lenbytes().alias("length")] dswithlength = ds.map(lambda df: df.withcolumns(cols), batched=True) dswithlength[:5] shape: (5, 5) ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐ │ idx ┆ title ┆ content ┆ labels ┆ length │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str ┆ u32 │ ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡ │ 0 ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyfuladventure ┆ 180 │ │ 1 ┆ Pikachu's Quest for Peace ┆ Pikachu, with his cheeky persona… ┆ peacefulnarrative ┆ 138 │ │ 2 ┆ The Tender Tale of Squirtle ┆ Squirtle took everyone on a memo… ┆ gentleadventure ┆ 135 │ │ 3 ┆ Charizard's Heartwarming Tale ┆ Charizard found joy in helping o… ┆ heartwarmingstory ┆ 112 │ │ 4 ┆ Jolteon's Sparkling Journey ┆ Jolteon, with his zest for life,… ┆ celebratorynarrative ┆ 111 │ └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘ ```
Support NumPy 2
- Allow numpy-2.1 and test it without audio extra by @albertvillanova in https://github.com/huggingface/datasets/pull/7118

Cache Changes

Use huggingface_hub cache by @lhoestq in https://github.com/huggingface/datasets/pull/7105
- use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
- cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets

Breaking changes

Remove deprecated code by @albertvillanova in https://github.com/huggingface/datasets/pull/6996
- removed deprecated arguments like use_auth_token, fs or ignore_verifications
Remove beam by @albertvillanova in https://github.com/huggingface/datasets/pull/6987
- removed deprecated apache beam datasets support
Remove metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/6983
- remove deprecated load_metric, please use the evaluate library instead
Remove tasks by @albertvillanova in https://github.com/huggingface/datasets/pull/6999
- remove deprecated task argument in load_dataset() .prepare_for_task() method, datasets.tasks module

General improvements and bug fixes

Improved the tutorial by adding a link for loading datasets by @AmboThom in https://github.com/huggingface/datasets/pull/7042
Automatically create cache_dir from cache_file_name by @ringohoffman in https://github.com/huggingface/datasets/pull/7096
remove more script docs by @lhoestq in https://github.com/huggingface/datasets/pull/7104
Fix args of feature docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/7103
Temporarily pin numpy<2.1 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7114
Fix ConnectionError for gated datasets and unauthenticated users by @albertvillanova in https://github.com/huggingface/datasets/pull/7110
Install transformers with numpy-2 CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7119
don't mention the script if trustremotecode=False by @severo in https://github.com/huggingface/datasets/pull/7120
Fix typed examples iterable state dict by @lhoestq in https://github.com/huggingface/datasets/pull/7121
Rename LargeList.dtype to LargeList.feature by @albertvillanova in https://github.com/huggingface/datasets/pull/7106
Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by @albertvillanova in https://github.com/huggingface/datasets/pull/7125
Disable implicit token in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7126
Test getdatasetconfig_info with non-existing/gated/private dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/7124
fix streaming from arrow files by @fschlatt in https://github.com/huggingface/datasets/pull/7083

New Contributors

@AmboThom made their first contribution in https://github.com/huggingface/datasets/pull/7042
@fschlatt made their first contribution in https://github.com/huggingface/datasets/pull/7083

Full Changelog: https://github.com/huggingface/datasets/compare/2.21.0...3.0.0

- Python
Published by albertvillanova almost 2 years ago

datasets - 2.21.0

Features

Support pyarrow large_list by @albertvillanova in https://github.com/huggingface/datasets/pull/7019
- Support Polars round trip: ```python import polars as pl from datasets import Dataset
df1 = pl.fromdict({"col1": [[1, 2], [3, 4]]} df2 = Dataset.frompolars(df).topolars() assert df1.equals(df2) ```

What's Changed

Use HF_HUB_OFFLINE instead of HF_DATASETS_OFFLINE by @Wauplin in https://github.com/huggingface/datasets/pull/6968
packaging: Remove useless dependencies by @daskol in https://github.com/huggingface/datasets/pull/6971
Fix resuming arrow format by @lhoestq in https://github.com/huggingface/datasets/pull/6964
Fix webdataset pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6972
Set temporary numpy upper version < 2.0.0 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6975
Fix regression for pandas < 2.0.0 in JSON loader by @albertvillanova in https://github.com/huggingface/datasets/pull/6978
Ensure compatibility with numpy 2.0.0 by @KennethEnevoldsen in https://github.com/huggingface/datasets/pull/6976
Remove underlines between badges by @novialriptide in https://github.com/huggingface/datasets/pull/6966
Update docs on trustremotecode defaults to False by @albertvillanova in https://github.com/huggingface/datasets/pull/6981
Improve skip take shuffling and distributed by @lhoestq in https://github.com/huggingface/datasets/pull/6965
Fix tests using hf-internal-testing/librispeechasrdummy by @albertvillanova in https://github.com/huggingface/datasets/pull/6998
Fix dump of bfloat16 torch tensor by @lhoestq in https://github.com/huggingface/datasets/pull/7002
minor fix for bfloat16 by @lhoestq in https://github.com/huggingface/datasets/pull/7003
Fix incorrect rank value in data splitting by @yzhangcs in https://github.com/huggingface/datasets/pull/6994
less script docs by @lhoestq in https://github.com/huggingface/datasets/pull/6993
Fix CI by temporarily pinning ruff < 0.5.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/7007
Support ruff 0.5.0 in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7009
Fix WebDatasets KeyError for user-defined Features when a field is missing in an example by @ProGamerGov in https://github.com/huggingface/datasets/pull/7004
[Streaming] retry on requests errors by @lhoestq in https://github.com/huggingface/datasets/pull/6963
Re-enable raising error from huggingface-hub FutureWarning in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7011
Skip faiss tests on Windows to avoid running CI for 360 minutes by @albertvillanova in https://github.com/huggingface/datasets/pull/7014
Support fsspec 2024.6.1 by @albertvillanova in https://github.com/huggingface/datasets/pull/7017
Persist IterableDataset epoch in workers by @lhoestq in https://github.com/huggingface/datasets/pull/6710
Fix casting list array to fixed size list by @albertvillanova in https://github.com/huggingface/datasets/pull/7021
Remove dead code for pyarrow < 15.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/7023
Fix checklibraryimports by @lhoestq in https://github.com/huggingface/datasets/pull/7026
Missing line from previous pr by @lhoestq in https://github.com/huggingface/datasets/pull/7027
Fix ci by @lhoestq in https://github.com/huggingface/datasets/pull/7028
Add decorator as explicit test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/7043
Mark tests that require librosa by @albertvillanova in https://github.com/huggingface/datasets/pull/7044
Unblock NumPy 2.0 by @NeilGirdhar in https://github.com/huggingface/datasets/pull/6991
Fix tensorflow min version depending on Python version by @albertvillanova in https://github.com/huggingface/datasets/pull/7045
Support librosa and numpy 2.0 for Python 3.10 by @albertvillanova in https://github.com/huggingface/datasets/pull/7046
add checkpoint and resume title in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7050
Update load_hub.mdx by @severo in https://github.com/huggingface/datasets/pull/7057
Add batching to IterableDataset by @lappemic in https://github.com/huggingface/datasets/pull/7054
Avoid calling http_head for non-HTTP URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/7062
Fix loaddataset for datafiles with protocols other than HF by @matstrand in https://github.com/huggingface/datasets/pull/6862
Add batch method to Dataset class by @lappemic in https://github.com/huggingface/datasets/pull/7064
Fix doc generation when NamedSplit is used as parameter default value by @albertvillanova in https://github.com/huggingface/datasets/pull/7036
Fix CI by temporarily marking testconvertto_parquet as expected to fail by @albertvillanova in https://github.com/huggingface/datasets/pull/7074
add split argument to Generator by @piercus in https://github.com/huggingface/datasets/pull/7015
Update required soxr version from pre-release to release by @albertvillanova in https://github.com/huggingface/datasets/pull/7075
Fix CI testconvertto_parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/7078
Fix preparesinglehoppathandstorageoptions by @albertvillanova in https://github.com/huggingface/datasets/pull/7068
Set loadfromdisk path type as PathLike by @albertvillanova in https://github.com/huggingface/datasets/pull/7081
Fix pushtohub by not calling create_branch if branch exists by @albertvillanova in https://github.com/huggingface/datasets/pull/7069
feat: support non streamable arrow file binary format by @kmehant in https://github.com/huggingface/datasets/pull/7025
Support HTTP authentication in non-streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/7082
chore: fix typos in docs by @hattizai in https://github.com/huggingface/datasets/pull/7034
Fix CI for metrics by @albertvillanova in https://github.com/huggingface/datasets/commit/83e5c05fd38a4a37b5e6d5d7c0cfa73d76f1b220

New Contributors

@novialriptide made their first contribution in https://github.com/huggingface/datasets/pull/6966
@yzhangcs made their first contribution in https://github.com/huggingface/datasets/pull/6994
@ProGamerGov made their first contribution in https://github.com/huggingface/datasets/pull/7004
@NeilGirdhar made their first contribution in https://github.com/huggingface/datasets/pull/6991
@matstrand made their first contribution in https://github.com/huggingface/datasets/pull/6862
@lappemic made their first contribution in https://github.com/huggingface/datasets/pull/7054
@piercus made their first contribution in https://github.com/huggingface/datasets/pull/7015
@kmehant made their first contribution in https://github.com/huggingface/datasets/pull/7025
@hattizai made their first contribution in https://github.com/huggingface/datasets/pull/7034

Full Changelog: https://github.com/huggingface/datasets/compare/2.20.0...2.21.0

- Python
Published by albertvillanova almost 2 years ago

datasets - 2.20.0

Important

Remove default trust_remote_code=True by @lhoestq in https://github.com/huggingface/datasets/pull/6954
- datasets with a python loading script now require passing trust_remote_code=True to be used

Datasets features

[Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6658
- checkpoint and resume an iterable dataset (e.g. when streaming):
```python

iterabledataset = Dataset.fromdict({"a": range(6)}).toiterabledataset(numshards=3) for idx, example in enumerate(iterabledataset): ... print(example) ... if idx == 2: ... statedict = iterabledataset.statedict() ... print("checkpoint") ... break iterabledataset.loadstatedict(statedict) print(f"restart from checkpoint") for example in iterabledataset: ... print(example) ```

Returns:

{'a': 0} {'a': 1} {'a': 2} checkpoint restart from checkpoint {'a': 3} {'a': 4} {'a': 5}

General improvements and bug fixes

Add docs about the CLI by @albertvillanova in https://github.com/huggingface/datasets/pull/6831
Remove token arg from CLI examples by @albertvillanova in https://github.com/huggingface/datasets/pull/6839
Allow deleting a subset/config from a no-script dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6820
Fix line-endings in tests on Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/6857
Fix CI by temporarily pinning huggingface-hub < 0.23.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6861
Fix dataset name for community Hub script-datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6855
Update tqdm >= 4.66.3 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6870
Fix download for dict of dicts of URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/6871
Set dev version by @albertvillanova in https://github.com/huggingface/datasets/pull/6873
Shorten long logs by @lhoestq in https://github.com/huggingface/datasets/pull/6875
Support jax 0.4.27 in CI tests by @albertvillanova in https://github.com/huggingface/datasets/pull/6885
Close gzipped files properly by @lhoestq in https://github.com/huggingface/datasets/pull/6893
Make CLI converttoparquet not raise error if no rights to create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6902
Fix YAML error in README files appearing on GitHub by @albertvillanova in https://github.com/huggingface/datasets/pull/6898
Document that to_json defaults to JSON Lines by @albertvillanova in https://github.com/huggingface/datasets/pull/6895
Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6883
Create function to convert to parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6878
Update features.py to avoid bfloat16 unsupported error by @skaulintel in https://github.com/huggingface/datasets/pull/6607
Fix decoding multi part extension by @lhoestq in https://github.com/huggingface/datasets/pull/6904
Use pandas ujson in JSON loader to improve performance by @albertvillanova in https://github.com/huggingface/datasets/pull/6874
Update requests >=2.32.1 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6909
Fix wrong type hints in data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/6910
Remove dead code for non-dict data_files from packaged modules by @albertvillanova in https://github.com/huggingface/datasets/pull/6911
Support fsspec 2024.5.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6921
Remove torchaudio remnants from code by @albertvillanova in https://github.com/huggingface/datasets/pull/6922
[WebDataset] Add .pth support for torch tensors by @lhoestq in https://github.com/huggingface/datasets/pull/6920
Unpin hfh by @lhoestq in https://github.com/huggingface/datasets/pull/6876
Preserve JSON column order and support list of strings field by @albertvillanova in https://github.com/huggingface/datasets/pull/6914
[WebDataset] Support compressed files by @lhoestq in https://github.com/huggingface/datasets/pull/6931
update ci user by @lhoestq in https://github.com/huggingface/datasets/pull/6933
Revert ci user by @lhoestq in https://github.com/huggingface/datasets/pull/6934
Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing datadir/datafiles in no-code Hub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6925
Set dev version by @albertvillanova in https://github.com/huggingface/datasets/pull/6944
Update yanked version of minimum requests requirement by @albertvillanova in https://github.com/huggingface/datasets/pull/6945
Re-enable import sorting disabled by flake8:noqa directive when using ruff linter by @albertvillanova in https://github.com/huggingface/datasets/pull/6946
Update dataset_dict.py by @Arunprakash-A in https://github.com/huggingface/datasets/pull/6932
Update process.mdx: Code Listings Fixes by @FadyMorris in https://github.com/huggingface/datasets/pull/6928
Fix small typo by @marcenacp in https://github.com/huggingface/datasets/pull/6955
update docs on N-dim arrays by @lhoestq in https://github.com/huggingface/datasets/pull/6956
Fix typos in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6957
Validate config name and data_files in packaged modules by @albertvillanova in https://github.com/huggingface/datasets/pull/6915
Add support for categorical/dictionary types by @EthanSteinberg in https://github.com/huggingface/datasets/pull/6892
feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/datasets/pull/6960
Better error handling in dataset_module_factory by @Wauplin in https://github.com/huggingface/datasets/pull/6959
Move info_utils errors to exceptions module by @albertvillanova in https://github.com/huggingface/datasets/pull/6952
fix(ci): remove unnecessary permissions by @McPatate in https://github.com/huggingface/datasets/pull/6962

New Contributors

@skaulintel made their first contribution in https://github.com/huggingface/datasets/pull/6607
@Arunprakash-A made their first contribution in https://github.com/huggingface/datasets/pull/6932
@FadyMorris made their first contribution in https://github.com/huggingface/datasets/pull/6928
@marcenacp made their first contribution in https://github.com/huggingface/datasets/pull/6955
@EthanSteinberg made their first contribution in https://github.com/huggingface/datasets/pull/6892
@McPatate made their first contribution in https://github.com/huggingface/datasets/pull/6960

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.20.0

- Python
Published by albertvillanova about 2 years ago

datasets - 2.19.2

Bug fixes

Make CLI converttoparquet not raise error if no rights to create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6902
Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6883
Update requests >=2.32.1 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6909
Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing datadir/datafiles in no-code Hub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6925

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.1...2.19.2

- Python
Published by albertvillanova about 2 years ago

datasets - 2.19.1

Bug fixes

Fix download for dict of dicts of URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/6871

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.19.1

- Python
Published by albertvillanova about 2 years ago

datasets - 2.19.0

Dataset Features

Add Polars compatibility by @psmyth94 in https://github.com/huggingface/datasets/pull/6531
- convert to a Polars dataframe using .to_polars(); python import polars as pl from datasets import load_dataset ds = load_dataset("DIBT/10k_prompts_ranked", split="train") ds.to_polars() \ .groupby("topic") \ .agg(pl.len(), pl.first()) \ .sort("len", descending=True)
- Use Polars formatting to return Polars objects when accessing a dataset: python ds = ds.with_format("polars") ds[:10].group_by("kind").len()
Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in https://github.com/huggingface/datasets/pull/6096
- Save on HF in any file format: python ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl") ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv") ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
Add mode parameter to Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735
- Set images to be read in a certain mode like "RGB" python dataset = dataset.cast_column("image", Image(mode="RGB"))
Add CLI function to convert script-dataset to Parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6795
- run command to open a PR in script-based dataset to convert it to Parquet: datasets-cli convert_to_parquet <dataset_id>
Add Dataset.take and Dataset.skip by @lhoestq in https://github.com/huggingface/datasets/pull/6813
- same as IterableDataset.take and IterableDataset.skip python ds = ds.take(10) # take only the first 10 examples

General improvements and bug fixes

Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in https://github.com/huggingface/datasets/pull/6713
fix CastError pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6712
Expand no-code dataset info with datasets-server info by @mariosasko in https://github.com/huggingface/datasets/pull/6714
Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in https://github.com/huggingface/datasets/pull/6715
Fix concurrent script loading with force_redownload by @lhoestq in https://github.com/huggingface/datasets/pull/6718
getdatasetdefaultconfigname docstring by @lhoestq in https://github.com/huggingface/datasets/pull/6723
Deprecate Beam API and download from HF GCS bucket by @mariosasko in https://github.com/huggingface/datasets/pull/6474
Deprecate Pandas builder by @mariosasko in https://github.com/huggingface/datasets/pull/6730
Using a registry instead of calling globals for fetching feature types by @psmyth94 in https://github.com/huggingface/datasets/pull/6727
Update torch_formatter.py by @VarunNSrivastava in https://github.com/huggingface/datasets/pull/6402
Improve default patterns resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6704
Transpose images with EXIF Orientation tag by @mariosasko in https://github.com/huggingface/datasets/pull/6739
Fix missing downloadconfig in getdata_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6742
Allow null values in dict columns by @mariosasko in https://github.com/huggingface/datasets/pull/6743
Fix fsspec tqdm callback by @lhoestq in https://github.com/huggingface/datasets/pull/6749
chore(deps): bump fsspec by @shcheklein in https://github.com/huggingface/datasets/pull/6747
Fix offline mode with single config by @lhoestq in https://github.com/huggingface/datasets/pull/6741
Remove deprecated code by @Wauplin in https://github.com/huggingface/datasets/pull/6761
fixing the issue 6755(small typo) by @JINO-ROHIT in https://github.com/huggingface/datasets/pull/6767
remove_columns/rename_columns doc fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6772
Fix CI by @mariosasko in https://github.com/huggingface/datasets/pull/6780
rename datasets-server to dataset-viewer by @severo in https://github.com/huggingface/datasets/pull/6785
Install dependencies with uv in CI by @mariosasko in https://github.com/huggingface/datasets/pull/6779
Fix cache conflict in _check_legacy_cache2 by @lhoestq in https://github.com/huggingface/datasets/pull/6792
Fix typo in docs (upload CLI) by @Wauplin in https://github.com/huggingface/datasets/pull/6802
fix DatasetBuilder._split_generators incomplete type annotation by @JonasLoos in https://github.com/huggingface/datasets/pull/6799
#6791 Improve type checking around FAISS by @Dref360 in https://github.com/huggingface/datasets/pull/6803
Fix --repo-type order in cli upload docs by @lhoestq in https://github.com/huggingface/datasets/pull/6804
Fix hf-internal-testing/datasetwithscript commit SHA in CI test by @albertvillanova in https://github.com/huggingface/datasets/pull/6806
Fix cache path to snakecase for CachedDatasetModuleFactory and Cache by @izhx in https://github.com/huggingface/datasets/pull/6754
Multithreaded downloads by @lhoestq in https://github.com/huggingface/datasets/pull/6794
Remove os.path.relpath in resolve_patterns by @mariosasko in https://github.com/huggingface/datasets/pull/6815
Extract data on the fly in packaged builders by @mariosasko in https://github.com/huggingface/datasets/pull/6784
add allowprimitivetostr and allowdecimaltostr instead of allownumberto_str by @Modexus in https://github.com/huggingface/datasets/pull/6811
Support indexable objects in Dataset.__getitem__ by @mariosasko in https://github.com/huggingface/datasets/pull/6817
Make converttoparquet CLI command create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6809
Fix parquet export infos by @lhoestq in https://github.com/huggingface/datasets/pull/6822

New Contributors

@VarunNSrivastava made their first contribution in https://github.com/huggingface/datasets/pull/6402
@shcheklein made their first contribution in https://github.com/huggingface/datasets/pull/6747
@JINO-ROHIT made their first contribution in https://github.com/huggingface/datasets/pull/6767
@JonasLoos made their first contribution in https://github.com/huggingface/datasets/pull/6799
@izhx made their first contribution in https://github.com/huggingface/datasets/pull/6754
@Modexus made their first contribution in https://github.com/huggingface/datasets/pull/6811

Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0

- Python
Published by albertvillanova about 2 years ago

datasets - 2.18.0

Dataset features

Make JSON builder support an array of strings by @albertvillanova in https://github.com/huggingface/datasets/pull/6696
Base parquet batch_size on parquet row group size by @lhoestq in https://github.com/huggingface/datasets/pull/6701
- Faster cold start for streaming
Change default compression argument for JsonDatasetWriter by @Rexhaif in https://github.com/huggingface/datasets/pull/6659
Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in https://github.com/huggingface/datasets/pull/6660
fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in https://github.com/huggingface/datasets/pull/6687
- Support latest fsspec up to 2024.2.0

General improvements and bug fixes

Fix for Incorrect exiterable used with multi numworker by @kq-chen in https://github.com/huggingface/datasets/pull/6582
- Previously using PyTorch DDP and num_workers could lead to incorrect shards assignments to workers and cause errors
Fix imagefolder dataset url by @mariosasko in https://github.com/huggingface/datasets/pull/6683
Improve error message for gated datasets on load by @lewtun in https://github.com/huggingface/datasets/pull/6684
Updated Quickstart Notebook link by @Codeblockz in https://github.com/huggingface/datasets/pull/6685
Update the print message for chunked_dataset in process.mdx by @gzbfgjf2 in https://github.com/huggingface/datasets/pull/6693
Faster xlistdir by @mariosasko in https://github.com/huggingface/datasets/pull/6698
Update GitHub Actions to Node 20 by @albertvillanova in https://github.com/huggingface/datasets/pull/6682
Update release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/6681
Pass through information about location of cache directory. by @stridge-cruxml in https://github.com/huggingface/datasets/pull/6677
Allow SplitDict setitem to replace existing SplitInfo by @lhoestq in https://github.com/huggingface/datasets/pull/6665
Update ruff by @lhoestq in https://github.com/huggingface/datasets/pull/6706
Silence ruff deprecation messages by @mariosasko in https://github.com/huggingface/datasets/pull/6707
fix: show correct package name to install biopython by @BioGeek in https://github.com/huggingface/datasets/pull/6662
Fix datafiles when passing datadir by @lhoestq in https://github.com/huggingface/datasets/pull/6705
Release: 2.18.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6708

New Contributors

@Codeblockz made their first contribution in https://github.com/huggingface/datasets/pull/6685
@gzbfgjf2 made their first contribution in https://github.com/huggingface/datasets/pull/6693
@stridge-cruxml made their first contribution in https://github.com/huggingface/datasets/pull/6677
@pmrowla made their first contribution in https://github.com/huggingface/datasets/pull/6687
@BioGeek made their first contribution in https://github.com/huggingface/datasets/pull/6662
@Rexhaif made their first contribution in https://github.com/huggingface/datasets/pull/6659
@mohalisad made their first contribution in https://github.com/huggingface/datasets/pull/6660
@kq-chen made their first contribution in https://github.com/huggingface/datasets/pull/6582

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0

- Python
Published by lhoestq over 2 years ago

datasets - 2.17.1

Bug Fixes

Revert the changes in arrow_writer.py from #6636 by @bryant1410 in https://github.com/huggingface/datasets/pull/6664
Remove deprecated verbose parameter from CSV builder by @albertvillanova in https://github.com/huggingface/datasets/pull/6672

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1

- Python
Published by albertvillanova over 2 years ago

datasets - 2.17.0

What's Changed

Fix parallel downloads for datasets without scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6551
Fix imagefolder with one image by @lhoestq in https://github.com/huggingface/datasets/pull/6556
Fix tests based on datasets that used to have scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6574
remove eli5 test by @lhoestq in https://github.com/huggingface/datasets/pull/6583
[IterableDataset] Fix drop_last_batchin map after shuffling or sharding by @lhoestq in https://github.com/huggingface/datasets/pull/6575
[WebDataset] Audio support and bug fixes by @lhoestq in https://github.com/huggingface/datasets/pull/6573
Support standalone yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6557
Drop redundant None guard. by @xkszltl in https://github.com/huggingface/datasets/pull/6596
fix os.listdir return name is empty string by @d710055071 in https://github.com/huggingface/datasets/pull/6581
Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in https://github.com/huggingface/datasets/pull/6617
Dedicated RNG object for fingerprinting by @mariosasko in https://github.com/huggingface/datasets/pull/6606
Add concurrent loading of shards to datasets.loadfromdisk by @kkoutini in https://github.com/huggingface/datasets/pull/6464
Migrate from setup.cfg to pyproject.toml by @mariosasko in https://github.com/huggingface/datasets/pull/6619
keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in https://github.com/huggingface/datasets/pull/6586
Read GeoParquet files using parquet reader by @weiji14 in https://github.com/huggingface/datasets/pull/6508
Use schema metadata only if it matches features by @lhoestq in https://github.com/huggingface/datasets/pull/6616
Raise error on bad split name by @lhoestq in https://github.com/huggingface/datasets/pull/6626
Disable tqdm bars in non-interactive environments by @mariosasko in https://github.com/huggingface/datasets/pull/6627
Add with_rank param to Dataset.filter by @mariosasko in https://github.com/huggingface/datasets/pull/6608
Bump max range of dill to 0.3.8 by @ringohoffman in https://github.com/huggingface/datasets/pull/6630
Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/6631
Faster webdataset streaming by @lhoestq in https://github.com/huggingface/datasets/pull/6578
Multi gpu docs by @lhoestq in https://github.com/huggingface/datasets/pull/6550
dataset viewer requires no-script by @severo in https://github.com/huggingface/datasets/pull/6633
Make split slicing consistent with list slicing by @mariosasko in https://github.com/huggingface/datasets/pull/5891
Do not use Parquet exports if revision is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/6555
Make CLI test support multi-processing by @albertvillanova in https://github.com/huggingface/datasets/pull/6628
Support datadir parameter in pushto_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6634
Support pushtohub without org/user to default to logged-in user by @albertvillanova in https://github.com/huggingface/datasets/pull/6629
Fix reload cache with data dir by @lhoestq in https://github.com/huggingface/datasets/pull/6632
Fix array cast/embed with null values by @mariosasko in https://github.com/huggingface/datasets/pull/6283
Faster column validation and reordering by @psmyth94 in https://github.com/huggingface/datasets/pull/6636
Better multi-gpu example by @lhoestq in https://github.com/huggingface/datasets/pull/6646
Fix missing info when loading some datasets from Parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6635
Minor multi gpu doc improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6649
Document usage of hfh cli instead of git by @lhoestq in https://github.com/huggingface/datasets/pull/6648
Allow concatenation of datasets with mixed structs by @Dref360 in https://github.com/huggingface/datasets/pull/6587

New Contributors

@xkszltl made their first contribution in https://github.com/huggingface/datasets/pull/6596
@kkoutini made their first contribution in https://github.com/huggingface/datasets/pull/6464
@JochenSiegWork made their first contribution in https://github.com/huggingface/datasets/pull/6586
@weiji14 made their first contribution in https://github.com/huggingface/datasets/pull/6508
@ringohoffman made their first contribution in https://github.com/huggingface/datasets/pull/6630
@psmyth94 made their first contribution in https://github.com/huggingface/datasets/pull/6636

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0

- Python
Published by albertvillanova over 2 years ago

datasets - 2.16.1

Bug fixes

Fix dl_manager.extract returning FileNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6543
- Fix bug causing FileNotFoundError when passing a relative directory as cache_dir to load_dataset
Fix custom configs from script by @lhoestq in https://github.com/huggingface/datasets/pull/6544
- Fix bug when loading a dataset with a loading script using custom arguments would fail
- e.g. load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1

- Python
Published by lhoestq over 2 years ago

datasets - 2.16.0

Security features

Add trustremotecode argument by @lhoestq in https://github.com/huggingface/datasets/pull/6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
- Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
- Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
Use parquet export if possible by @lhoestq in https://github.com/huggingface/datasets/pull/6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

Webdataset dataset builder by @lhoestq in https://github.com/huggingface/datasets/pull/6391
Implement get dataset default config name by @albertvillanova in https://github.com/huggingface/datasets/pull/6511
Lazy data files resolution and offline cache reload by @lhoestq in https://github.com/huggingface/datasets/pull/6493
- This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
- Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
- Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
- Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

Remove unused argument in _get_data_files_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6343
Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in https://github.com/huggingface/datasets/pull/6414
Use ruff for formatting by @mariosasko in https://github.com/huggingface/datasets/pull/6434
Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in https://github.com/huggingface/datasets/pull/6431
Fix multi gpu map example by @lhoestq in https://github.com/huggingface/datasets/pull/6415
Better tqdm wrapper by @mariosasko in https://github.com/huggingface/datasets/pull/6433
Remove Table.__getstate__ and Table.__setstate__ by @LZHgrla in https://github.com/huggingface/datasets/pull/6444
Use filelock package for file locking by @mariosasko in https://github.com/huggingface/datasets/pull/6445
Fix metadata file resolution when inferred pattern is ** by @mariosasko in https://github.com/huggingface/datasets/pull/6449
Update hub-docs reference by @mishig25 in https://github.com/huggingface/datasets/pull/6453
Refactor dill logic by @mariosasko in https://github.com/huggingface/datasets/pull/6454
Don't require trustremotecode in inspect_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6456
[docs] troubleshooting guide by @MKhalusova in https://github.com/huggingface/datasets/pull/6424
Missing DatasetNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6462
Disable benchmarks in PRs by @lhoestq in https://github.com/huggingface/datasets/pull/6463
More robust temporary directory deletion by @mariosasko in https://github.com/huggingface/datasets/pull/6426
Fix shard retry mechanism in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6461
Use auth to get parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6468
Remove delete doc CI by @lhoestq in https://github.com/huggingface/datasets/pull/6471
Fix CI quality by @albertvillanova in https://github.com/huggingface/datasets/pull/6473
Fix PermissionError on Windows CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6477
More robust preupload retry mechanism by @mariosasko in https://github.com/huggingface/datasets/pull/6479
Add IterableDataset __repr__ by @lhoestq in https://github.com/huggingface/datasets/pull/6480
Fix max lock length on unix by @lhoestq in https://github.com/huggingface/datasets/pull/6482
Fix ArrayXD YAML conversion by @mariosasko in https://github.com/huggingface/datasets/pull/6168
Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6486
Fix deprecation warning when building conda package by @albertvillanova in https://github.com/huggingface/datasets/pull/6425
Make pushtohub return CommitInfo by @albertvillanova in https://github.com/huggingface/datasets/pull/6492
docs: add reference Git over SSH by @severo in https://github.com/huggingface/datasets/pull/6499
Fallback on dataset script if user wants to load default config by @lhoestq in https://github.com/huggingface/datasets/pull/6498
Don't expand_info in HF glob by @lhoestq in https://github.com/huggingface/datasets/pull/6469
Fix streaming xnli by @lhoestq in https://github.com/huggingface/datasets/pull/6503
Pickle support for torch.Generator objects by @mariosasko in https://github.com/huggingface/datasets/pull/6502
Enable setting config as default when pushtohub by @albertvillanova in https://github.com/huggingface/datasets/pull/6500
Better cast error when generating dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6509
Replace list_files_info with list_repo_tree in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6510
Remove deprecated HfFolder by @lhoestq in https://github.com/huggingface/datasets/pull/6512
Support huggingface-hub pre-releases by @albertvillanova in https://github.com/huggingface/datasets/pull/6516
Support pushtohub canonical datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6519
Support commitdescription parameter in pushto_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6520
fix getmetadatapatterns function args error by @d710055071 in https://github.com/huggingface/datasets/pull/6518
Fix metrics dead link by @qgallouedec in https://github.com/huggingface/datasets/pull/6491
fix tests by @lhoestq in https://github.com/huggingface/datasets/pull/6523
Cache backward compatibility with 2.15.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6514
Preserve order of configs and splits when using Parquet exports by @albertvillanova in https://github.com/huggingface/datasets/pull/6526

New Contributors

@LZHgrla made their first contribution in https://github.com/huggingface/datasets/pull/6444
@d710055071 made their first contribution in https://github.com/huggingface/datasets/pull/6518

Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0

- Python
Published by lhoestq over 2 years ago

datasets - 2.15.0

What's Changed

Fix typo in Audio dataset documentation by @prassanna-ravishankar in https://github.com/huggingface/datasets/pull/6222
Add pushtohub with multiple configs docs by @lhoestq in https://github.com/huggingface/datasets/pull/6226
Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in https://github.com/huggingface/datasets/pull/6228
Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6233
Don't skip hidden files in dl_manager.iter_files when they are given as input by @mariosasko in https://github.com/huggingface/datasets/pull/6230
Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6223
Remove unused global variables in audio.py by @mariosasko in https://github.com/huggingface/datasets/pull/6241
Improve error message for missing function parameters by @suavemint in https://github.com/huggingface/datasets/pull/6232
Fix cast from fixed size list to variable size list by @mariosasko in https://github.com/huggingface/datasets/pull/6243
Update create_dataset.mdx by @EswarDivi in https://github.com/huggingface/datasets/pull/6247
[DOCS] Fix typo: Elasticsearch by @leemthompo in https://github.com/huggingface/datasets/pull/6258
Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in https://github.com/huggingface/datasets/pull/6251
Temporarily pin tensorflow < 2.14.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6264
Fix CI 404 errors by @albertvillanova in https://github.com/huggingface/datasets/pull/6262
Remove apache_beam import in BeamBasedBuilder._save_info by @mariosasko in https://github.com/huggingface/datasets/pull/6265
Improve documentation of dataset.from_generator by @hartmans in https://github.com/huggingface/datasets/pull/6281
Fix parquet columns argument in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/6295
Doc readme improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6298
Unpin tensorflow maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6301
Unpin jax maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6300
Fix ArrayXD cast by @mariosasko in https://github.com/huggingface/datasets/pull/6297
Reduce the number of commits in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6269
Fix typo in code example in docs by @bryant1410 in https://github.com/huggingface/datasets/pull/6307
Update README.md by @smty2018 in https://github.com/huggingface/datasets/pull/6304
Deterministic set hash by @lhoestq in https://github.com/huggingface/datasets/pull/6318
docs: resolving namespace conflict, refactored variable by @smty2018 in https://github.com/huggingface/datasets/pull/6312
Fix typos by @python273 in https://github.com/huggingface/datasets/pull/6321
Fix commit message formatting in multi-commit uploads by @qgallouedec in https://github.com/huggingface/datasets/pull/6313
Temporarily pin fsspec < 2023.10.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6331
Unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/6336
Fix use_dataset.mdx by @angel-luis in https://github.com/huggingface/datasets/pull/6351
Add fsspec version to the datasets-cli env command output by @mariosasko in https://github.com/huggingface/datasets/pull/6356
Expanduser in savetodisk() by @Unknown3141592 in https://github.com/huggingface/datasets/pull/6098
Fix time measuring snippet in docs by @mariosasko in https://github.com/huggingface/datasets/pull/6367
Temporarily pin pyarrow < 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6375
Fix typo in Dataset.map docstring by @bryant1410 in https://github.com/huggingface/datasets/pull/6373
Avoid redundant warning when encoding NumPy array as Image by @mariosasko in https://github.com/huggingface/datasets/pull/6379
Replace deprecated license_file in setup.cfg by @albertvillanova in https://github.com/huggingface/datasets/pull/6332
Minor release step improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6339
Fix dependency conflict within CI build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/6411
Remove redundant condition in builders by @albertvillanova in https://github.com/huggingface/datasets/pull/6398
Handle future deprecation argument by @winglian in https://github.com/huggingface/datasets/pull/6390
Remove token value from warnings by @mariosasko in https://github.com/huggingface/datasets/pull/6418
Rename audioclassificiation.py to audioclassification.py by @carlthome in https://github.com/huggingface/datasets/pull/6416
Add pyarrow-hotfix to release docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6421
Simplify filesystem logic by @mariosasko in https://github.com/huggingface/datasets/pull/6362
Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/6423

New Contributors

@prassanna-ravishankar made their first contribution in https://github.com/huggingface/datasets/pull/6222
@NinoRisteski made their first contribution in https://github.com/huggingface/datasets/pull/6233
@suavemint made their first contribution in https://github.com/huggingface/datasets/pull/6232
@EswarDivi made their first contribution in https://github.com/huggingface/datasets/pull/6247
@leemthompo made their first contribution in https://github.com/huggingface/datasets/pull/6258
@hartmans made their first contribution in https://github.com/huggingface/datasets/pull/6281
@smty2018 made their first contribution in https://github.com/huggingface/datasets/pull/6304
@python273 made their first contribution in https://github.com/huggingface/datasets/pull/6321
@angel-luis made their first contribution in https://github.com/huggingface/datasets/pull/6351
@Unknown3141592 made their first contribution in https://github.com/huggingface/datasets/pull/6098
@winglian made their first contribution in https://github.com/huggingface/datasets/pull/6390
@carlthome made their first contribution in https://github.com/huggingface/datasets/pull/6416

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0

- Python
Published by albertvillanova over 2 years ago

datasets - 2.14.7

Bug Fixes

Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in https://github.com/huggingface/datasets/pull/6346
Fix python formatting for complex types in format_table by @mariosasko in https://github.com/huggingface/datasets/pull/6368
Support pyarrow 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6378
Do not try to download from HF GCS for generator by @yundai424 in https://github.com/huggingface/datasets/pull/6372
Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in https://github.com/huggingface/datasets/pull/6404

New Contributors

@cwallenwein made their first contribution in https://github.com/huggingface/datasets/pull/6346
@yundai424 made their first contribution in https://github.com/huggingface/datasets/pull/6372

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7

- Python
Published by albertvillanova over 2 years ago

datasets - 2.14.6

What's Changed

Ignore dataset_info.json in data files resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6224
Check builder cls default config name in inspect by @lhoestq in https://github.com/huggingface/datasets/pull/6253
Add support for fsspec>=2023.9.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6244
Create DefunctDatasetError by @albertvillanova in https://github.com/huggingface/datasets/pull/6286
Fix getdatapatterns for directories with the word data twice by @albertvillanova in https://github.com/huggingface/datasets/pull/6309
Fix loading Hub datasets with CSV metadata file by @albertvillanova in https://github.com/huggingface/datasets/pull/6316
datasets.filesystems: fix isremotefilesystems by @ap-- in https://github.com/huggingface/datasets/pull/6334
Pin upper version of fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/6337
Fix regex getdatafiles formatting for base paths by @ZachNagengast in https://github.com/huggingface/datasets/pull/6322

New Contributors

@ap-- made their first contribution in https://github.com/huggingface/datasets/pull/6334
@ZachNagengast made their first contribution in https://github.com/huggingface/datasets/pull/6322

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6

- Python
Published by lhoestq over 2 years ago

datasets - 2.14.5

Bug fixes

Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6091
Minor fix in iter_files for hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092
Use yaml instead of get data patterns when possible by @lhoestq in https://github.com/huggingface/datasets/pull/6154
Fix Parquet loading with columns by @mariosasko in https://github.com/huggingface/datasets/pull/6160
Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164
PyArrow 13 CI fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6175
Don't alter input in Features.from_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6189
Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/6165
Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6192
Temporarily pin pandas < 2.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6200
Preserve split order in DataFilesDict by @albertvillanova in https://github.com/huggingface/datasets/pull/6198
Add missing revision argument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191
Temporarily pin fsspec < 2023.9.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6210
Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208
Fix empty splitinfo json by @lhoestq in https://github.com/huggingface/datasets/pull/6211
Fix to_json ValueError and remove pandas pin by @albertvillanova in https://github.com/huggingface/datasets/pull/6201
Fix checking patterns to infer packaged builder by @polinaeterna in https://github.com/huggingface/datasets/pull/6215
Rename old pushtohub configs to "default" in dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/6218

Other improvements

Deprecate Dataset.export by @mariosasko in https://github.com/huggingface/datasets/pull/6081
Deprecate download_custom by @mariosasko in https://github.com/huggingface/datasets/pull/6093
Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in https://github.com/huggingface/datasets/pull/6138
Remove unused allowed_extensions param by @albertvillanova in https://github.com/huggingface/datasets/pull/6135
Export toiterabledataset to document by @npuichigo in https://github.com/huggingface/datasets/pull/6145
[Docs] Add description of select_columns to guide by @unifyh in https://github.com/huggingface/datasets/pull/6119
Ignore parallel warning in map_nested by @lhoestq in https://github.com/huggingface/datasets/pull/6148
[docs] Complete to_iterable_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/6158
Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in https://github.com/huggingface/datasets/pull/6155
Fix typo in aboutmapstylevs_iterable.mdx by @lhoestq in https://github.com/huggingface/datasets/pull/6171
Document BUILDERCONFIGCLASS by @lhoestq in https://github.com/huggingface/datasets/pull/6166
Fix import in image_load doc by @mariosasko in https://github.com/huggingface/datasets/pull/6181
Use object detection images from huggingface/documentation-images by @mariosasko in https://github.com/huggingface/datasets/pull/6177
Use hf-internal-testing repos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180

New Contributors

@npuichigo made their first contribution in https://github.com/huggingface/datasets/pull/6145
@unifyh made their first contribution in https://github.com/huggingface/datasets/pull/6119

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5

- Python
Published by albertvillanova almost 3 years ago

datasets - 2.13.2

Bug fixes

Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.13.2

- Python
Published by albertvillanova almost 3 years ago

datasets - 2.14.4

What's Changed

Fix authentication issues by @albertvillanova in https://github.com/huggingface/datasets/pull/6127

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.3...2.14.4

- Python
Published by albertvillanova almost 3 years ago

datasets - 2.14.3

Bug fixes

Fix error when loading from GCP bucket by @albertvillanova in https://github.com/huggingface/datasets/pull/6105
Fix deprecation of useauthtoken in file_utils by @albertvillanova in https://github.com/huggingface/datasets/pull/6107

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3

- Python
Published by albertvillanova almost 3 years ago

datasets - 2.14.2

Bug fixes

Fix deprecation of useauthtoken in DownloadConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6094
Fix deprecation of errors in TextConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6095

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2

- Python
Published by albertvillanova almost 3 years ago

datasets - 2.14.1

Bug fixes

fix tqdm lock by @lhoestq in https://github.com/huggingface/datasets/pull/6067
fix tqdm lock deletion by @lhoestq in https://github.com/huggingface/datasets/pull/6068
Fix fsspec storageoptions from loaddataset by @lhoestq in https://github.com/huggingface/datasets/pull/6072
No gzip encoding from github by @lhoestq in https://github.com/huggingface/datasets/pull/6076

Other improvements

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository by @alvarobartt in https://github.com/huggingface/datasets/pull/5902
Fix Quickstart notebook link by @mariosasko in https://github.com/huggingface/datasets/pull/6070
Remove README link to deprecated Colab notebook by @mariosasko in https://github.com/huggingface/datasets/pull/6080
Misc doc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6074

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1

- Python
Published by lhoestq almost 3 years ago

datasets - 2.14.0

Important: caching

Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
Datasets that were already cached are still supported.
This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in https://github.com/huggingface/datasets/pull/5331.

Dataset Configuration

Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
- Configure your dataset using YAML at the top of your dataset card (docs here)
- Choose which file goes into which split

yaml --- configs: - config_name: default data_files: - split: train path: data.csv - split: test path: holdout.csv --- * Define multiple dataset configurations

yaml --- configs: - config_name: main_data data_files: main_data.csv - config_name: additional_data data_files: additional_data.csv ---

Dataset Features

Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
- push_to_hub() additional dataset configurations

python ds.push_to_hub("username/dataset_name", config_name="additional_data") # reload later ds = load_dataset("username/dataset_name", "additional_data")

Support returning dataframe in map transform by @mariosasko in https://github.com/huggingface/datasets/pull/5995

What's Changed

Deprecate errors param in favor of encoding_errors in text builder by @mariosasko in https://github.com/huggingface/datasets/pull/5974
Fix select_columns columns order by @lhoestq in https://github.com/huggingface/datasets/pull/5994
Replace metadata utils with huggingface_hub's RepoCard API by @mariosasko in https://github.com/huggingface/datasets/pull/5949
Pin joblib to avoid joblibspark test failures by @mariosasko in https://github.com/huggingface/datasets/pull/6000
Align column_names type check with type hint in sort by @mariosasko in https://github.com/huggingface/datasets/pull/6001
Deprecate use_auth_token in favor of token by @mariosasko in https://github.com/huggingface/datasets/pull/5996
Drop Python 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6005
Misc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6004
Make IterableDataset.from_spark more efficient by @mathewjacob1002 in https://github.com/huggingface/datasets/pull/5986
Fix cast for dictionaries with no keys by @mariosasko in https://github.com/huggingface/datasets/pull/6009
Avoid stuck map operation when subprocesses crashes by @pappacena in https://github.com/huggingface/datasets/pull/5976
Deprecate task api by @mariosasko in https://github.com/huggingface/datasets/pull/5865
Add metadata ui screenshot in docs by @lhoestq in https://github.com/huggingface/datasets/pull/6015
Fix ClassLabel min max check for None values by @mariosasko in https://github.com/huggingface/datasets/pull/6023
[docs] Update return statement of index search by @stevhliu in https://github.com/huggingface/datasets/pull/6021
Improve logging by @mariosasko in https://github.com/huggingface/datasets/pull/6019
Fix style with ruff 0.0.278 by @lhoestq in https://github.com/huggingface/datasets/pull/6026
Don't reference self in Spark.validatecache_dir by @maddiedawson in https://github.com/huggingface/datasets/pull/6024
Delete task_templates in IterableDataset when they are no longer valid by @mariosasko in https://github.com/huggingface/datasets/pull/6027
[docs] Fix link by @stevhliu in https://github.com/huggingface/datasets/pull/6029
fixed typo in comment by @NightMachinery in https://github.com/huggingface/datasets/pull/6030
Fix legacydatasetinfos by @lhoestq in https://github.com/huggingface/datasets/pull/6040
Flatten repository_structure docs on yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6041
Use new hffs by @lhoestq in https://github.com/huggingface/datasets/pull/6028
Bump dev version by @lhoestq in https://github.com/huggingface/datasets/pull/6047
Fix unused DatasetInfosDict code in pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/6042
Rename "pattern" to "path" in YAML data_files configs by @lhoestq in https://github.com/huggingface/datasets/pull/6044
Remove HfFileSystem and deprecate S3FileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/6052
Dill 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6061
Improve Dataset.from_list docstring by @mariosasko in https://github.com/huggingface/datasets/pull/6062
Check if column names match in Parquet loader only when config features are specified by @mariosasko in https://github.com/huggingface/datasets/pull/6045
Release: 2.14.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6063

New Contributors

@mathewjacob1002 made their first contribution in https://github.com/huggingface/datasets/pull/5986
@pappacena made their first contribution in https://github.com/huggingface/datasets/pull/5976

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0

- Python
Published by lhoestq almost 3 years ago

datasets - 2.13.1

General improvements and bug fixes

Fix JSON generation in benchmarks CI by @mariosasko in https://github.com/huggingface/datasets/pull/5966
Always return list in list_datasets by @mariosasko in https://github.com/huggingface/datasets/pull/5964
Add encoding and errors params to JSON loader by @mariosasko in https://github.com/huggingface/datasets/pull/5969
Filter unsupported extensions by @lhoestq in https://github.com/huggingface/datasets/pull/5972

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1

- Python
Published by lhoestq about 3 years ago

datasets - 2.13.0

Dataset Features

Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770
- Stream the data from your Spark DataFrame directly to your training pipeline

```python from datasets import IterableDataset from torch.utils.data import DataLoader

ids = IterableDataset.fromspark(df) ids = ids.map(...).filter(...).withformat("torch") for batch in DataLoader(ids, batchsize=16, numworkers=4): ... ``` * IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow: * IterableDataset Arrow formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5821 * Iterable torch formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5852

```python from datasets import load_dataset

ids = loaddataset("c4", "en", split="train", streaming=True) ids = ids.map(...).withformat("torch") # to get PyTorch tensors - also works with tf, np, jax etc. ```

Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893

```python from datasets import IterableDataset

ids = IterableDataset.from_file("path/to/data.arrow") ```

Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944

```python from datasets import load_dataset

ds = loaddataset("arrow", datafiles={"train": "train.arrow", "test": "test.arrow"}) ```

Experimental

Add parallel module using joblib for Spark by @es94129 in https://github.com/huggingface/datasets/pull/5924

General improvements and bug fixes

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816
Fix incomplete docstring for BuilderConfig by @Laurent2916 in https://github.com/huggingface/datasets/pull/5824
[docs] Custom decoding transforms by @stevhliu in https://github.com/huggingface/datasets/pull/5836
Add accelerate as metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848
Add date_format param to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845
[docs] Redirects, migrated from nginx by @julien-c in https://github.com/huggingface/datasets/pull/5853
Fix infer module for uppercase extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5872
Minor tqdm optim by @lhoestq in https://github.com/huggingface/datasets/pull/5860
Always set nullable fields in the writer by @lhoestq in https://github.com/huggingface/datasets/pull/5835
Add fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in https://github.com/huggingface/datasets/pull/5810
Better error message when combining dataset dicts instead of datasets by @lhoestq in https://github.com/huggingface/datasets/pull/5861
Force overwrite existing filesystem protocol by @baskrahmer in https://github.com/huggingface/datasets/pull/5894
Support workingdir in fromspark by @maddiedawson in https://github.com/huggingface/datasets/pull/5826
Raise TypeError when indexing a dataset with bool by @albertvillanova in https://github.com/huggingface/datasets/pull/5859
Fix minor typo in docs loading.mdx by @albertvillanova in https://github.com/huggingface/datasets/pull/5900
Fix FixedSizeListArray casting by @mariosasko in https://github.com/huggingface/datasets/pull/5897
Unpin responses by @mariosasko in https://github.com/huggingface/datasets/pull/5916
Validate name parameter in makefileinstructions by @albertvillanova in https://github.com/huggingface/datasets/pull/5904
Raise error in DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in https://github.com/huggingface/datasets/pull/5915
Refactor extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5917
Use more efficient and idiomatic way to construct list. by @ttsugriy in https://github.com/huggingface/datasets/pull/5909
Add flatten_indices to DatasetDict by @maximxlss in https://github.com/huggingface/datasets/pull/5907
Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in https://github.com/huggingface/datasets/pull/5920
Make preparesplit more robust if errors in metadata datasetinfo splits by @albertvillanova in https://github.com/huggingface/datasets/pull/5901
Fix streaming parquet with image feature in schema by @lhoestq in https://github.com/huggingface/datasets/pull/5921
canonicalize data dir in config ID hash by @kylrth in https://github.com/huggingface/datasets/pull/5899
Fix link to quickstart docs in README.md by @mariosasko in https://github.com/huggingface/datasets/pull/5928
Fix string-encoding, make batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in https://github.com/huggingface/datasets/pull/5883
Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5863
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/datasets/pull/5932
Fix to_numpy when None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933
Better row group size in pushtohub by @lhoestq in https://github.com/huggingface/datasets/pull/5935
Avoid parallel redownload in cache by @albertvillanova in https://github.com/huggingface/datasets/pull/5937
Better filenotfound for gated by @lhoestq in https://github.com/huggingface/datasets/pull/5954
Make getfromcache use custom temp filename that is locked by @albertvillanova in https://github.com/huggingface/datasets/pull/5938
Fix ArrowExamplesIterable.sharddatasources by @lhoestq in https://github.com/huggingface/datasets/pull/5956
Add Arrow builder docs by @lhoestq in https://github.com/huggingface/datasets/pull/5952
Fix sequence of array support for most dtype by @qgallouedec in https://github.com/huggingface/datasets/pull/5948

New Contributors

@Laurent2916 made their first contribution in https://github.com/huggingface/datasets/pull/5824
@yuukicammy made their first contribution in https://github.com/huggingface/datasets/pull/5810
@baskrahmer made their first contribution in https://github.com/huggingface/datasets/pull/5894
@ttsugriy made their first contribution in https://github.com/huggingface/datasets/pull/5909
@maximxlss made their first contribution in https://github.com/huggingface/datasets/pull/5907
@mariusz-jachimowicz-83 made their first contribution in https://github.com/huggingface/datasets/pull/5893
@kylrth made their first contribution in https://github.com/huggingface/datasets/pull/5899
@qgallouedec made their first contribution in https://github.com/huggingface/datasets/pull/5933
@es94129 made their first contribution in https://github.com/huggingface/datasets/pull/5924

Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef

- Python
Published by lhoestq about 3 years ago

datasets - 2.12.0

Datasets Features

Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701
- Get a Dataset from a Spark DataFrame (docs):

```python

from datasets import Dataset ds = Dataset.from_spark(df) ``` * Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689 * Stream data from Wikipedia:

```python

from datasets import loaddataset ds = loaddataset("wikipedia", "20220301.de", streaming=True) next(iter(ds["train"])) {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...} ``` * Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735 * Use interleaved datasets in a distributed setup or with a DataLoader

```python

from datasets import loaddataset, interleavedatasets from torch.utils.data import DataLoader wiki = loaddataset("wikipedia", "20220301.en", split="train", streaming=True) c4 = loaddataset("c4", "en", split="train", streaming=True) merged = interleavedatasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stoppingstrategy="allexhausted") dataloader = DataLoader(merged, numworkers=4) ``` * Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751 * Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python * Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't * Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

Fix a description error for interleave_datasets. by @QizhiPei in https://github.com/huggingface/datasets/pull/5680
[docs] Split pattern search order by @stevhliu in https://github.com/huggingface/datasets/pull/5693
Raise an error on missing distributed seed by @lhoestq in https://github.com/huggingface/datasets/pull/5697
Fix xnumpy_load for .npz files by @albertvillanova in https://github.com/huggingface/datasets/pull/5714
Temporarily pin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5731
Unpin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5733
Fix CI warnings by @albertvillanova in https://github.com/huggingface/datasets/pull/5741
Fix CI mock filesystem fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/5740
Fix link in docs by @bbbxyz in https://github.com/huggingface/datasets/pull/5746
fix typo: "mow" -> "now" by @csris in https://github.com/huggingface/datasets/pull/5763
[docs] Compress data files by @stevhliu in https://github.com/huggingface/datasets/pull/5691
Fix style by @lhoestq in https://github.com/huggingface/datasets/pull/5774
Minor tqdm fixes by @mariosasko in https://github.com/huggingface/datasets/pull/5754
Fixes #5757 by @eli-osherovich in https://github.com/huggingface/datasets/pull/5758
Fix JSON builder when missing keys in first row by @albertvillanova in https://github.com/huggingface/datasets/pull/5772
Warning specifying future change in totfdataset behaviour by @amyeroberts in https://github.com/huggingface/datasets/pull/5742
Prepare tests for hfh 0.14 by @Wauplin in https://github.com/huggingface/datasets/pull/5788
Call fs.makedirs in savetodisk by @lhoestq in https://github.com/huggingface/datasets/pull/5779
Allow to run CI on push to ci-branch by @albertvillanova in https://github.com/huggingface/datasets/pull/5790
Fix nondeterministic sharded data split order by @albertvillanova in https://github.com/huggingface/datasets/pull/5729
Raise subprocesses traceback when interrupting by @lhoestq in https://github.com/huggingface/datasets/pull/5784
Fix spark imports by @lhoestq in https://github.com/huggingface/datasets/pull/5795
Change downloaded file permission based on umask by @albertvillanova in https://github.com/huggingface/datasets/pull/5800
Fix inferring module for unsupported data files by @albertvillanova in https://github.com/huggingface/datasets/pull/5787
Reorder default data splits to have validation before test by @albertvillanova in https://github.com/huggingface/datasets/pull/5718
Validate non-empty data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/5802
Spark docs by @lhoestq in https://github.com/huggingface/datasets/pull/5796
Release: 2.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5803

New Contributors

@QizhiPei made their first contribution in https://github.com/huggingface/datasets/pull/5680
@bbbxyz made their first contribution in https://github.com/huggingface/datasets/pull/5746
@csris made their first contribution in https://github.com/huggingface/datasets/pull/5763
@eli-osherovich made their first contribution in https://github.com/huggingface/datasets/pull/5758
@maddiedawson made their first contribution in https://github.com/huggingface/datasets/pull/5701

Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0

- Python
Published by lhoestq about 3 years ago

datasets - 2.11.0

Important

Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in https://github.com/huggingface/datasets/pull/5573
- this allows to not have dependencies on pytorch to decode audio files
- this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
Deprecated batch_size on Dataset.to_dict()

Datasets Features

Add writerbatchsize for ArrowBasedBuilder by @lhoestq in https://github.com/huggingface/datasets/pull/5565
- allow to specofy the row group / record batch size when you download_and_prepare() a dataset
Experimental support of cloud storage in load_dataset():
- Support cloud storage in load_dataset via fsspec by @dwyatte in https://github.com/huggingface/datasets/pull/5580
- Pass down storage options by @dwyatte in https://github.com/huggingface/datasets/pull/5673
Support PyArrow arrays as column values in from_dict by @mariosasko in https://github.com/huggingface/datasets/pull/5643
Allow direct cast from binary to Audio/Image by @mariosasko in https://github.com/huggingface/datasets/pull/5644
Add column_names to IterableDataset by @patrickloeber in https://github.com/huggingface/datasets/pull/5582
pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5569
add Dataset.to_list by @kyoto7250 in https://github.com/huggingface/datasets/pull/5611

General imrovements and bug fixes

Update csv.py by @XDoubleU in https://github.com/huggingface/datasets/pull/5562
Remove instructions for ffmpeg system package installation on Colab by @polinaeterna in https://github.com/huggingface/datasets/pull/5558
Apply ruff flake8-comprehension checks by @Skylion007 in https://github.com/huggingface/datasets/pull/5549
Fix datasets.load_from_disk, DatasetDict.load_from_disk and Dataset.load_from_disk by @alvarobartt in https://github.com/huggingface/datasets/pull/5529
Add pre-commit config yaml file to enable automatic code formatting by @polinaeterna in https://github.com/huggingface/datasets/pull/5561
Add huggingface_hub version to env cli command by @mariosasko in https://github.com/huggingface/datasets/pull/5578
Do no write index by default when exporting a dataset by @mariosasko in https://github.com/huggingface/datasets/pull/5583
Flatten dataset on the fly in save_to_disk by @mariosasko in https://github.com/huggingface/datasets/pull/5588
Fix sort with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5587
Fix docstring example by @stevhliu in https://github.com/huggingface/datasets/pull/5592
Fix pushtohub with no dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/5598
Don't compute checksums if not necessary in datasets-cli test by @lhoestq in https://github.com/huggingface/datasets/pull/5603
Update README logo by @gary149 in https://github.com/huggingface/datasets/pull/5605
Fix CI by temporarily pinning fsspec < 2023.3.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5617
Fix archive fs test by @lhoestq in https://github.com/huggingface/datasets/pull/5614
unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/5619
Bump pyarrow to 8.0.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5620
Remove setaccesstoken usage + fail tests if FutureWarning by @Wauplin in https://github.com/huggingface/datasets/pull/5623
Fix outdated verification_mode values by @polinaeterna in https://github.com/huggingface/datasets/pull/5607
Adding Oracle Cloud to docs by @ahosler in https://github.com/huggingface/datasets/pull/5621
Fix CI: ignore C901 ("some_func" is to complex) in ruff by @polinaeterna in https://github.com/huggingface/datasets/pull/5636
add kwargs to index search by @SaulLu in https://github.com/huggingface/datasets/pull/5628
Less zip false positives by @lhoestq in https://github.com/huggingface/datasets/pull/5640
Allow self as key in Features by @mariosasko in https://github.com/huggingface/datasets/pull/5646
Bump hfh to 0.11.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5642
Support streaming datasets with numpy.load by @albertvillanova in https://github.com/huggingface/datasets/pull/5626
Fix unnecessary dict comprehension by @albertvillanova in https://github.com/huggingface/datasets/pull/5662
Fix CI by temporarily pinning tensorflow < 2.12.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5664
Copy features by @lhoestq in https://github.com/huggingface/datasets/pull/5652
Improve features decoding in toiterabledataset by @lhoestq in https://github.com/huggingface/datasets/pull/5655
Fix fsspec.open when using an HTTP proxy by @bryant1410 in https://github.com/huggingface/datasets/pull/5656
Jax requires jaxlib by @lhoestq in https://github.com/huggingface/datasets/pull/5667
docs: Update numshards docs to mention numproc on Dataset and DatasetDict by @connor-henderson in https://github.com/huggingface/datasets/pull/5658
Allow loading/saving of FAISS index using fsspec by @Dref360 in https://github.com/huggingface/datasets/pull/5526
Fix verificationmode when ignoreverifications is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/5683
Release: 2.11.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5684

New Contributors

@XDoubleU made their first contribution in https://github.com/huggingface/datasets/pull/5562
@Skylion007 made their first contribution in https://github.com/huggingface/datasets/pull/5549
@Hubert-Bonisseur made their first contribution in https://github.com/huggingface/datasets/pull/5569
@ahosler made their first contribution in https://github.com/huggingface/datasets/pull/5621
@patrickloeber made their first contribution in https://github.com/huggingface/datasets/pull/5582
@SaulLu made their first contribution in https://github.com/huggingface/datasets/pull/5628
@connor-henderson made their first contribution in https://github.com/huggingface/datasets/pull/5658
@kyoto7250 made their first contribution in https://github.com/huggingface/datasets/pull/5611

Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.11.0

- Python
Published by lhoestq over 3 years ago

datasets - 2.10.1

What's Changed

Fix sort with indices mapping by @mariosasko https://github.com/huggingface/datasets/pull/5587
- Fix IndexError when doing ds.filter(...).sort(...) or ds.select(...).sort(...)

Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.10.1

- Python
Published by lhoestq over 3 years ago

datasets - 2.10.0

Important

Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in https://github.com/huggingface/datasets/pull/5542
- Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
Skip dataset verifications by default by @mariosasko in https://github.com/huggingface/datasets/pull/5303
- introduces multiple verification_mode you can pass to `load_dataset()):
- the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

Single TQDM bar in multi-proc map by @mariosasko in https://github.com/huggingface/datasets/pull/5455
- No more stacked TQDM bars when calling .map() in multiprocessing
Map-style Dataset to IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5410
- introduces .to_iterable_dataset() to get a IterableDataset from a Dataset
- see all the advantages of IterableDataset in the documentation about the differences between Dataset and IterableDataset
Select columns of Dataset or DatasetDict by @daskol in https://github.com/huggingface/datasets/pull/5480
- introduces .select_column() to return a dataset only containing the requested columns
Added functionality: sort datasets by multiple keys by @MichlF in https://github.com/huggingface/datasets/pull/5502
- introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
Add JAX device selection when formatting by @alvarobartt in https://github.com/huggingface/datasets/pull/5547
- introduces ds = ds.with_format("jax", device=device)
Reload features from Parquet metadata by @MFreidank in https://github.com/huggingface/datasets/pull/5516
Speed up batched PyTorch DataLoader by @lhoestq in https://github.com/huggingface/datasets/pull/5512

Documentation

Add section in tutorial for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/5485
- https://huggingface.co/docs/datasets/main/en/access#iterabledataset
Tutorial for creating a dataset by @stevhliu in https://github.com/huggingface/datasets/pull/5540
- https://huggingface.co/docs/datasets/main/en/create_dataset
Add JAX-formatting documentation by @alvarobartt in https://github.com/huggingface/datasets/pull/5535
- https://huggingface.co/docs/datasets/main/en/usewithjax

General improvements and bug fixes

Pin sqlalchemy by @lhoestq in https://github.com/huggingface/datasets/pull/5476
Update dataset card creation by @stevhliu in https://github.com/huggingface/datasets/pull/5470
Add numtestbatches option by @amyeroberts in https://github.com/huggingface/datasets/pull/5471
Tip for recomputing metadata by @stevhliu in https://github.com/huggingface/datasets/pull/5478
Disable aiohttp requoting of redirection URL by @albertvillanova in https://github.com/huggingface/datasets/pull/5459
[MINOR] Typo by @cakiki in https://github.com/huggingface/datasets/pull/5491
Pin dill lower version by @albertvillanova in https://github.com/huggingface/datasets/pull/5489
Improved error message for gated/private repos by @osanseviero in https://github.com/huggingface/datasets/pull/5497
Update docs for nyu_depth_v2 dataset by @awsaf49 in https://github.com/huggingface/datasets/pull/5484
don't zero copy timestamps by @dwyatte in https://github.com/huggingface/datasets/pull/5504
Remove unused load_from_cache_file arg from Dataset.shard() docstring by @polinaeterna in https://github.com/huggingface/datasets/pull/5493
Do not add index column by default when exporting to CSV by @albertvillanova in https://github.com/huggingface/datasets/pull/5490
Fix bug when casting empty array to class labels by @marioga in https://github.com/huggingface/datasets/pull/5521
Fix benchmarks CI - pin protobuf by @lhoestq in https://github.com/huggingface/datasets/pull/5527
Remove py.typed by @mariosasko in https://github.com/huggingface/datasets/pull/5518
Add missing license in NumpyFormatter by @alvarobartt in https://github.com/huggingface/datasets/pull/5530
Unify load_from_cache_file type and logic by @HallerPatrick in https://github.com/huggingface/datasets/pull/5515
Format code with ruff by @mariosasko in https://github.com/huggingface/datasets/pull/5519
Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in https://github.com/huggingface/datasets/pull/5522
Resolve four broken refs in the docs by @tomaarsen in https://github.com/huggingface/datasets/pull/5550
Use default audio resampling type by @lhoestq in https://github.com/huggingface/datasets/pull/5556
- resampy is no longer needed to resample audio data
improved message error row formatting by @Plutone11011 in https://github.com/huggingface/datasets/pull/5553
Make tiktoken tokenizers hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5552
Suggest scikit-learn instead of sklearn by @osbm in https://github.com/huggingface/datasets/pull/5551
Add filter desc by @lhoestq in https://github.com/huggingface/datasets/pull/5557
Fix map suffix_template by @lhoestq in https://github.com/huggingface/datasets/pull/5559
Ensure last tqdm update in map by @mariosasko in https://github.com/huggingface/datasets/pull/5560

New Contributors

@amyeroberts made their first contribution in https://github.com/huggingface/datasets/pull/5471
@awsaf49 made their first contribution in https://github.com/huggingface/datasets/pull/5484
@dwyatte made their first contribution in https://github.com/huggingface/datasets/pull/5504
@marioga made their first contribution in https://github.com/huggingface/datasets/pull/5521
@MFreidank made their first contribution in https://github.com/huggingface/datasets/pull/5516
@daskol made their first contribution in https://github.com/huggingface/datasets/pull/5480
@Plutone11011 made their first contribution in https://github.com/huggingface/datasets/pull/5553
@osbm made their first contribution in https://github.com/huggingface/datasets/pull/5551
@MichlF made their first contribution in https://github.com/huggingface/datasets/pull/5502

Full Changelog: https://github.com/huggingface/datasets/compare/2.9.0...ef

- Python
Published by lhoestq over 3 years ago

datasets - 2.9.0

Datasets Features

Parallel implementation of totfdataset() by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5377
- Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing
Distributed support by @lhoestq in https://github.com/huggingface/datasets/pull/5369
- Split your dataset for each node for distributed training
- It supports both Dataset and IterableDataset (e.g. in streaming mode)
- See the documentation for more details

```python import os from datasets.distributed import splitdatasetby_node

rank = int(os.environ["RANK"]) worldsize = int(os.environ["WORLDSIZE"]) ds = splitdatasetbynode(ds, rank=rank, worldsize=world_size) ```

Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in https://github.com/huggingface/datasets/pull/5400
Tqdm progress bar for to_parquet by @zanussbaum in https://github.com/huggingface/datasets/pull/5456
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3379
Support other formats than uint8 for image arrays by @vigsterkr in https://github.com/huggingface/datasets/pull/5365

Documentation

Depth estimation dataset guide by @sayakpaul in https://github.com/huggingface/datasets/pull/5379
- see https://huggingface.co/docs/datasets/main/en/depth_estimation
Imagefolder docs: mention support of CSV and ZIP by @lhoestq in https://github.com/huggingface/datasets/pull/5463
- see https://huggingface.co/docs/datasets/main/en/image_load#imagefolder
Update docs of S3 filesystem with async aiobotocore by @maheshpec in https://github.com/huggingface/datasets/pull/5411
- see https://huggingface.co/docs/datasets/main/en/filesystems#amazon-s3

General improvements and bug fixes

Raise error if ClassLabel names is not python list by @freddyheppell in https://github.com/huggingface/datasets/pull/5359
Temporarily pin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5395
Unpin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5397
Replace one letter import in docs by @MKhalusova in https://github.com/huggingface/datasets/pull/5403
Fix Colab notebook link by @albertvillanova in https://github.com/huggingface/datasets/pull/5392
Fix fs.open resource leaks by @tkukurin in https://github.com/huggingface/datasets/pull/5358
Fix deprecation warning when useauthtoken passed to downloadandprepare by @albertvillanova in https://github.com/huggingface/datasets/pull/5409
Fix streaming pandas.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/5372
ci: 🎡 remove two obsolete issue templates by @severo in https://github.com/huggingface/datasets/pull/5420
Handle 0-dim tensors in cast_to_python_objects by @mariosasko in https://github.com/huggingface/datasets/pull/5384
Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5429
Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in https://github.com/huggingface/datasets/pull/5432
Revert container image pin in CI benchmarks by @0x2b3bfa0 in https://github.com/huggingface/datasets/pull/5436
Finish deprecating the fs argument by @dconathan in https://github.com/huggingface/datasets/pull/5393
Update actions/checkout in CD Conda release by @albertvillanova in https://github.com/huggingface/datasets/pull/5438
Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/5416
Fix documentation about batch samplers by @thomasw21 in https://github.com/huggingface/datasets/pull/5440
Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5447
Support fsspec 2023.1.0 in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/5449
Update share tutorial by @stevhliu in https://github.com/huggingface/datasets/pull/5443
Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in https://github.com/huggingface/datasets/pull/5452
Fix base directory while extracting insecure TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/5453
Fix link in load_dataset docstring by @mariosasko in https://github.com/huggingface/datasets/pull/5389
Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in https://github.com/huggingface/datasets/pull/5460
Concatenate on axis=1 with misaligned blocks by @lhoestq in https://github.com/huggingface/datasets/pull/5462
Raise from disconnect error in xopen by @lhoestq in https://github.com/huggingface/datasets/pull/5382
remove pathlib.Path with URIs by @jonny-cyberhaven in https://github.com/huggingface/datasets/pull/5466
Remove deprecated shard_size arg from .push_to_hub() by @polinaeterna in https://github.com/huggingface/datasets/pull/5469

New Contributors

@freddyheppell made their first contribution in https://github.com/huggingface/datasets/pull/5359
@MKhalusova made their first contribution in https://github.com/huggingface/datasets/pull/5403
@tkukurin made their first contribution in https://github.com/huggingface/datasets/pull/5358
@0x2b3bfa0 made their first contribution in https://github.com/huggingface/datasets/pull/5436
@maheshpec made their first contribution in https://github.com/huggingface/datasets/pull/5411
@dconathan made their first contribution in https://github.com/huggingface/datasets/pull/5393
@zanussbaum made their first contribution in https://github.com/huggingface/datasets/pull/5456
@jonny-cyberhaven made their first contribution in https://github.com/huggingface/datasets/pull/5466

Full Changelog: https://github.com/huggingface/datasets/compare/2.8.0...2.9.0

- Python
Published by lhoestq over 3 years ago

datasets - 2.8.0

Important

Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in https://github.com/huggingface/datasets/pull/5287
- Datasets in streaming mode now update their features after column renaming or removal
Add numproc to fromcsv/generator/json/parquet/text by @lhoestq in https://github.com/huggingface/datasets/pull/5239
- Use multiprocessing to load multiple files in parallel
Add features param to IterableDataset.map by @alvarobartt in https://github.com/huggingface/datasets/pull/5311
Sharded savetodisk + multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5268
- Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
- Pass num_proc to use multiprocessing.
Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in https://github.com/huggingface/datasets/pull/5252
Support torch dataloader without torch formatting for IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly: python from datasets import load_dataset ds = load_dataset("c4", "en", streaming=True, split="train") dataloader = DataLoader(ds, batch_size=32, num_workers=4)

Docs

Complete doc migration by @mishig25 in https://github.com/huggingface/datasets/pull/5248

General improvements and bug fixes

typo by @WrRan in https://github.com/huggingface/datasets/pull/5253
typo by @WrRan in https://github.com/huggingface/datasets/pull/5254
remove an unused statement by @WrRan in https://github.com/huggingface/datasets/pull/5257
fix wrong print by @WrRan in https://github.com/huggingface/datasets/pull/5256
Fix max_shard_size docs by @lhoestq in https://github.com/huggingface/datasets/pull/5267
Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in https://github.com/huggingface/datasets/pull/5266
Change release procedure to use only pull requests by @albertvillanova in https://github.com/huggingface/datasets/pull/5250
Warn about checksums by @lhoestq in https://github.com/huggingface/datasets/pull/5279
Tweak readme by @lhoestq in https://github.com/huggingface/datasets/pull/5210
Save file name in embed_storage by @lhoestq in https://github.com/huggingface/datasets/pull/5285
Use correct dataset type in from_generator docs by @mariosasko in https://github.com/huggingface/datasets/pull/5307
Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in https://github.com/huggingface/datasets/pull/5294
Fix xjoin for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5297
Fix xopen for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5299
Ci py3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/5065
Update Overview.ipynb google colab by @lhoestq in https://github.com/huggingface/datasets/pull/5211
Support xPath for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5310
Fix description of streaming in the docs by @polinaeterna in https://github.com/huggingface/datasets/pull/5313
Fix Text sample_by paragraph by @albertvillanova in https://github.com/huggingface/datasets/pull/5319
[Extract] Place the lock file next to the destination directory by @lhoestq in https://github.com/huggingface/datasets/pull/5320
Fix loading from HF GCP cache by @lhoestq in https://github.com/huggingface/datasets/pull/5321
- This was affecting datasets like wikipedia or natural_questions
Fix docs building for main by @albertvillanova in https://github.com/huggingface/datasets/pull/5328
Origin/fix missing features error by @eunseojo in https://github.com/huggingface/datasets/pull/5318
fix: 🐛 pass the token to get the list of config names by @severo in https://github.com/huggingface/datasets/pull/5333
Clarify imagefolder is for small datasets by @stevhliu in https://github.com/huggingface/datasets/pull/5329
Close stream in ArrowWriter.finalize before inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309
Use same num_proc for dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300
Set IterableDataset.map param batch_size typing as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336
fix: dataset path should be absolute by @vigsterkr in https://github.com/huggingface/datasets/pull/5234
Clean up DatasetInfo and Dataset docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5340
Clean up docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5334
Remove tasks.json by @lhoestq in https://github.com/huggingface/datasets/pull/5341
Support topdown parameter in xwalk by @mariosasko in https://github.com/huggingface/datasets/pull/5308
Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in https://github.com/huggingface/datasets/pull/5302
Clean up Loading methods docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5350
Clean up remaining Main Classes docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5349
Clean up Dataset and DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/5344
Clean up Table class docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5355
Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in https://github.com/huggingface/datasets/pull/5322
Clean filesystem and logging docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5356
ExamplesIterable fixes by @lhoestq in https://github.com/huggingface/datasets/pull/5366
Simplify skipping by @Muennighoff in https://github.com/huggingface/datasets/pull/5373
Release: 2.8.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5375

New Contributors

@WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
@eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
@vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234
@Muennighoff made their first contribution in https://github.com/huggingface/datasets/pull/5373

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0

- Python
Published by lhoestq over 3 years ago

datasets - 2.6.2

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.6.2

- Python
Published by albertvillanova over 3 years ago

datasets - 2.7.1

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.7.1

- Python
Published by albertvillanova over 3 years ago

datasets - 2.7.0

Dataset Features

Multiprocessed dataset builder by @TevenLeScao in https://github.com/huggingface/datasets/pull/5107
- Load big datasets faster than before using multiprocessing: python from datasets import load_dataset ds = load_dataset("imagenet-1k", num_proc=4)
Make torch.Tensor and spacy models cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/5191
- Function passed to map or filter that uses tensors or pipelines can now be cached
Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in https://github.com/huggingface/datasets/pull/5192
TextConfig: added "errors" by @NightMachinery in https://github.com/huggingface/datasets/pull/5155

Audio setup

Add ffmpeg4 installation instructions in warnings by @polinaeterna in https://github.com/huggingface/datasets/pull/5167

Docs

Update create image dataset docs by @stevhliu in https://github.com/huggingface/datasets/pull/5177
add: segmentation guide. by @sayakpaul in https://github.com/huggingface/datasets/pull/5188
Reword E2E training and inference tips in the vision guides by @sayakpaul in https://github.com/huggingface/datasets/pull/5217
Add SQL guide by @stevhliu in https://github.com/huggingface/datasets/pull/5223

General improvements and bug fixes

Add pyproject.toml for black by @mariosasko in https://github.com/huggingface/datasets/pull/5125
Fix tqdm zip bug by @david1542 in https://github.com/huggingface/datasets/pull/5120
Install tensorflow-macos dependency conditionally by @albertvillanova in https://github.com/huggingface/datasets/pull/5124
[TYPO] Update newdatasetscript.py by @cakiki in https://github.com/huggingface/datasets/pull/5119
Avoid extra cast in class_encode_column by @mariosasko in https://github.com/huggingface/datasets/pull/5130
Use yaml for issue templates + revamp by @mariosasko in https://github.com/huggingface/datasets/pull/5116
Update docs once dataset scripts transferred to the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5136
Delete duplicate issue template file by @albertvillanova in https://github.com/huggingface/datasets/pull/5146
Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in https://github.com/huggingface/datasets/pull/5142
Raise ImportError instead of OSError by @ayushthe1 in https://github.com/huggingface/datasets/pull/5141
Fix CI require beam by @albertvillanova in https://github.com/huggingface/datasets/pull/5168
Make iter_files deterministic by @albertvillanova in https://github.com/huggingface/datasets/pull/5149
Add PB and TB in convertfilesizetoint by @lhoestq in https://github.com/huggingface/datasets/pull/5171
Reduce default max writer_batch_size by @mariosasko in https://github.com/huggingface/datasets/pull/5163
Support dill 0.3.6 by @albertvillanova in https://github.com/huggingface/datasets/pull/5166
Make filename matching more robust by @riccardobucco in https://github.com/huggingface/datasets/pull/5128
Preserve None in list type cast in PyArrow 10 by @mariosasko in https://github.com/huggingface/datasets/pull/5174
Raise ffmpeg warnings only once by @polinaeterna in https://github.com/huggingface/datasets/pull/5173
Add "ipykernel" to list of co_filenames to remove by @gpucce in https://github.com/huggingface/datasets/pull/5169
chore: add notebook links to img cls and obj det. by @sayakpaul in https://github.com/huggingface/datasets/pull/5187
Fix docs about dataset_info in YAML by @albertvillanova in https://github.com/huggingface/datasets/pull/5194
fsspec lock reset in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5159
Add note about the name of a dataset script by @polinaeterna in https://github.com/huggingface/datasets/pull/5198
Deprecate dummy data generation command by @mariosasko in https://github.com/huggingface/datasets/pull/5199
Do not sort splits in dataset info by @polinaeterna in https://github.com/huggingface/datasets/pull/5201
Add missing DownloadConfig.use_auth_token value by @alvarobartt in https://github.com/huggingface/datasets/pull/5205
Update canonical links to Hub links by @stevhliu in https://github.com/huggingface/datasets/pull/5203
Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in https://github.com/huggingface/datasets/pull/5208
Update github pr docs actions by @mishig25 in https://github.com/huggingface/datasets/pull/5214
Use hfh hfhuburl function by @albertvillanova in https://github.com/huggingface/datasets/pull/5196
Pin typer version in tests to <0.5 to fix Windows CI by @polinaeterna in https://github.com/huggingface/datasets/pull/5235
Fix shards in IterableDataset.from_generator by @lhoestq in https://github.com/huggingface/datasets/pull/5233
Fix class name of symbolic link by @riccardobucco in https://github.com/huggingface/datasets/pull/5126
Make Version hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5238
Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in https://github.com/huggingface/datasets/pull/5236
Encode path only for old versions of hfh by @lhoestq in https://github.com/huggingface/datasets/pull/5237
Fix CI require_beam maximum compatible dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/5212
Support hfh rc version by @lhoestq in https://github.com/huggingface/datasets/pull/5241
Cleaner error tracebacks for dataset script errors by @mariosasko in https://github.com/huggingface/datasets/pull/5240

New Contributors

@david1542 made their first contribution in https://github.com/huggingface/datasets/pull/5120
@ayushthe1 made their first contribution in https://github.com/huggingface/datasets/pull/5142
@gpucce made their first contribution in https://github.com/huggingface/datasets/pull/5169
@sayakpaul made their first contribution in https://github.com/huggingface/datasets/pull/5187
@NightMachinery made their first contribution in https://github.com/huggingface/datasets/pull/5155

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.7.0

- Python
Published by albertvillanova over 3 years ago

datasets - 2.6.1

Bug fixes

Fix filter indices when batched by @albertvillanova in https://github.com/huggingface/datasets/pull/5113
- fixed a bug where filter could return examples with the wrong indices
Fix iter_batches by @lhoestq in https://github.com/huggingface/datasets/pull/5115
- fixed a bug where map with batch=True could return a dataset with less examples
Fix a typo in arrow_dataset.py by @yangky11 in https://github.com/huggingface/datasets/pull/5108

New Contributors

@yangky11 made their first contribution in https://github.com/huggingface/datasets/pull/5108

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.0...2.6.1

- Python
Published by lhoestq over 3 years ago

datasets - 2.6.0

Important

[GH->HF] Remove all dataset scripts from github by @lhoestq in https://github.com/huggingface/datasets/pull/4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

Add ability to read-write to SQL databases. by @Dref360 in https://github.com/huggingface/datasets/pull/4928
- Read from sqlite file: python from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
- Allow connection objects in from_sql + small doc improvement by @mariosasko in https://github.com/huggingface/datasets/pull/5091 python from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in https://github.com/huggingface/datasets/pull/5072
- return numpy/torch/tf/jax tensors with python from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]
Added IterableDataset.from_generator by @hamid-vakilzadeh in https://github.com/huggingface/datasets/pull/5052
Fast dataset iter by @mariosasko in https://github.com/huggingface/datasets/pull/5030
- speed up by a factor of 2 using the Arrow Table reader
Dataset infos in yaml by @lhoestq in https://github.com/huggingface/datasets/pull/4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
Add kwargs to Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/5049
Support converters in CsvBuilder by @mariosasko in https://github.com/huggingface/datasets/pull/5057
Restore saved format state in load_from_disk by @asofiaoliveira in https://github.com/huggingface/datasets/pull/5073

Dataset changes

Update: hendrycks_test - support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/5041
Update: swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5019
- Update swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5042
Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in https://github.com/huggingface/datasets/pull/5022
Fix: sbu_captions - fix URLs by @donglixp in https://github.com/huggingface/datasets/pull/5020
Fix: xcsr - fix string features by @albertvillanova in https://github.com/huggingface/datasets/pull/5024
Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/5040
Fix: catsvsdogs - fix number of samples by @lhoestq in https://github.com/huggingface/datasets/pull/5047
Fix: lexglue - fix bug with labels of eurlex config of lexglue dataset by @iliaschalkidis in https://github.com/huggingface/datasets/pull/5048
Fix: msr_sqa - fix dataset generation by @Timothyxxx in https://github.com/huggingface/datasets/pull/3715

Dataset cards

Add description to hellaswag dataset by @julien-c in https://github.com/huggingface/datasets/pull/4810
Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5010
Update languages in aeslc dataset card by @apergo-ai in https://github.com/huggingface/datasets/pull/3357
Update license to bookcorpus dataset card by @meg-huggingface in https://github.com/huggingface/datasets/pull/3526
Update paper link in medmcqa dataset card by @monk1337 in https://github.com/huggingface/datasets/pull/4290
Add oversampling strategy iterable datasets interleave by @ylacombe in https://github.com/huggingface/datasets/pull/5036
Fix license/citation information of squadshifts dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5054

General improvements and bug fixes

Fix missing useauthtoken in streaming docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/5003
Add some note about running the transformers ci before a release by @lhoestq in https://github.com/huggingface/datasets/pull/5007
Remove license tag file and validation by @albertvillanova in https://github.com/huggingface/datasets/pull/5004
Re-apply input columns change by @mariosasko in https://github.com/huggingface/datasets/pull/5008
patch CIHUBTOKEN_PATH with Path instead of str by @Wauplin in https://github.com/huggingface/datasets/pull/5026
Fix typo in error message by @severo in https://github.com/huggingface/datasets/pull/5027
Fix import in ClassLabel docstring example by @alvarobartt in https://github.com/huggingface/datasets/pull/5029
Remove redundant code from some dataset module factories by @albertvillanova in https://github.com/huggingface/datasets/pull/5033
Fix typos in load docstrings and comments by @albertvillanova in https://github.com/huggingface/datasets/pull/5035
Prefer split patterns from directories over split patterns from filenames by @polinaeterna in https://github.com/huggingface/datasets/pull/4985
Fix tar extraction vuln by @lhoestq in https://github.com/huggingface/datasets/pull/5016
Support hfh 0.10 implicit auth by @lhoestq in https://github.com/huggingface/datasets/pull/5031
Fix flatten_indices with empty indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5043
Improve CI performance speed of PackagedDatasetTest by @albertvillanova in https://github.com/huggingface/datasets/pull/5037
Revert task removal in folder-based builders by @mariosasko in https://github.com/huggingface/datasets/pull/5051
Fix backward compatibility for dataset_infos.json by @lhoestq in https://github.com/huggingface/datasets/pull/5055
Fix typo by @stevhliu in https://github.com/huggingface/datasets/pull/5059
Fix CI hfh token warning by @albertvillanova in https://github.com/huggingface/datasets/pull/5062
Mark CI tests as xfail when 502 error by @albertvillanova in https://github.com/huggingface/datasets/pull/5058
Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in https://github.com/huggingface/datasets/pull/5077
Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5067
Fix header level in Audio docs by @stevhliu in https://github.com/huggingface/datasets/pull/5078
Support DEFAULTCONFIGNAME when no BUILDER_CONFIGS by @albertvillanova in https://github.com/huggingface/datasets/pull/5071
Support streaming gzip.open by @albertvillanova in https://github.com/huggingface/datasets/pull/5066
adding keep in memory by @Mustapha-AJEGHRIR in https://github.com/huggingface/datasets/pull/5082
refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in https://github.com/huggingface/datasets/pull/5079
fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in https://github.com/huggingface/datasets/pull/5076
Align signature of listrepofiles with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5063
Align signature of create/delete_repo with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5064
Fix filter with empty indices by @Mouhanedg56 in https://github.com/huggingface/datasets/pull/5087
Fix tutorial (#5093) by @riccardobucco in https://github.com/huggingface/datasets/pull/5095
Use HTML relative paths for tiles in the docs by @lewtun in https://github.com/huggingface/datasets/pull/5092
Fix loading how to guide (#5102) by @riccardobucco in https://github.com/huggingface/datasets/pull/5104
url encode hub url (#5099) by @riccardobucco in https://github.com/huggingface/datasets/pull/5103
Free the "hf" filesystem protocol for hffs by @lhoestq in https://github.com/huggingface/datasets/pull/5101
Fix task template reload from dict by @lhoestq in https://github.com/huggingface/datasets/pull/5106

New Contributors

@Wauplin made their first contribution in https://github.com/huggingface/datasets/pull/5026
@donglixp made their first contribution in https://github.com/huggingface/datasets/pull/5020
@Timothyxxx made their first contribution in https://github.com/huggingface/datasets/pull/3715
@hamid-vakilzadeh made their first contribution in https://github.com/huggingface/datasets/pull/5052
@Mustapha-AJEGHRIR made their first contribution in https://github.com/huggingface/datasets/pull/5082
@galbwe made their first contribution in https://github.com/huggingface/datasets/pull/5079
@rahulXs made their first contribution in https://github.com/huggingface/datasets/pull/5076
@Mouhanedg56 made their first contribution in https://github.com/huggingface/datasets/pull/5087
@riccardobucco made their first contribution in https://github.com/huggingface/datasets/pull/5095
@asofiaoliveira made their first contribution in https://github.com/huggingface/datasets/pull/5073

Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.6.0

- Python
Published by lhoestq over 3 years ago

datasets - 2.5.2

Bug fixes

Revert task removal in folder-based builders (#5051)
Support hfh 0.10 implicit auth (#5031)

Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.5.2

- Python
Published by lhoestq almost 4 years ago

datasets - 2.5.1

Bug fixes

Revert input_columns change by @lhoestq in https://github.com/huggingface/datasets/pull/5006

Full Changelog: https://github.com/huggingface/datasets/compare/2.5.0...2.5.1

- Python
Published by lhoestq almost 4 years ago

datasets - 2.5.0

Important

Drop Python 3.6 support by @mariosasko in https://github.com/huggingface/datasets/pull/4460
Deprecate metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4739
- Metrics are now deprecated and have been moved to evaluate: python !pip install evaluate import evaluate metric = evaluate.load("accuracy")
Load GitHub datasets from Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
Use HTTP requests to access data and metadata through the Datasets REST API (docs here)

Datasets features

No-code loaders

Add AudioFolder packaged loader by @polinaeterna in https://github.com/huggingface/datasets/pull/4530
Add support for CSV metadata files to ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/4837
Add support for parsing JSON files in array form by @mariosasko in https://github.com/huggingface/datasets/pull/4997

Dataset methods

add Dataset.from_list by @sanderland in https://github.com/huggingface/datasets/pull/4890
Add Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/4957
Add oversampling strategies to interleave datasets by @ylacombe in https://github.com/huggingface/datasets/pull/4831
Preserve non-input_colums in Dataset.map if input_columns are specified by @mariosasko in https://github.com/huggingface/datasets/pull/4971
Add fn_kwargs param to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4975
More rigorous shape inference in totfdataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4763

Parquet support

Download and prepare as Parquet for cloud storage by @lhoestq in https://github.com/huggingface/datasets/pull/4724
Shard parquet in downloadandprepare by @lhoestq in https://github.com/huggingface/datasets/pull/4747
Embed image/audio data in dlandprepare parquet by @lhoestq in https://github.com/huggingface/datasets/pull/4987

Datasets changes

Update: natural questions - Add long answer candidates by @seirasto in https://github.com/huggingface/datasets/pull/4368
Update: opus_paracrawl - update version by @albertvillanova in https://github.com/huggingface/datasets/pull/4816
Update: ReCoRD - Include entity positions as feature by @richarddwang in https://github.com/huggingface/datasets/pull/4479
Update: swda - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4914
Update: Enwik8 - update broken link and information by @mtanghu in https://github.com/huggingface/datasets/pull/4
Update: compguesswhat - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4968
Update: nli_tr - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4970
Update: IndicGLUE - update download links by @sumanthd17 in https://github.com/huggingface/datasets/pull/4978
Update: iwslt2017 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4992
Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4788
Fix: mkqa - Update data URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4823
Fix: exams - fix bug and checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/4853
Fix: trec - use fine classes by @albertvillanova in https://github.com/huggingface/datasets/pull/4801
Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in https://github.com/huggingface/datasets/pull/4871
Fix: LibriSpeech - Fix dev split localextractedarchive for 'all' config by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4904
Fix: compguesswhat - fix data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/4959
Fix: vivos - fix data URL and metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4969
Fix: MBPP - Add splits by @cwarny in https://github.com/huggingface/datasets/pull/4943

Dataset cards

Add language_bcp47 tag by @lhoestq in https://github.com/huggingface/datasets/pull/4753
Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in https://github.com/huggingface/datasets/pull/4701
Remove "unkown" language tags by @lhoestq in https://github.com/huggingface/datasets/pull/4754
Highlight non-commercial license in amazonreviewsmulti dataset card by @sbroadhurst-hf in https://github.com/huggingface/datasets/pull/4712
Added dataset information in clinic oos dataset card by @Arnav-Ladkat in https://github.com/huggingface/datasets/pull/4751
Fix opus_gnome dataset card by @gojiteji in https://github.com/huggingface/datasets/pull/4806
Complete the mlqa dataset card by @eldhoittangeorge in https://github.com/huggingface/datasets/pull/4809
Fix loading example in opus dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4813
Add missing language tags to resources by @albertvillanova in https://github.com/huggingface/datasets/pull/4819
Fix titles in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4824
Fix language tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4826
Add license metadata to pg19 by @julien-c in https://github.com/huggingface/datasets/pull/4827
Fix task tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4830
Fix tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4832
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4833
Fix documentation card of recipe_nlg dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4834
Fix documentation card of ethos dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4835
Update documentation card of miam dataset by @PierreColombo in https://github.com/huggingface/datasets/pull/4846
Update stackexchange license by @cakiki in https://github.com/huggingface/datasets/pull/4842
Update tedtalksiwslt license to include ND by @cakiki in https://github.com/huggingface/datasets/pull/4841
Fix documentation card of adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4838
Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
Fix license tag and Source Data section in billsum dataset card by @kashif in https://github.com/huggingface/datasets/pull/4851
Fix documentation card of covidqacastorini dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4877
Fix Citation Information section in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4879
Fix documentation card of math_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4884
Added names of less-studied languages by @BenjaminGalliot in https://github.com/huggingface/datasets/pull/4880
Fix language tags resource file by @albertvillanova in https://github.com/huggingface/datasets/pull/4882
Add citation to rosts and rosts_parallel datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4892
Add citation information to makhzan dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4894
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4891
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4896
Re-add code and und language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/4899
Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
Update GLUE evaluation metadata by @lewtun in https://github.com/huggingface/datasets/pull/4909
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4908
Add license and citation information to cosmos_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4913
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4921
Add cc-by-nc-2.0 to list of licenses by @albertvillanova in https://github.com/huggingface/datasets/pull/4930
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4931
Add Papers with Code ID to scifact dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4941
Fix license information in qasc dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4951
Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4940
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4979
Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4991

Documentation

Update map docs by @stevhliu in https://github.com/huggingface/datasets/pull/4743
Add image classification processing guide by @stevhliu in https://github.com/huggingface/datasets/pull/4748
Fix traintestsplit docs by @NielsRogge in https://github.com/huggingface/datasets/pull/4821
Update local loading script docs by @stevhliu in https://github.com/huggingface/datasets/pull/4778
Docs for creating a loading script for image datasets by @stevhliu in https://github.com/huggingface/datasets/pull/4783
Docs for creating an audio dataset by @stevhliu in https://github.com/huggingface/datasets/pull/4872

General improvements and bug fixes

Use CI unit/integration tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4738
Fix multiprocessing in map_nested by @albertvillanova in https://github.com/huggingface/datasets/pull/4740
Add 2.4.0 version added to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4767
Update CI badge by @mariosasko in https://github.com/huggingface/datasets/pull/4764
Fix version in map_nested docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4765
fix typo by @xwwwwww in https://github.com/huggingface/datasets/pull/4770
Unpin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4768
Remove apachebeam import from module level in naturalquestions dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4780
Require torchaudio<0.12.0 to avoid RuntimeError by @albertvillanova in https://github.com/huggingface/datasets/pull/4777
Remove dummy data generation docs by @stevhliu in https://github.com/huggingface/datasets/pull/4771
Require torchaudio<0.12.0 in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/4785
Fix bug in function validate_type for Python >= 3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/4812
Fix typo in streaming docs by @flozi00 in https://github.com/huggingface/datasets/pull/4843
Fix test of getextraction_protocol for TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/4850
Fix typos in documentation by @fl-lo in https://github.com/huggingface/datasets/pull/
Mark CI tests as xfail if Hub HTTP error by @albertvillanova in https://github.com/huggingface/datasets/pull/4845
[Windows] Fix Access Denied when using os.rename() by @DougTrajano in https://github.com/huggingface/datasets/pull/4825
[docs] Some tiny doc tweaks by @julien-c in https://github.com/huggingface/datasets/pull/4874
Document loading from relative path by @stevhliu in https://github.com/huggingface/datasets/pull/4773
Fix CI reporting by @albertvillanova in https://github.com/huggingface/datasets/pull/4903
Add 'val' to VALIDATION_KEYWORDS. by @akt42 in https://github.com/huggingface/datasets/pull/4844
Raise ManualDownloadError from getdatasetconfig_info by @albertvillanova in https://github.com/huggingface/datasets/pull/4901
feat: improve error message on Keys mismatch. closes #4917 by @PaulLerner in https://github.com/huggingface/datasets/pull/4919
Fixes a typo in loading documentation by @sighingnow in https://github.com/huggingface/datasets/pull/4929
Remove main branch rename notice by @lhoestq in https://github.com/huggingface/datasets/pull/4938
Fix NonMatchingChecksumError in adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4939
Remove deprecated identical_ok by @lhoestq in https://github.com/huggingface/datasets/pull/4937
Pin TensorFlow temporarily by @albertvillanova in https://github.com/huggingface/datasets/pull/4954
Fix minor typo in error message for missing imports by @mariosasko in https://github.com/huggingface/datasets/pull/4948
Fix TF tests for 2.10 by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4956
fix BLEU metric card by @antoniolanza1996 in https://github.com/huggingface/datasets/pull/4927
Update doc upload_dataset.mdx by @mishig25 in https://github.com/huggingface/datasets/pull/4789
Improve features resolution in streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4762
Fix label renaming and add a battery of tests by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4781
Strip "/" in local dataset path to avoid empty dataset name error by @apohllo in https://github.com/huggingface/datasets/pull/4967
Introduce regex check when pushing as well by @LysandreJik in https://github.com/huggingface/datasets/pull/4946
[doc] Fix broken snippet that had too many quotes by @tomaarsen in https://github.com/huggingface/datasets/pull/4986
Fix map batched with torch output by @lhoestq in https://github.com/huggingface/datasets/pull/4972
fix: avoid casting tuples after Dataset.map by @szmoro in https://github.com/huggingface/datasets/pull/4993
decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
Don't add a tag on the Hub on release by @lhoestq in https://github.com/huggingface/datasets/pull/4998
Add EmptyDatasetError by @lhoestq in https://github.com/huggingface/datasets/pull/4999

New Contributors

@seirasto made their first contribution in https://github.com/huggingface/datasets/pull/4368
@sbroadhurst-hf made their first contribution in https://github.com/huggingface/datasets/pull/4712
@nawarhalabi made their first contribution in https://github.com/huggingface/datasets/pull/4701
@Arnav-Ladkat made their first contribution in https://github.com/huggingface/datasets/pull/4751
@xwwwwww made their first contribution in https://github.com/huggingface/datasets/pull/4770
@gojiteji made their first contribution in https://github.com/huggingface/datasets/pull/4806
@eldhoittangeorge made their first contribution in https://github.com/huggingface/datasets/pull/4809
@flozi00 made their first contribution in https://github.com/huggingface/datasets/pull/4843
@fl-lo made their first contribution in https://github.com/huggingface/datasets/pull/4869
@BenjaminGalliot made their first contribution in https://github.com/huggingface/datasets/pull/4880
@DougTrajano made their first contribution in https://github.com/huggingface/datasets/pull/4825
@ylacombe made their first contribution in https://github.com/huggingface/datasets/pull/4831
@osanseviero made their first contribution in https://github.com/huggingface/datasets/pull/4887
@akt42 made their first contribution in https://github.com/huggingface/datasets/pull/4844
@sanderland made their first contribution in https://github.com/huggingface/datasets/pull/4890
@sighingnow made their first contribution in https://github.com/huggingface/datasets/pull/4929
@mtanghu made their first contribution in https://github.com/huggingface/datasets/pull/4950
@antoniolanza1996 made their first contribution in https://github.com/huggingface/datasets/pull/4927
@apohllo made their first contribution in https://github.com/huggingface/datasets/pull/4967
@cwarny made their first contribution in https://github.com/huggingface/datasets/pull/4943
@tomaarsen made their first contribution in https://github.com/huggingface/datasets/pull/4986
@szmoro made their first contribution in https://github.com/huggingface/datasets/pull/4993

Full Changelog: https://github.com/huggingface/datasets/compare/2.4.0...2.5.0

- Python
Published by lhoestq almost 4 years ago

datasets - 2.4.0

Dataset Features

Add concatenate_datasets for iterable datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4500
Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in https://github.com/huggingface/datasets/pull/4625
Support using PCM audio files (#4323) by @YooSungHyun in https://github.com/huggingface/datasets/pull/4409
[data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in https://github.com/huggingface/datasets/pull/4633
Support extract 7-zip compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4672
Support extract lz4 compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4700
Support metadata.jsonl from parent directories in imagefolder @mariosasko in https://github.com/huggingface/datasets/pull/4576

Dataset changes

Update: allocine - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4563
Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4585
Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4586
Update: financial_phrasebank - Host data on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4598
Update: cfq - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4579
Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4588
Update: bookcorpus - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4564
Update: fever - Refactor and add metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4503
Update: mlsum - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4574
Fix: catsvsdogs - Update download url and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/4523
Fix: conll2003 - fix empty example by @lhoestq in https://github.com/huggingface/datasets/pull/4662
Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in https://github.com/huggingface/datasets/pull/4554
Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in https://github.com/huggingface/datasets/pull/4706
Fix: crd3 - fix splits that were containing the same data by @lhoestq in https://github.com/huggingface/datasets/pull/4705

Dataset Cards

Add action names in schemaguideddstc8 dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/4559
Add evaluation data to acronym_identification by @lewtun in https://github.com/huggingface/datasets/pull/4561
Update WinoBias README by @sashavor in https://github.com/huggingface/datasets/pull/4631
Support "tags" yaml tag by @lhoestq in https://github.com/huggingface/datasets/pull/4716
Fix POS tags by @lhoestq in https://github.com/huggingface/datasets/pull/4715
AESLC dataset: Add summarization tags by @hobson in https://github.com/huggingface/datasets/pull/4517

Documentation

Update docs around audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4440
Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in https://github.com/huggingface/datasets/pull/4513
Remove multiple config section by @stevhliu in https://github.com/huggingface/datasets/pull/4600
Create new sections for audio and vision in guides by @stevhliu in https://github.com/huggingface/datasets/pull/4519
Document installation of sox OS dependency for audio by @albertvillanova in https://github.com/huggingface/datasets/pull/4713

General improvements and bug fixes

Add regression test for ArrowWriter.write_batch when batch is empty by @alvarobartt in https://github.com/huggingface/datasets/pull/4510
Support all negative values in ClassLabel by @lhoestq in https://github.com/huggingface/datasets/pull/4511
Add uppercased versions of image file extensions for automatic module inference by @mariosasko in https://github.com/huggingface/datasets/pull/4515
Patch tests for hfh v0.8.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/4518
Replace deprecated logging.warn with logging.warning by @hugovk in https://github.com/huggingface/datasets/pull/4539
[CI] Fix upstream hub test url by @lhoestq in https://github.com/huggingface/datasets/pull/4543
Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4541
[CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in https://github.com/huggingface/datasets/pull/4546
Tell users to upload on the hub directly by @lhoestq in https://github.com/huggingface/datasets/pull/4552
Add batch_size parameter when calling add_faiss_index and add_faiss_index_from_external_arrays by @alvarobartt in https://github.com/huggingface/datasets/pull/4535
Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4545
Properly raise FileNotFound even if the dataset is private by @lhoestq in https://github.com/huggingface/datasets/pull/4536
Fix hashing for python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/4516
[CI] Fix some warnings by @lhoestq in https://github.com/huggingface/datasets/pull/4547
Validate new_fingerprint passed by user by @lhoestq in https://github.com/huggingface/datasets/pull/4587
Update CI Windows orb by @albertvillanova in https://github.com/huggingface/datasets/pull/4604
Perform hidden file check on relative data file path by @mariosasko in https://github.com/huggingface/datasets/pull/4551
Align more metadata with other repo types (models,spaces) by @julien-c in https://github.com/huggingface/datasets/pull/4607
Align/fix license metadata info by @julien-c in https://github.com/huggingface/datasets/pull/4613
Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in https://github.com/huggingface/datasets/pull/4611
Add authentication tip to load_dataset by @mariosasko in https://github.com/huggingface/datasets/pull/4577
Stop dropping columns in totfdataset() before we load batches by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4553
fix(datasetwrappers): Fixes access to fsspec.asyn in torchiterable_dataset.py. by @gugarosa in https://github.com/huggingface/datasets/pull/4630
Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in https://github.com/huggingface/datasets/pull/4608
Rename master to main by @lhoestq in https://github.com/huggingface/datasets/pull/4643
Set HFSCRIPTSVERSION to main by @lhoestq in https://github.com/huggingface/datasets/pull/4645
[Minor fix] Typo correction by @cakiki in https://github.com/huggingface/datasets/pull/4644
fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in https://github.com/huggingface/datasets/pull/4627
Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4590
Fix time type _arrow_to_datasets_dtype conversion by @mariosasko in https://github.com/huggingface/datasets/pull/4628
Fix resolvesinglepatternlocally on Windows with multiple drives by @albertvillanova in https://github.com/huggingface/datasets/pull/4660
Replace assertEqual with assertTupleEqual in unit tests for verbosity by @alvarobartt in https://github.com/huggingface/datasets/pull/4496
Fix embed_storage on features inside lists/sequences by @mariosasko in https://github.com/huggingface/datasets/pull/4615
Add links to vision tasks scripts in ADDNEWDATASET template by @mariosasko in https://github.com/huggingface/datasets/pull/4512
Transfer CI to GitHub Actions by @albertvillanova in https://github.com/huggingface/datasets/pull/4659
Fix mock fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/4685
Trigger CI also on push to main by @albertvillanova in https://github.com/huggingface/datasets/pull/4687
Fix ImageFolder with parameters dropmetadata=True and droplabels=False (when metadata.jsonl is present) by @polinaeterna in https://github.com/huggingface/datasets/pull/4622
Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4688
Test extractors for all compression formats by @albertvillanova in https://github.com/huggingface/datasets/pull/4689
Refactor base extractors by @albertvillanova in https://github.com/huggingface/datasets/pull/4690
Update create dataset card docs by @stevhliu in https://github.com/huggingface/datasets/pull/4683
Add text decorators by @stevhliu in https://github.com/huggingface/datasets/pull/4663
Skip tests only for lz4/zstd params if not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4704
Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in https://github.com/huggingface/datasets/pull/4614
Docs: Fix same-page haslinks by @mishig25 in https://github.com/huggingface/datasets/pull/4722
Fix broken link to the Hub by @stevhliu in https://github.com/huggingface/datasets/pull/4726
Refactor conftest fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/4723
Add object detection processing tutorial by @nateraw in https://github.com/huggingface/datasets/pull/4710
Fix require torchaudio and refactor test requirements by @albertvillanova in https://github.com/huggingface/datasets/pull/4708
docs: ✏️ fix TranslationVariableLanguages example by @severo in https://github.com/huggingface/datasets/pull/4731
Pin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4735
Fix named split sorting and remove unnecessary casting by @albertvillanova in https://github.com/huggingface/datasets/pull/4714
Make cast in from_pandas more robust by @mariosasko in https://github.com/huggingface/datasets/pull/4703
Make Extractor accept Path as input by @albertvillanova in https://github.com/huggingface/datasets/pull/4718
Refactor Hub tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4729
Fix to dict conversion of DatasetInfo/Features by @mariosasko in https://github.com/huggingface/datasets/pull/4741

New Contributors

@hugovk made their first contribution in https://github.com/huggingface/datasets/pull/4539
@VijayKalmath made their first contribution in https://github.com/huggingface/datasets/pull/4545
@gugarosa made their first contribution in https://github.com/huggingface/datasets/pull/4630
@benlipkin made their first contribution in https://github.com/huggingface/datasets/pull/4627
@YooSungHyun made their first contribution in https://github.com/huggingface/datasets/pull/4409
@hobson made their first contribution in https://github.com/huggingface/datasets/pull/4517
@khushmeeet made their first contribution in https://github.com/huggingface/datasets/pull/4554
@dtuit made their first contribution in https://github.com/huggingface/datasets/pull/4614

Full Changelog: https://github.com/huggingface/datasets/compare/2.3.2...2.4.0

- Python
Published by lhoestq almost 4 years ago

datasets - 2.3.2

Bug fixes

Fix double dots in data files by @lhoestq in https://github.com/huggingface/datasets/pull/4505
- fix a bug when /../ is passed to data_files causing FileNotFoundError
fix ETT m1/m2 test/val dataset by @kashif in https://github.com/huggingface/datasets/pull/4499
Corrected broken links in doc by @clefourrier in https://github.com/huggingface/datasets/pull/4501

New Contributors

@clefourrier made their first contribution in https://github.com/huggingface/datasets/pull/4501

Full Changelog: https://github.com/huggingface/datasets/compare/2.3.1...2.3.2

- Python
Published by lhoestq about 4 years ago

datasets - 2.3.1

Bug fixes

Fix patching module that doesn't exist by @lhoestq in https://github.com/huggingface/datasets/pull/4495
- fix bug when importing the lib when scipy is not installed
Re-add download_manager module in utils by @lhoestq in https://github.com/huggingface/datasets/pull/4497
- fix moved imports of DownloadConfig, DownloadMode, DownloadManager
Support streaming UDHR dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4487

Full Changelog: https://github.com/huggingface/datasets/compare/2.3.0...2.3.1

- Python
Published by lhoestq about 4 years ago

datasets - 2.3.0

Datasets Changes

New: ImageNet-Sketch by @nateraw in https://github.com/huggingface/datasets/pull/4301
New: Biwi Kinect Head Pose by @dnaveenr in https://github.com/huggingface/datasets/pull/3903
New: enwik8 by @HallerPatrick in https://github.com/huggingface/datasets/pull/4321
New: LCCC dataset by @silverriver in https://github.com/huggingface/datasets/pull/4416
New: TruthfulQA by @jon-tow in https://github.com/huggingface/datasets/pull/4159
New: BIG-bench by @andersjohanandreassen in https://github.com/huggingface/datasets/pull/4125
New: QuickDraw by @mariosasko in https://github.com/huggingface/datasets/pull/3592
New: SST-2 by @albertvillanova in https://github.com/huggingface/datasets/pull/4473
Update: imagenet-1k - remove manual download by @mariosasko in https://github.com/huggingface/datasets/pull/4299
- ImageNet can now be loaded in python with load_dataset without requiring a manual download !
- It also supports streaming mode with load_dataset("imagenet-1k", streaming=True)
Update: spider - Remove Google Drive URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4410
Update: blendedskilltalk - add missing columns to by @mariosasko in https://github.com/huggingface/datasets/pull/4437
Update: multi-news - Use newer version with fixes by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4451
Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
Update: udhr - update metadata by @leondz in https://github.com/huggingface/datasets/pull/4362
Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4469
Update: PASS - update dataset version by @mariosasko in https://github.com/huggingface/datasets/pull/4488
Fix: GEM - fix bug in wikiautoasset_turk config by @albertvillanova in https://github.com/huggingface/datasets/pull/4389
Fix: GEM - fix URL for totto config by @albertvillanova in https://github.com/huggingface/datasets/pull/4396
Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in https://github.com/huggingface/datasets/pull/4424
Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in https://github.com/huggingface/datasets/pull/4425
Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in https://github.com/huggingface/datasets/pull/4436
Fix: iwslt2017 by @lhoestq in https://github.com/huggingface/datasets/pull/4481

Dataset Features

totfdataset rewrite by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4170
- see more in the documentation
Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4375
- see more in the documentation
Added stratify option to train_test_split by @nandwalritik in https://github.com/huggingface/datasets/pull/4322
Re-add support for Apache Beam functionality by @albertvillanova in https://github.com/huggingface/datasets/pull/4328
Resume push_to_hub: skip identical files in push_to_hub instead of overwriting by @mariosasko in https://github.com/huggingface/datasets/pull/4402
Support nested/complex feature types as features in packaged loaders by @mariosasko in https://github.com/huggingface/datasets/pull/4364
Optimize contiguous shard and select by @lhoestq in https://github.com/huggingface/datasets/pull/4466

Dataset Cards

Minor fixes/improvements in scene_parse_150 card by @mariosasko in https://github.com/huggingface/datasets/pull/4447
Tidy up license metadata for googlewellformedquery, newspop, sick by @leondz in https://github.com/huggingface/datasets/pull/4378
Fix example in opus_ubuntu, Add license info by @leondz in https://github.com/huggingface/datasets/pull/4360
Update README.md of fquad by @lhoestq in https://github.com/huggingface/datasets/pull/4450

Documentation

Add API code examples for loading methods by @stevhliu in https://github.com/huggingface/datasets/pull/4300
Add API code examples for remaining main classes by @stevhliu in https://github.com/huggingface/datasets/pull/4292
Generalize tutorials for audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4468
[Docs] How to use with PyTorch page by @lhoestq in https://github.com/huggingface/datasets/pull/4474
First draft of the docs for TF + Datasets by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4457

Other improvements and bug fixes

Update CI deprecated legacy image by @albertvillanova in https://github.com/huggingface/datasets/pull/4393
remove int documentation from logging docs by @lvwerra in https://github.com/huggingface/datasets/pull/4392
Fix docstring in DatasetDict::shuffle by @felixdivo in https://github.com/huggingface/datasets/pull/4344
Fix Version equality by @albertvillanova in https://github.com/huggingface/datasets/pull/4359
Set builder name from module instead of class by @albertvillanova in https://github.com/huggingface/datasets/pull/4388
Test dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4385
Refactor download by @albertvillanova in https://github.com/huggingface/datasets/pull/4384
Fix dependency on dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/4397
Support remote cache_dir by @albertvillanova in https://github.com/huggingface/datasets/pull/4347
Update imagenet gate by @lhoestq in https://github.com/huggingface/datasets/pull/4408
Fix dataset builder default version by @albertvillanova in https://github.com/huggingface/datasets/pull/4356
Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in https://github.com/huggingface/datasets/pull/4403
Rename DatasetBuilder config_name by @albertvillanova in https://github.com/huggingface/datasets/pull/4414
Fix metadata validation by @albertvillanova in https://github.com/huggingface/datasets/pull/4390
Add HF.co for PRs/Issues for specific datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4427
Fix type hint and documentation for new_fingerprint by @fxmarty in https://github.com/huggingface/datasets/pull/4326
Skip hidden files/directories in data files resolution and iter_files by @mariosasko in https://github.com/huggingface/datasets/pull/4412
Fix docstring of inspect_dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4438
Fix builder docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4432
Fix kwargs in docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4444
Fix missing args in docstring of loaddatasetbuilder by @albertvillanova in https://github.com/huggingface/datasets/pull/4445
Add missing kwargs to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4446
Add extractor for bzip2-compressed files by @asivokon in https://github.com/huggingface/datasets/pull/4421
Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in https://github.com/huggingface/datasets/pull/4434
Update dataset_infos.json with new split info in Dataset.push_to_hub to avoid verification error by @mariosasko in https://github.com/huggingface/datasets/pull/4415
Update builder docstring for deprecated/added arguments by @albertvillanova in https://github.com/huggingface/datasets/pull/4429
Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/4464
Fix script fetching and local path handling in inspect_dataset and inspect_metric by @mariosasko in https://github.com/huggingface/datasets/pull/4433
Fix bigbench config names by @lhoestq in https://github.com/huggingface/datasets/pull/4465
Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in https://github.com/huggingface/datasets/pull/4472
Reorder returned validation/test splits in script template by @albertvillanova in https://github.com/huggingface/datasets/pull/4470
Better ImportError message when a dataset script dependency is missing by @lhoestq in https://github.com/huggingface/datasets/pull/4484
Fix cast to null by @lhoestq in https://github.com/huggingface/datasets/pull/4485
Update _format_columns in remove_columns by @alvarobartt in https://github.com/huggingface/datasets/pull/4411
Fix wrong map parameter name in cache docs by @h4iku in https://github.com/huggingface/datasets/pull/4293
Pin the revision in imagenet download links by @lhoestq in https://github.com/huggingface/datasets/pull/4492
Refactor column mappings for question answering datasets by @lewtun in https://github.com/huggingface/datasets/pull/4391

New Contributors

@leondz made their first contribution in https://github.com/huggingface/datasets/pull/4378
@felixdivo made their first contribution in https://github.com/huggingface/datasets/pull/4344
@nandwalritik made their first contribution in https://github.com/huggingface/datasets/pull/4322
@fxmarty made their first contribution in https://github.com/huggingface/datasets/pull/4326
@HallerPatrick made their first contribution in https://github.com/huggingface/datasets/pull/4321
@silverriver made their first contribution in https://github.com/huggingface/datasets/pull/4416
@asivokon made their first contribution in https://github.com/huggingface/datasets/pull/4421
@andersjohanandreassen made their first contribution in https://github.com/huggingface/datasets/pull/4125

Full Changelog: https://github.com/huggingface/datasets/compare/2.2.2...lol

- Python
Published by lhoestq about 4 years ago

datasets - 2.2.2

Datasets fixes

Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4377
Fix: CC-Aligned - fix invalid url by @juntang-zhuang in https://github.com/huggingface/datasets/pull/4231
Fix: multi_news - don't strip proceeding hyphen by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4353

Bug fixes

Support lists of multi-dimensional numpy arrays by @albertvillanova in https://github.com/huggingface/datasets/pull/4194
Check if dataset features match before push in DatasetDict.push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/4372
Pin dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4380
- dill 0.3.5 has some issues in transformers - pinning the version to <0.3.5 for now

Dataset Cards

Adding eval metadata for ade v2 by @sashavor in https://github.com/huggingface/datasets/pull/4319
Adding eval metadata for AG News by @sashavor in https://github.com/huggingface/datasets/pull/4329
Adding eval metadata to Allociné dataset by @sashavor in https://github.com/huggingface/datasets/pull/4330
Adding eval metadata to Amazon Polarity by @sashavor in https://github.com/huggingface/datasets/pull/4331
Adding eval metadata for arabic speech corpus by @sashavor in https://github.com/huggingface/datasets/pull/4332
Adding eval metadata for Banking 77 by @sashavor in https://github.com/huggingface/datasets/pull/4333
Eval metadata Batch 4: Tweet Eval, Tweets Hate Speech Detection, VCTK, Weibo NER, Wisesight Sentiment, XSum, Yahoo Answers Topics, Yelp Polarity, Yelp Review Full by @sashavor in https://github.com/huggingface/datasets/pull/4338
Eval metadata batch 3: Reddit, Rotten Tomatoes, SemEval 2010, Sentiment 140, SMS Spam, Snips, SQuAD, SQuAD v2, Timit ASR by @sashavor in https://github.com/huggingface/datasets/pull/4337
Eval metadata batch 1: BillSum, CoNLL2003, CoNLLPP, CUAD, Emotion, GigaWord, GLUE, Hate Speech 18, Hate Speech by @sashavor in https://github.com/huggingface/datasets/pull/4335
Eval metadata batch 2 : Health Fact, Jigsaw Toxicity, LIAR, LJ Speech, MSRA NER, Multi News, NCBI Disease, Poem Sentiment by @sashavor in https://github.com/huggingface/datasets/pull/4336

Docs

Add API code examples for Builder classes by @stevhliu in https://github.com/huggingface/datasets/pull/4313
Add redirect to dataset script in the repo structure page by @lhoestq in https://github.com/huggingface/datasets/pull/4369

Other improvements and bug fixes

Fix failing CI on Windows for sari and wiki_split metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4342
Fix never ending GH Action to build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/4345
Fix warning in upload_file by @albertvillanova in https://github.com/huggingface/datasets/pull/4355
Fix warning in pushtohub by @albertvillanova in https://github.com/huggingface/datasets/pull/4357
Remove config names as yaml keys by @lhoestq in https://github.com/huggingface/datasets/pull/4367
Add missing language tags for udhr dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4371
Remove links in docs to old dataset viewer by @mariosasko in https://github.com/huggingface/datasets/pull/4373

New Contributors

@JohnGiorgi made their first contribution in https://github.com/huggingface/datasets/pull/4353
@juntang-zhuang made their first contribution in https://github.com/huggingface/datasets/pull/4231

Full Changelog: https://github.com/huggingface/datasets/compare/2.2.1...2.2.2

- Python
Published by lhoestq about 4 years ago

datasets - 2.2.1

Datasets bug fixes

Fix cnn_dailymail (dm stories were ignored) by @lhoestq in https://github.com/huggingface/datasets/pull/4317
- datasets 2.2.0 introduced a bug in cnn_dailymail and some examples were missing in the dataset

General improvements and bug fixes

Fix: Add missing comma by @mrm8488 in https://github.com/huggingface/datasets/pull/4303
Catch pull error when mirroring by @lhoestq in https://github.com/huggingface/datasets/pull/4314
Remove unused multiprocessing args from test CLI by @albertvillanova in https://github.com/huggingface/datasets/pull/4308
Fix CLI run_beam namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/4315
Support passing configkwargs to CLI runbeam by @albertvillanova in https://github.com/huggingface/datasets/pull/4316
Don't check f.loc in getextractionprotocolwithmagicnumber by @lhoestq in https://github.com/huggingface/datasets/pull/4318

New Contributors

@mrm8488 made their first contribution in https://github.com/huggingface/datasets/pull/4303

Full Changelog: https://github.com/huggingface/datasets/compare/2.2.0...2.2.1

- Python
Published by lhoestq about 4 years ago

datasets - 2.2.0

Dataset Changes

New: ImageNet by @apsdehal in https://github.com/huggingface/datasets/pull/4178
- Manual download only for now
New: Google Conceptual Captions by @abhishekkrthakur in https://github.com/huggingface/datasets/pull/1459
New: Conceptual 12M by @thomasw21 in https://github.com/huggingface/datasets/pull/4162
New: Visual Genome by @thomasw21 in https://github.com/huggingface/datasets/pull/4161
New: RVL-CDIP by @dnaveenr in https://github.com/huggingface/datasets/pull/4050
New: Text-based NP Enrichment (TNE) by @yanaiela in https://github.com/huggingface/datasets/pull/4153
New: TextVQA by @apsdehal in https://github.com/huggingface/datasets/pull/3967
New: ETT time series dataset by @kashif in https://github.com/huggingface/datasets/pull/4213
Update: assin2 - update metadata by @lhoestq in https://github.com/huggingface/datasets/pull/4172
Update: Librispeech - Add 'all' config by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4184
Update: XGLUE - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4249
Update: crd3 - group all the turns in one example by @shanyas10 in https://github.com/huggingface/datasets/pull/4240
Update: pubmed_qa - Remove google drive URL by @lhoestq in https://github.com/huggingface/datasets/pull/4255
Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4254
Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in https://github.com/huggingface/datasets/pull/4267
Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4236
Update: openbookqa - Add missing features for additional config by @albertvillanova in https://github.com/huggingface/datasets/pull/4278
Update: commonsense_qa - Add missing features by @albertvillanova in https://github.com/huggingface/datasets/pull/4280
Fix: Common Voice - Make sure bytes are correctly deleted if path exists by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4212
Fix: openbookqa - fix bug in choices labels by @manandey in https://github.com/huggingface/datasets/pull/4259
Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4270

Dataset Features

Add support for metadata files to imagefolder by @mariosasko in https://github.com/huggingface/datasets/pull/4069
- load a folder of images and metadata stored in metadata.jsonl, more info in the documentation on how to load an image dataset
Infer splits from the data_dir parameter when loading datasets without script by @polinaeterna in https://github.com/huggingface/datasets/pull/4144
- splits are inferred from the directory and file names, see more info in the documentation on how to structure your repository
Enable label alignment for token classification datasets by @lewtun in https://github.com/huggingface/datasets/pull/4277
Add drop_last_batch to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4215
Load dataset with TSV files by @albertvillanova in https://github.com/huggingface/datasets/pull/4246

Dataset Cards

Autoeval config by @nrajani in https://github.com/huggingface/datasets/pull/4234
- Add train-deval-index metadata to automate evaluation on your datasets based on their tasks
Adding license information for Openbookcorpus by @meg-huggingface in https://github.com/huggingface/datasets/pull/3525
Make code for image downloading from image urls cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/4218
Fix description links in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4222
Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in https://github.com/huggingface/datasets/pull/4262
Remove a copy-paste sentence in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4281
Update LexGLUE README.md by @iliaschalkidis in https://github.com/huggingface/datasets/pull/4285
leadboard info added for TNE by @yanaiela in https://github.com/huggingface/datasets/pull/4273
Add Lahnda language tag by @mariosasko in https://github.com/huggingface/datasets/pull/4286
Add license and point of contact to big_patent dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4269
Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4266

Metrics Changes

Perplexity Speedup by @emibaylor in https://github.com/huggingface/datasets/pull/4108
Add AUC ROC Metric by @emibaylor in https://github.com/huggingface/datasets/pull/4158
Small fixes in ROC AUC docs by @wschella in https://github.com/huggingface/datasets/pull/4239
Fix/start token mask issue and update documentation by @TristanThrush in https://github.com/huggingface/datasets/pull/4258
Add pearsonr mc, update functionality to match the original docs by @emibaylor in https://github.com/huggingface/datasets/pull/4226

Metric Cards

Metric card for the XTREME-S dataset by @sashavor in https://github.com/huggingface/datasets/pull/4251
Creating metric card for MAE by @sashavor in https://github.com/huggingface/datasets/pull/4252
Create metric cards for mean IOU by @sashavor in https://github.com/huggingface/datasets/pull/4253
Create metric card for Mahalanobis Distance by @sashavor in https://github.com/huggingface/datasets/pull/4257
Create metric card for MSE by @sashavor in https://github.com/huggingface/datasets/pull/4256
Fix exact match by @emibaylor in https://github.com/huggingface/datasets/pull/4166
Fix google bleu typos, examples by @emibaylor in https://github.com/huggingface/datasets/pull/4165
Add f1 metric card, update docstring in py file by @emibaylor in https://github.com/huggingface/datasets/pull/4227
Add Recall Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4204
Matthews Correlation Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4110
Add Precision Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4203
Add Accuracy Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4223
Add Spearmanr Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4109
Metric card template by @emibaylor in https://github.com/huggingface/datasets/pull/3915

Documentation

Document savetodisk and pushtohub on images and audio files by @lhoestq in https://github.com/huggingface/datasets/pull/4193
Add to docs how to load from local script by @albertvillanova in https://github.com/huggingface/datasets/pull/4200
Add code examples to API docs by @stevhliu in https://github.com/huggingface/datasets/pull/4168
Add code examples for DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/4245
Add API code examples for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/4274
Add packaged builder configs to the documentation by @lhoestq in https://github.com/huggingface/datasets/pull/4307
[Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in https://github.com/huggingface/datasets/pull/4311

General improvements and bug fixes

Generate tasks.json taxonomy from huggingface_hub by @julien-c in https://github.com/huggingface/datasets/pull/4154
Fix when map function modifies input in-place by @thomasw21 in https://github.com/huggingface/datasets/pull/4174
Support streaming cnn_dailymail dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4188
Don't duplicate data when encoding audio or image by @lhoestq in https://github.com/huggingface/datasets/pull/4187
Fix outdated docstring about default dataset config by @lhoestq in https://github.com/huggingface/datasets/pull/4186
Deprecate shard_size in push_to_hub in favor of max_shard_size by @mariosasko in https://github.com/huggingface/datasets/pull/4190
Fix some type annotation in doc by @thomasw21 in https://github.com/huggingface/datasets/pull/4202
Update GH template for dataset viewer issues by @albertvillanova in https://github.com/huggingface/datasets/pull/4201
Update auth when mirroring datasets on the hub by @lhoestq in https://github.com/huggingface/datasets/pull/4242
Rename imagenet2012 -> imagenet-1k by @lhoestq in https://github.com/huggingface/datasets/pull/4263
Skip checksum computation in Imagefolder by default by @mariosasko in https://github.com/huggingface/datasets/pull/4214
Fix convert_file_size_to_int for kilobits and megabits by @mariosasko in https://github.com/huggingface/datasets/pull/4205
Fix typo in logging docs by @stevhliu in https://github.com/huggingface/datasets/pull/4272
Bump PyArrow Version to 6 by @dnaveenr in https://github.com/huggingface/datasets/pull/4250
task id update by @nrajani in https://github.com/huggingface/datasets/pull/4244
Avoid recursion error in map if example is returned as dict value by @mariosasko in https://github.com/huggingface/datasets/pull/4216
Update minimal PyArrow version warning by @mariosasko in https://github.com/huggingface/datasets/pull/4279
[Minor edit] Fix typo in class name by @cakiki in https://github.com/huggingface/datasets/pull/4207
Stream private zipped images by @lhoestq in https://github.com/huggingface/datasets/pull/4173
Fix filesystem docstring by @stevhliu in https://github.com/huggingface/datasets/pull/4283
Document how to use FAISS index for special operations by @albertvillanova in https://github.com/huggingface/datasets/pull/4189
Contributing MedMCQA dataset by @monk1337 in https://github.com/huggingface/datasets/pull/4064
Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in https://github.com/huggingface/datasets/pull/4282
Fix missing lz4 dependency for tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4295
Altered faiss installation comment by @vishalsrao in https://github.com/huggingface/datasets/pull/4220
Fix CLI runbeam saveinfos by @albertvillanova in https://github.com/huggingface/datasets/pull/4294
Add missing faiss import to fix https://github.com/huggingface/datasets/issues/4287 by @alvarobartt in https://github.com/huggingface/datasets/pull/4288

New Contributors

@shanyas10 made their first contribution in https://github.com/huggingface/datasets/pull/4240
@apsdehal made their first contribution in https://github.com/huggingface/datasets/pull/4178
@wschella made their first contribution in https://github.com/huggingface/datasets/pull/4239
@TristanThrush made their first contribution in https://github.com/huggingface/datasets/pull/4258
@yanaiela made their first contribution in https://github.com/huggingface/datasets/pull/4153
@mo6zes made their first contribution in https://github.com/huggingface/datasets/pull/4262
@nrajani made their first contribution in https://github.com/huggingface/datasets/pull/4244
@sanchit-gandhi made their first contribution in https://github.com/huggingface/datasets/pull/4266
@cakiki made their first contribution in https://github.com/huggingface/datasets/pull/4207
@monk1337 made their first contribution in https://github.com/huggingface/datasets/pull/4064
@alvarobartt made their first contribution in https://github.com/huggingface/datasets/pull/4288

Full Changelog: https://github.com/huggingface/datasets/compare/2.1.0...2.2.0

- Python
Published by lhoestq about 4 years ago

datasets - 2.1.0

Datasets Changes

New: initial monash time series forecasting by @kashif in https://github.com/huggingface/datasets/pull/3743
New: Roman Urdu Hate Speech dataset by @bp-high in https://github.com/huggingface/datasets/pull/3972
New: Adversarial GLUE by @jxmorris12 in https://github.com/huggingface/datasets/pull/3849
New: MetaShift by @dnaveenr in https://github.com/huggingface/datasets/pull/3900
New: GSM8K by @jon-tow in https://github.com/huggingface/datasets/pull/4103
New: SBU Captions Photo by @thomasw21 in https://github.com/huggingface/datasets/pull/4130
Deprecated: Multilingual Librispeech - deprecate dataset in favor of facebook/multilingual_librispeechby @polinaeterna in https://github.com/huggingface/datasets/pull/4060
Update (BREAKING): TIMIT - Redirect users to download data manually from LDC by @lhoestq in https://github.com/huggingface/datasets/pull/4145
Update: Wikipedia by @albertvillanova in https://github.com/huggingface/datasets/pull/3821 and https://github.com/huggingface/datasets/pull/3989
Update: conll2012_ontonotesv5 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4002
Update: daily_dialog - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4008
Update: id_clickbait - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4014
Update: blimp - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4016
Update: scan - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4017
Update: yelpreviewfull - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4018
Update: yelp_polarity - Support streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4019
Update: amazon_polarity - Replace data URL by @lhoestq in https://github.com/huggingface/datasets/pull/4020
Update: dbpedia_14 - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4022
Update: xtreme - Support streaming dataset for bucc18 config by @albertvillanova in https://github.com/huggingface/datasets/pull/4026
Update: yahooanswerstopics - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4023* Update: ASSIN 2 dataset - replace broken Google Drive URLS by links on github by @ruanchaves in https://github.com/huggingface/datasets/pull/4004
Update: xcopa - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4039
Update: medical_dialog - Add configs with processed data by @albertvillanova in https://github.com/huggingface/datasets/pull/4127
Update: xtreme - Support streaming for udpos config by @albertvillanova in https://github.com/huggingface/datasets/pull/4131
Update: xtreme - Support streaming for PAWS-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4132
Update: xtreme - Support streaming for PAN-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4135
Update: SQuAD v2 - Use a constant for the articles regex by @bryant1410 in https://github.com/huggingface/datasets/pull/4030
Update: HANS - Support streaming by @mariosasko in https://github.com/huggingface/datasets/pull/4155
Fix: catsvsdogs - fix checksum error dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4033
Fix: xcopa - fix null checksum by @albertvillanova in https://github.com/huggingface/datasets/pull/4034
Fix: amazonusreviews - fix metadata - 4/4/2022 by @trentonstrong in https://github.com/huggingface/datasets/pull/4092

Dataset Cards

Updated annotations for nli_tr dataset by @e-budur in https://github.com/huggingface/datasets/pull/4058
Add missing label for emotion description by @lijiazheng99 in https://github.com/huggingface/datasets/pull/4151
Remove unncessary 'pylint disable' message in ReadMe by @Datta0 in https://github.com/huggingface/datasets/pull/3955
Improve RedCaps dataset card by @mariosasko in https://github.com/huggingface/datasets/pull/4100
Fix duplicate key in multi_news by @lhoestq in https://github.com/huggingface/datasets/pull/4164

Datasets Tags and Search on the Hugging Face Hub

Tasks alignment with models by @lhoestq in https://github.com/huggingface/datasets/pull/4066
Update datasets task tags to align tags with models by @lhoestq in https://github.com/huggingface/datasets/pull/4067

Metrics Changes

Xtreme-S Metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3799
Fix xtreme s metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3957
Avoid info log messages from transformers in FrugalScore metric by @albertvillanova in https://github.com/huggingface/datasets/pull/3938
Add exact match metric by @emibaylor in https://github.com/huggingface/datasets/pull/3899
Fix comet metric by @lhoestq in https://github.com/huggingface/datasets/pull/3945
Add zero_division argument to precision and recall metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4035
Support float data types in pearsonr/spearmanr metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4054
Remove GLEU metric by @emibaylor in https://github.com/huggingface/datasets/pull/3949

Metric Cards

Perplexity Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3905
Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3917
Create README.md for CER metric by @sashavor in https://github.com/huggingface/datasets/pull/3911
Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3944
Update README.md by @sashavor in https://github.com/huggingface/datasets/pull/3933
Create SARI metric card by @sashavor in https://github.com/huggingface/datasets/pull/3932
Create MAUVE metric card by @sashavor in https://github.com/huggingface/datasets/pull/3934
Create CoVAL metric card by @sashavor in https://github.com/huggingface/datasets/pull/3940
Google BLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3948
Create metric card for BERTScore by @sashavor in https://github.com/huggingface/datasets/pull/3966
Rename wer to cer by @pmgautam in https://github.com/huggingface/datasets/pull/4012
Create metric card for XNLI by @sashavor in https://github.com/huggingface/datasets/pull/4046
Create metric card for the Code Eval metric by @sashavor in https://github.com/huggingface/datasets/pull/4049
Add TER metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3981
BLEU metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3947
Create metric card for CUAD by @sashavor in https://github.com/huggingface/datasets/pull/4043
Create metric card for METEOR by @sashavor in https://github.com/huggingface/datasets/pull/4065
Create a metric card for Competition MATH by @sashavor in https://github.com/huggingface/datasets/pull/4073
Create metric card for seqeval by @sashavor in https://github.com/huggingface/datasets/pull/4070
Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3930
Create metric card for Frugal Score by @sashavor in https://github.com/huggingface/datasets/pull/4089
Updating FrugalScore metric card by @sashavor in https://github.com/huggingface/datasets/pull/4097
Proposing WikiSplit metric card by @sashavor in https://github.com/huggingface/datasets/pull/4098
Fix formatting in BLEU metric card by @mariosasko in https://github.com/huggingface/datasets/pull/4157

Documentation

Doc maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3926
[Doc] Don't use v for version tags on GitHub by @sgugger in https://github.com/huggingface/datasets/pull/3943
Use templates for doc-builidng jobs by @sgugger in https://github.com/huggingface/datasets/pull/3914
Add alignlabelswith_mapping docs by @stevhliu in https://github.com/huggingface/datasets/pull/3931
Add tip on how to speed up loading with ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/3980
Fix main_classes docs index by @lhoestq in https://github.com/huggingface/datasets/pull/3925
More consistent references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3988
Docs maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3999
Add ROUGE Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4076
Add chrF(++) Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4082
Add SacreBLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4083

General improvements and bug fixes

Fix flatten of complex feature types by @mariosasko in https://github.com/huggingface/datasets/pull/3723
Fix flatten of Sequence feature type by @lhoestq in https://github.com/huggingface/datasets/pull/3962
Exclude Google Drive tests of the CI by @lhoestq in https://github.com/huggingface/datasets/pull/3982
Close PIL.Image file handler in Image.decode_example by @mariosasko in https://github.com/huggingface/datasets/pull/3995
Fix Faiss custom_index device by @albertvillanova in https://github.com/huggingface/datasets/pull/3987
Fix None issue with Sequence of dict by @lhoestq in https://github.com/huggingface/datasets/pull/4010
Update main readme by @lhoestq in https://github.com/huggingface/datasets/pull/3927
Fix map remove_columns on empty dataset by @lhoestq in https://github.com/huggingface/datasets/pull/4021
Fix Audio.encode_example() when writing an array by @polinaeterna in https://github.com/huggingface/datasets/pull/3998
Use audio feature in ASR task template by @lhoestq in https://github.com/huggingface/datasets/pull/4006
Improve out of bounds error message by @lhoestq in https://github.com/huggingface/datasets/pull/4068
Increase max retries for GitHub metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4063
Fix CLI dummy data generation by @albertvillanova in https://github.com/huggingface/datasets/pull/4045
Fix docs on audio feature installation by @albertvillanova in https://github.com/huggingface/datasets/pull/4028
Add installation instructions to image_process doc by @mariosasko in https://github.com/huggingface/datasets/pull/4072
Fix GithubMetricModuleFactory instantiation with None download_config by @albertvillanova in https://github.com/huggingface/datasets/pull/4078
Increase max retries for GitHub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4079
Close parquet writer properly in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/4081
fix typo in rename_column error message by @hunterlang in https://github.com/huggingface/datasets/pull/4095
Fix BeamWriter output Parquet file by @albertvillanova in https://github.com/huggingface/datasets/pull/4087
Remove unused legacy Beam utils by @albertvillanova in https://github.com/huggingface/datasets/pull/4088
Hotfix failing CI tests on Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/4119
Update security policy by @albertvillanova in https://github.com/huggingface/datasets/pull/4111
Avoid writing empty license files by @albertvillanova in https://github.com/huggingface/datasets/pull/4090
Support huggingface_hub 0.5 by @lhoestq in https://github.com/huggingface/datasets/pull/4106
Pretty print dataset info files by @mariosasko in https://github.com/huggingface/datasets/pull/4116
Add single dataset citations for TweetEval by @gchhablani in https://github.com/huggingface/datasets/pull/4137
Adjust path to datasets tutorial in How-To by @NimaBoscarino in https://github.com/huggingface/datasets/pull/4147
Applied index-filters on scores in search.py. by @vishalsrao in https://github.com/huggingface/datasets/pull/3971
More robust cast_to_python_objects in TypedSequence by @mariosasko in https://github.com/huggingface/datasets/pull/4128
Sync Features dictionaries by @mariosasko in https://github.com/huggingface/datasets/pull/3997
Avoid rate limit in update hub repositories by @lhoestq in https://github.com/huggingface/datasets/pull/4167

New Contributors

@bp-high made their first contribution in https://github.com/huggingface/datasets/pull/3972
@ruanchaves made their first contribution in https://github.com/huggingface/datasets/pull/4004
@pmgautam made their first contribution in https://github.com/huggingface/datasets/pull/4012
@hunterlang made their first contribution in https://github.com/huggingface/datasets/pull/4095
@trentonstrong made their first contribution in https://github.com/huggingface/datasets/pull/4092
@NimaBoscarino made their first contribution in https://github.com/huggingface/datasets/pull/4147
@jon-tow made their first contribution in https://github.com/huggingface/datasets/pull/4103
@lijiazheng99 made their first contribution in https://github.com/huggingface/datasets/pull/4151
@Datta0 made their first contribution in https://github.com/huggingface/datasets/pull/3955
@vishalsrao made their first contribution in https://github.com/huggingface/datasets/pull/3971

Full Changelog: https://github.com/huggingface/datasets/compare/2.0.0...2.1.0

- Python
Published by lhoestq about 4 years ago

datasets - 2.0.0

🤗 Datasets 2.0.0

We're happy to announce that our new documentation is available at hf.co/docs/datasets !

Dataset Features

Load a folder of images using the imagefolder dataset loader:
- Add imagefolder dataset by @nateraw in https://github.com/huggingface/datasets/pull/2830
- Faster ImageFolder + add option to drop labels by @mariosasko in https://github.com/huggingface/datasets/pull/3887
Push your image and audio datasets on the Hugging Face Hub with push_to_hub:
- Add support for Audio and Image feature in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3685
New processing methods for streaming datasets:
- Add IterableDataset.filter by @lhoestq in https://github.com/huggingface/datasets/pull/3826
- Manipulate columns on IterableDataset (rename columns, cast, etc.) by @lhoestq in https://github.com/huggingface/datasets/pull/3862
- Add the new methods to IterableDatasetDict by @lhoestq in https://github.com/huggingface/datasets/pull/3923
And more:
- Add more compression types for to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3551
- Multi-GPU support for FaissIndex by @rentruewang in https://github.com/huggingface/datasets/pull/3721

Breaking changes

API changes for map and shuffle for datasets loaded in streaming mode:
- Align map when streaming: update instead of overwrite + add missing parameters by @lhoestq in https://github.com/huggingface/datasets/pull/3801
- Align IterableDataset.shuffle with Dataset.shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/3842
Rename GenerateMode to DownloadMode by @albertvillanova in https://github.com/huggingface/datasets/pull/3759
Remove deprecated methods/params (preparation for v2.0) by @mariosasko in https://github.com/huggingface/datasets/pull/3803
Remove deprecated remove_columns param in filter by @mariosasko in https://github.com/huggingface/datasets/pull/3827
Module namespace cleanup for v2.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3875

Dataset Changes

New: CFPB Consumer Complaints by @kayvane1 in https://github.com/huggingface/datasets/pull/3617
New: told-br (brazilian hate speech) by @JAugusto97 in https://github.com/huggingface/datasets/pull/3683
New: electricity load diagram by @kashif in https://github.com/huggingface/datasets/pull/3722
New: MIT Scene Parsing Benchmark by @mariosasko in https://github.com/huggingface/datasets/pull/3607
New: ElkarHizketak v1.0 by @antxa in https://github.com/huggingface/datasets/pull/3780
New: wikitablequestions by @SivilTaram in https://github.com/huggingface/datasets/pull/3870
New: ontonotes_conll by @richarddwang in https://github.com/huggingface/datasets/pull/3853
Update: BnL Historical Newspapers - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3616
Update: Common voice - add validated partition by @shalymin-amzn in https://github.com/huggingface/datasets/pull/3669
Update: Common Voice - add local paths to audio files by @lhoestq in https://github.com/huggingface/datasets/pull/3736
Update: Common Voice - simplify code by @lhoestq in https://github.com/huggingface/datasets/pull/3817
Update: Natural Questions - add dev-only configuration by @albertvillanova in https://github.com/huggingface/datasets/pull/3699
Update: pubmed - update data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3692
Update: pubmed - make the dataset streamable by @abhi-mosaic in https://github.com/huggingface/datasets/pull/3740
Update: RedCaps - make the dataset streamable by @mariosasko in https://github.com/huggingface/datasets/pull/3737
Update: catsvsdogs - update metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3752
Update: newsroom - update manual download url by @albertvillanova in https://github.com/huggingface/datasets/pull/3779
Update: xcopa - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3810
Update: catsvsdogs size by @mariosasko in https://github.com/huggingface/datasets/pull/3878
Fix: semeval2018task1 - fix download location by @maxpel in https://github.com/huggingface/datasets/pull/3643
Fix: newsqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3696
Fix: The Pile datasets - fix host urls by @albertvillanova in https://github.com/huggingface/datasets/pull/3627
Fix: Evidence Infer Treatment - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3718
Fix: NewsQA - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3734
Fix: head_qa - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3766
Fix: msr_sqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3771
Fix: reddit_tifu - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3774
Fix: wiki_lingua - fix spanish data file url by @albertvillanova in https://github.com/huggingface/datasets/pull/3806
Fix: beans - fix data urls by @mariosasko in https://github.com/huggingface/datasets/pull/3890
Fix: CRD3 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3921
Fix: MultiWOZ 2.2 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3922

Dataset cards

Add code example in wikipedia card by @lhoestq in https://github.com/huggingface/datasets/pull/3678
Fix Multi-News dataset metadata and card by @albertvillanova in https://github.com/huggingface/datasets/pull/3731
Reddit dataset card additions by @anna-kay in https://github.com/huggingface/datasets/pull/3781
Update gigaword card and info by @mariosasko in https://github.com/huggingface/datasets/pull/3775
Reddit dataset card contribution by @anna-kay in https://github.com/huggingface/datasets/pull/3797

Metric Changes

New: FrugalScore by @moussaKam in https://github.com/huggingface/datasets/pull/3674
New: Mahalanobis distance by @JoaoLages in https://github.com/huggingface/datasets/pull/3794
New: mIoU by @NielsRogge in https://github.com/huggingface/datasets/pull/3745
New: MSE and MAE - V2 by @dnaveenr in https://github.com/huggingface/datasets/pull/3874
Fix: METEOR - fix bug due to nltk version by @albertvillanova in https://github.com/huggingface/datasets/pull/3884

Metric cards

Add perplexity to metrics by @emibaylor in https://github.com/huggingface/datasets/pull/3757
Create SQuAD metric README.md by @sashavor in https://github.com/huggingface/datasets/pull/3873
SQuAD v2 metric: create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3879
Update README.md for SQuAD v2 metric by @sashavor in https://github.com/huggingface/datasets/pull/3908
Update README.md for SQuAD metric by @sashavor in https://github.com/huggingface/datasets/pull/3907
Create README.md for WER metric by @sashavor in https://github.com/huggingface/datasets/pull/3898
Create README.md for GLUE by @sashavor in https://github.com/huggingface/datasets/pull/3916

New documentation

Update docs to new frontend/UI by @mishig25 in https://github.com/huggingface/datasets/pull/3690
Image process doc by @stevhliu in https://github.com/huggingface/datasets/pull/3882

General improvements and bug fixes

Better TQDM output by @mariosasko in https://github.com/huggingface/datasets/pull/3654
Prioritize module.builder_kwargs over defaults in TestCommand by @lvwerra in https://github.com/huggingface/datasets/pull/3672
Extend support for streaming datasets that use os.path.relpath by @albertvillanova in https://github.com/huggingface/datasets/pull/3623
Add Fon language tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3620
Remove unnecessary 'r' arg in by @bryant1410 in https://github.com/huggingface/datasets/pull/3661
Fix TestCommand to copy dataset_infos to local dir with only data files by @albertvillanova in https://github.com/huggingface/datasets/pull/3680
Upgrade black to version ~=22.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/3691
Fix streaming for servers not supporting HTTP range requests by @albertvillanova in https://github.com/huggingface/datasets/pull/3689
Pin ElasticSearch by @lhoestq in https://github.com/huggingface/datasets/pull/3701
Raise informative error when loading a savetodisk dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3705
Fix ClassLabel to/from dict when passed names_file by @albertvillanova in https://github.com/huggingface/datasets/pull/3695
Fix CI code quality issue by @albertvillanova in https://github.com/huggingface/datasets/pull/3710
Check if indices values in Dataset.select are within bounds by @mariosasko in https://github.com/huggingface/datasets/pull/3719
Pin pandas to avoid bug in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3725
Use config pandas version in CSV dataset builder by @albertvillanova in https://github.com/huggingface/datasets/pull/3726
Set base path to hub url for canonical datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3709
Fix ValueError message formatting in int2str by @akulchik in https://github.com/huggingface/datasets/pull/3742
Patch all module attributes in its namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/3727
Fix typo in train split name by @albertvillanova in https://github.com/huggingface/datasets/pull/3751
feat: 🎸 generate info if dataset_infos.json does not exist by @severo in https://github.com/huggingface/datasets/pull/3670
Support streaming in size estimation function in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3732
Expose method and fix param by @severo in https://github.com/huggingface/datasets/pull/3767
Fix HfFileSystem docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3768
process .opus files (for Multilingual Spoken Words) by @polinaeterna in https://github.com/huggingface/datasets/pull/3666
Fix: dataset name is stored in keys by @thomasw21 in https://github.com/huggingface/datasets/pull/3772
Use the same seed to shuffle shards and metadata in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/3746
Start removing canonical datasets logic by @lhoestq in https://github.com/huggingface/datasets/pull/3777
Support passing str to iter_files by @albertvillanova in https://github.com/huggingface/datasets/pull/3783
Fix Google Drive URL to avoid Virus scan warning by @albertvillanova in https://github.com/huggingface/datasets/pull/3787
Skip checksum computation if ignore_verifications is True by @mariosasko in https://github.com/huggingface/datasets/pull/3796
Fix error message in CSV loader for newer Pandas versions by @mariosasko in https://github.com/huggingface/datasets/pull/3798
Add data_dir to data_files resolution and misc improvements to HfFileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/3791
Error of writing with different schema, due to nonpreservation of nullability by @richarddwang in https://github.com/huggingface/datasets/pull/3782
Handle Nones in PyArrow struct by @mariosasko in https://github.com/huggingface/datasets/pull/3814
Fix iter_archive getting reset by @lhoestq in https://github.com/huggingface/datasets/pull/3815
Added computer vision tasks by @merveenoyan in https://github.com/huggingface/datasets/pull/3800
Fix typo in doc build yml by @mishig25 in https://github.com/huggingface/datasets/pull/3819
Allow not specifying feature cols other than predictions/references in Metric.compute by @mariosasko in https://github.com/huggingface/datasets/pull/3824
Logo float left by @mishig25 in https://github.com/huggingface/datasets/pull/3836
Pin responses to fix CI for Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/3840
Fix dead dataset scripts creation link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3834
Remove decode: true for image feature in head_qa by @craffel in https://github.com/huggingface/datasets/pull/3805
Update faiss device docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3846
Udpate index.mdx margins by @gary149 in https://github.com/huggingface/datasets/pull/3858
Fix pushtohub with null images by @lhoestq in https://github.com/huggingface/datasets/pull/3856
Redundant add dataset information and dead link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3852
Update image dataset tags by @mariosasko in https://github.com/huggingface/datasets/pull/3864
Bring back imgs so that forsk dont get broken by @mishig25 in https://github.com/huggingface/datasets/pull/3866
Small typos in How-to-train tutorial. by @lkhphuc in https://github.com/huggingface/datasets/pull/3833
Small doc fixes by @mishig25 in https://github.com/huggingface/datasets/pull/3860
add pandas to env command by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3871
Ignore duplicate keys if ignore_verifications=True by @mariosasko in https://github.com/huggingface/datasets/pull/3868
Update code blocks by @lhoestq in https://github.com/huggingface/datasets/pull/3863
Fix downloadmode in datasetmodule_factory by @albertvillanova in https://github.com/huggingface/datasets/pull/3876
Fix some shuffle docs by @lhoestq in https://github.com/huggingface/datasets/pull/3885
Fix race condition in doc build by @lhoestq in https://github.com/huggingface/datasets/pull/3891
Add default branch for doc building by @sgugger in https://github.com/huggingface/datasets/pull/3893
[docs] make dummy data creation optional by @lhoestq in https://github.com/huggingface/datasets/pull/3894
Fix code examples indentation by @lhoestq in https://github.com/huggingface/datasets/pull/3895
Align tqdm control/cache control with Transformers by @mariosasko in https://github.com/huggingface/datasets/pull/3897
Fix CLI test checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/3892
Fix Google Drive URL to avoid Virus scan warning in streaming mode by @mariosasko in https://github.com/huggingface/datasets/pull/3843
Change the framework switches to the new syntax by @sgugger in https://github.com/huggingface/datasets/pull/3880

New Contributors

@kayvane1 made their first contribution in https://github.com/huggingface/datasets/pull/3617
@JAugusto97 made their first contribution in https://github.com/huggingface/datasets/pull/3683
@shalymin-amzn made their first contribution in https://github.com/huggingface/datasets/pull/3669
@kashif made their first contribution in https://github.com/huggingface/datasets/pull/3722
@akulchik made their first contribution in https://github.com/huggingface/datasets/pull/3742
@abhi-mosaic made their first contribution in https://github.com/huggingface/datasets/pull/3740
@emibaylor made their first contribution in https://github.com/huggingface/datasets/pull/3757
@anna-kay made their first contribution in https://github.com/huggingface/datasets/pull/3781
@JoaoLages made their first contribution in https://github.com/huggingface/datasets/pull/3794
@mishig25 made their first contribution in https://github.com/huggingface/datasets/pull/3690
@antxa made their first contribution in https://github.com/huggingface/datasets/pull/3780
@dnaveenr made their first contribution in https://github.com/huggingface/datasets/pull/3834
@lkhphuc made their first contribution in https://github.com/huggingface/datasets/pull/3833
@rentruewang made their first contribution in https://github.com/huggingface/datasets/pull/3721
@gary149 made their first contribution in https://github.com/huggingface/datasets/pull/3858
@NielsRogge made their first contribution in https://github.com/huggingface/datasets/pull/3745
@sashavor made their first contribution in https://github.com/huggingface/datasets/pull/3873
@SivilTaram made their first contribution in https://github.com/huggingface/datasets/pull/3870
Document cases for github datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3924
Fix text loader to split only on universal newlines by @albertvillanova in https://github.com/huggingface/datasets/pull/3910
Retry HfApi call inside pushtohub when 504 error by @albertvillanova in https://github.com/huggingface/datasets/pull/3886

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...0.0.0

- Python
Published by lhoestq over 4 years ago

datasets - 1.18.4

Bug fixes

Prioritize module.builder_kwargs over defaults in TestCommand #3672 (@lvwerra)
Fix TestCommand to copy dataset_infos to local dir with only data files #3680 (@albertvillanova)
Upgrade black to version ~=22.0 #3691 (@LysandreJik)
Fix streaming for servers not supporting HTTP range requests #3689 (@albertvillanova)
Pin ElasticSearch #3701 (@lhoestq)
Fix ClassLabel to/from dict when passed names_file #3695 (@albertvillanova)
Fix CI code quality issue #3710 (@albertvillanova)
Check if indices values in Dataset.select are within bounds #3719 (@mariosasko)
Pin pandas to avoid bug in streaming mode #3725 (@albertvillanova)
Use config pandas version in CSV dataset builder #3726 (@albertvillanova)
Fix dataset mirroring (@lhoestq)
Fix ValueError message formatting in int2str #3742 (@akulchik)
Patch all module attributes in its namespace #3727 (@albertvillanova)
Fix HfFileSystem docstring #3768 (@lhoestq)
Fix: dataset name is stored in keys #3772 (@thomasw21)
Fix Google Drive URL to avoid Virus scan warning #3787 (@albertvillanova)
Fix error message in CSV loader for newer Pandas versions #3798 (@mariosasko)
Pin responses to fix CI for Windows #3840 (@albertvillanova)

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...1.18.4

- Python
Published by albertvillanova over 4 years ago

datasets - 1.18.3

Bug fixes

Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq in https://github.com/huggingface/datasets/pull/3665
Extend dataset builder for streaming in get_dataset_split_names by @mariosasko in https://github.com/huggingface/datasets/pull/3657

Dataset changes

New: Turkic X-WMT evaluation set for machine translation by @mirzakhalov in https://github.com/huggingface/datasets/pull/3605
New: British Library books dataset by @davanstrien in https://github.com/huggingface/datasets/pull/3603
Fix: wiki_bio - Update link by @jxmorris12 in https://github.com/huggingface/datasets/pull/3651

Other improvements

sp. Columbia => Colombia by @serapio in https://github.com/huggingface/datasets/pull/3652
Run pyupgrade for Python 3.6+ by @bryant1410 in https://github.com/huggingface/datasets/pull/3560

New Contributors

@serapio made their first contribution in https://github.com/huggingface/datasets/pull/3652
@mirzakhalov made their first contribution in https://github.com/huggingface/datasets/pull/3605

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.2...1.18.3

- Python
Published by lhoestq over 4 years ago

datasets - 1.18.2

Bug fixes

Fix streaming datasets that are not reset correctly by @lhoestq in https://github.com/huggingface/datasets/pull/3646
Fix numpy rngs when shuffling with seed=None by @mariosasko in https://github.com/huggingface/datasets/pull/3641
Fix dataset slicing with negative bounds when indices mapping is not None by @mariosasko in https://github.com/huggingface/datasets/pull/3642
Fix add_column on datasets with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/3647

Other improvements

Update index.rst by @VioletteLepercq in https://github.com/huggingface/datasets/pull/3636
Fix Windows CI: bump python to 3.7 by @lhoestq in https://github.com/huggingface/datasets/pull/3648

New Contributors

@VioletteLepercq made their first contribution in https://github.com/huggingface/datasets/pull/3636

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.1...1.18.2

- Python
Published by lhoestq over 4 years ago

datasets - 1.18.1

Improvements

Make decoding of Audio and Image feature optional by @mariosasko in https://github.com/huggingface/datasets/pull/3430

Bug fixes

Fix prepare_for_task() by @mariosasko in https://github.com/huggingface/datasets/pull/3614
Fix: Multilingual Librispeech - fix bad url formatting by @polinaeterna in https://github.com/huggingface/datasets/pull/3619

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.0...1.18.1

- Python
Published by lhoestq over 4 years ago

datasets - 1.18.0

Datasets Changes

New: VCTK
- Add VCTK dataset by @jaketae in https://github.com/huggingface/datasets/pull/3351
- Fix VCTK encoding by @lhoestq in https://github.com/huggingface/datasets/pull/3493
- Docs: Add VCTK dataset description by @jaketae in https://github.com/huggingface/datasets/pull/3500
New: CPPE-5 dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3517
New: RedCaps dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3424
New: WIDER FACE dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3413
New: SVHN dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3535
New: BNL newspapers by @davanstrien in https://github.com/huggingface/datasets/pull/3397
New: PASS dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3576
New: Text2log Dataset by @apergo-ai in https://github.com/huggingface/datasets/pull/3579
Update: beans, catsvsdogs - Use iter_files instead of str(Path(...) in image dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3477
Update : PIB - update version and make it streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3496
Update: codexgluetttexttotext, compguesswhat - Remove print statements in datasets by @mariosasko in https://github.com/huggingface/datasets/pull/3546
Update: MuchoCine - add missing tasks by @mariosasko in https://github.com/huggingface/datasets/pull/3571
Fix: Tashkeela - fix to yield stripped text by @albertvillanova in https://github.com/huggingface/datasets/pull/3471
Fix: asset - change to raw.githubusercontent.com URLs by @VictorSanh in https://github.com/huggingface/datasets/pull/3516
Fix: CC100 - use HTTPS for the data source URL by @aajanki in https://github.com/huggingface/datasets/pull/3519
Fix: vision datsets - Fix bug in ImageClassifcation task template by @mariosasko in https://github.com/huggingface/datasets/pull/3557
Fix: tweet_qa - fix DuplicatedKeysError and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/3559
Fix: mC4 - fix multiple language downloading by @polinaeterna in https://github.com/huggingface/datasets/pull/3594
Fix: CoNLL2003:
- Use old url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3600
- Update url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3602
- Add conll2003 licensing by @lhoestq in https://github.com/huggingface/datasets/pull/3601

Datasets Features

[Time series] Add support for time, date, duration, and decimal dtypes by @mariosasko in https://github.com/huggingface/datasets/pull/3591
[Image][Audio] Add flexible casting for Image and Audio + Support nested casting by @lhoestq in https://github.com/huggingface/datasets/pull/3575
Allows DatasetDict.filter to have batching option by @thomasw21 in https://github.com/huggingface/datasets/pull/3506
Add desc parameter to filter by @mariosasko in https://github.com/huggingface/datasets/pull/3513
Add gzip for to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3492
Allow multiple task templates of the same type by @mariosasko in https://github.com/huggingface/datasets/pull/3562
Add parameter preserve_index to from_pandas by @Sorrow321 in https://github.com/huggingface/datasets/pull/3565
Dataset Streaming:
- Fix str(Path(...)) conversion in streaming on Linux by @mariosasko in https://github.com/huggingface/datasets/pull/3472
- Extend support for streaming datasets that use ET.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/3476
- Extend support for streaming datasets that use os.walk by @albertvillanova in https://github.com/huggingface/datasets/pull/3478

Metrics Changes

Add Mauve metric by @jthickstun in https://github.com/huggingface/datasets/pull/3573

Dataset cards

update pretty_name for first 200 datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3498
update pretty_name for all the other datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3536
pib: Update pib dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/3501
arabicspeechcorpus: Adding link to license. by @meg-huggingface in https://github.com/huggingface/datasets/pull/3524
Covost2: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3528
librispeech_asr: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3529
vivos: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3530
audio datasets: Audio datacard update - first pass by @meg-huggingface in https://github.com/huggingface/datasets/pull/3520
common_language: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3527
wikidpr: Update wikidpr README.md by @lhoestq in https://github.com/huggingface/datasets/pull/3534
qa4mre: Fix qa4mre tags by @lhoestq in https://github.com/huggingface/datasets/pull/3574
HellaSwag: Update HellaSwag README.md by @borgr in https://github.com/huggingface/datasets/pull/3588
ANLI: Update ANLI README.md by @borgr in https://github.com/huggingface/datasets/pull/3590
tweet_eval: Update README.md by @borgr in https://github.com/huggingface/datasets/pull/3593

Documentation

Fix rendering of docs by @albertvillanova in https://github.com/huggingface/datasets/pull/3470
Fix totfdataset references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3514
added PII statements and license links to data cards by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3537
Readme usage update by @meg-huggingface in https://github.com/huggingface/datasets/pull/3538
Update the CC-100 dataset card by @aajanki in https://github.com/huggingface/datasets/pull/3542
Research wording for nc licenses by @meg-huggingface in https://github.com/huggingface/datasets/pull/3539
Added links to licensing and PII message in vctk dataset by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3523
Give clearer instructions to add the YAML tags by @albertvillanova in https://github.com/huggingface/datasets/pull/3532

General improvements and bug fixes

Fix overriding of filesystem info by @albertvillanova in https://github.com/huggingface/datasets/pull/3481
Update ADDNEWDATASET.md by @apergo-ai in https://github.com/huggingface/datasets/pull/3487
Fix weird spacing in ManualDownloadError message by @bryant1410 in https://github.com/huggingface/datasets/pull/3486
Clone full repo to detect new tags when mirroring datasets on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3494
Remove unused phony rule from Makefile by @bryant1410 in https://github.com/huggingface/datasets/pull/3483
fix: 🐛 pass token when retrieving the split names by @severo in https://github.com/huggingface/datasets/pull/3545
Pin torchmetrics to fix the COMET test by @lhoestq in https://github.com/huggingface/datasets/pull/3589
Preserve encoding/decoding with features in Iterable.map call by @mariosasko in https://github.com/huggingface/datasets/pull/3556

New Contributors

@apergo-ai made their first contribution in https://github.com/huggingface/datasets/pull/3487
@bryant1410 made their first contribution in https://github.com/huggingface/datasets/pull/3486
@meg-huggingface made their first contribution in https://github.com/huggingface/datasets/pull/3527
@aajanki made their first contribution in https://github.com/huggingface/datasets/pull/3519
@Sorrow321 made their first contribution in https://github.com/huggingface/datasets/pull/3565
@jthickstun made their first contribution in https://github.com/huggingface/datasets/pull/3573
@borgr made their first contribution in https://github.com/huggingface/datasets/pull/3588

Full Changelog: https://github.com/huggingface/datasets/compare/1.17.0...1.18.0

- Python
Published by lhoestq over 4 years ago

datasets - 1.17.0

Dataset Changes

New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3287
- Add The Pile Free Law subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3359
- Add The Pile USPTO subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3360
- Add The Pile subsets by @albertvillanova in https://github.com/huggingface/datasets/pull/3378
- Add The Pile Enron Emails subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3427
New: British Library Books Genre by @davanstrien in https://github.com/huggingface/datasets/pull/3312
New: Americas NLI by @fdschmidt93 in https://github.com/huggingface/datasets/pull/3371
New: Speech commands by @polinaeterna in https://github.com/huggingface/datasets/pull/3335
New: eli5_category by @jingshenSN2 in https://github.com/huggingface/datasets/pull/3420
New: OneStopQa by @scaperex in https://github.com/huggingface/datasets/pull/3436
Update: LABR - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3352
Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in https://github.com/huggingface/datasets/pull/3376
Update: beans, castvsdogs, cifar10, cifar100, fashionmnist, mnist, headqa: use the new Image feature type + streaming support by @mariosasko in https://github.com/huggingface/datasets/pull/3362
Update: CC100- add Georgian data by @AnzorGozalishvili in https://github.com/huggingface/datasets/pull/3383
Update: disasterresponsemessages - update download urls (+ add validation split) by @mariosasko in https://github.com/huggingface/datasets/pull/3426
Update: swahili_news - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3463
Fix: WikiAuto, Jeopardy, definitepronounresolution - fix URLs by @LashaO in https://github.com/huggingface/datasets/pull/3266
Fix: QED - fix type of bridge field by @mariosasko in https://github.com/huggingface/datasets/pull/3417
Fix: ASSET - fix dataset data URLs by @tianjianjiang in https://github.com/huggingface/datasets/pull/3342

Dataset Features

Add Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/3163
totfdataset() refactor by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3356
More robust None handling by @mariosasko in https://github.com/huggingface/datasets/pull/3195
Add cast_column to IterableDataset by @mariosasko in https://github.com/huggingface/datasets/pull/3439
Support streaming zipped dataset repo by passing only repo name by @albertvillanova in https://github.com/huggingface/datasets/pull/3375
Extend support for streaming datasets that use pd.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/3355
Extend iter_archive to support file object input by @albertvillanova in https://github.com/huggingface/datasets/pull/3443
Extend text to support yielding lines, paragraphs or documents by @albertvillanova in https://github.com/huggingface/datasets/pull/3442
Push dataset_infos.json to Hub to preserve feature types by @lhoestq in https://github.com/huggingface/datasets/pull/3467

Dataset cards

Change TriviaQA license (#3313) by @avinashsai in https://github.com/huggingface/datasets/pull/3330
Add missing tags to XTREME by @mariosasko in https://github.com/huggingface/datasets/pull/3322
Remove duplicate name from dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3354
Fix typos in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3386
Fix duplicated tag in wikicorpus dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/3458

Dataset Tasks

Create Language Modeling task by @albertvillanova in https://github.com/huggingface/datasets/pull/3387

Metric Changes

BLEURT: Match key names to correspond with filename by @jaehlee in https://github.com/huggingface/datasets/pull/3348
Fix links in metrics description by @albertvillanova in https://github.com/huggingface/datasets/pull/3461
Fix METEOR missing NLTK's omw-1.4 by @lhoestq in https://github.com/huggingface/datasets/pull/3469

Docs

Add ArrayXD docs by @stevhliu in https://github.com/huggingface/datasets/pull/3344
Document a training loop for streaming dataset by @lhoestq in https://github.com/huggingface/datasets/pull/3370
Fix formatting in IterableDataset.map docs by @mariosasko in https://github.com/huggingface/datasets/pull/3395
Correctly indent builder config in dataset script docs by @mariosasko in https://github.com/huggingface/datasets/pull/3432
Update BLEURT hyperlink by @lewtun in https://github.com/huggingface/datasets/pull/3437

Additional improvements and bug fixes

Quick fix error formatting by @NouamaneTazi in https://github.com/huggingface/datasets/pull/3328
Fix error message and add extension fallback by @mariosasko in https://github.com/huggingface/datasets/pull/3332
Avoid content-encoding issue while streaming datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/3350
Fix JSON ClassLabel casting for integers by @lhoestq in https://github.com/huggingface/datasets/pull/3340
Better error message when download fails by @lhoestq in https://github.com/huggingface/datasets/pull/3343
Fix dict source_datasets tagset validator by @albertvillanova in https://github.com/huggingface/datasets/pull/3368
Fix typo in other-structured-to-text task tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3367
Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in https://github.com/huggingface/datasets/pull/3296
Fix flaky test of the temporary directory used by loadfromdisk by @lhoestq in https://github.com/huggingface/datasets/pull/3388
More robust first elem check in encode/cast example by @mariosasko in https://github.com/huggingface/datasets/pull/3402
Fix module inference for archive with a directory by @albertvillanova in https://github.com/huggingface/datasets/pull/3406
Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in https://github.com/huggingface/datasets/pull/3410
Pass new_fingerprint in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/3409
Fix flaky test again for s3 serialization by @lhoestq in https://github.com/huggingface/datasets/pull/3412
Skip None encoding (line deleted by accident in #3195) by @mariosasko in https://github.com/huggingface/datasets/pull/3414
Clean squad dummy data by @lhoestq in https://github.com/huggingface/datasets/pull/3428
#3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in https://github.com/huggingface/datasets/pull/3382
Make cast cacheable (again) on Windows by @mariosasko in https://github.com/huggingface/datasets/pull/3429
Use max number of data files to infer module by @albertvillanova in https://github.com/huggingface/datasets/pull/3407
Fix iter_archive generator by @albertvillanova in https://github.com/huggingface/datasets/pull/3454
[Staging] Update dataset repos automatically on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3451
Update supported versions of Python in setup.py by @mariosasko in https://github.com/huggingface/datasets/pull/3438
raise exception instead of using assertions. by @manisnesan in https://github.com/huggingface/datasets/pull/3349

New Contributors

@avinashsai made their first contribution in https://github.com/huggingface/datasets/pull/3330
@NouamaneTazi made their first contribution in https://github.com/huggingface/datasets/pull/3328
@davanstrien made their first contribution in https://github.com/huggingface/datasets/pull/3312
@francisco-perez-sorrosal made their first contribution in https://github.com/huggingface/datasets/pull/3296
@LashaO made their first contribution in https://github.com/huggingface/datasets/pull/3266
@fdschmidt93 made their first contribution in https://github.com/huggingface/datasets/pull/3371
@polinaeterna made their first contribution in https://github.com/huggingface/datasets/pull/3335
@AnzorGozalishvili made their first contribution in https://github.com/huggingface/datasets/pull/3383
@tianjianjiang made their first contribution in https://github.com/huggingface/datasets/pull/3342
@jingshenSN2 made their first contribution in https://github.com/huggingface/datasets/pull/3420
@scaperex made their first contribution in https://github.com/huggingface/datasets/pull/3436

Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0

- Python
Published by lhoestq over 4 years ago

datasets - 1.16.1

Bug fixes

Fix import datasets on python 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/3326
Fix wrongly converted assert by @eliasws in https://github.com/huggingface/datasets/pull/3323

- Python
Published by lhoestq over 4 years ago

datasets - 1.16.0

Datasets Changes

New: riddle_sense by @ziyiwu9494 in https://github.com/huggingface/datasets/pull/3161
New: Multi-Lingual LibriSpeech by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3198
New: XCSR by @yangxqiao in https://github.com/huggingface/datasets/pull/3074
New: CMU Hinglish DoG by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3149
New: Multidoc2dial by @sivasankalpp in https://github.com/huggingface/datasets/pull/3205
New: IndoNLI by @afaji in https://github.com/huggingface/datasets/pull/3307
Update: DaNE - updated URL for download by @MalteHB in https://github.com/huggingface/datasets/pull/3203
Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in https://github.com/huggingface/datasets/pull/3254
Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in https://github.com/huggingface/datasets/pull/3225
Update: KILT - update metadata JSON by @albertvillanova in https://github.com/huggingface/datasets/pull/3276
Update: Covost 2 - update download instructions by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3281
Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in https://github.com/huggingface/datasets/pull/3290
Fix: tuple_ie - fix download url by @mariosasko in https://github.com/huggingface/datasets/pull/3213
Fix: idnewspapers2018 - fix streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3249
Fix: bookcorpusopen - fix RAM usage by @lhoestq in https://github.com/huggingface/datasets/pull/3280
Fix: Scielo - fix ConnectionError by @mariosasko in https://github.com/huggingface/datasets/pull/3260
Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in https://github.com/huggingface/datasets/pull/3321

Datasets Features

Push to hub capabilities for Dataset and DatasetDict by @LysandreJik in https://github.com/huggingface/datasets/pull/3098:
- upload your dataset to the Hugging face Hub with the push_to_hub() method !
- See documentation here
200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in https://github.com/huggingface/datasets/pull/3110
- Stream from Google Drive and other hosts by @lhoestq in https://github.com/huggingface/datasets/pull/3248
- Support Audio feature in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in https://github.com/huggingface/datasets/pull/3129
Resolve data_files by split name automatically by @lhoestq in https://github.com/huggingface/datasets/pull/3221
- It takes into account the file names to know which file goes into which split
- See documentation here
Filter method for batched=True by @thomasw21 in https://github.com/huggingface/datasets/pull/3244
Adding with_rank arg to pass process rank to map by @TevenLeScao in https://github.com/huggingface/datasets/pull/3314

Dataset Cards

Add full tagset to conll2003 README by @BramVanroy in https://github.com/huggingface/datasets/pull/3230
Fix some contact information formats by @lhoestq in https://github.com/huggingface/datasets/pull/3274
Add wikipedia tags by @lhoestq in https://github.com/huggingface/datasets/pull/3301
Updating details of IRC disentanglement data by @jkkummerfeld in https://github.com/huggingface/datasets/pull/3259

Metrics Changes

New: OpenAI's pass@k code evaluation metric by @lvwerra in https://github.com/huggingface/datasets/pull/2916
Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in https://github.com/huggingface/datasets/pull/3235
Update: CER - update to support latest release by @mariosasko in https://github.com/huggingface/datasets/pull/3252
Update: WER - update to the documentation by @wooters in https://github.com/huggingface/datasets/pull/3278

Documentation

Add docs for to_tf_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/3175
Small updates to totfdataset documentation by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3215
Update link to Datasets Tagging app in Spaces by @albertvillanova in https://github.com/huggingface/datasets/pull/3194
Improve repository structure docs by @lhoestq in https://github.com/huggingface/datasets/pull/3233
Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3241
Add docs for audio processing by @stevhliu in https://github.com/huggingface/datasets/pull/3222
Add pushtohub docs by @lhoestq in https://github.com/huggingface/datasets/pull/3319

Additional improvements and bug fixes

Catch token invalid error in CI by @lhoestq in https://github.com/huggingface/datasets/pull/3200
Pin keras version until TF fixes its release by @albertvillanova in https://github.com/huggingface/datasets/pull/3208
Fix disable_nullable default value to False by @lhoestq in https://github.com/huggingface/datasets/pull/3211
Fix code quality in riddle_sense dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3218
Better error msg if len(predictions) doesn't match len(references) in metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3160
Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3121
Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in https://github.com/huggingface/datasets/pull/3216
Group tests in multiprocessing workers by test file by @albertvillanova in https://github.com/huggingface/datasets/pull/3231
Fix loadfromdisk temporary directory by @lhoestq in https://github.com/huggingface/datasets/pull/3245
[tiny] fix typo in stream docs by @nollied in https://github.com/huggingface/datasets/pull/3246
Avoid PyArrow type optimization if it fails by @mariosasko in https://github.com/huggingface/datasets/pull/3234
Remove redundant isort module placement by @mariosasko in https://github.com/huggingface/datasets/pull/3243
asserts replaced by exception for text classification task with test. by @manisnesan in https://github.com/huggingface/datasets/pull/3256
Add os.listdir for streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3270
asserts replaced with exception for image classification task, csv, json by @manisnesan in https://github.com/huggingface/datasets/pull/3262
Force data files extraction if downloadmode='forceredownload' by @mariosasko in https://github.com/huggingface/datasets/pull/3275
Minor Typo Fix - Precision to Recall by @SebastinSanty in https://github.com/huggingface/datasets/pull/3279
Decode audio from remote by @lhoestq in https://github.com/huggingface/datasets/pull/3271
Fix build_docs CI by @lhoestq in https://github.com/huggingface/datasets/pull/3286
Allow datasets with indices table when concatenating along axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3288
f-string formatting by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3277
Unpin markdown for build_docs now that it's fixed by @lhoestq in https://github.com/huggingface/datasets/pull/3289
Pin version exclusion for Markdown by @albertvillanova in https://github.com/huggingface/datasets/pull/3293
Use f-strings in the dataset scripts by @Carlosbogo in https://github.com/huggingface/datasets/pull/3291
fix old_val typo in f-string by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3302
asserts replaced with exception for fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3305
fix: files counted twice in inferred structure by @borisdayma in https://github.com/huggingface/datasets/pull/3309
Finish transition to PyArrow 3.0.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3318
Removing query params for dynamic URL caching by @anton-l in https://github.com/huggingface/datasets/pull/3315

Citation

Update BibTeX entry by @albertvillanova in https://github.com/huggingface/datasets/pull/3223
Fix paper BibTeX citation with proceedings reference by @albertvillanova in https://github.com/huggingface/datasets/pull/3226
Add CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3228
Fix URL in CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3229

Deprecations

Deprecate prepare_module by @albertvillanova in https://github.com/huggingface/datasets/pull/3166

Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0

- Python
Published by lhoestq over 4 years ago

datasets - 1.15.1

Dependencies

Bump huggingface_hub to 0.1.0 by @lhoestq in https://github.com/huggingface/datasets/pull/3199

- Python
Published by lhoestq over 4 years ago

datasets - 1.15.0

Dataset Changes

Update: JNLBA - add tags names by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3092
Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in https://github.com/huggingface/datasets/pull/3125 and https://github.com/huggingface/datasets/pull/3176
Update: RONEC - update to v2 by @dumitrescustefan in https://github.com/huggingface/datasets/pull/3184
Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in https://github.com/huggingface/datasets/pull/3136
Fix: HLGD - fix label mapping by @VictorSanh in https://github.com/huggingface/datasets/pull/3180

Dataset Features

Allow dynamic first dimension for ArrayXD by @rpowalski in https://github.com/huggingface/datasets/pull/2891
add multi-proc in to_csv by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/2896
QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in https://github.com/huggingface/datasets/pull/3196

Dataset Cards

Fill in dataset card for NCBI disease dataset by @edugp in https://github.com/huggingface/datasets/pull/3115

Metrics Changes

New: metric for the MATH dataset (competition_math). by @hacobe in https://github.com/huggingface/datasets/pull/3020
New: Google BLEU (aka GLEU) metric by @slowwavesleep in https://github.com/huggingface/datasets/pull/3108
New: TER by @BramVanroy in https://github.com/huggingface/datasets/pull/3153
New: ChrF(++) by @BramVanroy in https://github.com/huggingface/datasets/pull/3187

General improvements and bug fixes

Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3120
Fixes to to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3085
Add security policy to the project by @albertvillanova in https://github.com/huggingface/datasets/pull/2958
Update doc links to point to new docs by @mariosasko in https://github.com/huggingface/datasets/pull/3116
Fix caching bugs by @mariosasko in https://github.com/huggingface/datasets/pull/3141
Fix numpy deprecation warning for ragged tensors by @lhoestq in https://github.com/huggingface/datasets/pull/3137
Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in https://github.com/huggingface/datasets/pull/3157
Fix some typos in the documentation by @h4iku in https://github.com/huggingface/datasets/pull/3152
Fix string encoding for Value type by @lhoestq in https://github.com/huggingface/datasets/pull/3158
Fix CLI test to ignore verfications when saving infos by @albertvillanova in https://github.com/huggingface/datasets/pull/3147
Make inspect.getdatasetconfig_names always return a non-empty list by @albertvillanova in https://github.com/huggingface/datasets/pull/3159
Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in https://github.com/huggingface/datasets/pull/3173
Asserts replaced by exceptions (huggingface#3171) by @joseporiolayats in https://github.com/huggingface/datasets/pull/3174
Preserve ordering in zip_dict by @mariosasko in https://github.com/huggingface/datasets/pull/3170
Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in https://github.com/huggingface/datasets/pull/3182
Re-add faiss to windows testing suite by @BramVanroy in https://github.com/huggingface/datasets/pull/3151
Add missing docstring to DownloadConfig by @mariosasko in https://github.com/huggingface/datasets/pull/3183
More efficient nested features encoding by @eladsegal in https://github.com/huggingface/datasets/pull/3124
Fix optimized encoding for arrays by @lhoestq in https://github.com/huggingface/datasets/pull/3197

- Python
Published by lhoestq over 4 years ago

datasets - 1.14.0

Dataset changes

Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
Update: SUPERB - use Audio features #3101 (@anton-l)
Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

Dataset features

Add iter_archive #3066 (@lhoestq)

General improvements and bug fixes

Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
Fix project description in PyPI #3103 (@albertvillanova)
Align tqdm control with cache control #3031 (@mariosasko)
Add paper BibTeX citation #3107 (@albertvillanova)

- Python
Published by albertvillanova over 4 years ago

datasets - 1.13.3

Dataset changes

Update: Adapt all audio datasets #3081 (@patrickvonplaten)

Bug fixes

Update BibTeX entry #3090 (@albertvillanova)
Use template columnmapping to transmitformat instead of template features #3088 (@mariosasko)
Fix Audio feature mp3 resampling #3096 (@albertvillanova)

- Python
Published by albertvillanova over 4 years ago

datasets - 1.13.2

Bug fixes

Fix error related to huggingface_hub timeout parameter #3082 (@albertvillanova)
Remove _resampler from Audio fields #3086 (@albertvillanova)

- Python
Published by albertvillanova over 4 years ago

datasets - 1.13.1

Bug fixes

Fix loading a metric with internal import #3077 (@albertvillanova)

- Python
Published by albertvillanova over 4 years ago

datasets - 1.13.0

Dataset changes

New: CaSiNo #2867 (@kushalchawla)
New: Mostly Basic Python Problems #2893 (@lvwerra)
New: OpenAI's HumanEval #2897 (@lvwerra)
New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
New: SEDE #2942 (@Hazoom)
New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
New: AMI #2853 (@cahya-wirawan)
New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
New: KanHope #2985 (@adeepH)
New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
New: SwedMedNER #2940 (@bwang482)
New: SberQuAD #3039 (@Alenush)
New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
New: Greek Legal Code #2966 (@christospi)
New: Story Cloze Test #3067 (@zaidalyafeai)
Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
Update: TriviaQA - add web and wiki config #2949 (@shirte)
Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
Update: Biosses - fix column names #3054 (@bwang482)
Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)

Metric changes

Update: meteor - update from nltk update #2946 (@lhoestq)
Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)

Dataset features

Use with TensorFlow:
- Adding to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)
Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add remove_columns to IterableDataset #3030 (@cccntu)
- All the above ZIP features also work in streaming mode
New utilities:
- Add get_dataset_split_names() to get a dataset config's split names #2906 (@severo)
Replace script_version with revision #2933 (@albertvillanova)
- The script_version parameter in load_dataset is now deprecated, in favor of revision
Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed

Dataset cards

Add arxiv paper inswissjudgmentprediction dataset card #3026 (@JoelNiklaus)

Documentation

Add tutorial for no-code dataset upload #2925 (@stevhliu)

General improvements and bug fixes

Fix filter leaking #3019 (@lhoestq)
- calling filter several times in a row was not returning the right results in 1.12.0 and 1.12.1
Update BibTeX entry #2928 (@albertvillanova)
Fix exception chaining #2911 (@albertvillanova)
Add regression test for null Sequence #2929 (@albertvillanova)
Don't use old, incompatible cache for the new filter #2947 (@lhoestq)
Fix fn kwargs in filter #2950 (@lhoestq)
Use pyarrow.Table.replaceschemametadata instead of pyarrow.Table.cast #2895 (@arsarabi)
Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
Fix missing conda deps #2952 (@lhoestq)
Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
Support pandas 1.3 new read_csv parameters #2960 (@SBrandeis)
Fix CI doc build #2961 (@albertvillanova)
Run tests in parallel #2954 (@albertvillanova)
Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
Take namespace into account in caching #2938 (@lhoestq)
Make Dataset.map accept list of np.array #2990 (@albertvillanova)
Fix loading compressed CSV without streaming #2994 (@albertvillanova)
Fix json loader when conversion not implemented #3000 (@lhoestq)
Remove all query parameters when extracting protocol #2996 (@albertvillanova)
Correct a typo #3007 (@Yann21)
Fix Windows test suite #3025 (@albertvillanova)
Remove unused parameter in xdirname #3017 (@albertvillanova)
Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
Fix typo #3023 (@qqaatw)
Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
Use cache folder for lockfile #2887 (@Dref360)
Fix streaming: catch Timeout error #3050 (@borisdayma)
Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
Fix task reloading from cache #3059 (@lhoestq)
Fix test command after refac #3065 (@lhoestq)
Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
Update summary on PyPi beyond NLP #3062 (@thomwolf)
Remove a reference to the open Arrow file when deleting a TF dataset created with totfdataset #3002 (@mariosasko)
feat: increase streaming retry config #3068 (@borisdayma)
Fix pathlib patches for streaming #3072 (@lhoestq)

Breaking changes:

Due to the big refactoring at #2986, the prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.

- Python
Published by lhoestq over 4 years ago

datasets - 1.12.1

Bug fixes

Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
- this fixes the ArrowInvalid: Can only convert 1-dimensional array values errors

- Python
Published by lhoestq almost 5 years ago

datasets - 1.12.0

New documentation

New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference

See the new documentation here !

Datasets changes

New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
New: The Pile books3 #2801 (@richarddwang)
New: The Pile stack exchange #2803 (@richarddwang)
New: The Pile openwebtext2 #2802 (@richarddwang)
New: Food-101 #2804 (@nateraw)
New: Beans #2809 (@nateraw)
New: cedr #2796 (@naumov-al)
New: catsvsdogs #2807 (@nateraw)
New: MultiEURLEX #2865 (@iliaschalkidis)
New: BIOSSES #2881 (@bwang482)
Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
Update: SUPERB - Add SD task #2661 (@albertvillanova)
Update: SUPERB - Add KS task #2783 (@anton-l)
Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
Update: Openwebtext - update size #2857 (@lhoestq)
Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
Fix: turkishmoviesentiment - fix metadata JSON #2755 (@albertvillanova)
Fix: ubuntudialogscorpus - fix metadata JSON #2756 (@albertvillanova)
Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
Fix: linnaeus - fix url #2852 (@lhoestq)
Fix ToTTo - fix data URL #2864 (@albertvillanova)
Fix: wikicorpus - fix keys #2844 (@lhoestq)
Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)

Datasets features

Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
add multi-proc in to_json #2747 (@bhavitvyamalik)
Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)

Dataset streaming - better support for compression:

Fix streaming zip files #2798 (@albertvillanova)
Support streaming tar files #2800 (@albertvillanova)
Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
Add url prefix convention for many compression formats #2822 (@lhoestq)
Support streaming datasets that use pathlib #2874 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)

Metrics changes

Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)

Dataset cards

Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
Update ELI5 README.md #2848 (@odellus)

General improvements and bug fixes

Update release instructions #2740 (@albertvillanova)
Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
Allow PyArrow from source #2769 (@patrickvonplaten)
fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
Fix typo in testdatasetcommon #2790 (@nateraw)
Fix type hint for data_files #2793 (@albertvillanova)
Bump tqdm version #2814 (@mariosasko)
Use packaging to handle versions #2777 (@albertvillanova)
Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
Rename The Pile subsets #2817 (@lhoestq)
Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
Fix extraction protocol inference from urls with params #2843 (@lhoestq)
Fix caching when moving script #2854 (@lhoestq)
Fix windows CI CondaError #2855 (@lhoestq)
fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
Update column_names showed as :func: in exploring.st #2851 (@ClementRomac)
Fix s3fs version in CI #2858 (@lhoestq)
Fix three typos in two files for documentation #2870 (@leny-mi)
Move checks from mapsingle to map #2660 (@mariosasko)
fix regex to accept negative timezone #2847 (@jadermcs)
Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
Fix null sequence encoding #2900 (@lhoestq)

- Python
Published by lhoestq almost 5 years ago

datasets - 1.11.0

Datasets Changes

New: Add Russian SuperGLUE #2668 (@slowwavesleep)
New: Add Disfl-QA #2473 (@bhavitvyamalik)
New: Add TimeDial #2476 (@bhavitvyamalik)
Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
Fix: Update WikiANN data URL #2710 (@albertvillanova)
Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)

General improvements and bug fixes

fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
Update BibTeX entry #2706 (@albertvillanova)
Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
Add support for disableprogressbar on Windows #2696 (@mariosasko)
Ignore empty batch when writing #2698 (@pcuenca)
Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
fix: 🐛 fix two typos #2720 (@severo)
Docs details #2690 (@severo)
Deal with the bad check in test_load.py #2721 (@mariosasko)
Pass useauthtoken to request_etags #2725 (@albertvillanova)
Typo fix tokenize_exemple #2726 (@shabie)
Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
Add missing parquet known extension #2733 (@lhoestq)

- Python
Published by albertvillanova almost 5 years ago

datasets - 1.10.2

The error message to tell which dataset config name to load was not displayed: - Fix pick default config name message #2704 (@lhoestq)

Docstrings: - Fix download_mode docstrings #2701 (@albertvillanova)

- Python
Published by lhoestq almost 5 years ago

datasets - 1.10.1

Fix minimum tqdm version and import on Colab #2697 (@nateraw)
Fix OSCAR Esperanto #2693 (@lhoestq)

- Python
Published by lhoestq almost 5 years ago

datasets - 1.10.0

Datasets Features

Support remote data files #2616 (@albertvillanova) This allows to pass URLs of remote data files to any dataset loader: python load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]}) This works for all these dataset loaders:
- text
- csv
- json
- parquet
- pandas
Streaming from remote text/json/csv/parquet/pandas files: When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions:
- Streaming for the Pandas loader #2636 (@lhoestq)
- Streaming for the CSV loader #2635 (@lhoestq)
- Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq)
Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
Delete extracted files when loading dataset #2631 (@albertvillanova)

Datasets Changes

Fix: C4 - fix expected files list #2682 (@lhoestq)
Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)

Dataset Tasks

Add speech processing tasks #2620 (@lewtun)
Update ASR tags #2633 (@lewtun)
Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
Add ASR task for SUPERB #2619 (@lewtun)
add image-classification task template #2632 (@nateraw)

Metrics Changes

New: wiki_split #2623 (@bhadreshpsavani)
Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)

General improvements and bug fixes

Fix BibTeX entry #2594 (@albertvillanova)
Fix testissmall_dataset #2588 (@albertvillanova)
Remove import of transformers #2602 (@albertvillanova)
Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
Fix filter with multiprocessing in case all samples are discarded #2601 (@mxschmdt)
Remove redundant prepare_module #2597 (@albertvillanova)
Create ExtractManager #2295 (@albertvillanova)
Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
Use correct logger in metrics.py #2626 (@mariosasko)
Minor fix tests with Windows paths #2627 (@albertvillanova)
Use ETag of remote data files #2628 (@albertvillanova)
More consistent naming #2611 (@mariosasko)
Refactor patching to specific submodule #2639 (@albertvillanova)
Fix docstrings #2640 (@albertvillanova)
Fix anchor in README #2647 (@mariosasko)
Fix logging docstring #2652 (@mariosasko)
Allow dataset config kwargs to be None #2659 (@lhoestq)
Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
Use tqdm from tqdm_utils #2667 (@mariosasko)
Increase json reader block_size automatically #2676 (@lhoestq)
Parallelize ETag requests #2675 (@lhoestq)
Fix bad config ids that name cache directories #2686 (@lhoestq)
Minor documentation fix #2687 (@slowwavesleep)

Dataset Cards

Add missing WikiANN language tags #2610 (@albertvillanova)
feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)

Docs

Update processing.rst with other export formats #2599 (@TevenLeScao)

- Python
Published by lhoestq almost 5 years ago

datasets -

Datasets Changes

New: C4 #2575 #2592 (@lhoestq)
New: mC4 #2576 (@lhoestq)
New: MasakhaNER #2465 (@dadelani)
New: Eduge #2492 (@enod)
Update: xortydiqa - update version #2455 (@cccntu)
Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
Update: udpos - change features structure #2466 (@jerryIsHere)
Update: WebNLG - update checksums #2558 (@lhoestq)
Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
Fix: proto_qa - fix download link #2463 (@mariosasko)
Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
Fix: codesearchnet - fix keys #2555 (@lhoestq)
Fix: discofuse - fix link cc #2541 (@VictorSanh)
Fix: fever - fix keys #2557 (@lhoestq)

Datasets Features

Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
JAX integration #2502 (@lhoestq)
Add Parquet loader + fromparquet and toparquet #2537 (@lhoestq)
Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
Set configurable downloaded datasets path #2488 (@albertvillanova)
Set configurable extracted datasets path #2487 (@albertvillanova)
Add alignlabelswith_mapping function #2457 (@lewtun) #2510 (@lhoestq)
Add interleave_datasets for map-style datasets #2568 (@lhoestq)
Add loaddatasetbuilder #2500 (@mariosasko)
Support Zstandard compressed files #2578 (@albertvillanova)

Task templates

Add task templates for tydiqa and xquad #2518 (@lewtun)
Insert text classification template for Emotion dataset #2521 (@lewtun)
Add summarization template #2529 (@lewtun)
Add task template for automatic speech recognition #2533 (@lewtun)
Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
Allow latest pyarrow version #2490 (@albertvillanova)
Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
Add Zenodo metadata file with license #2501 (@albertvillanova)
add tensorflow-macos support #2493 (@slayerjain)
Keep original features order #2453 (@albertvillanova)
Add course banner #2506 (@sgugger)
Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
Improve performance of pandas arrow extractor #2519 (@albertvillanova)
Fix fingerprint when moving cache dir #2509 (@lhoestq)
Replace bad n>1M size tag #2527 (@lhoestq)
Fix dev version #2531 (@lhoestq)
Sync with transformers disabling NOTSET #2534 (@albertvillanova)
Fix logging levels #2544 (@albertvillanova)
Add support for Split.ALL #2259 (@mariosasko)
Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
Make numpy arrow extractor faster #2505 (@lhoestq)
fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
Add ASR task and new languages to resources #2567 (@lewtun)
Filter expected warning log from transformers #2571 (@albertvillanova)
Fix BibTeX entry #2579 (@albertvillanova)
Fix Counter import #2580 (@albertvillanova)
Add aiohttp to tests extras require #2587 (@albertvillanova)
Add language tags #2590 (@lewtun)
Support pandas 1.3.0 read_csv #2593 (@lhoestq)

Dataset cards

Updated Dataset Description #2420 (@binny-mathew)
Update DatasetMetadata and ReadMe #2436 (@gchhablani)
CRD3 dataset card #2515 (@wilsonyhlee)
Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)

Docs

no s at load_datasets #2479 (@julien-c)
Fix docs custom stable version #2477 (@albertvillanova)
Improve Features docs #2535 (@albertvillanova)
Update README.md #2414 (@cryoff)
Fix FileSystems documentation #2551 (@connor-mccarthy)
Minor fix in loading metrics docs #2562 (@albertvillanova)
Minor fix docs format for bertscore #2570 (@albertvillanova)
Add streaming in load a dataset docs #2574 (@lhoestq)

- Python
Published by lhoestq about 5 years ago

datasets - 1.8.0

Datasets Changes

New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
New: KLUE benchmark #2416 (@jungwhank)
New: HendrycksTest #2370 (@andyzoujm)
Update: xortydiqa - update url to v1.1 #2449 (@cccntu)
Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
Fix: bnhatespeech and covidtweetsjapanese - fix broken URLs for #2445 (@lewtun)
Fix: flores - fix download link #2448 (@mariosasko)

Datasets Features

Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
Support sliced list arrays in cast #2461 (@lhoestq)
- Dataset.cast can now change the feature types of Sequence fields
Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets INMEMORYMAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

adds license information for DailyDialog. #2419 (@aditya2211)
add english language tags for ~100 datasets #2442 (@VictorSanh)
Add copyright info to MLSUM dataset #2427 (@PhilipMay)
Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)

General improvements and bug fixes

Add DOI badge to README #2411 (@albertvillanova)
Make datasets PEP-561 compliant #2417 (@SBrandeis)
Fix savetodisk nested features order in dataset_info.json #2422 (@lhoestq)
Fix CI six installation on linux #2432 (@lhoestq)
Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
doc: fix typo HFMAXINMEMORYDATASETSIZEIN_BYTES #2421 (@borisdayma)
add utf-8 while reading README #2418 (@bhavitvyamalik)
Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
Rename config and environment variable for in memory max size #2454 (@albertvillanova)
Add version-specific BibTeX #2430 (@albertvillanova)
Fix cross-reference typos in documentation #2456 (@albertvillanova)
Better error message when using the wrong loadfromdisk #2437 (@lhoestq)

Experimental and work in progress: Format a dataset for specific tasks

Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
Insert task templates for text classification #2389 (@lewtun)
Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)

- Python
Published by lhoestq about 5 years ago

datasets - 1.7.0

Dataset Changes

New: NLU evaluation data #2238 (@dkajtoch)
New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
New: Bbaw egyptian #2290 (@phiwi)
New: GooAQ #2260 (@bhavitvyamalik)
New: SubjQA #2302 (@lewtun)
New: Ascent KB #2341, #2349 (@phongnt570)
New: HLGD #2325 (@tingofurro)
New: Qasper #2346 (@cceyda)
New: ConvQuestions benchmark #2372 (@PhilippChr)
Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
Update multiwozv22 - update checksum #2281 (@lhoestq)
Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
Update: web_science - fixed download link #2338 (@bhavitvyamalik)
Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
Update: conll2003 - correct labels #2369 (@philschmid)
Update: offenseval_dravidian - update citations #2385 (@adeepH)
Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
Fix: newsphnli - test data added, datasetinfos updated #2263 (@bhavitvyamalik)
Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
Fix: indicglue - Fix number of classes in indicglue sna.bn dataset #2397 (@albertvillanova)
Fix: head_qa - Fix keys #2408 (@lhoestq)

Dataset Features

Implement Dataset add_item #1870 (@albertvillanova)
Implement Dataset add_column #2145 (@albertvillanova)
Implement Dataset to JSON #2248, #2352 (@albertvillanova)
Add rename_columnS method #2312 (@SBrandeis)
add desc to tqdm in Dataset.map() #2374 (@bhavitvyamalik)
Add env variable HFMAXINMEMORYDATASETSIZEIN_BYTES #2399, #2409 (@albertvillanova)

Metric Changes

New: CUAD metrics #2273 (@bhavitvyamalik)
New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
Update: CER - Docs, CER above 1 #2342 (@borisdayma)

General improvements and bug fixes

Update black #2265 (@lhoestq)
Fix incorrect updatemetadatawith_features calls in ArrowDataset #2258 (@mariosasko)
Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
Fix query table with iterable #2269 (@lhoestq)
Perform minor refactoring: use config #2253 (@albertvillanova)
Update format, fingerprint and indices after add_item #2254 (@lhoestq)
Always update metadata in arrow schema #2274 (@lhoestq)
Make tests run faster #2266 (@lhoestq)
Fix metadata validation with config names #2286 (@lhoestq)
Fixed typo seperate->separate #2292 (@laksh9950)
Allow collaborators to self-assign issues #2289 (@albertvillanova)
Mapping in the distributed setting #2298 (@TevenLeScao)
Fix conda release #2309 (@lhoestq)
Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
Set default name in initdynamicmodules #2320 (@albertvillanova)
Fix duplicate keys #2333 (@lhoestq)
Add note about indices mapping in savetodisk docstring #2332 (@lhoestq)
Metadata validation #2107 (@theo-m)
Add Validation For README #2121 (@gchhablani)
Fix overflow issue in interpolation search #2336 (@mariosasko)
Datasets cli improvements #2315 (@mariosasko)
Add key type and duplicates verification with hashing #2245 (@NikhilBartwal)
More consistent copy logic #2340 (@mariosasko)
Update README vallidation rules #2353 (@gchhablani)
normalized TOCs and titles in data cards #2355 (@yjernite)
simpllify faiss index save #2351 (@Guitaricet)
Allow "other-X" in licenses #2368 (@gchhablani)
Improve ReadInstruction logic and update docs #2261 (@mariosasko)
Disallow duplicate keys in yaml tags #2379 (@lhoestq)
maintain YAML structure reading from README #2380 (@bhavitvyamalik)
add dataset card title #2381 (@bhavitvyamalik)
Add tests for dataset cards #2348 (@gchhablani)
Improve example in rounding docs #2383 (@mariosasko)
Paperswithcode dataset mapping #2404 (@julien-c)
Free datasets with cache file in temp dir on exit #2403 (@mariosasko)

Experimental and work in progress: Format a dataset for specific tasks

Task formatting for text classification & question answering #2255 (@SBrandeis)
Add check for task templates on dataset load #2390 (@lewtun)
Add args description to DatasetInfo #2384 (@lewtun)
Improve task api code quality #2376 (@mariosasko)

- Python
Published by lhoestq about 5 years ago

datasets - 1.6.2

Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq) This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.

Breaking change: - when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.

- Python
Published by lhoestq about 5 years ago

datasets - 1.6.1

Fix memory issue in multiprocessing: Don't pickle table index #2264 (@lhoestq)

- Python
Published by lhoestq about 5 years ago

datasets - 1.6.0

Dataset changes

New: MOROCO #2002 (@MihaelaGaman)
New: CBT dataset #2044 (@gchhablani)
New: MDD Dataset #2051 (@gchhablani)
New: Multilingual dIalogAct benchMark (miam) #2047 (@eusip)
New: bAbI QA tasks #2053 (@gchhablani)
New: machine translated multilingual STS benchmark dataset #2090 (@PhilipMay)
New: EURLEX legal NLP dataset #2114 (@iliaschalkidis)
New: ECtHR legal NLP dataset #2114 (@iliaschalkidis)
New: EU-REG-IR legal NLP dataset #2114 (@iliaschalkidis)
New: NorNE dataset for Norwegian POS and NER #2154 (@versae)
New: banking77 #2140 (@dkajtoch)
New: OpenSLR #2173 #2215 #2221 (@cahya-wirawan)
New: CUAD dataset #2219 (@bhavitvyamalik)
Update: Gem V1.1 + new challenge sets#2142 #2186 (@yjernite)
Update: Wikiann - added spans field #2141 (@rabeehk)
Update: XTREME - Add tel to xtreme tatoeba #2180 (@lhoestq)
Update: GLUE MRPC - added real label to test set #2216 (@philschmid)
Fix: MultiWoz22 - fix dialogue action slot name and value #2136 (@adamlin120)
Fix: wikiauto - fix link #2171 (@mounicam)
Fix: wino_bias - use right splits #1930 (@JieyuZhao)
Fix: lc_quad - update download checksum #2213 (@mariosasko)
Fix newsgroup -fix one instance of 'train' to 'test' #2225 (@alexwdong)
Fix: xnli - fix tuple key #2233 (@NikhilBartwal)

Dataset features

Allow stateful function in dataset.map #1960 (@mariosasko)
MIAM dataset - new citation details #2101 (@eusip)
[Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset #2025 (@lhoestq)
Allow pickling of big in-memory tables #2150 (@lhoestq)
updated user permissions based on umask #2086 #2157 (@bhavitvyamalik)
Fast table queries with interpolation search #2122 (@lhoestq)
Concat only unique fields in DatasetInfo.from_merge #2163 (@mariosasko)
Implementation of classencodecolumn #2184 #2227 (@SBrandeis)
Add support for axis in concatenate datasets #2151 (@albertvillanova)
Set default in-memory value depending on the dataset size #2182 (@albertvillanova)

Metrics changes

New: CER metric #2138 (@chutaklee)
Update: WER - Compute metric iteratively #2111 (@albertvillanova)
Update: seqeval - configurable options to seqeval metric #2204 (@marrodion)

Dataset cards

REFreSD: Updated card using information from data statement and datasheet #2082 (@mcmillanmajora)
Winobiais: fix split infos #2152 (@JieyuZhao)
all: Fix size categories in YAML Tags #2074 (@gchhablani)
LinCE: Updating citation information on LinCE readme #2205 (@gaguilar)
Swda: Update README.md #2235 (@PierreColombo)

General improvements and bug fixes

Refactorize Metric.compute signature to force keyword arguments only #2079 (@albertvillanova)
Fix maxwaittime in requests #2085 (@lhoestq)
Fix copy snippet in docs #2091 (@mariosasko)
Fix deprecated warning message and docstring #2100 (@albertvillanova)
Move Dataset.to_csv to csv module #2102 (@albertvillanova)
Fix: Allows a feature to be named "_type" #2093 (@dcfidalgo)
copy.deepcopy os.environ instead of copy #2119 (@NihalHarish)
Replace legacy torch.Tensor constructor with torch.tensor #2126 (@mariosasko)
Implement Dataset as context manager #2113 (@albertvillanova)
Fix missing infos from concurrent dataset loading #2137 (@lhoestq)
Pin fsspec lower than 0.9.0 #2172 (@lhoestq)
Replace assertTrue(isinstance with assertIsInstance in tests #2164 (@mariosasko)
add social thumbnial #2177 (@philschmid)
Fix s3fs tests for py36 and py37+ #2183 (@lhoestq)
Fix typo in huggingface hub #2192 (@LysandreJik)
Update metadata if dataset features are modified #2087 (@mariosasko)
fix missing indicesfiles in loadform_disk #2197 (@lhoestq)
Fix backward compatibility in Dataset.loadfromdisk #2199 (@albertvillanova)
Fix ArrowWriter overwriting features in ArrowBasedBuilder #2201 (@lhoestq)
Fix incorrect assertion in builder.py #2110 (@dreamgonfly)
Remove Python2 leftovers #2208 (@mariosasko)
Revert breaking change in cache_files property #2217 (@lhoestq)
Set test cache config #2223 (@albertvillanova)
Fix map when removing columns on a formatted dataset #2231 (@lhoestq)
Refactorize tests to use Dataset as context manager #2191 (@albertvillanova)
Preserve split type when reloading dataset #2168 (@mariosasko)

Docs

make documentation more clear to use different cloud storage #2127 (@philschmid)
Render docstring return type as inline #2147 (@albertvillanova)
Add table classes to the documentation #2155 (@lhoestq)
Pin docutils for better doc #2174 (@sgugger)
Fix docstrings issues #2081 (@albertvillanova)
Add code of conduct to the project #2209 (@albertvillanova)
Add classes GenerateMode, DownloadConfig and Version to the documentation #2202 (@albertvillanova)
Fix bash snippet formatting in ADDNEWDATASET.md #2234 (@mariosasko)

- Python
Published by lhoestq about 5 years ago

datasets - 1.5.0

Datasets changes

New: Europarl Bilingual #1874 (@lucadiliello)
New: Stanford Sentiment Treebank #1961 (@patpizio)
New: RO-STS #1978 (@lorinczb)
New: newspop #1871 (@frankier)
New: FashionMNIST #1999 (@gchhablani)
New: Common voice #1886 (@BirgerMoell), #2063 (@patrickvonplaten)
New: Cryptonite #2013 (@theo-m)
New: RoSent #2011 (@gchhablani)
New: PersiNLU reading-comprehension #2028 (@danyaljj)
New: conllpp #1991 (@ZihanWangKi)
New: LaRoSeDa #2004 (@MihaelaGaman)
Update: unnecessary docstart check in conll-like datasets #2020 (@mariosasko)
Update: semeval 2020 task 11 - add article_id and process test set template #1979 (@hemildesai)
Update: Md gender - card update #2018 (@mcmillanmajora)
Update: XQuAD - add Romanian #2023 (@M-Salti)
Update: DROP - all answers #1980 (@KaijuML)
Fix: TIMIT ASR - Make sure not only the first sample is used #1995 (@patrickvonplaten)
Fix: Wikipedia - save memory by replacing root.clear with elem.clear #2037 (@miyamonz)
Fix: Doc2dial update datainfos and dataloaders #2041 (@songfeng)
Fix: ZEST - update download link #2057 (@matt-peters)
Fix: tedtalksiwslt - fix version error #2064 (@mariosasko)

Datasets Features

Implement Dataset from CSV #1946 (@albertvillanova)
Implement Dataset from JSON and JSON Lines #1943 (@albertvillanova)
Implement Dataset from text #2030 (@albertvillanova)
Optimize int precision for tokenization #1985 (@albertvillanova)
- This allows to save 75%+ of space when tokenizing a dataset

General Bug fixes and improvements

Fix ArrowWriter closes stream at exit #1971 (@albertvillanova)
feat(docs): navigate with left/right arrow keys #1974 (@ydcjeff)
Fix various typos/grammer in the docs #2008 (@mariosasko)
Update format columns in Dataset.rename_columns #2027 (@mariosasko)
Replace print with logging in dataset scripts #2019 (@mariosasko)
Raise an error for outdated sacrebleu versions #2033 (@lhoestq)
Not all languages have 2 digit codes. #2016 (@asiddhant)
Fix arrow memory checks issue in tests #2042 (@lhoestq)
Support pickle protocol for dataset splits defined as ReadInstruction #2043 (@mariosasko)
Preserve column ordering in Dataset.rename_column #2045 (@mariosasko)
Fix text-classification tags #2049 (@gchhablani)
Fix docstring rendering of Dataset/DatasetDict.from_csv args #2066 (@albertvillanova)
Fixes check of TFAVAILABLE and TORCHAVAILABLE #2073 (@philschmid)
Add and fix docstring for NamedSplit #2069 (@albertvillanova)
Bump huggingface_hub version #2077 (@SBrandeis)
Fix docstring issues #2072 (@albertvillanova)

- Python
Published by lhoestq over 5 years ago

datasets -

Fix an issue #1981 with WMT downloads #1982 (@albertvillanova)

- Python
Published by lhoestq over 5 years ago

datasets - 1.4.0

Datasets Changes

New: iappwikiqa_squad #1873 (@cstorm125)
New: Financial PhraseBank #1866 (@frankier)
New: CoVoST2 #1935 (@patil-suraj)
New: TIMIT #1903 (@vrindaprabhu)
New: Mlama (multilingual lama) #1931 (@pdufter)
New: FewRel #1823 (@gchhablani)
New: CCAligned Multilingual Dataset #1815 (@gchhablani)
New: Turkish News Category Lite #1967 (@yavuzKomecoglu)
Update: WMT - use mirror links #1912 for better download speed (@lhoestq)
Update: multi_nli - add missing fields #1950 (@bhavitvyamalik)
Fix: ALT - fix duplicated examples in alt-parallel #1899 (@lhoestq)
Fix: WMT datasets - fix download errors #1901 (@YangWang92), #1902 (@lhoestq)
Fix: QA4MRE - fix download URLs #1918 (@M-Salti)
Fix: Wikidpr - fix when withembeddings is False or indexname is "noindex" #1925 (@lhoestq)
Fix: Wiki_dpr - add missing scalar quantizer #1926 (@lhoestq)
Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM #1970 (@yjernite)

Datasets Features

Add todict and topandas for Dataset #1889 (@SBrandeis)
Add to_csv for Dataset #1887 (@SBrandeis)
Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
- This introduces new methods for Dataset objects: renamecolumn, removecolumns, flatten and cast.
- The old in-place methods renamecolumn, removecolumns, flatten_ and cast_ are now deprecated.
Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
Add cross-platform support for datasets-cli #1951 (@mariosasko)

Metrics Changes

New: sari metric #1875 (@ddhruvkr)

Offline loading

Handle timeouts #1952 (@lhoestq)
Add datasets full offline mode with HFDATASETSOFFLINE #1976 (@lhoestq)

General improvements and bugfixes

Replace flatten_nested #1879 (@albertvillanova)
add missing info on how to add large files #1885 (@stas00)
Docs for adding new column on formatted dataset #1888 (@lhoestq)
Fix PandasArrayExtensionArray conversion to native type #1897 (@lhoestq)
Bugfix for stringtoarrow timestamp[ns] support #1900 (@justin-yan)
Fix to_pandas for boolean ArrayXD #1904 (@lhoestq)
Fix logging imports and make all datasets use library logger #1914 (@albertvillanova)
Standardizing datasets dtypes #1921 (@justin-yan)
Remove unused py_utils objects #1916 (@albertvillanova)
Fix savetodisk with relative path #1923 (@lhoestq)
Updating old cards #1928 (@mcmillanmajora)
Improve typing and style and fix some inconsistencies #1929 (@mariosasko)
Fix builder config creation with data_dir #1932 (@lhoestq)
Disallow ClassLabel with no names #1938 (@lhoestq)
Update documentation with not in place transforms and update DatasetDict #1947 (@lhoestq)
Documentation for tocsv, topandas and to_dict #1953 (@lhoestq)
typos + grammar #1955 (@stas00)
Fix unused arguments #1962 (@mariosasko)
Fix metrics collision in separate multiprocessed experiments #1966 (@lhoestq)

- Python
Published by lhoestq over 5 years ago

datasets -

Dataset Features

On-the-fly data transforms (#1795)
ADD S3 support for downloading and uploading processed datasets (#1723)
Allow loading dataset in-memory (#1792)
Support future datasets (#1813)
Enable/disable caching (#1703)
Offline dataset loading (#1726)

Datasets Hub Features

Loading from the Datasets Hub (#1860) This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library. Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation

Dataset Changes

New: LJ Speech (#1878)
New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
New: cord 19 (#1850)
New: Tweet Eval Dataset (#1829)
New: CIFAR-100 Dataset (#1812)
New: SICK (#1804)
New: BBC Hindi NLI Dataset (#1158)
New: Freebase QA Dataset (#1814)
New: Arabic sarcasm (#1798)
New: Semantic Scholar Open Research Corpus (#1606)
New: DuoRC Dataset (#1800)
New: Aggregated dataset for the GEM benchmark (#1807)
New: CC-News dataset of English language articles (#1323)
New: irc disentangle (#1586)
New: Narrative QA Manual (#1778)
New: Universal Morphologies (#1174)
New: SILICONE (#1761)
New: Librispeech ASR (#1767)
New: OSCAR (#1694, #1868, #1833)
New: CANER Corpus (#1684)
New: Arabic Speech Corpus (#1852)
New: id_liputan6 (#1740)
New: Stuctured Argument Extraction for Korean dataset (#1748)
New: TurkCorpus (#1732)
New: Hatexplain Dataset (#1716)
New: adversarialQA (#1714)
Update: Doc2dial - reading comprehension update to latest version (#1816)
Update: OPUS Open Subtitles - add with metadata information (#1865)
Update: SWDA - use all metadata features(#1799)
Update: SWDA - add metadata and correct splits (#1749)
Update: CommonGen - update citation information (#1787)
Update: SciFact - update URL (#1780)
Update: BrWaC - update features name (#1736)
Update: TLC - update urls to be github links (#1737)
Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
Fix: multiwozv22 - fix checksums (#1880)
Fix: limit - fix url (#1861)
Fix: WebNLG - fix test test + more field (#1739)
Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
Fix: reuters - add missing "brief" entries (#1744)
Fix: thainer: empty token bug (#1734)
Fix: lst20: empty token bug (#1734)

Metrics Changes

New: Word Error Metric (#1847)
New: COMET (#1577, #1753)
Fix: bert_score - set version dependency (#1851)

Metric Docs

Add metrics usage examples and tests (#1820)

CLI Changes

[BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli uploaddataset" and "datasets-cli uploadmetric"
- instead, use the huggingface-hub CLI

Bug fixes

fix writing GPU Faiss index (#1862)
update pyarrow import warning (#1782)
Ignore definition line number of functions for caching (#1779)
update saving and loading methods for faiss index so to accept path like objects (#1663)
Print error message with filename when malformed CSV (#1826)
Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)

Refactoring

Refactoring: Create config module (#1848)
Use a config id in the cache directory names for custom configs (#1754)

Logging

Enable logging propagation and remove logging handler (#1845)

- Python
Published by lhoestq over 5 years ago

datasets - 1.2.1

New Features

Fast start up (#1690): Importing datasets is now significantly faster.

Datasets Changes

New: MNIST (#1730)
New: Korean intonation-aided intention identification dataset (#1715)
New: Switchboard Dialog Act Corpus (#1678)
Update: Wiki-Auto - Added unfiltered versions of the training data for the GEM simplification task. (#1722)
Update: Scientific papers - Mirror datasets zip (#1721)
Update: Update DBRD dataset card and download URL (#1699)
Fix: Thainer - fix ner_tag bugs (#1695)
Fix: reuters21578 - metadata parsing errors (#1693)
Fix: adecorpusv2 - fix config names (#1689)
Fix: DaNE - fix last example (#1688)

Datasets tagging

rename "part-of-speech-tagging" tag in some dataset cards (#1645)

Bug Fixes

Fix column list comparison in transmit format (#1719)
Fix windows path scheme in cached path (#1711)

Docs

Add information about caching and verifications in "Load a Dataset" docs (#1705)

Moreover many dataset cards of datasets added during the sprint were updated ! Thanks to all the contributors :)

- Python
Published by lhoestq over 5 years ago

datasets -

Intermediate release before v2.0.0 Includes all the datasets added during the datasets sprint of December 2020 (currently over 610 datasets).

- Python
Published by lhoestq over 5 years ago

datasets - 1.1.3

Datasets changes

New: NLI-Tr (#787)
New: Amazon Reviews (#791)(#844)(#845)(#799)
New: ASNQ - answer sentence selection (#780)
New: OpenBookCorpus (#856)
New: ASLG-PC12 - sign language translation (#731)
New: Quail - question answering dataset (#747)
Update: SNLI: Created dataset card snli.md (#663)
Update: csv - Use pandas reader in csv (#857)
- Better memory management
- Breaking: the previous read_options, parse_options and convert_options are replaced with plain parameters like pandas.read_csv
Update: conll2000, conll2003, germeval14, wnut17, XTREME PAN-X - Create ClassLabel for labelling tasks datasets (#850)
- Breaking: use of ClassLabel features instead of string features + naming of columns updated for consistency
Update: XNLI - Add XNLI train set (#781)
Update: XSUM - Use full released xsum dataset (#754)
Update: CompGuessWhat - New version of CompGuessWhat?! with refined annotations (#748)
Update: CLUE - add OCNLI, a new CLUE dataset (#742)
Fix: KOR-NLI - Fix csv reader (#855)
Fix: Discofuse - fix discofuse urls (#793)
Fix: Emotion - fix description (#745)
Fix: TREC - update urls (#740)

Metrics changes

New: accuracy, precision, recall and F1 metrics (#825)
Fix: squad_v2 (#840)
Fix: seqeval (#810)(#738)
Fix: Rouge - fix description (#774)
Fix: GLUE - fix description (#734)
Fix: BertScore - fix custom baseline (#763)

Command line tools

add clear_cache parameter in the test command (#863)

Dependencies

Integrate file_lock inside the lib for better logging control (#859)

Dataset features

Add writerbatchsize attribute to GeneratorBasedBuilder (#828)
pretty print dataset objects (#725)
allow custom split names in text dataset (#776)

Tests

All configs is a slow test now

Bug fixes

Make save function use deterministic global vars order (#819)
fix type hints pickling in python 3.6 (#818)
fix metric deletion when attributes are missing (#782)
Fix custom builder caching (#770)
Fix metric with cache dir (#772)
Fix traintestsplit output format (#719)

- Python
Published by lhoestq over 5 years ago

datasets -

Dataset changes

Fix: text - use python read instead of pandas reader (#715):
- fix delimiter/overflow issues
- better memory handling

Bug fixes

Fix dataset configuration creation using data_files per splits using NamedSplit (#706)
Fix permission issue on windows - don't use tqdm 4.50.0 (#718)

- Python
Published by lhoestq almost 6 years ago

datasets - 1.1.0: Windows support, Better Multiprocessing, New Datasets

Windows support

Add Windows support (#644):
- add tests and CI for Windows
- fix numerous windows specific issues
- The library now fully supports Windows

Dataset changes

New: HotpotQA (#703)
New: OpenWebText (#660)
New: Winogrande - add debiased subset (#655)
Update: XNLI - update download link (#695)
Update: text - switch to pandas reader, better memory usage, fix delimiter issues (#689)
Update: csv - add features parameter to CSV (#685)
Fix: GAP - fix wrong computation of boolean features (#680)
Fix: C4 - fix manual instruction function (#681)

Metric changes

Update: ROUGE - Add rouge 2 and rouge Lsum to rouge metric outputs by default (#701, #702)
Fix: SQuAD - fix kwargs description (#670)

Dataset Features

Use multiprocess from pathos for multiprocessing (#656):
- allow lambda functions in multiprocessed map
- allow local functions in multiprocessed map
- and more ! As long as functions are compatible with dill

Bug fixes

Datasets: fix possible program hanging with tokenizers - Disable tokenizers parallelism in multiprocessed map (#688)
Datasets: fix cast with unordered features - fix column order issue in cast (#684)
Datasets: fix first time creation of cache directory - move cache dir root creation in builder's init (#677)
Datasets: fix OverflowError when using negative ids - fix negative ids in slicing with an array (#679)
Datasets: fix empty dictionaries afetr multiprocessing - keep new columns in transmit format (#659)
Datasets: fix type inference for nested types - handle data alteration when trying type (#653)
Metrics: fix compute metric with empty input - pass metric features to the reader (#654)

Documentation

Elasticsearch integration documentation (#696)

Tests

Use GitHub instead of AWS in remote dataset tests (#694)

- Python
Published by lhoestq almost 6 years ago

datasets -

Dataset changes:

New: CoNLL-2003 (#613)
New: ConLL-2000 (#634)
New: MATINF (ACL 2020) (#637)
New: Polyglot-NER (#641)
Update: GLUE - update GLUE urls (now hosted on FB) (#626)
Update: GLUE/qqp - update download checksum (#639)
Update: MLQA - feature names update (#627)
Update: LinCE - update feature names - Consistent ner features (#636)
Update: WNUT 17: update feature names - Consistent ner features (#642)
Update: XTREME/PAN-X - update feature names - Consistent ner features (#636)
Update: RACE - update dataset checksum + add new configurations (#540)
Fix: text - fix delimiter (#631)
Fix: Wiki DPR - fix download error in wiki_dpr (f38a871)

Logging:

Set level to warning (previously info) (#635)

Bug fixes:

make shuffle compatible with temp_seed (#640)
don't use take on dataset table (offset overflow error) (#645)
handle connection error in when downloading from HF google storage (#652)

- Python
Published by lhoestq almost 6 years ago

datasets -

Fix: - add multiprocessing to dataset dict (#612)

- Python
Published by lhoestq almost 6 years ago

datasets - 1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

Package Changes

Rename: nlp -> datasets

Update now with pip install datasets

Dataset Features

Keep the dataset format after dataset transforms (#607)
Pickle support (#536)
Save and load datasets to/from disk (#571)
Multiprocessing in map and filter (#552)
Multi-dimensional arrays support for multi-modal datasets (#533, #363)
Speed up Tokenization by optimizing casting to python objects (#523)
Speed up shuffle/shard/select methods - use indices mappings (#513)
Add input_column parameter in map and filter(#475)
Speed up download and processing (#563)
Indexed datasets for hybrid models (REALM/RAG/MARGE) (#500)

Dataset Changes

New: IWSLT 2017 (#470)
New: CommonGen Dataset (#578)
New: CLUE Benchmark (11 datasets) (#572)
New: the KILT knowledge source and tasks (#559)
New: DailyDialog (#556)
New: DoQA dataset (ACL 2020) (#473)
New: reuters21578 (#570)
New: HANS (#551)
New: MLSUM (#529)
New: Guardian authorship (#452)
New: web_questions (#401)
New: MS MARCO (#364)
Update: Germeval14 - update download url (#594)
Update: LinCE - update download url (#550)
Update: Hyperpartisan news detection - update download url, manual download no longer required (#504)
Update: Rotten Tomatoes - update download url (#484)
Update: Wiki DPR - Use HNSW faiss index (#500)
Update: Text - Speed up using multi-threaded PyArrow loading (#548)
Fix: GLUE, PAWS-X - skip header (#497)

[Breaking] Update Dataset and DatasetDict API (#459)

Rename the flatten, drop and dictionaryencodecolumn methods in flatten, drop and dictionaryencodecolumn_ to indicate that these methods have in-place effects
Remove the dataset.columns property and dataset.nbytes
Add a few more properties and methods to DatasetDict

Metric Features

Disallow the use of positional arguments to avoid predictions vs references mistakes (#466)
Allow to directly feed numpy/pytorch/tensorflow/pandas objects in metrics (#466)

Metric Changes

New: METEOR metric (#479)
Fix: Sacrebleu - fix inputs format (#520)

Loading script Features

Pin the version of the scripts (reproducibility) (#603, #584)
Specify default script_version with the env variable HF_SCRIPTS_VERSION (#584)
Save scripts in a modules cache directory that can be controlled with HF_MODULES_CACHE (#574)

Caching

Better support for tokenizers when caching map results (#601)
Faster caching for text dataset (#573, #502)
Use dataset fingerprints, updated after each transform (#536)
Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics (#518)

Documentation

Metrics documentation (#579)

Miscellaneous

Add centralized logging - Bump-up cache loads to warnings (#538)

Bug fixes

Datasets: [Breaking] fixed typo in "formated_as" method: rename formated to formatted (#516)
Datasets: fixed the error message when loading text/csv/json without providing data files (#586)
Datasets: fixed select method for pyarrow < 1.0.0 (#585)
Datasets: fixed elasticsearch result ids returning as strings (#487)
Datasets: fixed config used for slow test on real dataset (#527)
Datasets: fixed tensorflow-formatted datasets outputs by using ragged tensor by default (#530)
Datasets: fixed batched map for formatted dataset (#515)
Datasets: fixed encodings issues on Windows - apply utf-8 encoding to all datasets (#481)
Datasets: fixed dataset.map for function without outputs (#506)
Datasets: fixed bad type in overflow check (#496)
Datasets: fixed dataset info save - dont use beam fs to save info for local cache dir (#498)
Datasets: fixed arrays outputs - stack vectors in numpy, pytorch and tensorflow (#495, #494)
Metrics: fixed locking in distributed settings if one process finished before the other started writing (#564, #547)

- Python
Published by lhoestq almost 6 years ago

datasets - 0.4.0

Datasets Features

add frompandas and fromdict
add shard method
add rename/remove/cast columns methods
faster select method
add concatenate datasets
add support for taking samples using numpy arrays
add export to TFRecords
add features parameter when loading from text/json/pandas/csv or when using the map transform
add support for nested features for json
add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
add indexing using FAISS or ElasticSearch:
- add addfaissindex and addelasticsearchindex methods
- add getnearestexamples and getnearestexamples_batch to query the index and return examples
- add search and search_batch to query the index and return examples ids
- add savefaissindex/loadfaissindex to save/load a serialized faiss index

Datasets changes

new: PG19
new: ANLI
new: WikiSQL
new: qa_zre
new: MWSC
new: AG news
new: SQuADShifts
new: doc red
new: Wiki DPR
new: fever
new: hyperpartisan news detection
new: pandas
new: text
new: emotion
new: quora
new: BioMRC
new: web questions
new: search QA
new: LinCE
new: TREC
new: Style Change Detection
new: 20newsgroup
new: social biais frames
new: Emo
new: web of science
new: sogou news
new: crd3
update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
update: xtreme - add PAWS-X.es
update: xsum - manual download is no longer required.
new processed: Natural Questions

Metrics Features

add seed parameter for metrics that does sampling like rouge
better installation messages

Metrics changes

new: bleurt
update seqeval: fix entities extraction (more info here)

Bug fixes

fix bug in map and select that was causing memory issues
fix pyarrow version check
fix text/json/pandas/csv caching when loading different files in a row
fix metrics caching when they have with different config names
fix cache that was nto discarded when there's a KeybordInterrupt during .map
fix sacrebleu tokenizer's parameter
fix docstrings of metrics when multiple instances are created

More Tests

add tests for features handling in dataset transforms
add tests for dataset builders
add tests for metrics loading

Backward compatibility

because there are changes in the datasetinfo.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in datasetinfo.json

- Python
Published by lhoestq almost 6 years ago

datasets -

New methods to transform a dataset: - dataset.shuffle: create a shuffled dataset - dataset.train_test_split: create a train and a test split (similar to sklearn) - dataset.sort: create a dataset sorted according to a certain column - dataset.select: create a dataset with rows selected following the given list of indices

Other features: - Better instructions for datasets that require manual download > Important: if you load datasets that require manual downloads with an older version of nlp, instructions won't be shown and an error will be raised - Better access to dataset information (for instance dataset.feature['label'] or dataset.dataset_size)

Datasets: - New: cose v1.0 - New: rottentomatoes - New: german and italian wikipedia

New docs: - documentation about splitting a dataset

Bug fixes: - fix metric.compute that couldn't write on file - fix squad_v2 imports

- Python
Published by lhoestq about 6 years ago