datasets

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

Keywords

ai artificial-intelligence computer-vision dataset-hub datasets deep-learning llm machine-learning natural-language-processing nlp numpy pandas pytorch speech tensorflow

Keywords from Contributors

transformer jax cryptocurrency cryptography language-model named-entity-recognition langchain text-classification anthropic gemini

Last synced: 6 months ago · JSON representation ·

Repository

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

Basic Info

Host: GitHub
Owner: huggingface
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://huggingface.co/docs/datasets
Size: 85.2 MB

Statistics

Stars: 20,569
Watchers: 283
Forks: 2,921
Open Issues: 963
Releases: 103

Topics

ai artificial-intelligence computer-vision dataset-hub datasets deep-learning llm machine-learning natural-language-processing nlp numpy pandas pytorch speech tensorflow

Created almost 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Code of conduct Citation Security Authors Zenodo

README.md

🤗 Datasets is a lightweight library providing two main features:

one-line dataloaders for many public datasets: one-liners to download and pre-process any of the major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_dataset("rajpurkar/squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
efficient data pre-processing: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like processed_dataset = dataset.map(process_example), efficiently prepare the dataset for inspection and ML model evaluation and training.

🎓 Documentation 🔎 Find a dataset in the Hub 🌟 Share a dataset on the Hub

🤗 Datasets is designed to let the community easily add and share new datasets.

🤗 Datasets has many additional interesting features:

Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
Smart caching: never wait for your data to process several times.
Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
Built-in interoperability with NumPy, PyTorch, TensorFlow 2, JAX, Pandas, Polars and more.
Native support for audio, image and video data.
Enable streaming mode to save disk space and start iterating over the dataset immediately.

🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library.

Installation

With pip

🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

bash pip install datasets

With conda

🤗 Datasets can be installed using conda as follows:

bash conda install -c huggingface -c conda-forge datasets

Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.

For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation

Installation to use with Machine Learning & Data frameworks frameworks

If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (3.14+) you should also install PyTorch, TensorFlow or JAX. 🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately.

For more details on using the library with these frameworks, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart

Usage

🤗 Datasets is made to be very simple to use - the API is centered around a single function, datasets.load_dataset(dataset_name, **kwargs), that instantiates a dataset.

This library can be used for text/image/audio/etc. datasets. Here is an example to load a text dataset:

Here is a quick example:

```python from datasets import load_dataset

Print all the available datasets

from huggingfacehub import listdatasets print([dataset.id for dataset in list_datasets()])

Load a dataset and print the first example in the training set

squaddataset = loaddataset('rajpurkar/squad') print(squad_dataset['train'][0])

Process the dataset - add a column with the length of the context texts

datasetwithlength = squad_dataset.map(lambda x: {"length": len(x["context"])})

Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenizeddataset = squaddataset.map(lambda x: tokenizer(x['context']), batched=True) ```

If your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:

```python

If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset

imagedataset = loaddataset('timm/imagenet-1k-wds', streaming=True) for example in image_dataset["train"]: break ```

For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart and the specific pages on:

Loading a dataset: https://huggingface.co/docs/datasets/loading
What's in a Dataset: https://huggingface.co/docs/datasets/access
Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process
- Processing audio data: https://huggingface.co/docs/datasets/audio_process
- Processing image data: https://huggingface.co/docs/datasets/image_process
- Processing text data: https://huggingface.co/docs/datasets/nlp_process
Streaming a dataset: https://huggingface.co/docs/datasets/stream
etc.

Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the number of datasets datasets already provided on the HuggingFace Datasets Hub.

You can find: - how to upload a dataset to the Hub using your web browser or Python and also - how to upload it using Git.

Disclaimers

You can use 🤗 Datasets to load datasets based on versioned git repositories maintained by the dataset authors. For reproducibility reasons, we ask users to pin the revision of the repositories they use.

If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!

BibTeX

If you want to cite our 🤗 Datasets library, you can use our paper:

bibtex @inproceedings{lhoest-etal-2021-datasets, title = "Datasets: A Community Library for Natural Language Processing", author = "Lhoest, Quentin and Villanova del Moral, Albert and Jernite, Yacine and Thakur, Abhishek and von Platen, Patrick and Patil, Suraj and Chaumond, Julien and Drame, Mariama and Plu, Julien and Tunstall, Lewis and Davison, Joe and {\v{S}}a{\v{s}}ko, Mario and Chhablani, Gunjan and Malik, Bhavitvya and Brandeis, Simon and Le Scao, Teven and Sanh, Victor and Xu, Canwen and Patry, Nicolas and McMillan-Major, Angelina and Schmid, Philipp and Gugger, Sylvain and Delangue, Cl{\'e}ment and Matussi{\`e}re, Th{\'e}o and Debut, Lysandre and Bekman, Stas and Cistac, Pierric and Goehringer, Thibault and Mustar, Victor and Lagunas, Fran{\c{c}}ois and Rush, Alexander and Wolf, Thomas", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-demo.21", pages = "175--184", abstract = "The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.", eprint={2109.02846}, archivePrefix={arXiv}, primaryClass={cs.CL}, }

If you need to cite a specific version of our 🤗 Datasets library for reproducibility, you can use the corresponding version Zenodo DOI from this list.

Owner

Name: Hugging Face
Login: huggingface
Kind: organization
Location: NYC + Paris

Website: https://huggingface.co/
Twitter: huggingface
Repositories: 344
Profile: https://github.com/huggingface

The AI community building the future.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "huggingface/datasets"
authors:
- family-names: Lhoest
  given-names: Quentin
- family-names: Villanova del Moral
  given-names: Albert
  orcid: "https://orcid.org/0000-0003-1727-1045"
- family-names: von Platen
  given-names: Patrick
- family-names: Wolf
  given-names: Thomas
- family-names: Šaško
  given-names: Mario
- family-names: Jernite
  given-names: Yacine
- family-names: Thakur
  given-names: Abhishek
- family-names: Tunstall
  given-names: Lewis
- family-names: Patil
  given-names: Suraj
- family-names: Drame
  given-names: Mariama
- family-names: Chaumond
  given-names: Julien
- family-names: Plu
  given-names: Julien
- family-names: Davison
  given-names: Joe
- family-names: Brandeis
  given-names: Simon
- family-names: Sanh
  given-names: Victor
- family-names: Le Scao
  given-names: Teven
- family-names: Canwen Xu
  given-names: Kevin
- family-names: Patry
  given-names: Nicolas
- family-names: Liu
  given-names: Steven
- family-names: McMillan-Major
  given-names: Angelina
- family-names: Schmid
  given-names: Philipp
- family-names: Gugger
  given-names: Sylvain
- family-names: Raw
  given-names: Nathan
- family-names: Lesage
  given-names: Sylvain
- family-names: Lozhkov
  given-names: Anton
- family-names: Carrigan
  given-names: Matthew
- family-names: Matussière
  given-names: Théo
- family-names: von Werra
  given-names: Leandro
- family-names: Debut
  given-names: Lysandre
- family-names: Bekman
  given-names: Stas
- family-names: Delangue
  given-names: Clément
doi: 10.5281/zenodo.4817768
repository-code: "https://github.com/huggingface/datasets"
license: Apache-2.0
preferred-citation:
  type: conference-paper
  title: "Datasets: A Community Library for Natural Language Processing"
  authors:
  - family-names: Lhoest
    given-names: Quentin
  - family-names: Villanova del Moral
    given-names: Albert
    orcid: "https://orcid.org/0000-0003-1727-1045"
  - family-names: von Platen
    given-names: Patrick
  - family-names: Wolf
    given-names: Thomas
  - family-names: Šaško
    given-names: Mario
  - family-names: Jernite
    given-names: Yacine
  - family-names: Thakur
    given-names: Abhishek
  - family-names: Tunstall
    given-names: Lewis
  - family-names: Patil
    given-names: Suraj
  - family-names: Drame
    given-names: Mariama
  - family-names: Chaumond
    given-names: Julien
  - family-names: Plu
    given-names: Julien
  - family-names: Davison
    given-names: Joe
  - family-names: Brandeis
    given-names: Simon
  - family-names: Sanh
    given-names: Victor
  - family-names: Le Scao
    given-names: Teven
  - family-names: Canwen Xu
    given-names: Kevin
  - family-names: Patry
    given-names: Nicolas
  - family-names: Liu
    given-names: Steven
  - family-names: McMillan-Major
    given-names: Angelina
  - family-names: Schmid
    given-names: Philipp
  - family-names: Gugger
    given-names: Sylvain
  - family-names: Raw
    given-names: Nathan
  - family-names: Lesage
    given-names: Sylvain
  - family-names: Lozhkov
    given-names: Anton
  - family-names: Carrigan
    given-names: Matthew
  - family-names: Matussière
    given-names: Théo
  - family-names: von Werra
    given-names: Leandro
  - family-names: Debut
    given-names: Lysandre
  - family-names: Bekman
    given-names: Stas
  - family-names: Delangue
    given-names: Clément
  collection-title: "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations"
  collection-type: proceedings
  month: 11
  year: 2021
  publisher:
    name: "Association for Computational Linguistics"
  url: "https://aclanthology.org/2021.emnlp-demo.21"
  start: 175
  end: 184
  identifiers:
    - type: other
      value: "arXiv:2109.02846"
      description: "The arXiv preprint of the paper"

Committers

Last synced: 9 months ago

All Time

Total Commits: 4,020
Total Committers: 601
Avg Commits per committer: 6.689
Development Distribution Score (DDS): 0.747

Past Year

Commits: 240
Committers: 52
Avg Commits per committer: 4.615
Development Distribution Score (DDS): 0.563

Top Committers

Name	Email	Commits
Quentin Lhoest	4****q	1,017
Albert Villanova del Moral	8****a	698
Mario Šaško	m**7@g**m	314
Patrick von Platen	p**n@g**m	128
Thomas Wolf	t****f	87
Steven Liu	5****u	62
Yacine Jernite	y****e	48
abhishek thakur	a****r	41
Sasha Luccioni	l**s@m**c	40
lewtun	l**l@g**m	38
Bhavitvya Malik	b**k@g**m	34
Julien Chaumond	j**n@h**o	32
Mariama Drame	m**a@d**d	32
Suraj Patil	s**5@g**m	30
Polina Kazakova	p**a@h**o	29
mariamabarham	3****m	26
emibaylor	2****r	22
Steven	s**u@g**m	21
Julien Plu	p**n@g**m	20
Gunjan Chhablani	c**n@g**m	20
Sylvain Lesage	s**e@h**o	18
Charin	c**b@g**m	17
Victor SANH	v**h@g**m	15
Teven	t**o@g**m	15
Simon Brandeis	3****s	15
Matt	R****1	15
Joe Davison	j**n@g**m	15
Cahya Wirawan	c**n@g**m	14
Jonatas Grosman	j**n@g**m	13
Thomas Wang	2****1	13
and 571 more...

Committer Domains (Top 20 + Academic)

huggingface.co: 8 163.com: 3 mail.ru: 3 usc.edu: 3 qq.com: 2 elementai.com: 2 hey.com: 2 allenai.org: 2 databricks.com: 2 iki.fi: 2 unbabel.com: 1 saama.com: 1 yan.family: 1 mail.de: 1 xinghanlu.com: 1 dlsu.edu.ph: 1 nsit.net.in: 1 total.com: 1 google.com: 1 me.com: 1 caltech.edu: 1 umich.edu: 1 durham.ac.uk: 1 coloradocollege.edu: 1 g.rit.edu: 1 mi.t.u-tokyo.ac.jp: 1 di.ku.dk: 1 iitrpr.ac.in: 1 smail.nju.edu.cn: 1 email.wm.edu: 1 kanji.zinbun.kyoto-u.ac.jp: 1 uww.edu: 1 mit.edu: 1 columbia.edu: 1 inria.fr: 1 ethz.ch: 1 mails.tsinghua.edu.cn: 1 mmci.uni-saarland.de: 1 cs.cmu.edu: 1 qu.edu.qa: 1 cs.washington.edu: 1 mail.mcgill.ca: 1 cs.columbia.edu: 1 ucsd.edu: 1 mail.usf.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 1,104
Total pull requests: 1,312
Average time to close issues: 3 months
Average time to close pull requests: 29 days
Total issue authors: 813
Total pull request authors: 210
Average comments per issue: 2.98
Average comments per pull request: 2.35
Merged pull requests: 888
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 284
Pull requests: 460
Average time to close issues: 14 days
Average time to close pull requests: 8 days
Issue authors: 241
Pull request authors: 81
Average comments per issue: 1.1
Average comments per pull request: 1.09
Merged pull requests: 283
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

albertvillanova (90)
lhoestq (22)
severo (16)
alex-hh (12)
kopyl (9)
jonathanasdf (7)
mariosasko (7)
andysingal (6)
d710055071 (6)
sanchit-gandhi (6)
yuvalkirstain (6)
npuichigo (6)
stas00 (5)
rangehow (4)
BramVanroy (4)

Pull Request Authors

lhoestq (414)
albertvillanova (269)
mariosasko (108)
ArjunJagdale (31)
alex-hh (20)
Wauplin (10)
maddiedawson (9)
severo (9)
Harry-Yang0518 (8)
lewtun (8)
cakiki (8)
klamike (8)
ringohoffman (8)
cyyever (8)
Modexus (7)

Top Labels

Issue Labels

enhancement (221) bug (102) maintenance (13) dataset request (12) documentation (9) good second issue (8) good first issue (8) duplicate (8) generic discussion (7) streaming (6) dataset-viewer (5) dataset bug (5) question (3) vision (2) arrow (1) metric bug (1) dataset contribution (1) help wanted (1) speech (1)

Pull Request Labels

dataset contribution (16) maintenance (3) transfer-to-evaluate (1) Dataset discussion (1)

Packages

Total packages: 5
Total downloads:
- pypi 26,802,388 last-month
Total docker downloads: 39,467,733

Total dependent packages: 951
(may contain duplicates)
Total dependent repositories: 15,020
(may contain duplicates)
Total versions: 145
Total maintainers: 6

pypi.org: datasets

HuggingFace community-driven open-source library of datasets

Homepage: https://github.com/huggingface/datasets
Documentation: https://datasets.readthedocs.io/
License: Apache 2.0
Latest release: 4.0.0
published 8 months ago

Versions: 100
Dependent Packages: 931
Dependent Repositories: 14,962
Downloads: 26,802,388 Last month
Docker Downloads: 39,467,733

Rankings

Dependent packages count: 0.0%

Dependent repos count: 0.1%

Downloads: 0.1%

Stargazers count: 0.1%

Average: 0.2%

Forks count: 0.3%

Docker downloads count: 0.7%

Maintainers (4)

lysandre Thomwolf albertvillanova lhoestq

Last synced: 6 months ago

conda-forge.org: datasets

Datasets is a lightweight library providing one-line dataloaders for many public datasets and one liners to download and pre-process any of the number of datasets major public datasets provided on the HuggingFace Datasets Hub. Datasets are ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX). Datasets also provide an API for simple, fast, and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text.

Homepage: https://github.com/huggingface/datasets
License: Apache-2.0
Latest release: 2.7.0
published over 3 years ago

Versions: 34
Dependent Packages: 13
Dependent Repositories: 29

Rankings

Stargazers count: 2.1%

Forks count: 2.7%

Average: 4.1%

Dependent packages count: 4.8%

Dependent repos count: 6.9%

Last synced: 6 months ago

spack.io: py-datasets

Datasets is a lightweight library providing two main features: one-line dataloaders for many public datasets and efficient data pre-processing.

Homepage: https://github.com/huggingface/datasets
License: []
Latest release: 3.2.0
published about 1 year ago

Versions: 4
Dependent Packages: 3
Dependent Repositories: 0

Rankings

Dependent repos count: 0.0%

Stargazers count: 0.6%

Forks count: 1.9%

Average: 7.6%

Dependent packages count: 28.1%

Maintainers (2)

adamjstewart thomas-bouvier

Last synced: 6 months ago

pypi.org: fdatasets

HuggingFace/Datasets is an open library of NLP datasets.

Homepage: https://github.com/huggingface/datasets
Documentation: https://fdatasets.readthedocs.io/
License: Apache 2.0
Latest release: 1.12.1
published almost 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 0

Rankings

Stargazers count: 0.1%

Forks count: 0.4%

Dependent packages count: 4.8%

Dependent repos count: 6.3%

Average: 12.6%

Downloads: 51.4%

Last synced: about 1 year ago

anaconda.org: datasets

Datasets is a lightweight library providing two main features: - one-line dataloaders for many public datasets: one-liners to download and pre-process any of the number of datasets major public datasets (text datasets in 467 languages and dialects, image datasets, audio datasets, etc.) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_dataset("squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - efficient data pre-processing: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text/PNG/JPEG/etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.

Homepage: https://github.com/huggingface/datasets
License: Apache-2.0
Latest release: 3.3.2
published 12 months ago

Versions: 6
Dependent Packages: 4
Dependent Repositories: 29

Rankings

Stargazers count: 6.0%

Forks count: 7.3%

Dependent packages count: 11.1%

Average: 13.4%

Dependent repos count: 29.2%

Last synced: 6 months ago

datasets

Science Score: 64.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Installation

With pip

With conda

Installation to use with Machine Learning & Data frameworks frameworks

Usage

Print all the available datasets

Load a dataset and print the first example in the training set

Process the dataset - add a column with the length of the context texts

Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)

If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset

Add a new dataset to the Hub

Disclaimers

BibTeX

Owner

Citation (CITATION.cff)

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: datasets

Rankings

Maintainers (4)

conda-forge.org: datasets

Rankings

spack.io: py-datasets

Rankings

Maintainers (2)

pypi.org: fdatasets

Rankings

anaconda.org: datasets

Rankings

Dependencies