datasets

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

https://github.com/huggingface/datasets

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • ✓
    CITATION.cff file
    Found CITATION.cff file
  • ✓
    codemeta.json file
    Found codemeta.json file
  • ✓
    .zenodo.json file
    Found .zenodo.json file
  • â—‹
    DOI references
  • ✓
    Academic publication links
    Links to: arxiv.org, zenodo.org
  • ✓
    Committers with academic emails
    29 of 601 committers (4.8%) from academic institutions
  • â—‹
    Institutional organization owner
  • â—‹
    JOSS paper metadata
  • â—‹
    Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary

Keywords

ai artificial-intelligence computer-vision dataset-hub datasets deep-learning llm machine-learning natural-language-processing nlp numpy pandas pytorch speech tensorflow

Keywords from Contributors

transformer jax cryptocurrency cryptography language-model named-entity-recognition langchain text-classification anthropic gemini
Last synced: 6 months ago · JSON representation ·

Repository

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

Basic Info
Statistics
  • Stars: 20,569
  • Watchers: 283
  • Forks: 2,921
  • Open Issues: 963
  • Releases: 103
Topics
ai artificial-intelligence computer-vision dataset-hub datasets deep-learning llm machine-learning natural-language-processing nlp numpy pandas pytorch speech tensorflow
Created almost 6 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct Citation Security Authors Zenodo

README.md

Hugging Face Datasets Library

Build GitHub Documentation GitHub release Number of datasets Contributor Covenant DOI

🤗 Datasets is a lightweight library providing two main features:

  • one-line dataloaders for many public datasets: one-liners to download and pre-process any of the number of datasets major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_dataset("rajpurkar/squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
  • efficient data pre-processing: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like processed_dataset = dataset.map(process_example), efficiently prepare the dataset for inspection and ML model evaluation and training.

🎓 Documentation 🔎 Find a dataset in the Hub 🌟 Share a dataset on the Hub

🤗 Datasets is designed to let the community easily add and share new datasets.

🤗 Datasets has many additional interesting features:

  • Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
  • Smart caching: never wait for your data to process several times.
  • Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
  • Built-in interoperability with NumPy, PyTorch, TensorFlow 2, JAX, Pandas, Polars and more.
  • Native support for audio, image and video data.
  • Enable streaming mode to save disk space and start iterating over the dataset immediately.

🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library.

Installation

With pip

🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

bash pip install datasets

With conda

🤗 Datasets can be installed using conda as follows:

bash conda install -c huggingface -c conda-forge datasets

Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.

For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation

Installation to use with Machine Learning & Data frameworks frameworks

If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (3.14+) you should also install PyTorch, TensorFlow or JAX. 🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately.

For more details on using the library with these frameworks, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart

Usage

🤗 Datasets is made to be very simple to use - the API is centered around a single function, datasets.load_dataset(dataset_name, **kwargs), that instantiates a dataset.

This library can be used for text/image/audio/etc. datasets. Here is an example to load a text dataset:

Here is a quick example:

```python from datasets import load_dataset

Print all the available datasets

from huggingfacehub import listdatasets print([dataset.id for dataset in list_datasets()])

Load a dataset and print the first example in the training set

squaddataset = loaddataset('rajpurkar/squad') print(squad_dataset['train'][0])

Process the dataset - add a column with the length of the context texts

datasetwithlength = squad_dataset.map(lambda x: {"length": len(x["context"])})

Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenizeddataset = squaddataset.map(lambda x: tokenizer(x['context']), batched=True) ```

If your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:

```python

If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset

imagedataset = loaddataset('timm/imagenet-1k-wds', streaming=True) for example in image_dataset["train"]: break ```

For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart and the specific pages on:

  • Loading a dataset: https://huggingface.co/docs/datasets/loading
  • What's in a Dataset: https://huggingface.co/docs/datasets/access
  • Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process
    • Processing audio data: https://huggingface.co/docs/datasets/audio_process
    • Processing image data: https://huggingface.co/docs/datasets/image_process
    • Processing text data: https://huggingface.co/docs/datasets/nlp_process
  • Streaming a dataset: https://huggingface.co/docs/datasets/stream
  • etc.

Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the number of datasets datasets already provided on the HuggingFace Datasets Hub.

You can find: - how to upload a dataset to the Hub using your web browser or Python and also - how to upload it using Git.

Disclaimers

You can use 🤗 Datasets to load datasets based on versioned git repositories maintained by the dataset authors. For reproducibility reasons, we ask users to pin the revision of the repositories they use.

If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!

BibTeX

If you want to cite our 🤗 Datasets library, you can use our paper:

bibtex @inproceedings{lhoest-etal-2021-datasets, title = "Datasets: A Community Library for Natural Language Processing", author = "Lhoest, Quentin and Villanova del Moral, Albert and Jernite, Yacine and Thakur, Abhishek and von Platen, Patrick and Patil, Suraj and Chaumond, Julien and Drame, Mariama and Plu, Julien and Tunstall, Lewis and Davison, Joe and {\v{S}}a{\v{s}}ko, Mario and Chhablani, Gunjan and Malik, Bhavitvya and Brandeis, Simon and Le Scao, Teven and Sanh, Victor and Xu, Canwen and Patry, Nicolas and McMillan-Major, Angelina and Schmid, Philipp and Gugger, Sylvain and Delangue, Cl{\'e}ment and Matussi{\`e}re, Th{\'e}o and Debut, Lysandre and Bekman, Stas and Cistac, Pierric and Goehringer, Thibault and Mustar, Victor and Lagunas, Fran{\c{c}}ois and Rush, Alexander and Wolf, Thomas", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-demo.21", pages = "175--184", abstract = "The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.", eprint={2109.02846}, archivePrefix={arXiv}, primaryClass={cs.CL}, }

If you need to cite a specific version of our 🤗 Datasets library for reproducibility, you can use the corresponding version Zenodo DOI from this list.

Owner

  • Name: Hugging Face
  • Login: huggingface
  • Kind: organization
  • Location: NYC + Paris

The AI community building the future.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "huggingface/datasets"
authors:
- family-names: Lhoest
  given-names: Quentin
- family-names: Villanova del Moral
  given-names: Albert
  orcid: "https://orcid.org/0000-0003-1727-1045"
- family-names: von Platen
  given-names: Patrick
- family-names: Wolf
  given-names: Thomas
- family-names: Šaško
  given-names: Mario
- family-names: Jernite
  given-names: Yacine
- family-names: Thakur
  given-names: Abhishek
- family-names: Tunstall
  given-names: Lewis
- family-names: Patil
  given-names: Suraj
- family-names: Drame
  given-names: Mariama
- family-names: Chaumond
  given-names: Julien
- family-names: Plu
  given-names: Julien
- family-names: Davison
  given-names: Joe
- family-names: Brandeis
  given-names: Simon
- family-names: Sanh
  given-names: Victor
- family-names: Le Scao
  given-names: Teven
- family-names: Canwen Xu
  given-names: Kevin
- family-names: Patry
  given-names: Nicolas
- family-names: Liu
  given-names: Steven
- family-names: McMillan-Major
  given-names: Angelina
- family-names: Schmid
  given-names: Philipp
- family-names: Gugger
  given-names: Sylvain
- family-names: Raw
  given-names: Nathan
- family-names: Lesage
  given-names: Sylvain
- family-names: Lozhkov
  given-names: Anton
- family-names: Carrigan
  given-names: Matthew
- family-names: Matussière
  given-names: Théo
- family-names: von Werra
  given-names: Leandro
- family-names: Debut
  given-names: Lysandre
- family-names: Bekman
  given-names: Stas
- family-names: Delangue
  given-names: Clément
doi: 10.5281/zenodo.4817768
repository-code: "https://github.com/huggingface/datasets"
license: Apache-2.0
preferred-citation:
  type: conference-paper
  title: "Datasets: A Community Library for Natural Language Processing"
  authors:
  - family-names: Lhoest
    given-names: Quentin
  - family-names: Villanova del Moral
    given-names: Albert
    orcid: "https://orcid.org/0000-0003-1727-1045"
  - family-names: von Platen
    given-names: Patrick
  - family-names: Wolf
    given-names: Thomas
  - family-names: Šaško
    given-names: Mario
  - family-names: Jernite
    given-names: Yacine
  - family-names: Thakur
    given-names: Abhishek
  - family-names: Tunstall
    given-names: Lewis
  - family-names: Patil
    given-names: Suraj
  - family-names: Drame
    given-names: Mariama
  - family-names: Chaumond
    given-names: Julien
  - family-names: Plu
    given-names: Julien
  - family-names: Davison
    given-names: Joe
  - family-names: Brandeis
    given-names: Simon
  - family-names: Sanh
    given-names: Victor
  - family-names: Le Scao
    given-names: Teven
  - family-names: Canwen Xu
    given-names: Kevin
  - family-names: Patry
    given-names: Nicolas
  - family-names: Liu
    given-names: Steven
  - family-names: McMillan-Major
    given-names: Angelina
  - family-names: Schmid
    given-names: Philipp
  - family-names: Gugger
    given-names: Sylvain
  - family-names: Raw
    given-names: Nathan
  - family-names: Lesage
    given-names: Sylvain
  - family-names: Lozhkov
    given-names: Anton
  - family-names: Carrigan
    given-names: Matthew
  - family-names: Matussière
    given-names: Théo
  - family-names: von Werra
    given-names: Leandro
  - family-names: Debut
    given-names: Lysandre
  - family-names: Bekman
    given-names: Stas
  - family-names: Delangue
    given-names: Clément
  collection-title: "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations"
  collection-type: proceedings
  month: 11
  year: 2021
  publisher:
    name: "Association for Computational Linguistics"
  url: "https://aclanthology.org/2021.emnlp-demo.21"
  start: 175
  end: 184
  identifiers:
    - type: other
      value: "arXiv:2109.02846"
      description: "The arXiv preprint of the paper"

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 4,020
  • Total Committers: 601
  • Avg Commits per committer: 6.689
  • Development Distribution Score (DDS): 0.747
Past Year
  • Commits: 240
  • Committers: 52
  • Avg Commits per committer: 4.615
  • Development Distribution Score (DDS): 0.563
Top Committers
Name Email Commits
Quentin Lhoest 4****q 1,017
Albert Villanova del Moral 8****a 698
Mario Šaško m****7@g****m 314
Patrick von Platen p****n@g****m 128
Thomas Wolf t****f 87
Steven Liu 5****u 62
Yacine Jernite y****e 48
abhishek thakur a****r 41
Sasha Luccioni l****s@m****c 40
lewtun l****l@g****m 38
Bhavitvya Malik b****k@g****m 34
Julien Chaumond j****n@h****o 32
Mariama Drame m****a@d****d 32
Suraj Patil s****5@g****m 30
Polina Kazakova p****a@h****o 29
mariamabarham 3****m 26
emibaylor 2****r 22
Steven s****u@g****m 21
Julien Plu p****n@g****m 20
Gunjan Chhablani c****n@g****m 20
Sylvain Lesage s****e@h****o 18
Charin c****b@g****m 17
Victor SANH v****h@g****m 15
Teven t****o@g****m 15
Simon Brandeis 3****s 15
Matt R****1 15
Joe Davison j****n@g****m 15
Cahya Wirawan c****n@g****m 14
Jonatas Grosman j****n@g****m 13
Thomas Wang 2****1 13
and 571 more...

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1,104
  • Total pull requests: 1,312
  • Average time to close issues: 3 months
  • Average time to close pull requests: 29 days
  • Total issue authors: 813
  • Total pull request authors: 210
  • Average comments per issue: 2.98
  • Average comments per pull request: 2.35
  • Merged pull requests: 888
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 284
  • Pull requests: 460
  • Average time to close issues: 14 days
  • Average time to close pull requests: 8 days
  • Issue authors: 241
  • Pull request authors: 81
  • Average comments per issue: 1.1
  • Average comments per pull request: 1.09
  • Merged pull requests: 283
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • albertvillanova (90)
  • lhoestq (22)
  • severo (16)
  • alex-hh (12)
  • kopyl (9)
  • jonathanasdf (7)
  • mariosasko (7)
  • andysingal (6)
  • d710055071 (6)
  • sanchit-gandhi (6)
  • yuvalkirstain (6)
  • npuichigo (6)
  • stas00 (5)
  • rangehow (4)
  • BramVanroy (4)
Pull Request Authors
  • lhoestq (414)
  • albertvillanova (269)
  • mariosasko (108)
  • ArjunJagdale (31)
  • alex-hh (20)
  • Wauplin (10)
  • maddiedawson (9)
  • severo (9)
  • Harry-Yang0518 (8)
  • lewtun (8)
  • cakiki (8)
  • klamike (8)
  • ringohoffman (8)
  • cyyever (8)
  • Modexus (7)
Top Labels
Issue Labels
enhancement (221) bug (102) maintenance (13) dataset request (12) documentation (9) good second issue (8) good first issue (8) duplicate (8) generic discussion (7) streaming (6) dataset-viewer (5) dataset bug (5) question (3) vision (2) arrow (1) metric bug (1) dataset contribution (1) help wanted (1) speech (1)
Pull Request Labels
dataset contribution (16) maintenance (3) transfer-to-evaluate (1) Dataset discussion (1)

Packages

  • Total packages: 5
  • Total downloads:
    • pypi 26,802,388 last-month
  • Total docker downloads: 39,467,733
  • Total dependent packages: 951
    (may contain duplicates)
  • Total dependent repositories: 15,020
    (may contain duplicates)
  • Total versions: 145
  • Total maintainers: 6
pypi.org: datasets

HuggingFace community-driven open-source library of datasets

  • Versions: 100
  • Dependent Packages: 931
  • Dependent Repositories: 14,962
  • Downloads: 26,802,388 Last month
  • Docker Downloads: 39,467,733
Rankings
Dependent packages count: 0.0%
Dependent repos count: 0.1%
Downloads: 0.1%
Stargazers count: 0.1%
Average: 0.2%
Forks count: 0.3%
Docker downloads count: 0.7%
Last synced: 6 months ago
conda-forge.org: datasets

Datasets is a lightweight library providing one-line dataloaders for many public datasets and one liners to download and pre-process any of the number of datasets major public datasets provided on the HuggingFace Datasets Hub. Datasets are ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX). Datasets also provide an API for simple, fast, and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text.

  • Versions: 34
  • Dependent Packages: 13
  • Dependent Repositories: 29
Rankings
Stargazers count: 2.1%
Forks count: 2.7%
Average: 4.1%
Dependent packages count: 4.8%
Dependent repos count: 6.9%
Last synced: 6 months ago
spack.io: py-datasets

Datasets is a lightweight library providing two main features: one-line dataloaders for many public datasets and efficient data pre-processing.

  • Versions: 4
  • Dependent Packages: 3
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Stargazers count: 0.6%
Forks count: 1.9%
Average: 7.6%
Dependent packages count: 28.1%
Maintainers (2)
Last synced: 6 months ago
pypi.org: fdatasets

HuggingFace/Datasets is an open library of NLP datasets.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 0
Rankings
Stargazers count: 0.1%
Forks count: 0.4%
Dependent packages count: 4.8%
Dependent repos count: 6.3%
Average: 12.6%
Downloads: 51.4%
Last synced: about 1 year ago
anaconda.org: datasets

Datasets is a lightweight library providing two main features: - one-line dataloaders for many public datasets: one-liners to download and pre-process any of the number of datasets major public datasets (text datasets in 467 languages and dialects, image datasets, audio datasets, etc.) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_dataset("squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - efficient data pre-processing: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text/PNG/JPEG/etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.

  • Versions: 6
  • Dependent Packages: 4
  • Dependent Repositories: 29
Rankings
Stargazers count: 6.0%
Forks count: 7.3%
Dependent packages count: 11.1%
Average: 13.4%
Dependent repos count: 29.2%
Last synced: 6 months ago

Dependencies

additional-tests-requirements.txt pypi
  • unbabel-comet >=1.0.0
.github/workflows/benchmarks.yaml actions
  • actions/checkout v2 composite
.github/workflows/ci.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/release-conda.yml actions
  • actions/checkout v1 composite
  • conda-incubator/setup-miniconda v2 composite