dcbench

A benchmark of data-centric tasks from across the machine learning lifecycle.

https://github.com/data-centric-ai/dcbench

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.6%) to scientific vocabulary

Keywords

data-science machine-learning
Last synced: 9 months ago · JSON representation ·

Repository

A benchmark of data-centric tasks from across the machine learning lifecycle.

Basic Info
  • Host: GitHub
  • Owner: data-centric-ai
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage: https://www.datacentricai.cc/
  • Size: 626 KB
Statistics
  • Stars: 72
  • Watchers: 4
  • Forks: 9
  • Open Issues: 3
  • Releases: 1
Topics
data-science machine-learning
Created almost 5 years ago · Last pushed about 4 years ago
Metadata Files
Readme Contributing License Citation

README.md

banner ----- ![GitHub Workflow Status](https://img.shields.io/github/workflow/status/data-centric-ai/dcbench/CI) ![GitHub](https://img.shields.io/github/license/data-centric-ai/dcbench) [![Documentation Status](https://readthedocs.org/projects/dcbench/badge/?version=latest)](https://dcbench.readthedocs.io/en/latest/?badge=latest) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dcbench)](https://pypi.org/project/dcbench/) [![codecov](https://codecov.io/gh/data-centric-ai/dcbench/branch/main/graph/badge.svg?token=MOLQYUSYQU)](https://codecov.io/gh/data-centric-ai/dcbench) A benchmark of data-centric tasks from across the machine learning lifecycle. [**Getting Started**](#%EF%B8%8F-quickstart) | [**What is dcbench?**](#-what-is-dcbench) | [**Docs**](https://dcbench.readthedocs.io/en/latest/index.html) | [**Contributing**](CONTRIBUTING.md) | [**Website**](https://www.datacentricai.cc/) | [**About**](#%EF%B8%8F-about)

⚡️ Quickstart

bash pip install dcbench

Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install dcbench[dev] instead. See setup.py for a full list of optional dependencies.

Installing from dev: pip install "dcbench[dev] @ git+https://github.com/data-centric-ai/dcbench@main"

Using a Jupyter notebook or some other interactive environment, you can import the library and explore the data-centric problems in the benchmark:

python import dcbench dcbench.tasks To learn more, follow the walkthrough in the docs.

💡 What is dcbench?

This benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they're focused on exploring and manipulating data – not training models. dcbench supports a growing list of them:

dcbench includes tasks that look very different from one another: the inputs and outputs of the slice discovery task are not the same as those of the minimal data cleaning task. However, we think it important that researchers and practitioners be able to run evaluations on data-centric tasks across the ML lifecycle without having to learn a bunch of different APIs or rewrite evaluation scripts.

So, dcbench is designed to be a common home for these diverse, but related, tasks. In dcbench all of these tasks are structured in a similar manner and they are supported by a common Python API that makes it easy to download data, run evaluations, and compare methods.

✉️ About

dcbench is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)

Owner

  • Name: data-centric-ai
  • Login: data-centric-ai
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this benchmark, please cite it as below."
authors:
- family-names: Eyuboglu
  given-names: Sabri
  orcid: "https://orcid.org/0000-0002-8412-0266"
- family-names: Karlaš
  given-names: Bojan
- family-names: Zhang
  given-names: Ce
- family-names: Ré
  given-names: Christopher
- family-names: Zou
  given-names: James
title: "dcbench"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2021-11-29
url: "https://github.com/data-centric-ai/dcbench"

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

.github/workflows/ci.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • codecov/codecov-action v1 composite
  • google-github-actions/auth v0 composite
  • google-github-actions/setup-gcloud v0.2.0 composite
Pipfile pypi
  • dcbench * develop
  • ipython * develop
  • twine * develop
  • dcbench *
Pipfile.lock pypi
  • 144 dependencies
docs/requirements.txt pypi
  • furo *
  • ipython *
  • nbsphinx *
  • recommonmark *
  • sphinx-rtd-theme *
  • sphinx_autodoc_typehints *
  • toml *
pyproject.toml pypi
setup.py pypi