ballet

☀️🦶 A lightweight framework for collaborative, open-source feature engineering

https://github.com/ballet/ballet

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: acm.org
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.7%) to scientific vocabulary

Keywords

collaborative-data-science feature-engineering
Last synced: 6 months ago · JSON representation

Repository

☀️🦶 A lightweight framework for collaborative, open-source feature engineering

Basic Info
  • Host: GitHub
  • Owner: ballet
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage: https://ballet.github.io
  • Size: 17.4 MB
Statistics
  • Stars: 33
  • Watchers: 4
  • Forks: 6
  • Open Issues: 20
  • Releases: 0
Topics
collaborative-data-science feature-engineering
Created almost 8 years ago · Last pushed over 4 years ago
Metadata Files
Readme Changelog Contributing License

README.md

PyPI Shield Tests codecov Shield

ballet

A lightweight framework for collaborative, open-source data science projects through feature engineering.

  • Free software: MIT license
  • Documentation: https://ballet.github.io/ballet
  • Repo: https://github.com/ballet/ballet
  • Project homepage: https://ballet.github.io

Overview

While the open-source model for software development has led to successful, large-scale collaborations in building software applications, chess engines, and scientific analyses, data science has not benefited from this development paradigm. In part, this is due to the divide between the development processes used by software engineers and those used by data scientists.

Ballet tries to address this disparity. It is a lightweight software framework that supports collaborative data science development by composing a data science pipeline from a collection of modular patches that can be written in parallel. Ballet provides the underlying functionality to support interactive development, test and merge high-quality contributions, and compose the accepted contributions into a single product.

We have deployed Ballet for feature engineering collaborations on tabular survey datasets of public interest. For example, predict-census-income is a large real-world collaborative project to engineer features from raw individual survey responses to the U.S. Census American Community Survey (ACS) and predict personal income. The resulting project is one of the largest data science collaborations GitHub, and outperforms state-of-the-art tabular AutoML systems and independent data science experts.

The Ballet framework

Ballet includes several different pieces for enabling collaborative data science.

  • The Ballet framework core is developed in this repository and includes:
    • the feature definition abstraction, a tuple of input variables and transformer steps (ballet.feature)
    • the feature engineering pipeline abstraction, a data flow graph over feature functions (ballet.pipeline)
    • the transformer step abstraction and a library of transformer steps that can be used in feature engineering (ballet.tranformer, ballet.eng)
    • a comprehensive feature validation library, that includes test suites and statistical methods for validating the machine learning performance and software quality of proposed feature definitions (ballet.validation)
    • functionality for programmatically collecting submitted feature definitions from file systems (ballet.contrib)
    • a project template for individual Ballet projects that can be automatically updated with upstream template improvements (ballet/templates/project_template, ballet.update)
    • a command line tool for maintaining and developing Ballet projects (ballet.cli)
    • an interface to interact with Ballet projects following the project template (ballet.project)
    • an interactive client for users during development (ballet.client)
  • Assemblé: A development environment for Ballet collaborations on top of Jupyter Lab
  • Ballet Bot: A bot to help manage Ballet projects on GitHub

Next steps

Learn more about Ballet

Are you a data owner or project maintainer that wants to organize a collaboration?

👉 Check out the Ballet Maintainer Guide

Are you a data scientist or enthusiast that wants to join a collaboration?

👉 Check out the Ballet Contributor Guide

Do you want to learn about how Ballet enables Better Feature Engineering™️?

👉 Check out the Feature Engineering Guide

You can also read our research paper about the Ballet framework and our case study analysis, which appeared at ACM CSCW 2021:

👉 Enabling Collaborative Data Science Development with the Ballet Framework

Join a Ballet collaboration

The Ballet GitHub organization hosts several ongoing Ballet collaborations:

Citing Ballet

If you use Ballet in your work, please consider citing the following paper:

bibtex @article{smith2021enabling, author = {Smith, Micah J. and Cito, J{\"u}rgen and Lu, Kelvin and Veeramachaneni, Kalyan}, title = "Enabling Collaborative Data Science Development with the {Ballet} Framework", year = "2021", month = "October", volume = "5", pages = "1--39", doi = "10.1145/3479575", journal = "Proceedings of the {ACM} on Human-Computer Interaction", publisher = "{ACM}", language = "en", number = "CSCW2" }

Owner

  • Name: ballet
  • Login: ballet
  • Kind: organization
  • Email: ballet@mit.edu
  • Location: United States of America

Collaborative data science development frameworks from the Data To AI Lab at MIT

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 1,180
  • Total Committers: 2
  • Avg Commits per committer: 590.0
  • Development Distribution Score (DDS): 0.097
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Micah Smith m****h@g****m 1,066
Kelvin Lu k****u@m****u 114
Committer Domains (Top 20 + Academic)
mit.edu: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 62
  • Total pull requests: 30
  • Average time to close issues: 3 months
  • Average time to close pull requests: 9 days
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 0.79
  • Average comments per pull request: 0.43
  • Merged pull requests: 28
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • micahjsmith (54)
  • kelvin-lu (6)
  • gsheni (1)
  • eleveyuan (1)
Pull Request Authors
  • micahjsmith (20)
  • kelvin-lu (10)
Top Labels
Issue Labels
enhancement (22) bug (7) help wanted (2) good first issue (2)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 376 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 2
  • Total versions: 50
  • Total maintainers: 2
pypi.org: ballet

Core functionality for lightweight, collaborative data science projects

  • Versions: 50
  • Dependent Packages: 0
  • Dependent Repositories: 2
  • Downloads: 376 Last month
Rankings
Dependent packages count: 10.0%
Stargazers count: 11.3%
Dependent repos count: 11.6%
Average: 12.6%
Forks count: 14.2%
Downloads: 15.9%
Maintainers (2)
Last synced: 6 months ago

Dependencies

ballet/templates/project_template/{{cookiecutter.project_slug}}/binder/requirements.txt pypi
  • ballet-assemble *
  • jupyterlab *
ballet/templates/project_template/{{cookiecutter.project_slug}}/requirements-notebook.txt pypi
  • matplotlib *
  • seaborn *