pydata-wrangler

Wrangle messy numerical, image, and text data into consistent well-organized formats

https://github.com/contextlab/data-wrangler

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    2 of 4 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.7%) to scientific vocabulary

Keywords

data data-analysis data-science data-wrangling hugging-face image-data machine-learning nlp numpy pandas polars python scikit-learn

Keywords from Contributors

environment-management google-colab install ipython package-management pip reproducibility
Last synced: 4 months ago · JSON representation ·

Repository

Wrangle messy numerical, image, and text data into consistent well-organized formats

Basic Info
  • Host: GitHub
  • Owner: ContextLab
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.34 MB
Statistics
  • Stars: 11
  • Watchers: 2
  • Forks: 2
  • Open Issues: 4
  • Releases: 12
Topics
data data-analysis data-science data-wrangling hugging-face image-data machine-learning nlp numpy pandas polars python scikit-learn
Created over 4 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Citation Authors

README.rst

Overview
================

|build-status|  |docs|  |doi|

Datasets come in all shapes and sizes, and are often *messy*:

  - Observations come in different formats
  - There are missing values
  - Labels are missing and/or aren't consistent
  - Datasets need to be wrangled 🐄 🐑 🚜

The main goal of ``data-wrangler`` is to turn messy data into clean(er) data, defined as either a ``DataFrame`` or a
list of ``DataFrame`` objects.  The package provides code for easily wrangling data from a variety of formats into
``DataFrame`` objects, manipulating ``DataFrame`` objects in useful ways (that can be tricky to implement, but that
apply to many analysis scenarios), and decorating Python functions to make them more flexible and/or easier to write.

🚀 **New**: ``data-wrangler`` now supports **high-performance Polars DataFrames** alongside pandas, delivering 2-100x speedups 
for large datasets with zero code changes. Simply add ``backend='polars'`` to any operation!

The ``data-wrangler`` package supports a variety of datatypes.  There is a special emphasis on text data, whereby
``data-wrangler`` provides a simple API for interacting with natural language processing tools and datasets provided by
``scikit-learn`` and ``hugging-face`` (via sentence-transformers).  The package is designed to provide sensible defaults, but also
implements convenient ways of deeply customizing how different datatypes are wrangled.

For more information, including a formal API and tutorials, check out https://data-wrangler.readthedocs.io

Quick start
================

Install datawrangler using:

.. code-block:: console

    $ pip install pydata-wrangler

Some quick natural language processing examples::

    import datawrangler as dw

    # load in sample text
    text_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/home_on_the_range.txt'
    text = dw.io.load(text_url)

    # embed text using scikit-learn's implementation of Latent Dirichlet Allocation, trained on a curated subset of
    # Wikipedia, called the 'minipedia' corpus.  Return the fitted model so that it can be applied to new text.
    # NEW: Simplified API - just pass model names as strings or lists!
    lda_embeddings, lda_fit = dw.wrangle(text, text_kwargs={'model': ['CountVectorizer', 'LatentDirichletAllocation'], 'corpus': 'minipedia'}, return_model=True)

    # apply the minipedia-trained LDA model to new text
    new_text = 'how much wood could a wood chuck chuck if a wood chuck could check wood?'
    new_embeddings = dw.wrangle(new_text, text_kwargs={'model': lda_fit})

    # embed text using sentence-transformers pre-trained model
    # NEW: Simplified API - just pass the model name as a string!
    sentence_embeddings = dw.wrangle(text, text_kwargs={'model': 'all-mpnet-base-v2'})

High-performance Polars backend examples::

    import numpy as np
    
    # Array processing with dramatic speedups
    large_array = np.random.rand(50000, 20)
    
    # Traditional pandas backend
    pandas_df = dw.wrangle(large_array, backend='pandas')
    
    # High-performance Polars backend (2-100x faster!)
    polars_df = dw.wrangle(large_array, backend='polars')
    
    # Set global backend preference
    from datawrangler.core.configurator import set_dataframe_backend
    set_dataframe_backend('polars')  # All operations now use Polars
    
    # Text processing also benefits from Polars
    fast_text_embeddings = dw.wrangle(text, backend='polars')

The ``data-wrangler`` package also provides powerful decorators that can modify existing functions to support new
datatypes.  Just write your function as though its inputs are guaranteed to be Pandas DataFrames, and decorate it with
``datawrangler.decorate.funnel`` to enable support for other datatypes without any new code::

  image_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/wrangler.jpg'
  image = dw.io.load(image_url)

  # define your function and decorate it with "funnel"
  @dw.decorate.funnel
  def binarize(x):
    return x > np.mean(x.values)

  binarized_image = binarize(image)  # rgb channels will be horizontally concatenated to create a 2D DataFrame


Supported data formats
----------------------

One package can't accommodate every foreseeable format or input source, but ``data-wrangler`` provides a framework for adding support for new datatypes in a straightforward way.  Essentially, adding support for a new data type entails writing two functions:

  - An ``is_`` function, which should return ``True`` if an object is compatible with the given datatype (or format), and ``False`` otherwise
  - A ``wrangle_`` function, which should take in an object of the given type or format and return a ``pandas`` or ``Polars`` ``DataFrame`` with numerical entries

Currently supported datatypes are limited to:

  - ``array``-like objects (including images)
  - ``DataFrame``-like or ``Series``-like objects (pandas and Polars)
  - text data (text is embedded using natural language processing models)
or lists of mixtures of the above.

**Backend Support**: All operations support both ``pandas`` (default) and ``Polars`` (high-performance) backends. Choose the backend that best fits your performance requirements and workflow preferences.

Missing observations (e.g., nans, empty strings, etc.) may be filled in using imputation and/or interpolation.

.. |build-status| image:: https://github.com/ContextLab/data-wrangler/actions/workflows/ci.yaml/badge.svg
    :alt: build status
    :target: https://github.com/ContextLab/data-wrangler

.. |docs| image:: https://readthedocs.org/projects/data-wrangler/badge/
    :alt: docs status
    :target: https://data-wrangler.readthedocs.io/

.. |doi| image:: https://zenodo.org/badge/DOI/10.5281/zenodo.5123310.svg
   :target: https://doi.org/10.5281/zenodo.5123310

Owner

  • Name: Contextual Dynamics Laboratory
  • Login: ContextLab
  • Kind: organization
  • Email: contextualdynamics@gmail.com
  • Location: Hanover, NH

Contextual Dynamics Laboratory at Dartmouth College

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Manning"
  given-names: "Jeremy"
  orcid: "https://orcid.org/0000-0001-7613-4732"
title: "DataWrangler: wrangler your messy data into consistent well-organized formats"
version: 0.2.2
doi: 10.5281/zenodo.5123310
date-released: 2022-07-25
url: "https://github.com/ContextLab/data-wrangler"

GitHub Events

Total
  • Release event: 1
  • Watch event: 1
  • Push event: 10
  • Create event: 2
Last Year
  • Release event: 1
  • Watch event: 1
  • Push event: 10
  • Create event: 2

Committers

Last synced: almost 2 years ago

All Time
  • Total Commits: 144
  • Total Committers: 4
  • Avg Commits per committer: 36.0
  • Development Distribution Score (DDS): 0.292
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
jeremymanning j****g@g****m 102
Jeremy Manning j****g@d****u 22
Jeremy Manning j****g 15
paxtonfitzpatrick P****k@D****u 5
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 13
  • Total pull requests: 16
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 26 minutes
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 1.15
  • Average comments per pull request: 0.06
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jeremymanning (12)
  • paxtonfitzpatrick (1)
Pull Request Authors
  • jeremymanning (15)
  • paxtonfitzpatrick (1)
Top Labels
Issue Labels
bug (3)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 58 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 12
  • Total maintainers: 1
pypi.org: pydata-wrangler

Wrangle messy data into DataFrames (pandas or Polars), with a special focus on text data and natural language processing

  • Versions: 12
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 58 Last month
Rankings
Dependent packages count: 10.1%
Stargazers count: 17.8%
Average: 18.4%
Forks count: 19.2%
Dependent repos count: 21.6%
Downloads: 23.2%
Maintainers (1)
Last synced: 4 months ago

Dependencies

docs/requirements.txt pypi
  • Pillow ==9.2.0
  • Sphinx ==4.0.2
  • configparser ==5.2.0
  • datasets ==2.3.2
  • dill ==0.3.5.1
  • flair ==0.11.3
  • importlib-metadata ==3.10.1
  • jupyter-client ==7.3.4
  • jupyter-core ==4.11.1
  • jupyterlab-pygments ==0.2.2
  • konoha ==4.6.5
  • matplotlib ==3.5.1
  • nbclient ==0.6.6
  • nbconvert ==6.5.0
  • nbformat ==5.4.0
  • nbsphinx ==0.8.9
  • numpy ==1.22.3
  • pandas ==1.4.3
  • pydata-sphinx-theme ==0.9.0
  • requests ==2.28.1
  • scikit-learn ==1.1.1
  • scipy ==1.7.3
  • sentence-transformers ==2.2.2
  • sentencepiece ==0.1.95
  • setuptools ==61.2.0
  • six ==1.16.0
  • tokenizers ==0.12.1
  • torch ==1.12.0
  • torchvision ==0.13.0
  • tqdm ==4.64.0
  • transformers ==4.20.1
requirements.txt pypi
  • Pillow *
  • configparser *
  • dill *
  • importlib-metadata *
  • matplotlib *
  • numpy *
  • pandas *
  • requests *
  • scikit-learn *
  • scipy *
  • setuptools *
  • six *
  • tqdm *
requirements_dev.txt pypi
  • Sphinx >=1.8.5 development
  • nbsphinx * development
  • pydata-sphinx-theme * development
  • pygments * development
  • pytest * development
requirements_hf.txt pypi
  • datasets *
  • flair *
  • konoha *
  • pytorch-pretrained-bert *
  • pytorch-transformers *
  • sentence-transformers *
  • sentencepiece *
  • tokenizers *
  • torch *
  • torchvision *
  • transformers *