pydata-wrangler

Wrangle messy numerical, image, and text data into consistent well-organized formats

https://github.com/contextlab/data-wrangler

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
2 of 4 committers (50.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary

Keywords

data data-analysis data-science data-wrangling hugging-face image-data machine-learning nlp numpy pandas polars python scikit-learn

Keywords from Contributors

environment-management google-colab install ipython package-management pip reproducibility

Last synced: 6 months ago · JSON representation ·

Repository

Wrangle messy numerical, image, and text data into consistent well-organized formats

Basic Info

Host: GitHub
Owner: ContextLab
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 2.34 MB

Statistics

Stars: 11
Watchers: 2
Forks: 2
Open Issues: 4
Releases: 12

Topics

data data-analysis data-science data-wrangling hugging-face image-data machine-learning nlp numpy pandas polars python scikit-learn

Created over 4 years ago · Last pushed 9 months ago

Metadata Files

Readme Changelog Contributing License Citation Authors

README.rst

Overview
================

|build-status|  |docs|  |doi|

Datasets come in all shapes and sizes, and are often *messy*:

  - Observations come in different formats
  - There are missing values
  - Labels are missing and/or aren't consistent
  - Datasets need to be wrangled 🐄 🐑 🚜

The main goal of ``data-wrangler`` is to turn messy data into clean(er) data, defined as either a ``DataFrame`` or a
list of ``DataFrame`` objects.  The package provides code for easily wrangling data from a variety of formats into
``DataFrame`` objects, manipulating ``DataFrame`` objects in useful ways (that can be tricky to implement, but that
apply to many analysis scenarios), and decorating Python functions to make them more flexible and/or easier to write.

🚀 **New**: ``data-wrangler`` now supports **high-performance Polars DataFrames** alongside pandas, delivering 2-100x speedups 
for large datasets with zero code changes. Simply add ``backend='polars'`` to any operation!

The ``data-wrangler`` package supports a variety of datatypes.  There is a special emphasis on text data, whereby
``data-wrangler`` provides a simple API for interacting with natural language processing tools and datasets provided by
``scikit-learn`` and ``hugging-face`` (via sentence-transformers).  The package is designed to provide sensible defaults, but also
implements convenient ways of deeply customizing how different datatypes are wrangled.

For more information, including a formal API and tutorials, check out https://data-wrangler.readthedocs.io

Quick start
================

Install datawrangler using:

.. code-block:: console

    $ pip install pydata-wrangler

Some quick natural language processing examples::

    import datawrangler as dw

    # load in sample text
    text_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/home_on_the_range.txt'
    text = dw.io.load(text_url)

    # embed text using scikit-learn's implementation of Latent Dirichlet Allocation, trained on a curated subset of
    # Wikipedia, called the 'minipedia' corpus.  Return the fitted model so that it can be applied to new text.
    # NEW: Simplified API - just pass model names as strings or lists!
    lda_embeddings, lda_fit = dw.wrangle(text, text_kwargs={'model': ['CountVectorizer', 'LatentDirichletAllocation'], 'corpus': 'minipedia'}, return_model=True)

    # apply the minipedia-trained LDA model to new text
    new_text = 'how much wood could a wood chuck chuck if a wood chuck could check wood?'
    new_embeddings = dw.wrangle(new_text, text_kwargs={'model': lda_fit})

    # embed text using sentence-transformers pre-trained model
    # NEW: Simplified API - just pass the model name as a string!
    sentence_embeddings = dw.wrangle(text, text_kwargs={'model': 'all-mpnet-base-v2'})

High-performance Polars backend examples::

    import numpy as np
    
    # Array processing with dramatic speedups
    large_array = np.random.rand(50000, 20)
    
    # Traditional pandas backend
    pandas_df = dw.wrangle(large_array, backend='pandas')
    
    # High-performance Polars backend (2-100x faster!)
    polars_df = dw.wrangle(large_array, backend='polars')
    
    # Set global backend preference
    from datawrangler.core.configurator import set_dataframe_backend
    set_dataframe_backend('polars')  # All operations now use Polars
    
    # Text processing also benefits from Polars
    fast_text_embeddings = dw.wrangle(text, backend='polars')

The ``data-wrangler`` package also provides powerful decorators that can modify existing functions to support new
datatypes.  Just write your function as though its inputs are guaranteed to be Pandas DataFrames, and decorate it with
``datawrangler.decorate.funnel`` to enable support for other datatypes without any new code::

  image_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/wrangler.jpg'
  image = dw.io.load(image_url)

  # define your function and decorate it with "funnel"
  @dw.decorate.funnel
  def binarize(x):
    return x > np.mean(x.values)

  binarized_image = binarize(image)  # rgb channels will be horizontally concatenated to create a 2D DataFrame


Supported data formats
----------------------

One package can't accommodate every foreseeable format or input source, but ``data-wrangler`` provides a framework for adding support for new datatypes in a straightforward way.  Essentially, adding support for a new data type entails writing two functions:

  - An ``is_`` function, which should return ``True`` if an object is compatible with the given datatype (or format), and ``False`` otherwise
  - A ``wrangle_`` function, which should take in an object of the given type or format and return a ``pandas`` or ``Polars`` ``DataFrame`` with numerical entries

Currently supported datatypes are limited to:

  - ``array``-like objects (including images)
  - ``DataFrame``-like or ``Series``-like objects (pandas and Polars)
  - text data (text is embedded using natural language processing models)
or lists of mixtures of the above.

**Backend Support**: All operations support both ``pandas`` (default) and ``Polars`` (high-performance) backends. Choose the backend that best fits your performance requirements and workflow preferences.

Missing observations (e.g., nans, empty strings, etc.) may be filled in using imputation and/or interpolation.

.. |build-status| image:: https://github.com/ContextLab/data-wrangler/actions/workflows/ci.yaml/badge.svg
    :alt: build status
    :target: https://github.com/ContextLab/data-wrangler

.. |docs| image:: https://readthedocs.org/projects/data-wrangler/badge/
    :alt: docs status
    :target: https://data-wrangler.readthedocs.io/

.. |doi| image:: https://zenodo.org/badge/DOI/10.5281/zenodo.5123310.svg
   :target: https://doi.org/10.5281/zenodo.5123310

Owner

Name: Contextual Dynamics Laboratory
Login: ContextLab
Kind: organization
Email: contextualdynamics@gmail.com
Location: Hanover, NH

Website: http://www.context-lab.com
Repositories: 35
Profile: https://github.com/ContextLab

Contextual Dynamics Laboratory at Dartmouth College

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Manning"
  given-names: "Jeremy"
  orcid: "https://orcid.org/0000-0001-7613-4732"
title: "DataWrangler: wrangler your messy data into consistent well-organized formats"
version: 0.2.2
doi: 10.5281/zenodo.5123310
date-released: 2022-07-25
url: "https://github.com/ContextLab/data-wrangler"

GitHub Events

Total

Release event: 1
Watch event: 1
Push event: 10
Create event: 2

Last Year

Release event: 1
Watch event: 1
Push event: 10
Create event: 2

Committers

Last synced: about 2 years ago

All Time

Total Commits: 144
Total Committers: 4
Avg Commits per committer: 36.0
Development Distribution Score (DDS): 0.292

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
jeremymanning	j**g@g**m	102
Jeremy Manning	j**g@d**u	22
Jeremy Manning	j****g	15
paxtonfitzpatrick	P**k@D**u	5

Committer Domains (Top 20 + Academic)

dartmouth.edu: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 13
Total pull requests: 16
Average time to close issues: about 1 month
Average time to close pull requests: 26 minutes
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 1.15
Average comments per pull request: 0.06
Merged pull requests: 16
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jeremymanning (12)
paxtonfitzpatrick (1)

Pull Request Authors

jeremymanning (15)
paxtonfitzpatrick (1)

Top Labels

Issue Labels

bug (3)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 58 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 12
Total maintainers: 1

pypi.org: pydata-wrangler

Wrangle messy data into DataFrames (pandas or Polars), with a special focus on text data and natural language processing

Homepage: https://github.com/ContextLab/data-wrangler
Documentation: https://pydata-wrangler.readthedocs.io/
License: MIT license
Latest release: 0.4.0
published 9 months ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 58 Last month

Rankings

Dependent packages count: 10.1%

Stargazers count: 17.8%

Average: 18.4%

Forks count: 19.2%

Dependent repos count: 21.6%

Downloads: 23.2%

Maintainers (1)

contextualdynamics

Last synced: 6 months ago

Dependencies

docs/requirements.txt pypi

Pillow ==9.2.0
Sphinx ==4.0.2
configparser ==5.2.0
datasets ==2.3.2
dill ==0.3.5.1
flair ==0.11.3
importlib-metadata ==3.10.1
jupyter-client ==7.3.4
jupyter-core ==4.11.1
jupyterlab-pygments ==0.2.2
konoha ==4.6.5
matplotlib ==3.5.1
nbclient ==0.6.6
nbconvert ==6.5.0
nbformat ==5.4.0
nbsphinx ==0.8.9
numpy ==1.22.3
pandas ==1.4.3
pydata-sphinx-theme ==0.9.0
requests ==2.28.1
scikit-learn ==1.1.1
scipy ==1.7.3
sentence-transformers ==2.2.2
sentencepiece ==0.1.95
setuptools ==61.2.0
six ==1.16.0
tokenizers ==0.12.1
torch ==1.12.0
torchvision ==0.13.0
tqdm ==4.64.0
transformers ==4.20.1

requirements.txt pypi

Pillow *
configparser *
dill *
importlib-metadata *
matplotlib *
numpy *
pandas *
requests *
scikit-learn *
scipy *
setuptools *
six *
tqdm *

requirements_dev.txt pypi

Sphinx >=1.8.5 development
nbsphinx * development
pydata-sphinx-theme * development
pygments * development
pytest * development

requirements_hf.txt pypi

datasets *
flair *
konoha *
pytorch-pretrained-bert *
pytorch-transformers *
sentence-transformers *
sentencepiece *
tokenizers *
torch *
torchvision *
transformers *