pydata-wrangler
Wrangle messy numerical, image, and text data into consistent well-organized formats
Science Score: 77.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
2 of 4 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary
Keywords
data
data-analysis
data-science
data-wrangling
hugging-face
image-data
machine-learning
nlp
numpy
pandas
polars
python
scikit-learn
Keywords from Contributors
environment-management
google-colab
install
ipython
package-management
pip
reproducibility
Last synced: 4 months ago
·
JSON representation
·
Repository
Wrangle messy numerical, image, and text data into consistent well-organized formats
Basic Info
Statistics
- Stars: 11
- Watchers: 2
- Forks: 2
- Open Issues: 4
- Releases: 12
Topics
data
data-analysis
data-science
data-wrangling
hugging-face
image-data
machine-learning
nlp
numpy
pandas
polars
python
scikit-learn
Created over 4 years ago
· Last pushed 7 months ago
Metadata Files
Readme
Changelog
Contributing
License
Citation
Authors
README.rst
Overview
================
|build-status| |docs| |doi|
Datasets come in all shapes and sizes, and are often *messy*:
- Observations come in different formats
- There are missing values
- Labels are missing and/or aren't consistent
- Datasets need to be wrangled 🐄 🐑 🚜
The main goal of ``data-wrangler`` is to turn messy data into clean(er) data, defined as either a ``DataFrame`` or a
list of ``DataFrame`` objects. The package provides code for easily wrangling data from a variety of formats into
``DataFrame`` objects, manipulating ``DataFrame`` objects in useful ways (that can be tricky to implement, but that
apply to many analysis scenarios), and decorating Python functions to make them more flexible and/or easier to write.
🚀 **New**: ``data-wrangler`` now supports **high-performance Polars DataFrames** alongside pandas, delivering 2-100x speedups
for large datasets with zero code changes. Simply add ``backend='polars'`` to any operation!
The ``data-wrangler`` package supports a variety of datatypes. There is a special emphasis on text data, whereby
``data-wrangler`` provides a simple API for interacting with natural language processing tools and datasets provided by
``scikit-learn`` and ``hugging-face`` (via sentence-transformers). The package is designed to provide sensible defaults, but also
implements convenient ways of deeply customizing how different datatypes are wrangled.
For more information, including a formal API and tutorials, check out https://data-wrangler.readthedocs.io
Quick start
================
Install datawrangler using:
.. code-block:: console
$ pip install pydata-wrangler
Some quick natural language processing examples::
import datawrangler as dw
# load in sample text
text_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/home_on_the_range.txt'
text = dw.io.load(text_url)
# embed text using scikit-learn's implementation of Latent Dirichlet Allocation, trained on a curated subset of
# Wikipedia, called the 'minipedia' corpus. Return the fitted model so that it can be applied to new text.
# NEW: Simplified API - just pass model names as strings or lists!
lda_embeddings, lda_fit = dw.wrangle(text, text_kwargs={'model': ['CountVectorizer', 'LatentDirichletAllocation'], 'corpus': 'minipedia'}, return_model=True)
# apply the minipedia-trained LDA model to new text
new_text = 'how much wood could a wood chuck chuck if a wood chuck could check wood?'
new_embeddings = dw.wrangle(new_text, text_kwargs={'model': lda_fit})
# embed text using sentence-transformers pre-trained model
# NEW: Simplified API - just pass the model name as a string!
sentence_embeddings = dw.wrangle(text, text_kwargs={'model': 'all-mpnet-base-v2'})
High-performance Polars backend examples::
import numpy as np
# Array processing with dramatic speedups
large_array = np.random.rand(50000, 20)
# Traditional pandas backend
pandas_df = dw.wrangle(large_array, backend='pandas')
# High-performance Polars backend (2-100x faster!)
polars_df = dw.wrangle(large_array, backend='polars')
# Set global backend preference
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars') # All operations now use Polars
# Text processing also benefits from Polars
fast_text_embeddings = dw.wrangle(text, backend='polars')
The ``data-wrangler`` package also provides powerful decorators that can modify existing functions to support new
datatypes. Just write your function as though its inputs are guaranteed to be Pandas DataFrames, and decorate it with
``datawrangler.decorate.funnel`` to enable support for other datatypes without any new code::
image_url = 'https://raw.githubusercontent.com/ContextLab/data-wrangler/main/tests/resources/wrangler.jpg'
image = dw.io.load(image_url)
# define your function and decorate it with "funnel"
@dw.decorate.funnel
def binarize(x):
return x > np.mean(x.values)
binarized_image = binarize(image) # rgb channels will be horizontally concatenated to create a 2D DataFrame
Supported data formats
----------------------
One package can't accommodate every foreseeable format or input source, but ``data-wrangler`` provides a framework for adding support for new datatypes in a straightforward way. Essentially, adding support for a new data type entails writing two functions:
- An ``is_`` function, which should return ``True`` if an object is compatible with the given datatype (or format), and ``False`` otherwise
- A ``wrangle_`` function, which should take in an object of the given type or format and return a ``pandas`` or ``Polars`` ``DataFrame`` with numerical entries
Currently supported datatypes are limited to:
- ``array``-like objects (including images)
- ``DataFrame``-like or ``Series``-like objects (pandas and Polars)
- text data (text is embedded using natural language processing models)
or lists of mixtures of the above.
**Backend Support**: All operations support both ``pandas`` (default) and ``Polars`` (high-performance) backends. Choose the backend that best fits your performance requirements and workflow preferences.
Missing observations (e.g., nans, empty strings, etc.) may be filled in using imputation and/or interpolation.
.. |build-status| image:: https://github.com/ContextLab/data-wrangler/actions/workflows/ci.yaml/badge.svg
:alt: build status
:target: https://github.com/ContextLab/data-wrangler
.. |docs| image:: https://readthedocs.org/projects/data-wrangler/badge/
:alt: docs status
:target: https://data-wrangler.readthedocs.io/
.. |doi| image:: https://zenodo.org/badge/DOI/10.5281/zenodo.5123310.svg
:target: https://doi.org/10.5281/zenodo.5123310
Owner
- Name: Contextual Dynamics Laboratory
- Login: ContextLab
- Kind: organization
- Email: contextualdynamics@gmail.com
- Location: Hanover, NH
- Website: http://www.context-lab.com
- Repositories: 35
- Profile: https://github.com/ContextLab
Contextual Dynamics Laboratory at Dartmouth College
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Manning" given-names: "Jeremy" orcid: "https://orcid.org/0000-0001-7613-4732" title: "DataWrangler: wrangler your messy data into consistent well-organized formats" version: 0.2.2 doi: 10.5281/zenodo.5123310 date-released: 2022-07-25 url: "https://github.com/ContextLab/data-wrangler"
GitHub Events
Total
- Release event: 1
- Watch event: 1
- Push event: 10
- Create event: 2
Last Year
- Release event: 1
- Watch event: 1
- Push event: 10
- Create event: 2
Committers
Last synced: almost 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| jeremymanning | j****g@g****m | 102 |
| Jeremy Manning | j****g@d****u | 22 |
| Jeremy Manning | j****g | 15 |
| paxtonfitzpatrick | P****k@D****u | 5 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 13
- Total pull requests: 16
- Average time to close issues: about 1 month
- Average time to close pull requests: 26 minutes
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 1.15
- Average comments per pull request: 0.06
- Merged pull requests: 16
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jeremymanning (12)
- paxtonfitzpatrick (1)
Pull Request Authors
- jeremymanning (15)
- paxtonfitzpatrick (1)
Top Labels
Issue Labels
bug (3)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 58 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 12
- Total maintainers: 1
pypi.org: pydata-wrangler
Wrangle messy data into DataFrames (pandas or Polars), with a special focus on text data and natural language processing
- Homepage: https://github.com/ContextLab/data-wrangler
- Documentation: https://pydata-wrangler.readthedocs.io/
- License: MIT license
-
Latest release: 0.4.0
published 7 months ago
Rankings
Dependent packages count: 10.1%
Stargazers count: 17.8%
Average: 18.4%
Forks count: 19.2%
Dependent repos count: 21.6%
Downloads: 23.2%
Maintainers (1)
Last synced:
4 months ago
Dependencies
docs/requirements.txt
pypi
- Pillow ==9.2.0
- Sphinx ==4.0.2
- configparser ==5.2.0
- datasets ==2.3.2
- dill ==0.3.5.1
- flair ==0.11.3
- importlib-metadata ==3.10.1
- jupyter-client ==7.3.4
- jupyter-core ==4.11.1
- jupyterlab-pygments ==0.2.2
- konoha ==4.6.5
- matplotlib ==3.5.1
- nbclient ==0.6.6
- nbconvert ==6.5.0
- nbformat ==5.4.0
- nbsphinx ==0.8.9
- numpy ==1.22.3
- pandas ==1.4.3
- pydata-sphinx-theme ==0.9.0
- requests ==2.28.1
- scikit-learn ==1.1.1
- scipy ==1.7.3
- sentence-transformers ==2.2.2
- sentencepiece ==0.1.95
- setuptools ==61.2.0
- six ==1.16.0
- tokenizers ==0.12.1
- torch ==1.12.0
- torchvision ==0.13.0
- tqdm ==4.64.0
- transformers ==4.20.1
requirements.txt
pypi
- Pillow *
- configparser *
- dill *
- importlib-metadata *
- matplotlib *
- numpy *
- pandas *
- requests *
- scikit-learn *
- scipy *
- setuptools *
- six *
- tqdm *
requirements_dev.txt
pypi
- Sphinx >=1.8.5 development
- nbsphinx * development
- pydata-sphinx-theme * development
- pygments * development
- pytest * development
requirements_hf.txt
pypi
- datasets *
- flair *
- konoha *
- pytorch-pretrained-bert *
- pytorch-transformers *
- sentence-transformers *
- sentencepiece *
- tokenizers *
- torch *
- torchvision *
- transformers *