datawig

Imputation of missing values in tables.

https://github.com/awslabs/datawig

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 14 committers (7.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.8%) to scientific vocabulary

Keywords

imputation missing-value-handling
Last synced: 6 months ago · JSON representation

Repository

Imputation of missing values in tables.

Basic Info
  • Host: GitHub
  • Owner: awslabs
  • License: apache-2.0
  • Language: JavaScript
  • Default Branch: master
  • Size: 6.51 MB
Statistics
  • Stars: 490
  • Watchers: 22
  • Forks: 70
  • Open Issues: 25
  • Releases: 0
Topics
imputation missing-value-handling
Created over 7 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md

DataWig - Imputation for Tables

PyPI version GitHub license GitHub issues Build Status

DataWig learns Machine Learning models to impute missing values in tables.

See our user-guide and extended documentation here.

Installation

CPU

bash pip3 install datawig

GPU

If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings. Depending on your version of CUDA, you can do this by running the following:

bash wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt rm requirements.gpu-cu${CUDA_VERSION}.txt where ${CUDA_VERSION} can be 75 (7.5), 80 (8.0), 90 (9.0), or 91 (9.1).

Running DataWig

The DataWig API expects your data as a pandas DataFrame. Here is an example of how the dataframe might look:

|Product Type | Description | Size | Color | |-------------|-----------------------|------|-------| | Shoe | Ideal for Running | 12UK | Black | | SDCards | Best SDCard ever ... | 8GB | Blue | | Dress | This yellow dress | M | ? |

Quickstart Example

For most use cases, the SimpleImputer class is the best starting point. For convenience there is the function SimpleImputer.complete that takes a DataFrame and fits an imputation model for each column with missing values, with all other columns as inputs:

```python import datawig, numpy

generate some data with simple nonlinear dependency

df = datawig.utils.generatedfnumeric()

mask 10% of the values

dfwithmissing = df.mask(numpy.random.rand(*df.shape) > .9)

impute missing values

dfwithmissingimputed = datawig.SimpleImputer.complete(dfwith_missing)

```

You can also impute values in specific columns only (called output_column below) using values in other columns (called input_columns below). DataWig currently supports imputation of categorical columns and numeric columns.

Imputation of categorical columns

```python import datawig

df = datawig.utils.generatedfstring( numsamples=200, datacolumnname='sentences', labelcolumn_name='label')

dftrain, dftest = datawig.utils.random_split(df)

Initialize a SimpleImputer model

imputer = datawig.SimpleImputer( inputcolumns=['sentences'], # column(s) containing information about the column we want to impute outputcolumn='label', # the column we'd like to impute values for outputpath = 'imputermodel' # stores model data and metrics )

Fit an imputer model on the train data

imputer.fit(traindf=dftrain)

Impute missing values and return original dataframe with predictions

imputed = imputer.predict(df_test) ```

Imputation of numerical columns

```python import datawig

df = datawig.utils.generatedfnumeric( numsamples=200, datacolumnname='x', labelcolumnname='y')
df
train, dftest = datawig.utils.randomsplit(df)

Initialize a SimpleImputer model

imputer = datawig.SimpleImputer( inputcolumns=['x'], # column(s) containing information about the column we want to impute outputcolumn='y', # the column we'd like to impute values for outputpath = 'imputermodel' # stores model data and metrics )

Fit an imputer model on the train data

imputer.fit(traindf=dftrain, num_epochs=50)

Impute missing values and return original dataframe with predictions

imputed = imputer.predict(df_test)

```

In order to have more control over the types of models and preprocessings, the Imputer class allows directly specifying all relevant model features and parameters.

For details on usage, refer to the provided examples.

Acknowledgments

Thanks to David Greenberg for the package name.

Building documentation

bash git clone git@github.com:awslabs/datawig.git cd datawig/docs make html open _build/html/index.html

Executing Tests

Clone the repository from git and set up virtualenv in the root dir of the package:

python3 -m venv venv

Install the package from local sources:

./venv/bin/pip install -e .

Run tests:

./venv/bin/pip install -r requirements/requirements.dev.txt ./venv/bin/python -m pytest

Updating PyPi distribution

Before updating, increment the version in setup.py.

``` git clone git@github.com:awslabs/datawig.git cd datawig

build local distribution for current version

python setup.py sdist

upload to PyPi

twine upload --skip-existing dist/* ```

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Watch event: 11
  • Fork event: 1
Last Year
  • Watch event: 11
  • Fork event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 131
  • Total Committers: 14
  • Avg Commits per committer: 9.357
  • Development Distribution Score (DDS): 0.809
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Philipp Schmidt p****d@a****m 25
Philipp Schmidt t****d 24
Felix Biessmann f****n@b****e 19
Tammo Rukat t****t@g****m 16
Prathik Naidu p****n@a****m 15
felixbiessmann f****n 7
James Siri j****i@a****m 7
Prathik Naidu p****3@g****m 5
Andrey Taptunov a****v@g****m 4
Andrey Taptunov t****v@a****m 3
Tammo Rukat t****a@a****m 2
Felix Biessmann b****n@a****m 2
Felix Biessmann f****n@t****e 1
Vinit Ganorkar v****r@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 57
  • Total pull requests: 48
  • Average time to close issues: 4 months
  • Average time to close pull requests: 15 days
  • Total issue authors: 45
  • Total pull request authors: 11
  • Average comments per issue: 2.63
  • Average comments per pull request: 0.81
  • Merged pull requests: 30
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • TammoR (4)
  • CLWcynthia (3)
  • gautambak (3)
  • zhimin-z (3)
  • tdhd (2)
  • bbb801 (2)
  • felixbiessmann (2)
  • wmlba (1)
  • imadeit (1)
  • imsazzad (1)
  • TaichiLi (1)
  • shoraj (1)
  • shreeratn (1)
  • TizianFusser (1)
  • angeliney (1)
Pull Request Authors
  • TammoR (14)
  • tdhd (13)
  • felixbiessmann (10)
  • dependabot[bot] (5)
  • prathik-naidu (2)
  • DovaX (1)
  • carlosmoralrubio (1)
  • VINIT777 (1)
  • tirkarthi (1)
  • angeliney (1)
  • pado31 (1)
Top Labels
Issue Labels
required for release (1)
Pull Request Labels
dependencies (5)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 123 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 11
  • Total versions: 14
  • Total maintainers: 4
pypi.org: datawig

Imputation for tables with missing values

  • Versions: 14
  • Dependent Packages: 1
  • Dependent Repositories: 11
  • Downloads: 123 Last month
Rankings
Stargazers count: 2.9%
Dependent repos count: 4.4%
Forks count: 5.2%
Average: 6.7%
Dependent packages count: 10.0%
Downloads: 10.9%
Last synced: 6 months ago

Dependencies

experiments/requirements.benchmarks.txt pypi
  • fancyimpute ==0.4.3
  • mxnet >=1.3.0
  • pandas >=0.22.0
  • tqdm ==4.32.2
  • typing >=3.6.6
requirements/requirements.benchmarks.txt pypi
  • fancyimpute ==0.4.3
  • mxnet >=1.3.0
  • pandas >=0.22.0
  • typing >=3.6.6
requirements/requirements.dev.txt pypi
  • pylint * development
  • pytest * development
requirements/requirements.gpu-cu10.txt pypi
  • mxnet-cu100 >=1.3.0
  • numpy >=1.15.0
  • pandas >=0.22.0
  • scikit-learn >=0.20.0
requirements/requirements.gpu-cu75.txt pypi
  • mxnet-cu75 >=1.3.0
  • numpy >=1.15.0
  • pandas >=0.22.0
  • scikit-learn >=0.20.0
requirements/requirements.gpu-cu80.txt pypi
  • mxnet-cu80 >=1.3.0
  • numpy >=1.15.0
  • pandas >=0.22.0
  • scikit-learn >=0.20.0
requirements/requirements.gpu-cu90.txt pypi
  • mxnet-cu90 >=1.3.0
  • numpy >=1.15.0
  • pandas >=0.22.0
  • scikit-learn >=0.20.0
requirements/requirements.gpu-cu91.txt pypi
  • mxnet-cu91 >=1.3.0
  • numpy >=1.15.0
  • pandas >=0.22.0
  • scikit-learn >=0.20.0
requirements/requirements.readthedocs.txt pypi
  • mxnet ==1.4.0
  • pandas ==0.25.0
  • python-dateutil *
  • scikit-learn ==0.22.1
  • sphinx *
  • typing ==3.6.6
requirements/requirements.txt pypi
  • mxnet <=1.7.0
  • pandas ==1.3.5
  • scikit-learn ==1.0.2
setup.py pypi