datawig

Imputation of missing values in tables.

https://github.com/awslabs/datawig

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 14 committers (7.1%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Keywords

imputation missing-value-handling

Last synced: 9 months ago · JSON representation

Repository

Imputation of missing values in tables.

Basic Info

Host: GitHub
Owner: awslabs
License: apache-2.0
Language: JavaScript
Default Branch: master
Size: 6.51 MB

Statistics

Stars: 490
Watchers: 22
Forks: 70
Open Issues: 25
Releases: 0

Topics

imputation missing-value-handling

Created almost 8 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Code of conduct

DataWig - Imputation for Tables

DataWig learns Machine Learning models to impute missing values in tables.

See our user-guide and extended documentation here.

Installation

CPU

bash pip3 install datawig

GPU

If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings. Depending on your version of CUDA, you can do this by running the following:

bash wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt rm requirements.gpu-cu${CUDA_VERSION}.txt where ${CUDA_VERSION} can be 75 (7.5), 80 (8.0), 90 (9.0), or 91 (9.1).

Running DataWig

The DataWig API expects your data as a pandas DataFrame. Here is an example of how the dataframe might look:

|Product Type | Description | Size | Color | |-------------|-----------------------|------|-------| | Shoe | Ideal for Running | 12UK | Black | | SDCards | Best SDCard ever ... | 8GB | Blue | | Dress | This yellow dress | M | ? |

Quickstart Example

For most use cases, the SimpleImputer class is the best starting point. For convenience there is the function SimpleImputer.complete that takes a DataFrame and fits an imputation model for each column with missing values, with all other columns as inputs:

```python import datawig, numpy

generate some data with simple nonlinear dependency

df = datawig.utils.generatedfnumeric()

mask 10% of the values

dfwithmissing = df.mask(numpy.random.rand(*df.shape) > .9)

impute missing values

dfwithmissingimputed = datawig.SimpleImputer.complete(dfwith_missing)

```

You can also impute values in specific columns only (called output_column below) using values in other columns (called input_columns below). DataWig currently supports imputation of categorical columns and numeric columns.

Imputation of categorical columns

```python import datawig

df = datawig.utils.generatedfstring( numsamples=200, datacolumnname='sentences', labelcolumn_name='label')

dftrain, dftest = datawig.utils.random_split(df)

Initialize a SimpleImputer model

imputer = datawig.SimpleImputer( inputcolumns=['sentences'], # column(s) containing information about the column we want to impute outputcolumn='label', # the column we'd like to impute values for outputpath = 'imputermodel' # stores model data and metrics )

Fit an imputer model on the train data

imputer.fit(traindf=dftrain)

Impute missing values and return original dataframe with predictions

imputed = imputer.predict(df_test) ```

Imputation of numerical columns

```python import datawig

df = datawig.utils.generatedfnumeric( numsamples=200, datacolumnname='x', labelcolumnname='y')
dftrain, dftest = datawig.utils.randomsplit(df)

Initialize a SimpleImputer model

imputer = datawig.SimpleImputer( inputcolumns=['x'], # column(s) containing information about the column we want to impute outputcolumn='y', # the column we'd like to impute values for outputpath = 'imputermodel' # stores model data and metrics )

Fit an imputer model on the train data

imputer.fit(traindf=dftrain, num_epochs=50)

Impute missing values and return original dataframe with predictions

imputed = imputer.predict(df_test)

```

In order to have more control over the types of models and preprocessings, the Imputer class allows directly specifying all relevant model features and parameters.

For details on usage, refer to the provided examples.

Acknowledgments

Thanks to David Greenberg for the package name.

Building documentation

bash git clone git@github.com:awslabs/datawig.git cd datawig/docs make html open _build/html/index.html

Executing Tests

Clone the repository from git and set up virtualenv in the root dir of the package:

python3 -m venv venv

Install the package from local sources:

./venv/bin/pip install -e .

Run tests:

./venv/bin/pip install -r requirements/requirements.dev.txt ./venv/bin/python -m pytest

Updating PyPi distribution

Before updating, increment the version in setup.py.

``` git clone git@github.com:awslabs/datawig.git cd datawig

build local distribution for current version

python setup.py sdist

upload to PyPi

twine upload --skip-existing dist/* ```

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Watch event: 11
Fork event: 1

Last Year

Watch event: 11
Fork event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 131
Total Committers: 14
Avg Commits per committer: 9.357
Development Distribution Score (DDS): 0.809

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Philipp Schmidt	p**d@a**m	25
Philipp Schmidt	t****d	24
Felix Biessmann	f**n@b**e	19
Tammo Rukat	t**t@g**m	16
Prathik Naidu	p**n@a**m	15
felixbiessmann	f****n	7
James Siri	j**i@a**m	7
Prathik Naidu	p**3@g**m	5
Andrey Taptunov	a**v@g**m	4
Andrey Taptunov	t**v@a**m	3
Tammo Rukat	t**a@a**m	2
Felix Biessmann	b**n@a**m	2
Felix Biessmann	f**n@t**e	1
Vinit Ganorkar	v**r@g**m	1

Committer Domains (Top 20 + Academic)

amazon.com: 6 tu-berlin.de: 1 beuth-hochschule.de: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 57
Total pull requests: 48
Average time to close issues: 4 months
Average time to close pull requests: 15 days
Total issue authors: 45
Total pull request authors: 11
Average comments per issue: 2.63
Average comments per pull request: 0.81
Merged pull requests: 30
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 2.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

TammoR (4)
CLWcynthia (3)
gautambak (3)
zhimin-z (3)
tdhd (2)
bbb801 (2)
felixbiessmann (2)
wmlba (1)
imadeit (1)
imsazzad (1)
TaichiLi (1)
shoraj (1)
shreeratn (1)
TizianFusser (1)
angeliney (1)

Pull Request Authors

TammoR (14)
tdhd (13)
felixbiessmann (10)
dependabot[bot] (5)
prathik-naidu (2)
DovaX (1)
carlosmoralrubio (1)
VINIT777 (1)
tirkarthi (1)
angeliney (1)
pado31 (1)

Top Labels

Issue Labels

required for release (1)

Pull Request Labels

dependencies (5)

Packages

Total packages: 1
Total downloads:
- pypi 123 last-month

Total dependent packages: 1
Total dependent repositories: 11
Total versions: 14
Total maintainers: 4

pypi.org: datawig

Imputation for tables with missing values

Homepage: https://github.com/awslabs/datawig
Documentation: https://datawig.readthedocs.io/
License: Apache License 2.0
Latest release: 0.2.0
published almost 6 years ago

Versions: 14
Dependent Packages: 1
Dependent Repositories: 11
Downloads: 123 Last month

Rankings

Stargazers count: 2.9%

Dependent repos count: 4.4%

Forks count: 5.2%

Average: 6.7%

Dependent packages count: 10.0%

Downloads: 10.9%

Maintainers (4)

andrey-taptunov felixbiessmann phschmid tammo

Last synced: 10 months ago

Dependencies

experiments/requirements.benchmarks.txt pypi

fancyimpute ==0.4.3
mxnet >=1.3.0
pandas >=0.22.0
tqdm ==4.32.2
typing >=3.6.6

requirements/requirements.benchmarks.txt pypi

fancyimpute ==0.4.3
mxnet >=1.3.0
pandas >=0.22.0
typing >=3.6.6

requirements/requirements.dev.txt pypi

pylint * development
pytest * development

requirements/requirements.gpu-cu10.txt pypi

mxnet-cu100 >=1.3.0
numpy >=1.15.0
pandas >=0.22.0
scikit-learn >=0.20.0

requirements/requirements.gpu-cu75.txt pypi

mxnet-cu75 >=1.3.0
numpy >=1.15.0
pandas >=0.22.0
scikit-learn >=0.20.0

requirements/requirements.gpu-cu80.txt pypi

mxnet-cu80 >=1.3.0
numpy >=1.15.0
pandas >=0.22.0
scikit-learn >=0.20.0

requirements/requirements.gpu-cu90.txt pypi

mxnet-cu90 >=1.3.0
numpy >=1.15.0
pandas >=0.22.0
scikit-learn >=0.20.0

requirements/requirements.gpu-cu91.txt pypi

mxnet-cu91 >=1.3.0
numpy >=1.15.0
pandas >=0.22.0
scikit-learn >=0.20.0

requirements/requirements.readthedocs.txt pypi

mxnet ==1.4.0
pandas ==0.25.0
python-dateutil *
scikit-learn ==0.22.1
sphinx *
typing ==3.6.6

requirements/requirements.txt pypi

mxnet <=1.7.0
pandas ==1.3.5
scikit-learn ==1.0.2

setup.py pypi

datawig

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DataWig - Imputation for Tables

Installation

CPU

GPU

Running DataWig

Quickstart Example

generate some data with simple nonlinear dependency

mask 10% of the values

impute missing values

Imputation of categorical columns

Initialize a SimpleImputer model

Fit an imputer model on the train data

Impute missing values and return original dataframe with predictions

Imputation of numerical columns

Initialize a SimpleImputer model

Fit an imputer model on the train data

Impute missing values and return original dataframe with predictions

Acknowledgments

Building documentation

Executing Tests

Updating PyPi distribution

build local distribution for current version

upload to PyPi

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: datawig

Rankings

Maintainers (4)

Dependencies