Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 14 committers (7.1%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Keywords
Repository
Imputation of missing values in tables.
Basic Info
- Host: GitHub
- Owner: awslabs
- License: apache-2.0
- Language: JavaScript
- Default Branch: master
- Size: 6.51 MB
Statistics
- Stars: 490
- Watchers: 22
- Forks: 70
- Open Issues: 25
- Releases: 0
Topics
Metadata Files
README.md
DataWig - Imputation for Tables
DataWig learns Machine Learning models to impute missing values in tables.
See our user-guide and extended documentation here.
Installation
CPU
bash
pip3 install datawig
GPU
If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings. Depending on your version of CUDA, you can do this by running the following:
bash
wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt
pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt
rm requirements.gpu-cu${CUDA_VERSION}.txt
where ${CUDA_VERSION} can be 75 (7.5), 80 (8.0), 90 (9.0), or 91 (9.1).
Running DataWig
The DataWig API expects your data as a pandas DataFrame. Here is an example of how the dataframe might look:
|Product Type | Description | Size | Color | |-------------|-----------------------|------|-------| | Shoe | Ideal for Running | 12UK | Black | | SDCards | Best SDCard ever ... | 8GB | Blue | | Dress | This yellow dress | M | ? |
Quickstart Example
For most use cases, the SimpleImputer class is the best starting point. For convenience there is the function SimpleImputer.complete that takes a DataFrame and fits an imputation model for each column with missing values, with all other columns as inputs:
```python import datawig, numpy
generate some data with simple nonlinear dependency
df = datawig.utils.generatedfnumeric()
mask 10% of the values
dfwithmissing = df.mask(numpy.random.rand(*df.shape) > .9)
impute missing values
dfwithmissingimputed = datawig.SimpleImputer.complete(dfwith_missing)
```
You can also impute values in specific columns only (called output_column below) using values in other columns (called input_columns below). DataWig currently supports imputation of categorical columns and numeric columns.
Imputation of categorical columns
```python import datawig
df = datawig.utils.generatedfstring( numsamples=200, datacolumnname='sentences', labelcolumn_name='label')
dftrain, dftest = datawig.utils.random_split(df)
Initialize a SimpleImputer model
imputer = datawig.SimpleImputer( inputcolumns=['sentences'], # column(s) containing information about the column we want to impute outputcolumn='label', # the column we'd like to impute values for outputpath = 'imputermodel' # stores model data and metrics )
Fit an imputer model on the train data
imputer.fit(traindf=dftrain)
Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test) ```
Imputation of numerical columns
```python import datawig
df = datawig.utils.generatedfnumeric( numsamples=200,
datacolumnname='x',
labelcolumnname='y')
dftrain, dftest = datawig.utils.randomsplit(df)
Initialize a SimpleImputer model
imputer = datawig.SimpleImputer( inputcolumns=['x'], # column(s) containing information about the column we want to impute outputcolumn='y', # the column we'd like to impute values for outputpath = 'imputermodel' # stores model data and metrics )
Fit an imputer model on the train data
imputer.fit(traindf=dftrain, num_epochs=50)
Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
```
In order to have more control over the types of models and preprocessings, the Imputer class allows directly specifying all relevant model features and parameters.
For details on usage, refer to the provided examples.
Acknowledgments
Thanks to David Greenberg for the package name.
Building documentation
bash
git clone git@github.com:awslabs/datawig.git
cd datawig/docs
make html
open _build/html/index.html
Executing Tests
Clone the repository from git and set up virtualenv in the root dir of the package:
python3 -m venv venv
Install the package from local sources:
./venv/bin/pip install -e .
Run tests:
./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest
Updating PyPi distribution
Before updating, increment the version in setup.py.
``` git clone git@github.com:awslabs/datawig.git cd datawig
build local distribution for current version
python setup.py sdist
upload to PyPi
twine upload --skip-existing dist/* ```
Owner
- Name: Amazon Web Services - Labs
- Login: awslabs
- Kind: organization
- Location: Seattle, WA
- Website: http://amazon.com/aws/
- Repositories: 914
- Profile: https://github.com/awslabs
AWS Labs
GitHub Events
Total
- Watch event: 11
- Fork event: 1
Last Year
- Watch event: 11
- Fork event: 1
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Philipp Schmidt | p****d@a****m | 25 |
| Philipp Schmidt | t****d | 24 |
| Felix Biessmann | f****n@b****e | 19 |
| Tammo Rukat | t****t@g****m | 16 |
| Prathik Naidu | p****n@a****m | 15 |
| felixbiessmann | f****n | 7 |
| James Siri | j****i@a****m | 7 |
| Prathik Naidu | p****3@g****m | 5 |
| Andrey Taptunov | a****v@g****m | 4 |
| Andrey Taptunov | t****v@a****m | 3 |
| Tammo Rukat | t****a@a****m | 2 |
| Felix Biessmann | b****n@a****m | 2 |
| Felix Biessmann | f****n@t****e | 1 |
| Vinit Ganorkar | v****r@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 57
- Total pull requests: 48
- Average time to close issues: 4 months
- Average time to close pull requests: 15 days
- Total issue authors: 45
- Total pull request authors: 11
- Average comments per issue: 2.63
- Average comments per pull request: 0.81
- Merged pull requests: 30
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 2.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- TammoR (4)
- CLWcynthia (3)
- gautambak (3)
- zhimin-z (3)
- tdhd (2)
- bbb801 (2)
- felixbiessmann (2)
- wmlba (1)
- imadeit (1)
- imsazzad (1)
- TaichiLi (1)
- shoraj (1)
- shreeratn (1)
- TizianFusser (1)
- angeliney (1)
Pull Request Authors
- TammoR (14)
- tdhd (13)
- felixbiessmann (10)
- dependabot[bot] (5)
- prathik-naidu (2)
- DovaX (1)
- carlosmoralrubio (1)
- VINIT777 (1)
- tirkarthi (1)
- angeliney (1)
- pado31 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 123 last-month
- Total dependent packages: 1
- Total dependent repositories: 11
- Total versions: 14
- Total maintainers: 4
pypi.org: datawig
Imputation for tables with missing values
- Homepage: https://github.com/awslabs/datawig
- Documentation: https://datawig.readthedocs.io/
- License: Apache License 2.0
-
Latest release: 0.2.0
published over 5 years ago
Rankings
Maintainers (4)
Dependencies
- fancyimpute ==0.4.3
- mxnet >=1.3.0
- pandas >=0.22.0
- tqdm ==4.32.2
- typing >=3.6.6
- fancyimpute ==0.4.3
- mxnet >=1.3.0
- pandas >=0.22.0
- typing >=3.6.6
- pylint * development
- pytest * development
- mxnet-cu100 >=1.3.0
- numpy >=1.15.0
- pandas >=0.22.0
- scikit-learn >=0.20.0
- mxnet-cu75 >=1.3.0
- numpy >=1.15.0
- pandas >=0.22.0
- scikit-learn >=0.20.0
- mxnet-cu80 >=1.3.0
- numpy >=1.15.0
- pandas >=0.22.0
- scikit-learn >=0.20.0
- mxnet-cu90 >=1.3.0
- numpy >=1.15.0
- pandas >=0.22.0
- scikit-learn >=0.20.0
- mxnet-cu91 >=1.3.0
- numpy >=1.15.0
- pandas >=0.22.0
- scikit-learn >=0.20.0
- mxnet ==1.4.0
- pandas ==0.25.0
- python-dateutil *
- scikit-learn ==0.22.1
- sphinx *
- typing ==3.6.6
- mxnet <=1.7.0
- pandas ==1.3.5
- scikit-learn ==1.0.2