https://github.com/bdwilliamson/vimpy

Perform inference on algorithm-agnostic variable importance in Python

Science Score: 20.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (4.5%) to scientific vocabulary

Keywords

machine-learning nonparametric-statistics statistical-inference variable-importance

Last synced: 5 months ago · JSON representation

Repository

Perform inference on algorithm-agnostic variable importance in Python

Basic Info

Host: GitHub
Owner: bdwilliamson
License: mit
Language: Python
Default Branch: master
Homepage: https://pypi.org/project/vimpy/
Size: 407 KB

Statistics

Stars: 20
Watchers: 3
Forks: 5
Open Issues: 3
Releases: 0

Topics

machine-learning nonparametric-statistics statistical-inference variable-importance

Created almost 9 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License

README.html














README.utf8.md
























































Python/vimpy: inference on algorithm-agnostic variable importance 
 
Software author: Brian Williamson
Methodology authors: Brian Williamson, Peter Gilbert, Noah Simon, Marco Carone

Introduction
In predictive modeling applications, it is often of interest to determine the relative contribution of subsets of features in explaining an outcome; this is often called variable importance. It is useful to consider variable importance as a function of the unknown, underlying data-generating mechanism rather than the specific predictive algorithm used to fit the data. This package provides functions that, given fitted values from predictive algorithms, compute nonparametric estimates of variable importance based on \(R^2\), deviance, classification accuracy, and area under the receiver operating characteristic curve, along with asymptotically valid confidence intervals for the true importance.
For more details, please see the accompanying manuscripts “Nonparametric variable importance assessment using machine learning techniques” by Williamson, Gilbert, Carone, and Simon (Biometrics, 2020) and “A unified approach for inference on algorithm-agnostic variable importance” by Williamson, Gilbert, Simon, and Carone (arXiv, 2020).


Installation
You may install a stable release of vimpy using pip by running python pip install vimpy from a Terminal window. Alternatively, you may install within a virtualenv environment.
You may install the current dev release of vimpy by downloading this repository directly.


Issues
If you encounter any bugs or have any specific feature requests, please file an issue.


Example
This example shows how to use vimpy in a simple setting with simulated data and using a single regression function. For more examples and detailed explanation, please see the R vignette.
## load required libraries
import numpy as np
import vimpy
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

## -------------------------------------------------------------
## problem setup
## -------------------------------------------------------------
## define a function for the conditional mean of Y given X
def cond_mean(x = None):
    f1 = np.where(np.logical_and(-2 <= x[:, 0], x[:, 0] < 2), np.floor(x[:, 0]), 0)
    f2 = np.where(x[:, 1] <= 0, 1, 0)
    f3 = np.where(x[:, 2] > 0, 1, 0)

    f6 = np.absolute(x[:, 5]/4) ** 3
    f7 = np.absolute(x[:, 6]/4) ** 5

    f11 = (7./3)*np.cos(x[:, 10]/2)

    ret = f1 + f2 + f3 + f6 + f7 + f11

    return ret

## create data
np.random.seed(4747)
n = 100
p = 15
s = 1 # importance desired for X_1
x = np.zeros((n, p))
for i in range(0, x.shape[1]) :
    x[:,i] = np.random.normal(0, 2, n)

y = cond_mean(x) + np.random.normal(0, 1, n)

## -------------------------------------------------------------
## preliminary step: get regression estimators
## -------------------------------------------------------------
## use grid search to get optimal number of trees and learning rate
ntrees = np.arange(100, 3500, 500)
lr = np.arange(.01, .5, .05)

param_grid = [{'n_estimators':ntrees, 'learning_rate':lr}]

## set up cv objects
cv_full = GridSearchCV(GradientBoostingRegressor(loss = 'ls', max_depth = 1), param_grid = param_grid, cv = 5)
cv_small = GridSearchCV(GradientBoostingRegressor(loss = 'ls', max_depth = 1), param_grid = param_grid, cv = 5)

## fit the full regression
cv_full.fit(x, y)
full_fit = cv_full.best_estimator_.predict(x)

## fit the reduced regression
x_small = np.delete(x, s, 1) # delete the columns in s
cv_small.fit(x_small, full_fit)
small_fit = cv_small.best_estimator_.predict(x_small)

## -------------------------------------------------------------
## get variable importance estimates
## -------------------------------------------------------------
## set up the vimp object
vimp = vimpy.vim(y = y, x = x, s = 1, pred_func = cv_full, measure_type = "r_squared")
## get the point estimate of variable importance
vimp.get_point_est()
## get the influence function estimate
vimp.get_influence_function()
## get a standard error
vimp.get_se()
## get a confidence interval
vimp.get_ci()
## do a hypothesis test, compute p-value
vimp.hypothesis_test(alpha = 0.05, delta = 0)
## display the estimates, etc.
vimp.vimp_
vimp.se_
vimp.ci_
vimp.p_value_
vimp.hyp_test_

## -------------------------------------------------------------
## get variable importance estimates using cross-validation
## -------------------------------------------------------------
## set up the vimp object
vimp_cv = vimp.cv_vim(y = y, x = x, s = 1, pred_func = cv_full, V = 5, measure_type = "r_squared")
## get the point estimate
vimp_cv.get_point_est()
## get the standard error
vimp_cv.get_influence_function()
vimp_cv.get_se()
## get a confidence interval
vimp_cv.get_ci()
## do a hypothesis test, compute p-value
vimp_cv.hypothesis_test(alpha = 0.05, delta = 0)
## display estimates, etc.
vimp_cv.vimp_
vimp_cv.se_
vimp_cv.ci_
vimp_cv.p_value_
vimp_cv.hyp_test_

Owner

Name: Brian Williamson
Login: bdwilliamson
Kind: user
Location: Seattle, Washington USA
Company: Kaiser Permanente Washington Health Research Institute

Website: https://bdwilliamson.github.io/
Repositories: 46
Profile: https://github.com/bdwilliamson

Assistant Investigator at Kaiser Permanente Washington Health Research Institute. Interested in inference in high-dimensional settings.

GitHub Events

Total

Last Year

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 74
Total Committers: 4
Avg Commits per committer: 18.5
Development Distribution Score (DDS): 0.176

Top Committers

Name	Email	Commits
Brian Williamson	b**6@u**u	61
Brian Williamson	b**n@k**g	10
Jean Feng	j**g@g**m	2
Jenny	j**t@g**m	1

Committer Domains (Top 20 + Academic)

kp.org: 1 uw.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 4
Total pull requests: 2
Average time to close issues: about 1 hour
Average time to close pull requests: about 1 hour
Total issue authors: 4
Total pull request authors: 2
Average comments per issue: 4.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Anaraquelpengelly (1)
shaayaansayed (1)
Tim-Re (1)
mizano924 (1)

Pull Request Authors

JennyLeeStat (1)
jjfeng (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 63 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 11
Total maintainers: 1

pypi.org: vimpy

vimpy: perform inference on algorithm-agnostic variable importance in python

Homepage: https://github.com/bdwilliamson/vimpy
Documentation: https://vimpy.readthedocs.io/
License: MIT
Latest release: 2.0.2
published over 5 years ago

Versions: 11
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 63 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 14.2%

Forks count: 15.3%

Average: 16.0%

Downloads: 18.7%

Dependent repos count: 21.7%

Maintainers (1)

bdwilliamson

Last synced: 6 months ago

Dependencies

setup.py pypi

numpy *
scipy *