rfpimp

Code to compute permutation and drop-column importances in Python scikit-learn models

https://github.com/parrt/random-forest-importances

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: springer.com
✓
Committers with academic emails
3 of 16 committers (18.8%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Code to compute permutation and drop-column importances in Python scikit-learn models

Basic Info

Host: GitHub
Owner: parrt
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 14.6 MB

Statistics

Stars: 619
Watchers: 21
Forks: 133
Open Issues: 10
Releases: 5

Created about 8 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

Feature importances for scikit-learn machine learning models

By Terence Parr and Kerem Turgutlu. See Explained.ai for more stuff.

The scikit-learn Random Forest feature importances strategy is mean decrease in impurity (or gini importance) mechanism, which is unreliable. To get reliable results, use permutation importance, provided in the rfpimp package in the src dir. Install with:

pip install rfpimp

We include permutation and drop-column importance measures that work with any sklearn model. Yes, rfpimp is an increasingly-ill-suited name, but we still like it.

Description

See Beware Default Random Forest Importances for a deeper discussion of the issues surrounding feature importances in random forests (authored by Terence Parr, Kerem Turgutlu, Christopher Csiszar, and Jeremy Howard).

The mean-decrease-in-impurity importance of a feature is computed by measuring how effective the feature is at reducing uncertainty (classifiers) or variance (regressors) when creating decision trees within random forests. The problem is that this mechanism, while fast, does not always give an accurate picture of importance. Strobl et al pointed out in Bias in random forest variable importance measures: Illustrations, sources and a solution that “the variable importance measures of Breiman's original random forest method ... are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.”

A more reliable method is permutation importance, which measures the importance of a feature as follows. Record a baseline accuracy (classifier) or R² score (regressor) by passing a validation set or the out-of-bag (OOB) samples through the random forest. Permute the column values of a single predictor feature and then pass all test samples back through the random forest and recompute the accuracy or R². The importance of that feature is the difference between the baseline and the drop in overall accuracy or R² caused by permuting the column. The permutation mechanism is much more computationally expensive than the mean decrease in impurity mechanism, but the results are more reliable.

Sample code

See the notebooks directory for things like Collinear features and Plotting feature importances.

Here's some sample Python code that uses the rfpimp package contained in the src directory. The data can be found in rent.csv, which is a subset of the data from Kaggle's Two Sigma Connect: Rental Listing Inquiries competition.

```python from rfpimp import * import pandas as pd from sklearn.ensemble import RandomForestRegressor from sklearn.modelselection import traintest_split

dforig = pd.readcsv("/Users/parrt/github/random-forest-importances/notebooks/data/rent.csv")

df = df_orig.copy()

attentuate affect of outliers in price

df['price'] = np.log(df['price'])

dftrain, dftest = traintestsplit(df, test_size=0.20)

features = ['bathrooms','bedrooms','longitude','latitude', 'price'] dftrain = dftrain[features] dftest = dftest[features]

Xtrain, ytrain = dftrain.drop('price',axis=1), dftrain['price'] Xtest, ytest = dftest.drop('price',axis=1), dftest['price'] Xtrain['random'] = np.random.random(size=len(Xtrain)) Xtest['random'] = np.random.random(size=len(Xtest))

rf = RandomForestRegressor(nestimators=100, njobs=-1) rf.fit(Xtrain, ytrain)

imp = importances(rf, Xtest, ytest) # permutation viz = plot_importances(imp) viz.view()

dftrain, dftest = traintestsplit(dforig, testsize=0.20) features = ['bathrooms','bedrooms','price','longitude','latitude', 'interestlevel'] dftrain = dftrain[features] dftest = df_test[features]

Xtrain, ytrain = dftrain.drop('interestlevel',axis=1), dftrain['interestlevel'] Xtest, ytest = dftest.drop('interestlevel',axis=1), dftest['interestlevel']

Add column of random numbers

Xtrain['random'] = np.random.random(size=len(Xtrain)) Xtest['random'] = np.random.random(size=len(Xtest))

rf = RandomForestClassifier(nestimators=100, minsamplesleaf=5, njobs=-1, oobscore=True) rf.fit(Xtrain, y_train)

imp = importances(rf, Xtest, ytest, nsamples=-1) viz = plotimportances(imp) viz.view() ```

Feature correlation

See Feature collinearity heatmap. We can get the Spearman's correlation matrix:

Feature dependencies

The features we use in machine learning are rarely completely independent, which makes interpreting feature importance tricky. We could compute correlation coefficients, but that only identifies linear relationships. A way to at least identify if a feature, x, is dependent on other features is to train a model using x as a dependent variable and all other features as independent variables. Because random forests give us an easy out of bag error estimate, the feature dependence functions rely on random forest models. The R^2 prediction error from the model indicates how easy it is to predict feature x using the other features. The higher the score, the more dependent feature x is.

You can also get a feature dependence matrix / heatmap that returns a non-symmetric data frame where each row is the importance of each var to the row's var used as a model target. Example:

Owner

Name: Terence Parr
Login: parrt
Kind: user
Location: San Francisco

Website: http://explained.ai
Twitter: the_antlr_guy
Repositories: 58
Profile: https://github.com/parrt

Tech lead at Google, ex-Professor of computer/data science, active contributor to open-source projects supporting developers. Creator of ANTLR parser generator.

GitHub Events

Total

Watch event: 22
Issue comment event: 7
Push event: 1
Pull request event: 2
Fork event: 3

Last Year

Watch event: 22
Issue comment event: 7
Push event: 1
Pull request event: 2
Fork event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 227
Total Committers: 16
Avg Commits per committer: 14.188
Development Distribution Score (DDS): 0.211

Past Year

Commits: 2
Committers: 2
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.5

Top Committers

Name	Email	Commits
parrt	p**t@c**u	179
keremturgutlu	k**u@g**m	13
chrispaulca	c**r@u**u	6
Baldino	M**o@l**m	5
Taylor Pellerin	t**n@d**u	4
Matheus Couto	m**o@g**m	4
Eugene Scherba	e**a@g**m	4
Feras	f**g@g**m	3
GilesStrong	g**g@o**m	2
Yusuke Sakamoto	y****t	1
ValterH	v**k@g**m	1
Sagi Bazinin	s**6@g**m	1
Rohan Bhandari	r**1@g**m	1
Marco Tamassia	t**o@g**m	1
Chris G	c**k@g**m	1
dugy	y**y@u**a	1

Committer Domains (Top 20 + Academic)

usherbrooke.ca: 1 dons.usfca.edu: 1 lfd.com: 1 usfca.edu: 1 cs.usfca.edu: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 39
Total pull requests: 21
Average time to close issues: 3 months
Average time to close pull requests: 1 day
Total issue authors: 35
Total pull request authors: 15
Average comments per issue: 2.62
Average comments per pull request: 2.43
Merged pull requests: 17
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 2
Average time to close issues: 3 days
Average time to close pull requests: 1 day
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 3.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

feribg (3)
Zylatis (2)
Yanjiayork (2)
marcotama (1)
Zianor (1)
xoelop (1)
p2327 (1)
TNFCFA (1)
miaomaggie (1)
mkhan037 (1)
diego-mazon (1)
Gunnvant (1)
Joprou (1)
vivekruhela (1)
liefficient (1)

Pull Request Authors

escherba (3)
ValterH (2)
cgilpatrick (2)
RohanBhandari (2)
tjpell (2)
parrt (2)
feribg (2)
mkbldn (1)
marcotama (1)
GilesStrong (1)
Sharpen6 (1)
matheusccouto (1)
yskmt (1)
femtomatic (1)

Top Labels

Issue Labels

enhancement (7) question (6) compatibility (4) lack of activity (2) duplicate (2) can't reproduce (1) bug (1) portability (1)

Pull Request Labels

enhancement (13) compatibility (4) bug (2)

Packages

Total packages: 2
Total downloads:
- pypi 12,171 last-month

Total dependent packages: 1
(may contain duplicates)
Total dependent repositories: 26
(may contain duplicates)
Total versions: 27
Total maintainers: 1

pypi.org: rfpimp

Permutation and drop-column importance for scikit-learn random forests and other models

Homepage: https://github.com/parrt/random-forest-importances
Documentation: https://rfpimp.readthedocs.io/
License: MIT
Latest release: 1.3.7
published over 5 years ago

Versions: 22
Dependent Packages: 1
Dependent Repositories: 26
Downloads: 12,171 Last month

Rankings

Downloads: 1.7%

Stargazers count: 2.6%

Dependent repos count: 2.9%

Average: 2.9%

Dependent packages count: 3.2%

Forks count: 4.2%

Maintainers (1)

parrt

Last synced: 10 months ago

conda-forge.org: rfpimp

Homepage: https://github.com/parrt/random-forest-importances
License: MIT
Latest release: 1.3.2
published over 7 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 14.9%

Stargazers count: 15.6%

Average: 28.9%

Dependent repos count: 34.0%

Dependent packages count: 51.2%

Last synced: 10 months ago

Dependencies

src/setup.py pypi

matplotlib *
numpy *
pandas *
scikit-learn *

rfpimp

Science Score: 46.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Feature importances for scikit-learn machine learning models

Description

Sample code

attentuate affect of outliers in price

Add column of random numbers

Feature correlation

Feature dependencies

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: rfpimp

Rankings

Maintainers (1)

conda-forge.org: rfpimp

Rankings

Dependencies