iRF
iRF: extracting interactions from random forests - Published in JOSS (2018)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
2 of 8 committers (25.0%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Repository
iterative Random Forests (iRF): iteratively grows weighted random forests, finds interaction among features
Basic Info
- Host: GitHub
- Owner: sumbose
- License: gpl-3.0
- Language: R
- Default Branch: master
- Size: 6.15 MB
Statistics
- Stars: 48
- Watchers: 4
- Forks: 16
- Open Issues: 3
- Releases: 1
Metadata Files
README.md
iterative Random Forests (iRF)
The R package iRF implements iterative Random Forests, a method for
iteratively growing ensemble of weighted decision trees, and detecting
high-order feature interactions by analyzing feature usage on decision paths.
This version uses source codes from the R package randomForest by Andy Liaw
and Matthew Weiner and the original Fortran codes by Leo Breiman and Adele
Cutler.
To download and install the package, use devtools
r
library(devtools)
devtools::install_github("karlkumbier/iRF2.0")
Alternatively, the package can be installed by downloading this repository and
using the command:
r
R CMD INSTALL iRF2.0
You can subsequently load the package with the usual R commands:
r
library(iRF)
OSX users may need to intall gfortran to compile. This can be done with the following commands:
r
curl -OL http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /
Workflow Overview
Here is a brief description of the algorithm implemented in this package. It assumes the default behavior and is overly simplified, but should be enough to give you an general idea of what it happening under the hood.
- Input a numeric feature matrix
xand a response vectory. - Iteratively train
n.iterrandom forests by doing...- Populate the weight vector
mtry.select.prob = rep(1, ncol(x)), which indicating the probabilty each feature would be chosen when training the random forests. - Train a random forest with
xandy, and save it for later use. - Update
mtry.select.probwith the Gini importance of each feature, so that the more prediction accuracy a certain feature provides, the more likely it will be selected in the next iteration. - Repeat this routine
n.itertimes.
- Populate the weight vector
- Find the random forest from the iteration with highest OOB accuracy, a.k.a.
rand.forest. - Run Generalized RIT on
rand.forestby callinggRIT, which does...- Construct
read.forestfromrand.forestby callingreadForest, which does...- Construct
read.forest$tree.info, a data frame where each row corresponds to a leaf node inrand.forest, and each column records some metadata about that leaf. This is mostly used to construct the following two matrices. - Construct
read.forest$node.feature, a numeric sparse matrix where each row corresponds to a leaf node inrand.forest, and each column records the split point of (the first appearance of) all features on the path to that leaf. - Construct
read.forest$node.obs, a boolean sparse matrix where each row corresponds to an observation, and each column records if that observation falls on a certain leaf inrand.forest. This meansrowSums(node.obs)should be equal torep(ntree, nrow(x))wherentreeis the number of trees in each forest.
- Construct
- Subset
read.forest, keeping only leaves whose prediction isrit.param$class.id(for classification), or is over a thresholdrit.param$class.cut(for regression). - Run Random Intersection Tree on
read.forest$node.feature, with weight being the precision of each leaf times its size, i.e. the number of observations fallen into that leaf. For the RIT algorithm, each row/leaf node/decision path is considered as an observation. A total ofrit.param$ntreeRITs are grown, and the union of intersections recovered by these RITs are aggregated and stored toints.evalfor further inspection. - Calculate importance metrics for the interactions in
ints.evalacross leaf nodes ofrand.forest.
- Construct
- Run outer layer bootstrap stability analysis on
ints.evalby callingstabilityScore, which does...- Generate
n.bootstrapbootstrap samples, a.k.a.bs.sample, and for each sample...- Fit random forests on a sample.
- Extract significant interactions on the fitted forests by calling
gRIT.
- Summarize interaction importance metrics across bootstrap samples.
- Generate
Iterative reweighting assigns weights proportional the predictive power of a feature. As a result, component features of a significant intersection would be given more weight, and thus tend to appear earlier in the decision path. By keeping parts of high-order intersections in the path, we essentially reduce the order of these intersections. Note, however, that iterative reweighting doesn't seem to improve the accuracy of prediction.
See Iterative random forests to discover predictive and stable high-order interactions and Refining interaction search through signed iterative Random Forests for a much more in-depth description, but note that this code base has evolved since their publication.
Owner
- Name: Sumanta Basu
- Login: sumbose
- Kind: user
- Location: Ithaca, NY
- Company: Cornell University
- Website: faculty.bscb.cornell.edu/~basu/
- Repositories: 2
- Profile: https://github.com/sumbose
JOSS Publication
iRF: extracting interactions from random forests
Authors
Denotes equal contribution, Department of Biological Statistics and Computational Biology, Cornell University, Department of Statistical Science, Cornell University
Denotes equal contribution, Statistics Department, University of California, Berkeley
Statistics Department, University of California, Berkeley, Centre for Computational Biology, School of Biosciences, University of Birmingham, Molecular Ecosystems Biology Department, Biosciences Area, Lawrence Berkeley National Laboratory
Statistics Department, University of California, Berkeley, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
Tags
Random Forests Interpretable machine learningGitHub Events
Total
- Watch event: 5
Last Year
- Watch event: 5
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Karl Kumbier | k****r@b****u | 32 |
| Karl Kumbier | k****l@M****l | 14 |
| sumbose | s****e@g****m | 9 |
| Karl Kumbier | k****l@M****l | 9 |
| Karl Kumbier | k****l@M****s | 4 |
| Karl Kumbier | k****l@c****u | 3 |
| Karl Kumbier | k****r@k****t | 2 |
| Karl Kumbier | k****r@K****l | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 17
- Total pull requests: 9
- Average time to close issues: 8 days
- Average time to close pull requests: about 2 hours
- Total issue authors: 8
- Total pull request authors: 1
- Average comments per issue: 1.12
- Average comments per pull request: 0.0
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- nanxstats (9)
- warnes (2)
- YidanZ65 (1)
- onesmallchen (1)
- sherff100 (1)
- jzyuan (1)
- rlbarter (1)
- PietJones (1)
Pull Request Authors
- karlkumbier (9)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- R >= 3.1.2 depends
- AUC * imports
- Matrix * imports
- RColorBrewer * imports
- Rcpp * imports
- data.table * imports
- doParallel * imports
- doRNG * imports
- dplyr * imports
- fastmatch * imports
- foreach * imports
- memoise * imports
- methods * imports
- ranger * imports
- rgl * imports
- stringr * imports
- MASS * suggests
- covr * suggests
- testthat >= 2.1.0 suggests
