iRF

iRF: extracting interactions from random forests - Published in JOSS (2018)

https://github.com/sumbose/irf

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
2 of 8 committers (25.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Last synced: 11 months ago · JSON representation

Repository

iterative Random Forests (iRF): iteratively grows weighted random forests, finds interaction among features

Basic Info

Host: GitHub
Owner: sumbose
License: gpl-3.0
Language: R
Default Branch: master
Size: 6.15 MB

Statistics

Stars: 48
Watchers: 4
Forks: 16
Open Issues: 3
Releases: 1

Created about 10 years ago · Last pushed over 5 years ago

Metadata Files

Readme Contributing License Code of conduct

iterative Random Forests (iRF)

The R package iRF implements iterative Random Forests, a method for iteratively growing ensemble of weighted decision trees, and detecting high-order feature interactions by analyzing feature usage on decision paths. This version uses source codes from the R package randomForest by Andy Liaw and Matthew Weiner and the original Fortran codes by Leo Breiman and Adele Cutler.

To download and install the package, use devtools

r library(devtools) devtools::install_github("karlkumbier/iRF2.0") Alternatively, the package can be installed by downloading this repository and using the command:

r R CMD INSTALL iRF2.0

You can subsequently load the package with the usual R commands:

r library(iRF)

OSX users may need to intall gfortran to compile. This can be done with the following commands:

r curl -OL http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2 sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /

Workflow Overview

Here is a brief description of the algorithm implemented in this package. It assumes the default behavior and is overly simplified, but should be enough to give you an general idea of what it happening under the hood.

Input a numeric feature matrix x and a response vector y.
Iteratively train n.iter random forests by doing...
1. Populate the weight vector mtry.select.prob = rep(1, ncol(x)), which indicating the probabilty each feature would be chosen when training the random forests.
2. Train a random forest with x and y, and save it for later use.
3. Update mtry.select.prob with the Gini importance of each feature, so that the more prediction accuracy a certain feature provides, the more likely it will be selected in the next iteration.
4. Repeat this routine n.iter times.
Find the random forest from the iteration with highest OOB accuracy, a.k.a. rand.forest.
Run Generalized RIT on rand.forest by calling gRIT, which does...
1. Construct read.forest from rand.forest by calling readForest, which does...
  1. Construct read.forest$tree.info, a data frame where each row corresponds to a leaf node in rand.forest, and each column records some metadata about that leaf. This is mostly used to construct the following two matrices.
  2. Construct read.forest$node.feature, a numeric sparse matrix where each row corresponds to a leaf node in rand.forest, and each column records the split point of (the first appearance of) all features on the path to that leaf.
  3. Construct read.forest$node.obs, a boolean sparse matrix where each row corresponds to an observation, and each column records if that observation falls on a certain leaf in rand.forest. This means rowSums(node.obs) should be equal to rep(ntree, nrow(x)) where ntree is the number of trees in each forest.
2. Subset read.forest, keeping only leaves whose prediction is rit.param$class.id (for classification), or is over a threshold rit.param$class.cut (for regression).
3. Run Random Intersection Tree on read.forest$node.feature, with weight being the precision of each leaf times its size, i.e. the number of observations fallen into that leaf. For the RIT algorithm, each row/leaf node/decision path is considered as an observation. A total of rit.param$ntree RITs are grown, and the union of intersections recovered by these RITs are aggregated and stored to ints.eval for further inspection.
4. Calculate importance metrics for the interactions in ints.eval across leaf nodes of rand.forest.
Run outer layer bootstrap stability analysis on ints.eval by calling stabilityScore, which does...
1. Generate n.bootstrap bootstrap samples, a.k.a. bs.sample, and for each sample...
  1. Fit random forests on a sample.
  2. Extract significant interactions on the fitted forests by calling gRIT.
2. Summarize interaction importance metrics across bootstrap samples.

Iterative reweighting assigns weights proportional the predictive power of a feature. As a result, component features of a significant intersection would be given more weight, and thus tend to appear earlier in the decision path. By keeping parts of high-order intersections in the path, we essentially reduce the order of these intersections. Note, however, that iterative reweighting doesn't seem to improve the accuracy of prediction.

See Iterative random forests to discover predictive and stable high-order interactions and Refining interaction search through signed iterative Random Forests for a much more in-depth description, but note that this code base has evolved since their publication.

Owner

Name: Sumanta Basu
Login: sumbose
Kind: user
Location: Ithaca, NY
Company: Cornell University

Website: faculty.bscb.cornell.edu/~basu/
Repositories: 2
Profile: https://github.com/sumbose

JOSS Publication

iRF: extracting interactions from random forests

Published

December 05, 2018

DOI

10.21105/joss.01077

Volume 3, Issue 32, Page 1077

Authors

Sumanta Basu
Denotes equal contribution, Department of Biological Statistics and Computational Biology, Cornell University, Department of Statistical Science, Cornell University

Karl Kumbier
Denotes equal contribution, Statistics Department, University of California, Berkeley

James B. Brown
Statistics Department, University of California, Berkeley, Centre for Computational Biology, School of Biosciences, University of Birmingham, Molecular Ecosystems Biology Department, Biosciences Area, Lawrence Berkeley National Laboratory

Bin Yu
Statistics Department, University of California, Berkeley, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley

Editor

Ariel Rokem

GitHub Events

Total

Watch event: 5

Last Year

Watch event: 5

Committers

Last synced: 12 months ago

All Time

Total Commits: 75
Total Committers: 8
Avg Commits per committer: 9.375
Development Distribution Score (DDS): 0.573

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Karl Kumbier	k**r@b**u	32
Karl Kumbier	k**l@M**l	14
sumbose	s**e@g**m	9
Karl Kumbier	k**l@M**l	9
Karl Kumbier	k**l@M**s	4
Karl Kumbier	k**l@c**u	3
Karl Kumbier	k**r@k**t	2
Karl Kumbier	k**r@K**l	2

Committer Domains (Top 20 + Academic)

karls-mbp.attlocal.net: 1 calvisitor-10-105-143-44.calvisitor.1918.berkeley.edu: 1 macbook-pro-5.dhcp.lbnl.us: 1 berkeley.edu: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 17
Total pull requests: 9
Average time to close issues: 8 days
Average time to close pull requests: about 2 hours
Total issue authors: 8
Total pull request authors: 1
Average comments per issue: 1.12
Average comments per pull request: 0.0
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

nanxstats (9)
warnes (2)
YidanZ65 (1)
onesmallchen (1)
sherff100 (1)
jzyuan (1)
rlbarter (1)
PietJones (1)

Pull Request Authors

karlkumbier (9)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

DESCRIPTION cran

R >= 3.1.2 depends
AUC * imports
Matrix * imports
RColorBrewer * imports
Rcpp * imports
data.table * imports
doParallel * imports
doRNG * imports
dplyr * imports
fastmatch * imports
foreach * imports
memoise * imports
methods * imports
ranger * imports
rgl * imports
stringr * imports
MASS * suggests
covr * suggests
testthat >= 2.1.0 suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

iRF

Science Score: 95.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

iterative Random Forests (iRF)

Workflow Overview

Owner

JOSS Publication

iRF: extracting interactions from random forests

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies