aorsf
aorsf: An R package for supervised learning using the oblique random survival forest - Published in JOSS (2022)
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: joss.theoj.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (21.4%) to scientific vocabulary
Keywords
data-science
oblique
random-forest
rstats
survival
Last synced: 6 months ago
·
JSON representation
Repository
Accelerated Oblique Random Survival Forests
Basic Info
- Host: GitHub
- Owner: ropensci
- License: other
- Language: R
- Default Branch: main
- Homepage: https://docs.ropensci.org/aorsf
- Size: 114 MB
Statistics
- Stars: 59
- Watchers: 3
- Forks: 10
- Open Issues: 10
- Releases: 5
Topics
data-science
oblique
random-forest
rstats
survival
Created over 4 years ago
· Last pushed 11 months ago
Metadata Files
Readme
Changelog
Contributing
License
Codemeta
Zenodo
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 300,
warning = FALSE,
message = FALSE
)
```
# aorsf
[](https://www.repostatus.org/#active)
[](https://app.codecov.io/gh/bcjaeger/aorsf?branch=master)
[](https://github.com/ropensci/aorsf/actions/)
[](https://github.com/ropensci/software-review/issues/532/)
[](https://CRAN.R-project.org/package=aorsf)
[](https://zenodo.org/doi/10.5281/zenodo.7116854)
Fit, interpret, and make predictions with oblique random forests (RFs).
## Why aorsf?
- Fast and versatile tools for oblique RFs.^1^
- Accurate predictions.^2^
- Intuitive design with formula based interface.
- Extensive input checks and informative error messages.
- Compatible with `tidymodels` and `mlr3`
## Installation
You can install `aorsf` from CRAN using
``` r
install.packages("aorsf")
```
You can install the development version of aorsf from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("ropensci/aorsf")
```
## Get started
```{r}
library(aorsf)
library(tidyverse)
```
`aorsf` fits several types of oblique RFs with the `orsf()` function, including classification, regression, and survival RFs.
```{r, child='Rmd/orsf-fit-intro.Rmd'}
```
## What does "oblique" mean?
Decision trees are grown by splitting a set of training data into non-overlapping subsets, with the goal of having more similarity within the new subsets than between them. When subsets are created with a single predictor, the decision tree is *axis-based* because the subset boundaries are perpendicular to the axis of the predictor. When linear combinations (i.e., a weighted sum) of variables are used instead of a single variable, the tree is *oblique* because the boundaries are neither parallel nor perpendicular to the axis.
**Figure**: Decision trees for classification with axis-based splitting (left) and oblique splitting (right). Cases are orange squares; controls are purple circles. Both trees partition the predictor space defined by variables X1 and X2, but the oblique splits do a better job of separating the two classes.
```{r fig_oblique_v_axis, out.width='100%', echo = FALSE}
knitr::include_graphics('man/figures/tree_axis_v_oblique.png')
```
So, how does this difference translate to real data, and how does it impact random forests comprising hundreds of axis-based or oblique trees? We will demonstrate this using the `penguin` data.^3^ We will also use this function to make several plots:
```{r}
plot_decision_surface <- function(predictions, title, grid){
# this is not a general function for plotting
# decision surfaces. It just helps to minimize
# copying and pasting of code.
class_preds <- bind_cols(grid, predictions) %>%
pivot_longer(cols = c(Adelie,
Chinstrap,
Gentoo)) %>%
group_by(flipper_length_mm, bill_length_mm) %>%
arrange(desc(value)) %>%
slice(1)
cols <- c("darkorange", "purple", "cyan4")
ggplot(class_preds, aes(bill_length_mm, flipper_length_mm)) +
geom_contour_filled(aes(z = value, fill = name),
alpha = .25) +
geom_point(data = penguins_orsf,
aes(color = species, shape = species),
alpha = 0.5) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols) +
labs(x = "Bill length, mm",
y = "Flipper length, mm") +
theme_minimal() +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
theme(panel.grid = element_blank(),
panel.border = element_rect(fill = NA),
legend.position = '') +
labs(title = title)
}
```
We also use a grid of points for plotting decision surfaces:
```{r}
grid <- expand_grid(
flipper_length_mm = seq(min(penguins_orsf$flipper_length_mm),
max(penguins_orsf$flipper_length_mm),
len = 200),
bill_length_mm = seq(min(penguins_orsf$bill_length_mm),
max(penguins_orsf$bill_length_mm),
len = 200)
)
```
We use `orsf` with `mtry=1` to fit axis-based trees:
```{r}
fit_axis_tree <- penguins_orsf %>%
orsf(species ~ bill_length_mm + flipper_length_mm,
n_tree = 1,
mtry = 1,
tree_seeds = 106760)
```
Next we use `orsf_update` to copy and modify the original model, expanding it to fit an oblique tree by using `mtry=2` instead of `mtry=1`, and to include 500 trees instead of 1:
```{r}
fit_axis_forest <- fit_axis_tree %>%
orsf_update(n_tree = 500)
fit_oblique_tree <- fit_axis_tree %>%
orsf_update(mtry = 2)
fit_oblique_forest <- fit_oblique_tree %>%
orsf_update(n_tree = 500)
```
And now we have all we need to visualize decision surfaces using predictions from these four fits:
```{r}
preds <- list(fit_axis_tree,
fit_axis_forest,
fit_oblique_tree,
fit_oblique_forest) %>%
map(predict, new_data = grid, pred_type = 'prob')
titles <- c("Axis-based tree",
"Axis-based forest",
"Oblique tree",
"Oblique forest")
plots <- map2(preds, titles, plot_decision_surface, grid = grid)
```
**Figure**: Axis-based and oblique decision surfaces from a single tree and an ensemble of 500 trees. Axis-based trees have boundaries perpendicular to predictor axes, whereas oblique trees can have boundaries that are neither parallel nor perpendicular to predictor axes. Axis-based forests tend to have 'step-function' decision boundaries, while oblique forests tend to have smooth decision boundaries.
```{r, echo=FALSE}
cowplot::plot_grid(plotlist = plots)
```
## Variable importance
The importance of individual predictor variables can be estimated in three ways using `aorsf` and can be used on any type of oblique RF. Also, variable importance functions always return a named character vector
- **negation**^2^: `r aorsf:::roxy_vi_describe('negate')`
```{r}
orsf_vi_negate(pbc_fit)
```
- **permutation**: `r aorsf:::roxy_vi_describe('permute')`
```{r}
orsf_vi_permute(penguin_fit)
```
- **analysis of variance (ANOVA)**^4^: `r aorsf:::roxy_vi_describe('anova')`
```{r}
orsf_vi_anova(bill_fit)
```
You can supply your own R function to estimate out-of-bag error (see [oob vignette](https://docs.ropensci.org/aorsf/articles/oobag.html)) or to estimate out-of-bag variable importance (see [orsf_vi examples](https://docs.ropensci.org/aorsf/reference/orsf_vi.html#examples))
## Partial dependence (PD)
`r aorsf:::roxy_pd_explain()`. You can use specific values for a predictor to compute PD or let `aorsf` pick reasonable values for you if you use `pred_spec_auto()`:
```{r}
# pick your own values
orsf_pd_oob(bill_fit, pred_spec = list(species = c("Adelie", "Gentoo")))
# let aorsf pick reasonable values for you:
orsf_pd_oob(bill_fit, pred_spec = pred_spec_auto(bill_depth_mm, island))
```
The summary function, `orsf_summarize_uni()`, computes PD for as many variables as you ask it to, using sensible values.
```{r}
orsf_summarize_uni(pbc_fit, n_variables = 2)
```
For more on PD, see the [vignette](https://docs.ropensci.org/aorsf/articles/pd.html)
## Individual conditional expectations (ICE)
`r aorsf:::roxy_ice_explain()`
For more on ICE, see the [vignette](https://docs.ropensci.org/aorsf/articles/pd.html#individual-conditional-expectations-ice)
## Interaction scores
The `orsf_vint()` function computes a score for each possible interaction in a model based on PD using the method described in Greenwell et al, 2018.^5^ It can be slow for larger datasets, but substantial speedups occur by making use of multi-threading and restricting the search to a smaller set of predictors.
```{r}
preds_interaction <- c("albumin", "protime", "bili", "spiders", "trt")
# While it is tempting to speed up `orsf_vint()` by growing a smaller
# number of trees, results may become unstable with this shortcut.
pbc_interactions <- pbc_fit %>%
orsf_update(n_tree = 500, tree_seeds = 329) %>%
orsf_vint(n_thread = 0, predictors = preds_interaction)
pbc_interactions
```
What do the values in `score` mean? These values are the average of the standard deviation of the standard deviation of PD in one variable conditional on the other variable. They should be interpreted relative to one another, i.e., a higher scoring interaction is more likely to reflect a real interaction between two variables than a lower scoring one.
Do these interaction scores make sense? Let's test the top scoring and lowest scoring interactions using `coxph()`.
```{r}
library(survival)
# the top scoring interaction should get a lower p-value
anova(coxph(Surv(time, status) ~ protime * albumin, data = pbc_orsf))
# the bottom scoring interaction should get a higher p-value
anova(coxph(Surv(time, status) ~ spiders * trt, data = pbc_orsf))
```
Note: this is exploratory and not a true null hypothesis test. Why? Because we used the same data both to generate and to test the null hypothesis. We are not so much conducting statistical inference when we test these interactions with `coxph` as we are demonstrating the interaction scores that `orsf_vint()` provides are consistent with tests from other models.
## Comparison to existing software
For survival analysis, comparisons between `aorsf` and existing software are presented in our [JCGS paper](https://doi.org/10.1080/10618600.2023.2231048). The paper:
- describes `aorsf` in detail with a summary of the procedures used in the tree fitting algorithm
- runs a general benchmark comparing `aorsf` with `obliqueRSF` and several other learners
- reports prediction accuracy and computational efficiency of all learners.
- runs a simulation study comparing variable importance techniques with oblique survival RFs, axis based survival RFs, and boosted trees.
- reports the probability that each variable importance technique will rank a relevant variable with higher importance than an irrelevant variable.
## References
1. `r aorsf:::cite("jaeger_2019")`
1. `r aorsf:::cite("jaeger_2022")`
1. `r aorsf:::cite("penguins_2020")`
1. `r aorsf:::cite("menze_2011")`
1. `r aorsf:::cite("greenwell_2018")`
## Funding
The developers of `aorsf` received financial support from the Center for Biomedical Informatics, Wake Forest University School of Medicine. We also received support from the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR001420.
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Owner
- Name: rOpenSci
- Login: ropensci
- Kind: organization
- Email: info@ropensci.org
- Location: Berkeley, CA
- Website: https://ropensci.org/
- Twitter: rOpenSci
- Repositories: 307
- Profile: https://github.com/ropensci
CodeMeta (codemeta.json)
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"identifier": "aorsf",
"description": "Fit, interpret, and make predictions with oblique random survival forests. Oblique decision trees are notoriously slow compared to their axis based counterparts, but 'aorsf' runs as fast or faster than axis-based decision tree algorithms for right-censored time-to-event outcomes.",
"name": "aorsf: Accelerated Oblique Random Survival Forests",
"relatedLink": "https://bcjaeger.github.io/aorsf",
"codeRepository": "https://github.com/bcjaeger/aorsf",
"issueTracker": "https://github.com/bcjaeger/aorsf/issues",
"license": "https://spdx.org/licenses/MIT",
"version": "0.0.1",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
"url": "https://r-project.org"
},
"runtimePlatform": "R version 4.1.3 (2022-03-10)",
"author": [
{
"@type": "Person",
"givenName": "Byron",
"familyName": "Jaeger",
"email": "bjaeger@wakehealth.edu",
"@id": "https://orcid.org/0000-0001-7399-2299"
}
],
"contributor": [
{
"@type": "Person",
"givenName": "Nicholas",
"familyName": "Pajewski"
},
{
"@type": "Person",
"givenName": "Sawyer",
"familyName": "Welden",
"email": "swelden@wakehealth.edu"
}
],
"maintainer": [
{
"@type": "Person",
"givenName": "Byron",
"familyName": "Jaeger",
"email": "bjaeger@wakehealth.edu",
"@id": "https://orcid.org/0000-0001-7399-2299"
}
],
"softwareSuggestions": [
{
"@type": "SoftwareApplication",
"identifier": "survival",
"name": "survival",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=survival"
},
{
"@type": "SoftwareApplication",
"identifier": "survivalROC",
"name": "survivalROC",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=survivalROC"
},
{
"@type": "SoftwareApplication",
"identifier": "ggplot2",
"name": "ggplot2",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=ggplot2"
},
{
"@type": "SoftwareApplication",
"identifier": "testthat",
"name": "testthat",
"version": ">= 3.0.0",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=testthat"
},
{
"@type": "SoftwareApplication",
"identifier": "knitr",
"name": "knitr",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=knitr"
},
{
"@type": "SoftwareApplication",
"identifier": "rmarkdown",
"name": "rmarkdown",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=rmarkdown"
},
{
"@type": "SoftwareApplication",
"identifier": "glmnet",
"name": "glmnet",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=glmnet"
},
{
"@type": "SoftwareApplication",
"identifier": "covr",
"name": "covr",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=covr"
},
{
"@type": "SoftwareApplication",
"identifier": "units",
"name": "units",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=units"
}
],
"softwareRequirements": {
"1": {
"@type": "SoftwareApplication",
"identifier": "Rcpp",
"name": "Rcpp",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=Rcpp"
},
"2": {
"@type": "SoftwareApplication",
"identifier": "data.table",
"name": "data.table",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=data.table"
},
"3": {
"@type": "SoftwareApplication",
"identifier": "R",
"name": "R",
"version": ">= 3.6"
},
"SystemRequirements": null
},
"fileSize": "9141.536KB",
"citation": [
{
"@type": "ScholarlyArticle",
"datePublished": "2019",
"author": [
{
"@type": "Person",
"givenName": [
"Byron",
"C."
],
"familyName": "Jaeger"
},
{
"@type": "Person",
"givenName": [
"D.",
"Leann"
],
"familyName": "Long"
},
{
"@type": "Person",
"givenName": [
"Dustin",
"M."
],
"familyName": "Long"
},
{
"@type": "Person",
"givenName": "Mario",
"familyName": "Sims"
},
{
"@type": "Person",
"givenName": [
"Jeff",
"M."
],
"familyName": "Szychowski"
},
{
"@type": "Person",
"givenName": "Yuan-I",
"familyName": "Min"
},
{
"@type": "Person",
"givenName": [
"Leslie",
"A."
],
"familyName": "Mcclure"
},
{
"@type": "Person",
"givenName": "George",
"familyName": "Howard"
},
{
"@type": "Person",
"givenName": "Noah",
"familyName": "Simon"
}
],
"name": "Oblique Random Survival Forests",
"url": "https://doi.org/10.1214/19-AOAS1261",
"pagination": "1847--1883",
"isPartOf": {
"@type": "PublicationIssue",
"issueNumber": "3",
"datePublished": "2019",
"isPartOf": {
"@type": [
"PublicationVolume",
"Periodical"
],
"volumeNumber": "13",
"name": "Annals of Applied Statistics"
}
}
}
],
"releaseNotes": "https://github.com/bcjaeger/aorsf/blob/master/NEWS.md",
"readme": "https://github.com/bcjaeger/aorsf/blob/master/README.md",
"contIntegration": [
"https://app.codecov.io/gh/bcjaeger/aorsf?branch=master",
"https://github.com/bcjaeger/aorsf/actions",
"https://github.com/bcjaeger/aorsf/actions?query=workflow%3Apkgcheck"
],
"developmentStatus": "https://www.repostatus.org/#wip",
"review": {
"@type": "Review",
"url": "https://github.com/ropensci/software-review/issues/532",
"provider": "https://ropensci.org"
},
"keywords": [
"r",
"rstats",
"data-science"
]
}
GitHub Events
Total
- Issues event: 10
- Watch event: 6
- Delete event: 1
- Issue comment event: 16
- Push event: 16
- Pull request review event: 1
- Pull request event: 9
- Fork event: 1
- Create event: 4
Last Year
- Issues event: 10
- Watch event: 6
- Delete event: 1
- Issue comment event: 16
- Push event: 16
- Pull request review event: 1
- Pull request event: 9
- Fork event: 1
- Create event: 4
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| bcjaeger | b****r@g****m | 754 |
| Sawyer Welden | s****n@g****m | 2 |
| Ciaran Evans | l****e@g****m | 2 |
| Jeroen Ooms | j****s@g****m | 1 |
| Ikko Eltociear Ashimine | e****r@g****m | 1 |
| Emily Riederer | e****r@g****m | 1 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 49
- Total pull requests: 34
- Average time to close issues: about 2 months
- Average time to close pull requests: 5 days
- Total issue authors: 14
- Total pull request authors: 7
- Average comments per issue: 2.49
- Average comments per pull request: 0.47
- Merged pull requests: 31
- Bot issues: 7
- Bot pull requests: 0
Past Year
- Issues: 8
- Pull requests: 9
- Average time to close issues: 14 days
- Average time to close pull requests: 20 days
- Issue authors: 7
- Pull request authors: 2
- Average comments per issue: 1.88
- Average comments per pull request: 0.22
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bcjaeger (24)
- github-actions[bot] (7)
- bblodfon (3)
- DustinMLong (2)
- hfrick (2)
- dillintj (1)
- instantkaffee (1)
- mattwarkentin (1)
- emilyriederer (1)
- ggrothendieck (1)
- AbubakerSuliman (1)
- alkat19 (1)
- mpadge (1)
- cmululu (1)
Pull Request Authors
- bcjaeger (35)
- sawyerWeld (2)
- eltociear (2)
- jeroen (2)
- emilyriederer (2)
- maelle (1)
- ciaran-evans (1)
Top Labels
Issue Labels
enhancement (3)
upkeep (3)
help wanted (1)
good first issue (1)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 2,054 last-month
- Total dependent packages: 2
- Total dependent repositories: 2
- Total versions: 13
- Total maintainers: 1
cran.r-project.org: aorsf
Accelerated Oblique Random Forests
- Homepage: https://github.com/ropensci/aorsf
- Documentation: http://cran.r-project.org/web/packages/aorsf/aorsf.pdf
- License: MIT + file LICENSE
-
Latest release: 0.1.5
published over 1 year ago
Rankings
Stargazers count: 11.6%
Forks count: 14.4%
Average: 18.6%
Downloads: 19.4%
Dependent repos count: 19.6%
Dependent packages count: 28.0%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.6 depends
- Rcpp * imports
- data.table * imports
- utils * imports
- covr * suggests
- ggplot2 * suggests
- glmnet * suggests
- knitr * suggests
- rmarkdown * suggests
- survival * suggests
- survivalROC * suggests
- testthat >= 3.0.0 suggests
- tibble * suggests
- units * suggests
.github/workflows/R-CMD-check.yaml
actions
- actions/checkout v3 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/draft-pdf.yaml
actions
- actions/checkout v2 composite
- actions/upload-artifact v1 composite
- openjournals/openjournals-draft-action master composite
.github/workflows/pkgdown.yaml
actions
- JamesIves/github-pages-deploy-action v4.4.1 composite
- actions/checkout v3 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pr-commands.yaml
actions
- actions/checkout v2 composite
- r-lib/actions/pr-fetch v1 composite
- r-lib/actions/pr-push v1 composite
- r-lib/actions/setup-r v1 composite
- r-lib/actions/setup-r-dependencies v1 composite
.github/workflows/test-coverage.yaml
actions
- actions/checkout v3 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite