robustHD

robustHD: An R package for robust regression with high-dimensional data - Published in JOSS (2021)

Scientific Fields

Economics Social Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Robust methods for high-dimensional data, in particular linear model selection techniques based on least angle regression and sparse regression.

Basic Info

Host: GitHub
Owner: aalfons
License: gpl-3.0
Language: R
Default Branch: main
Homepage:
Size: 14 MB

Statistics

Stars: 11
Watchers: 2
Forks: 6
Open Issues: 5
Releases: 1

Created almost 14 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---
# robustHD: Robust Methods for High-Dimensional Data

```{r setup, include=FALSE}
knitr::opts_chunk$set(highlight = FALSE, fig.path = "./inst/doc/paper/figure_")
```

[![CRAN](https://www.R-pkg.org/badges/version/robustHD)](https://CRAN.R-project.org/package=robustHD) [![DOI](https://joss.theoj.org/papers/10.21105/joss.03786/status.svg)](https://doi.org/10.21105/joss.03786)


To cite package `robustHD` in publications, please use:

A. Alfons (2021). `robustHD`: An `R` package for robust regression with high-dimensional data. *Journal of Open Source Software*, 6(67), 3786. DOI [10.21105/joss.03786](https://doi.org/10.21105/joss.03786).


## Summary

In regression analysis with high-dimensional data, variable selection is an important step to (i) overcome computational problems, (ii) improve prediction performance by variance reduction, and (iii) increase interpretability of the resulting models due to the smaller number of variables.  However, robust methods are necessary to prevent outlying data points from distorting the results.  The add-on package `robustHD` for the statistical computing environment `R` provides functionality for robust linear regression and model selection with high-dimensional data.  More specifically, the implemented functionality includes robust least angle regression ([Khan et al., 2007](https://doi.org/10.1198/016214507000000950)), robust groupwise least angle regression ([Alfons et al., 2016](https://doi.org/10.1016/j.csda.2015.02.007)), as well as sparse least trimmed squares regression ([Alfons et al., 2013](https://doi.org/10.1214/12-AOAS575)). The latter can be seen as a trimmed version of the popular lasso regression estimator ([Tibshirani, 1996](https://doi.org/10.1111/j.2517-6161.1996.tb02080.x)).  Selecting the optimal model can be done via cross-validation or an information criterion, and various plots are available to illustrate model selection and to evaluate the final model estimates.  Furthermore, the package includes functionality for pre-processing such as robust standardization and winsorization.  Finally, `robustHD` follows a clear object-oriented design and takes advantage of `C++` code and parallel computing to reduce computing time.


## Main functionality

 * `sparseLTS()`: Sparse least trimmed squares regression.
 
 * `rlars()`: Robust least angle regression.
 
 * `grplars()` and `rgrplars()`: (Robust) groupwise least angle regression.
 
 * `tslars()` and `rtslars()`: (Robust) least angle regression for time series data.
 
 * `corHuber()`: Robust correlation based on winsorization.
 
 * `winsorize()`: Winsorization of the data.
 
 * `robStandardize()`: Robust standardization of the data with given functions for computing center and scale. By default, the median and MAD are used.


## Installation

Package `robustHD` is on CRAN (The Comprehensive R Archive Network), hence the latest release can be easily installed from the `R` command line via

```
install.packages("robustHD")
```


## Building from source

To install the latest (possibly unstable) development version from GitHub, you can pull this repository and install it from the `R` command line via

```
install.packages("devtools")
devtools::install_github("aalfons/robustHD")
```

If you already have package `devtools` installed, you can skip the first line.  Moreover, package `robustHD` contains `C++` code that needs to be compiled, so you may need to download and install the [necessary tools for MacOS](https://cran.r-project.org/bin/macosx/tools/) or the [necessary tools for Windows](https://cran.r-project.org/bin/windows/Rtools/).


# Example: Sparse least trimmed squares regression

The well-known [NCI-60 cancer cell panel](https://discover.nci.nih.gov/cellminer/) is used to illustrate the functionality for sparse least trimmed squares regression. The protein expressions for a specific protein are selected as the response variable, and the gene expressions of the 100 genes that have the highest (robustly estimated) correlations with the response variable are screened as candidate predictors.

```{r, message=FALSE}
# load package and data
library("robustHD")
data("nci60")  # contains matrices 'protein' and 'gene'

# define response variable
y <- protein[, 92]
# screen most correlated predictor variables
correlations <- apply(gene, 2, corHuber, y)
keep <- partialOrder(abs(correlations), 100, decreasing = TRUE)
X <- gene[, keep]
```

Sparse least trimmed squares is a regularized estimator of the linear regression model, whose results depend on a non-negative regularization parameter [see [Alfons et al., 2013](https://doi.org/10.1214/12-AOAS575)]. In general, a larger value of this regularization parameter yields more regression coefficients being set to zero, which can be seen as a form of variable selection.

For convenience, `sparseLTS()` can internally estimate the smallest value of the regularization parameter that sets all coefficients to zero.  With `mode = "fraction"`, the values supplied via the argument `lambda` are then taken as fractions of this estimated value (i.e., they are multiplied with the internally estimated value).  In this example, the optimal value of the the regularization parameter is selected by estimating the prediction error (`crit = "PE"`) via 5-fold cross-validation with one replication (`splits = foldControl(K = 5, R = 1)`). The default prediction loss function is the root trimmed mean squared prediction error.  Finally, the seed of the random number generator is supplied for reproducibility.

```{r}
# fit sparse least trimmed squares regression and print results
lambda <- seq(0.01, 0.5, length.out = 10)
fit <- sparseLTS(X, y, lambda = lambda, mode = "fraction", crit = "PE",
                 splits = foldControl(K = 5, R = 1), seed = 20210507)
fit
```

Among other information, the output prints the results of the final model fit, which here consists of 17 genes with non-zero coefficients.

When selecting the optimal model fit by estimating the prediction error, the final model estimate on the full data is computed only with the optimal value of the regularization parameter instead of the full grid.  For visual inspection of the results, function `critPlot()` plots the values of the optimality criterion (in this example, the root trimmed mean squared error) against the values of the regularization parameter.  Moreover, function `diagnosticPlot()` allows to produce various diagnostic plots for the optimal model fit.

```{r, include=FALSE}
# load additional package
library("gridExtra")

# create optimality criterion plot
p1 <- critPlot(fit) +
  labs(title = "Optimality criterion plot")

# create diagnostic plot of optimal model fit
p2 <- diagnosticPlot(fit, which = "rdiag", id.n = 0) +
  labs(title = "Regression diagnostic plot") +
  theme(legend.position = "top", legend.title = element_blank())
```

```{r sparseLTS, echo=FALSE, dev="svglite", fig.width=6.5, fig.height=3.5, fig.align="center", out.width="67%"}
grid.arrange(p1, p2, nrow = 1)
```

Examples of the optimality criterion plot (*left*) and the regression diagnostic plot (*right*) for output of function `sparseLTS()`.


# Example: Robust groupwise least angle regression

Robust least angle regression ([Khan et al., 2007](https://doi.org/10.1198/016214507000000950)) and robust groupwise least angle regression ([Alfons et al., 2016](https://doi.org/10.1016/j.csda.2015.02.007)) follow a hybrid model selection strategy: first obtain a sequence of important candidate predictors, then fit submodels along that sequence via robust regressions.  Here, data on cars featured in the popular television show *Top Gear* are used to illustrate this functionality.

The response variable is fuel consumption in miles per gallon (MPG), with all remaining variables used as candidate predictors.  Information on the car model is first removed from the data set, and the car price is log-transformed.  In addition, only observations with complete information are used in this illustrative example.

```{r, message=FALSE}
# load package and data
library("robustHD")
data("TopGear")

# keep complete observations and remove information on car model
keep <- complete.cases(TopGear)
TopGear <- TopGear[keep, -(1:3)]
# log-transform price
TopGear$Price <- log(TopGear$Price)
```

As the *Top Gear* data set contains several categorical variables, robust groupwise least angle regression is used.  Through the formula interface, function `rgrplars()` by default takes each categorical variable (`factor`) as a group of dummy variables while all remaining variables are taken individually.  However, the group assignment can be defined by the user through argument `assign`.  The maximum number of candidate predictor groups to be sequenced is determined by argument `sMax`.    Furthermore, with `crit = "BIC"`, the optimal submodel along the sequence is selected via the Bayesian information criterion (BIC).  Note that each submodel along the sequence is fitted using a robust regression estimator with a non-deterministic algorithm, hence the seed of the random number generator is supplied for reproducibility.

```{r}
# fit robust groupwise least angle regression and print results
fit <- rgrplars(MPG ~ ., data = TopGear, sMax = 15, 
                crit = "BIC", seed = 20210507)
fit
```

The output prints information on the sequence of predictor groups, as well as the results of the final model fit.  Here, 9 predictor groups consisting of 10 individual covariates are selected into the final model.

When the optimal model fit is selected via BIC, each submodel along the sequence is estimated on the full data set. In this case, a plot of the coefficient path along the sequence can be produced via the function `coefPlot()`.  Functions `critPlot()` and `diagnosticPlot()` are again available to produce similar plots as in the previous example.

```{r, include=FALSE}
# load additional package
library("gridExtra")

# create coefficient plot of sequence of model fits
p1 <- coefPlot(fit) +
  labs(title = "Coefficient plot") +
  scale_x_continuous(expand = expansion(mult = c(0.05, 0.16)))

# create optimality criterion plot
p2 <- critPlot(fit) +
  labs(title = "Optimality criterion plot")

# create diagnostic plot of optimal model fit
p3 <- diagnosticPlot(fit, covArgs = list(alpha = 0.8),
                     which = "rdiag", id.n = 0) +
  labs(title = "Regression diagnostic plot") +
  theme(legend.position = "top", legend.title = element_blank())
```

```{r rgrplars, echo=FALSE, dev="svglite", fig.width=9.75, fig.height=3.5, fig.align="center", out.width="100%"}
grid.arrange(p1, p2, p3, nrow = 1)
```

Examples of the coefficient plot (*left*), the optimality criterion plot (*center*), and the regression diagnostic plot (*right*) for output of function `rgrplars()`.


## Community guidelines

### Report issues and request features

If you experience any bugs or issues or if you have any suggestions for additional features, please submit an issue via the [*Issues*](https://github.com/aalfons/robustHD/issues) tab of this repository.  Please have a look at existing issues first to see if your problem or feature request has already been discussed.

### Contribute to the package

If you want to contribute to the package, you can fork this repository and create a pull request after implementing the desired functionality.

### Ask for help

If you need help using the package, or if you are interested in collaborations related to this project, please get in touch with the [package maintainer](https://personal.eur.nl/alfons/).


## References

Alfons, A., Croux, C. and Gelper, S. (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 7(1), 226–248. DOI [10.1214/12-AOAS575](https://doi.org/10.1214/12-AOAS575).

Alfons, A., Croux, C. and Gelper, S. (2016) Robust groupwise least angle regression. Computational Statistics & Data Analysis, 93, 421–435. DOI [10.1016/j.csda.2015.02.007](https://doi.org/10.1016/j.csda.2015.02.007).

Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299. DOI [10.1198/016214507000000950](https://doi.org/10.1198/016214507000000950).

Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288. DOI [10.1111/j.2517-6161.1996.tb02080.x](https://doi.org/10.1111/j.2517-6161.1996.tb02080.x).

Owner

Name: Andreas Alfons
Login: aalfons
Kind: user
Location: Rotterdam, Netherlands
Company: Erasmus University Rotterdam

Website: https://personal.eur.nl/alfons/
Repositories: 11
Profile: https://github.com/aalfons

JOSS Publication

robustHD: An R package for robust regression with high-dimensional data

Published

November 03, 2021

DOI

10.21105/joss.03786

Volume 6, Issue 67, Page 3786

Authors

Andreas Alfons

Erasmus School of Economics, Erasmus University Rotterdam, Netherlands

Editor

Mikkel Meyer Andersen

Papers & Mentions

Total mentions: 3

Dysbiosis, gut barrier dysfunction and inflammation in dementia: a pilot study

DOI: 10.1186/s12877-020-01644-2
OpenAlex ID: https://openalex.org/W3043167826
Published: July 2020

Last synced: 5 months ago

Editorial, special issue on “Advances in Robust Statistics”

DOI: 10.1007/s40300-021-00213-w
OpenAlex ID: https://openalex.org/W3176526171
Published: June 2021

Last synced: 5 months ago

Cardiac surgery does not lead to loss of oscillatory components in circulatory signals

DOI: 10.14814/phy2.14423
OpenAlex ID: https://openalex.org/W3022823514
Published: May 2020

Last synced: 5 months ago

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 242
Total Committers: 5
Avg Commits per committer: 48.4
Development Distribution Score (DDS): 0.331

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Andreas Alfons	a**s@e**l	162
Andreas Alfons	a**s@e**e	76
Andreas Alfons	a**s@e**l	2
Dirk Eddelbuettel	e**d@d**g	1
Andreas Alfons	a**s@k**e	1

Committer Domains (Top 20 + Academic)

kuleuven.be: 1 debian.org: 1 econ.kuleuven.be: 1 ese.eur.nl: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 38
Total pull requests: 1
Average time to close issues: over 1 year
Average time to close pull requests: 3 days
Total issue authors: 6
Total pull request authors: 1
Average comments per issue: 0.95
Average comments per pull request: 5.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

aalfons (32)
voellerer (2)
Shicheng-Guo (1)
firthunands (1)
valentint (1)
lulukang (1)

Pull Request Authors

eddelbuettel (1)

Top Labels

Issue Labels

enhancement (16) bug (8) question (2)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 1,831 last-month
Total docker downloads: 43,390

Total dependent packages: 8
Total dependent repositories: 6
Total versions: 19
Total maintainers: 1

cran.r-project.org: robustHD

Robust Methods for High-Dimensional Data

Homepage: https://github.com/aalfons/robustHD
Documentation: http://cran.r-project.org/web/packages/robustHD/robustHD.pdf
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
Latest release: 0.8.1
published over 1 year ago

Versions: 19
Dependent Packages: 8
Dependent Repositories: 6
Downloads: 1,831 Last month
Docker Downloads: 43,390

Rankings

Docker downloads count: 0.6%

Dependent packages count: 6.1%

Downloads: 8.5%

Average: 9.3%

Forks count: 9.6%

Dependent repos count: 11.9%

Stargazers count: 19.3%

Maintainers (1)

alfons@ese.eur.nl

Last synced: 6 months ago

robustHD

Science Score: 93.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.Rmd

Owner

JOSS Publication

robustHD: An R package for robust regression with high-dimensional data

Authors

Editor

Tags

Papers & Mentions

Dysbiosis, gut barrier dysfunction and inflammation in dementia: a pilot study

Editorial, special issue on “Advances in Robust Statistics”

Cardiac surgery does not lead to loss of oscillatory components in circulatory signals

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: robustHD

Rankings

Maintainers (1)

Dependencies