collinear

R package to manage multicollinearity in modeling data frames.

https://github.com/blasbenito/collinear

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.4%) to scientific vocabulary

Keywords

machine-learning multicollinearity r-package statistics
Last synced: 10 months ago · JSON representation

Repository

R package to manage multicollinearity in modeling data frames.

Basic Info
Statistics
  • Stars: 12
  • Watchers: 1
  • Forks: 1
  • Open Issues: 3
  • Releases: 3
Topics
machine-learning multicollinearity r-package statistics
Created almost 3 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  eval = TRUE,
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
# options(tibble.print_min = 5, tibble.print_max = 5)
```


# `collinear` \n Seamless Multicollinearity Management 





[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10039489.svg)](https://doi.org/10.5281/zenodo.10039489)
[![CRAN status](https://www.r-pkg.org/badges/version/collinear)](https://cran.r-project.org/package=collinear)
[![CRAN\_Download\_Badge](http://cranlogs.r-pkg.org/badges/grand-total/collinear)](https://CRAN.R-project.org/package=collinear)
[![R-CMD-check](https://github.com/BlasBenito/collinear/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/BlasBenito/collinear/actions/workflows/R-CMD-check.yaml)



## Warning

Version 2.0.0 of `collinear` includes changes that may disrupt existing workflows, and results from previous versions may not be reproducible due to enhancements in the automated selection algorithms. Please refer to the Changelog for details.

## Summary

[Multicollinearity hinders the interpretability](https://www.blasbenito.com/post/multicollinearity-model-interpretability/) of linear and machine learning models.

The `collinear` package combines four methods for easy management of multicollinearity in modelling data frames with numeric and categorical variables:

- **Target Encoding**: Transforms categorical predictors to numeric using a numeric response as reference.
- **Preference Order**: Ranks predictors by their association with a response variable to preserve important ones in multicollinearity filtering.
- **Pairwise Correlation Filtering**: Automated multicollinearity filtering of numeric and categorical predictors based on pairwise correlations.
- **Variance Inflation Factor Filtering**: Automated multicollinearity filtering of numeric predictors based on Variance Inflation Factors.

These methods are combined in the function `collinear()`, which serves as single entry point for most of the functionalities in the package. The article [How It Works](https://blasbenito.github.io/collinear/articles/how_it_works.html) explains how `collinear()` works in detail.

## Citation

If you find this package useful, please cite it as:

*Blas M. Benito (2024). collinear: R Package for Seamless Multicollinearity Management. Version 2.0.0. doi: 10.5281/zenodo.10039489*

## Main Improvements in Version 2.0.0

1. **Expanded Functionality**: Functions `collinear()` and `preference_order()` support both categorical and numeric responses and predictors, and can handle several responses at once.
2. **Robust Selection Algorithms**: Enhanced selection in `vif_select()` and `cor_select()`.
3. **Enhanced Functionality to Rank Predictors**: New functions to compute association between response and predictors covering most use-cases, and automated function selection depending on data features.
4. **Simplified Target Encoding**: Streamlined and parallelized for better efficiency, and new default is "loo" (leave-one-out).
5. **Parallelization and Progress Bars**: Utilizes `future` and `progressr` for enhanced performance and user experience.


## Install

The package `collinear` can be installed from CRAN.

```{r, eval = FALSE}
install.packages("collinear")
```


The development version can be installed from GitHub.

```{r, eval = FALSE}
remotes::install_github(
  repo = "blasbenito/collinear", 
  ref = "development"
  )
```


Previous versions are in the “archive_xxx” branches of the GitHub repository.

```{r, eval = FALSE}
remotes::install_github(
  repo = "blasbenito/collinear", 
  ref = "archive_v1.1.1"
  )
```


```{r packages, message = FALSE, warning = FALSE, include = FALSE}
library(collinear)
library(future)
library(parallelly)
```


## Getting Started

The function `collinear()` provides all tools required for a fully fledged multicollinearity filtering workflow. The code below shows a small example workflow.

```{r}
#parallelization setup
future::plan(
  future::multisession,
  workers = parallelly::availableCores() - 1
  )

#progress bar (does not work in Rmarkdown)
#progressr::handlers(global = TRUE)

#example data frame
df <- collinear::vi[1:5000, ]

#there are many NA cases in this data frame
sum(is.na(df))
```

```{r}
#numeric and categorical predictors
predictors <- collinear::vi_predictors

collinear::identify_predictors(
  df = df,
  predictors = predictors
)
```

```{r}
#multicollinearity filtering
selection <- collinear::collinear(
  df = df,
  response = c(
    "vi_numeric",    #numeric response
    "vi_categorical" #categorical response
    ),
  predictors = predictors,
  max_cor = 0.75,
  max_vif = 5,
  quiet = TRUE
)
```

The output is a named list of vectors with selected predictor names when more than one response is provided, and a character vector otherwise.  

```{r}
selection
```

The output of `collinear()` can be easily converted into model formulas.

```{r}
formulas <- collinear::model_formula(
  predictors = selection
)

formulas
```

These formulas can be used to fit models right away.

```{r, eval = FALSE}
#linear model
m_vi_numeric <- stats::glm(
  formula = formulas[["vi_numeric"]], 
  data = df,
  na.action = na.omit
  )

#random forest model
m_vi_categorical <- ranger::ranger(
  formula = formulas[["vi_categorical"]],
  data = na.omit(df)
)
```

## Getting help

If you encounter bugs or issues with the documentation, please [file a issue on GitHub](https://github.com/BlasBenito/collinear/issues).

Owner

  • Name: Blas Benito
  • Login: BlasBenito
  • Kind: user
  • Location: Somewhere in the beach

PhD in Quantitative Ecology, Master in Geographic Information Systems, R developer, data scientist and data engineer at @BiomeMakers.

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 4
  • Watch event: 8
  • Issue comment event: 3
  • Push event: 76
  • Pull request event: 4
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 4
  • Watch event: 8
  • Issue comment event: 3
  • Push event: 76
  • Pull request event: 4

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 5
  • Total pull requests: 14
  • Average time to close issues: 27 days
  • Average time to close pull requests: about 2 months
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.36
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 8
  • Average time to close issues: 22 days
  • Average time to close pull requests: 19 days
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 1.75
  • Average comments per pull request: 0.63
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • AMBarbosa (2)
  • BlasBenito (1)
  • marineReg (1)
  • RockEco (1)
Pull Request Authors
  • AMBarbosa (10)
  • olivroy (4)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 486 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: collinear

Automated Multicollinearity Management

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 486 Last month
Rankings
Forks count: 28.0%
Dependent packages count: 29.0%
Stargazers count: 34.7%
Dependent repos count: 37.1%
Average: 43.1%
Downloads: 86.8%
Maintainers (1)
Last synced: 10 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.4.1 composite
  • actions/checkout v3 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 4.0 depends
  • dplyr * imports
  • rlang * imports
  • tibble * imports
  • tidyr * imports
  • roxyglobals * suggests
  • testthat >= 3.0.0 suggests