corrp

corrp: An R package for multiple correlation-like analysis and clustering in mixed data - Published in JOSS (2025)

https://github.com/meantrix/corrp

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
    Links to: sciencedirect.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

acca clustering-algorithm compute-correlations correlation correlation-calculations correlation-matrix dataframe mixed-types parallel pearson-correlation r statistical-tests uncertainty-coefficient

Scientific Fields

Engineering Computer Science - 60% confidence
Economics Social Sciences - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

Compute multiple types of correlations analysis (Pearson correlation, R^2 coefficient of linear regression, Cramer's V measure of association, Distance Correlation,The Maximal Information Coefficient, Uncertainty coefficient and Predictive Power Score) in large dataframes with mixed columns classes(integer, numeric, factor and character) in parallel backend.

Basic Info
Statistics
  • Stars: 8
  • Watchers: 2
  • Forks: 3
  • Open Issues: 0
  • Releases: 4
Topics
acca clustering-algorithm compute-correlations correlation correlation-calculations correlation-matrix dataframe mixed-types parallel pearson-correlation r statistical-tests uncertainty-coefficient
Created almost 6 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License

README.md

corrp

<!-- badges: start -->

version License: GPL3 R-CMD-check <!-- badges: end -->

Correlation-like analysis provides an important statistical measure that describes the size and direction of an association between variables. However, there are few R packages that can efficiently perform this type of analysis on large datasets with mixed data types. The corrp package provides a full suite of solutions for computing various correlation-like measures, such as Pearson correlation, Distance Correlation, Maximal Information Coefficient (MIC), Predictive Power Score (PPS), Cramér's V, and the Uncertainty Coefficient. These methods support the analysis of data frames with mixed classes (integer, numeric, factor, and character).

Additionally, it offers a C++ implementation of the Average Correlation Clustering Algorithm (ACCA) ACCA, which was originally developed for genetic studies using Pearson correlation as a similarity measure. In general, ACCA is an unsupervised clustering method, as it identifies patterns in the data without requiring predefined labels. Moreover, it requires the K parameter to be defined, similar to k-means. One of its main differences compared to other clustering methods is that it operates based on correlations rather than traditional distance metrics, such as Euclidean or Mahalanobis distance.

In this package, the ACCA algorithm has been extended to work directly with correlation matrices derived from different association methods, depending on the data types and user preferences. Furthermore, the package is designed for parallel processing in R, making it highly efficient for large datasets.

Details

The corrp package under development by Meantrix team and original based on Srikanth KS (talegari) cor2 function can provide to R users a way to work with correlation analysis among large data.frames, tibbles or data.tables through a R parallel backend and C++ functions.

The data.frame is allowed to have columns of these four classes: integer, numeric, factor and character. The character column is considered as categorical variable.

In this new package the correlation is automatically computed according to the follow options:

integer/numeric pair:

integer/numeric - factor/categorical pair:

factor/categorical pair:

Also, All statistical tests are controlled by the confidence interval of p.value parameter. If the statistical tests do not obtain a significance greater/less than p.value the value of variable isig will be FALSE.

If any errors occur during operations the association measure (infer.value) will be NA.

' The result data and index will have \eqn{N^2} rows, where N is the number of variables of the input data.

By default, the statistical significance test for the PPS algorithm is not calculated, as it is prohibitively expensive for medium to large datasets. In this case isig is NA, you can enable it by setting ptest = TRUE in pps.args.

All the *.args can modify the parameters (p.value, comp, alternative, num.s, rk, ptest) for the respective method on it's prefix.

Installation

Before you begin, ensure you have met the following requirement(s):

  • You have R >= 3.6.2 installed.

Install the development version from GitHub:

r library('remotes') remotes::install_github("meantrix/corrp@main")

Basic Usage

corrp package provides seven main functions for correlation calculations, clustering and basic data manipulation: corrp, corr_fun, corr_matrix, corr_rm, acca , sil_acca and best_acca.

corrp Next, we calculate the correlations for the data set iris using: Maximal Information Coefficient for numeric pair, the Power Predictive Score algorithm for numeric/categorical pair and Uncertainty coefficient for categorical pair.

```r

coorp with using iris using parallel processing

results <- corrp::corrp(iris, cor.nn = 'mic', cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2, verbose = FALSE)

an sequential example with different correlation pair types

results_2 <- corrp::corrp(palmerpenguins::penguins, cor.nn = 'pps', cor.nc = 'lm', cor.cc = 'cramersV', parallel = FALSE, verbose = FALSE)

head(results$data)

infer infer.value stat stat.value isig msg varx vary

Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length

Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width

Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length

Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width

Predictive Power Score 0.5591864 F1_weighted 0.7028029 NA Sepal.Length Species

Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length

head(results_2$data)

infer infer.value stat stat.value isig msg varx vary

Cramer's V 1.0000000 P-value 4.997501e-04 TRUE species species

Cramer's V 0.6598431 P-value 4.997501e-04 TRUE species island

Linear Model 0.8413139 P-value 2.694614e-91 TRUE species billlengthmm

Linear Model 0.8244751 P-value 1.507658e-84 TRUE species billdepthmm

Linear Model 0.8821728 P-value 1.351710e-111 TRUE species flipperlengthmm

Linear Model 0.8183349 P-value 2.892368e-82 TRUE species bodymassg

```

corr_matrix Using the previous result we can create a correlation matrix as follows:

```r m <- corrp::corr_matrix(results, col = 'infer.value', isig = FALSE) m

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Sepal.Length 0.9994870 0.2770503 0.7682996 0.6683281 0.4075487

Sepal.Width 0.2770503 0.9967831 0.4391362 0.4354146 0.2012876

Petal.Length 0.7682996 0.4391362 1.0000000 0.9182958 0.7904907

Petal.Width 0.6683281 0.4354146 0.9182958 0.9995144 0.7561113

Species 0.5591864 0.3134401 0.9167580 0.9398532 0.9999758

attr(,"class")

[1] "cmatrix" "matrix"

```

We can use the corrplot::corrplot function to plot the correlation matrix.

r corrplot::corrplot(m) Correlation Matrix Plot

Now, we can clustering the data set variables through ACCA and the correlation matrix. By way of example, consider 2 clusters k = 2:

```r acca.res <- corrp::acca(m, 2) acca.res

$cluster1

[1] "Species" "Sepal.Length" "Petal.Width"

$cluster2

[1] "Petal.Length" "Sepal.Width"

attr(,"class")

[1] "acca_list" "list"

```

Also,we can calculate The average silhouette width to the cluster acca.res:

```r corrp::sil_acca(acca.res, m)

[1] -0.02831006

attr(,"class")

[1] "corrpstat"

attr(,"statistic")

[1] "Silhouette"

``` Observations with a large average silhouette width (almost 1) are very well clustered.

Contributing to corrp

To contribute to corrp, follow these steps:

  1. Fork this repository.
  2. Create a branch: git checkout -b <branch_name>.
  3. Make your changes and commit them: git commit -m '<commit_message>'
  4. Push to the original branch: git push origin corrp/<location>
  5. Create the pull request.

Alternatively see the GitHub documentation on creating a pull request.

Bug Reports

If you have detected a bug (or want to ask for a new feature), please file an issue with a minimal reproducible example on GitHub.

License

This project uses the following license: GLP3 License.

Owner

  • Name: MEANTRIX
  • Login: meantrix
  • Kind: organization
  • Email: contato@meantrix.com
  • Location: Florianópolis , Brazil

We are a company specialized in Artificial Intelligence (AI) and data analysis software.

JOSS Publication

corrp: An R package for multiple correlation-like analysis and clustering in mixed data
Published
May 27, 2025
Volume 10, Issue 109, Page 7319
Authors
Igor Dornelles Schoeller Siciliani ORCID
Meantrix, Brazil, Universidade Federal de Santa Catarina, Brazil
Paulo Henrique dos Santos ORCID
Meantrix, Brazil, Universidade Federal de Santa Catarina, Brazil
Editor
Julia Romanowska ORCID
Tags
correlation clustering mixed data ACCA

GitHub Events

Total
  • Create event: 7
  • Release event: 1
  • Issues event: 45
  • Watch event: 6
  • Delete event: 4
  • Issue comment event: 27
  • Push event: 120
  • Pull request review comment event: 6
  • Pull request review event: 8
  • Pull request event: 26
  • Fork event: 3
Last Year
  • Create event: 7
  • Release event: 1
  • Issues event: 45
  • Watch event: 6
  • Delete event: 4
  • Issue comment event: 27
  • Push event: 120
  • Pull request review comment event: 6
  • Pull request review event: 8
  • Pull request event: 26
  • Fork event: 3

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 20
  • Total pull requests: 15
  • Average time to close issues: 4 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 7
  • Total pull request authors: 6
  • Average comments per issue: 0.65
  • Average comments per pull request: 0.53
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 19
  • Pull requests: 15
  • Average time to close issues: 3 months
  • Average time to close pull requests: about 1 month
  • Issue authors: 6
  • Pull request authors: 6
  • Average comments per issue: 0.68
  • Average comments per pull request: 0.53
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • malcolmbarrett (12)
  • devSJR (6)
  • moylettsinead (3)
  • PHS-Meantrix (2)
  • 13479776 (1)
  • anspiess (1)
  • jromanowska (1)
Pull Request Authors
  • PHS-Meantrix (8)
  • malcolmbarrett (2)
  • moylettsinead (2)
  • igor-siciliani (2)
  • devSJR (1)
  • jromanowska (1)
Top Labels
Issue Labels
documentation (5) :beetle: bug (1) question (1)
Pull Request Labels
done (6) documentation (2)

Dependencies

DESCRIPTION cran
  • R >= 3.6.0 depends
  • Rcpp >= 1.0.4.6 depends
  • DescTools >= 0.99.40 imports
  • RcppArmadillo * imports
  • caret >= 6.0 imports
  • checkmate >= 2.0.0 imports
  • energy >= 1.7 imports
  • lsr >= 0.5 imports
  • minerva >= 1.5.8 imports
  • parallel * imports
  • ppsr >= 0.0.2 imports
  • stats * imports
  • testthat * suggests