corrp

corrp: An R package for multiple correlation-like analysis and clustering in mixed data - Published in JOSS (2025)

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
✓
Academic publication links
Links to: sciencedirect.com
○
Academic email domains
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

acca clustering-algorithm compute-correlations correlation correlation-calculations correlation-matrix dataframe mixed-types parallel pearson-correlation r statistical-tests uncertainty-coefficient

Scientific Fields

Engineering Computer Science - 60% confidence

Economics Social Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Compute multiple types of correlations analysis (Pearson correlation, R^2 coefficient of linear regression, Cramer's V measure of association, Distance Correlation,The Maximal Information Coefficient, Uncertainty coefficient and Predictive Power Score) in large dataframes with mixed columns classes(integer, numeric, factor and character) in parallel backend.

Basic Info

Host: GitHub
Owner: meantrix
License: gpl-3.0
Language: R
Default Branch: main
Homepage: https://meantrix.github.io/corrp/
Size: 18.7 MB

Statistics

Stars: 8
Watchers: 2
Forks: 3
Open Issues: 0
Releases: 4

Topics

acca clustering-algorithm compute-correlations correlation correlation-calculations correlation-matrix dataframe mixed-types parallel pearson-correlation r statistical-tests uncertainty-coefficient

Created about 6 years ago · Last pushed 9 months ago

Metadata Files

Readme Changelog Contributing License

README.md

corrp

Correlation-like analysis provides an important statistical measure that describes the size and direction of an association between variables. However, there are few R packages that can efficiently perform this type of analysis on large datasets with mixed data types. The corrp package provides a full suite of solutions for computing various correlation-like measures, such as Pearson correlation, Distance Correlation, Maximal Information Coefficient (MIC), Predictive Power Score (PPS), Cramér's V, and the Uncertainty Coefficient. These methods support the analysis of data frames with mixed classes (integer, numeric, factor, and character).

Additionally, it offers a C++ implementation of the Average Correlation Clustering Algorithm (ACCA) ACCA, which was originally developed for genetic studies using Pearson correlation as a similarity measure. In general, ACCA is an unsupervised clustering method, as it identifies patterns in the data without requiring predefined labels. Moreover, it requires the K parameter to be defined, similar to k-means. One of its main differences compared to other clustering methods is that it operates based on correlations rather than traditional distance metrics, such as Euclidean or Mahalanobis distance.

In this package, the ACCA algorithm has been extended to work directly with correlation matrices derived from different association methods, depending on the data types and user preferences. Furthermore, the package is designed for parallel processing in R, making it highly efficient for large datasets.

Details

The corrp package under development by Meantrix team and original based on Srikanth KS (talegari) cor2 function can provide to R users a way to work with correlation analysis among large data.frames, tibbles or data.tables through a R parallel backend and C++ functions.

The data.frame is allowed to have columns of these four classes: integer, numeric, factor and character. The character column is considered as categorical variable.

In this new package the correlation is automatically computed according to the follow options:

integer/numeric pair:

integer/numeric - factor/categorical pair:

correlation coefficient or squared root of R^2 coefficient of linear regression;
Predictive Power Score.

factor/categorical pair:

cramersV a measure of association between two nominal .;
Uncertainty coefficient.
Predictive Power Score.

Also, All statistical tests are controlled by the confidence interval of p.value parameter. If the statistical tests do not obtain a significance greater/less than p.value the value of variable isig will be FALSE.

If any errors occur during operations the association measure (infer.value) will be NA.

' The result `data` and `index` will have \eqn{N^2} rows, where N is the number of variables of the input data.

By default, the statistical significance test for the PPS algorithm is not calculated, as it is prohibitively expensive for medium to large datasets. In this case isig is NA, you can enable it by setting ptest = TRUE in pps.args.

All the *.args can modify the parameters (p.value, comp, alternative, num.s, rk, ptest) for the respective method on it's prefix.

Installation

Before you begin, ensure you have met the following requirement(s):

You have R >= 3.6.2 installed.

Install the development version from GitHub:

r library('remotes') remotes::install_github("meantrix/corrp@main")

Basic Usage

corrp package provides seven main functions for correlation calculations, clustering and basic data manipulation: corrp, corr_fun, corr_matrix, corr_rm, acca , sil_acca and best_acca.

corrp Next, we calculate the correlations for the data set iris using: Maximal Information Coefficient for numeric pair, the Power Predictive Score algorithm for numeric/categorical pair and Uncertainty coefficient for categorical pair.

```r

coorp with using iris using parallel processing

results <- corrp::corrp(iris, cor.nn = 'mic', cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2, verbose = FALSE)

an sequential example with different correlation pair types

results_2 <- corrp::corrp(palmerpenguins::penguins, cor.nn = 'pps', cor.nc = 'lm', cor.cc = 'cramersV', parallel = FALSE, verbose = FALSE)

head(results$data)

infer infer.value stat stat.value isig msg varx vary

Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length

Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width

Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length

Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width

Predictive Power Score 0.5591864 F1_weighted 0.7028029 NA Sepal.Length Species

Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length

head(results_2$data)

infer infer.value stat stat.value isig msg varx vary

Cramer's V 1.0000000 P-value 4.997501e-04 TRUE species species

Cramer's V 0.6598431 P-value 4.997501e-04 TRUE species island

Linear Model 0.8413139 P-value 2.694614e-91 TRUE species billlengthmm

Linear Model 0.8244751 P-value 1.507658e-84 TRUE species billdepthmm

Linear Model 0.8821728 P-value 1.351710e-111 TRUE species flipperlengthmm

Linear Model 0.8183349 P-value 2.892368e-82 TRUE species bodymassg

```

corr_matrix Using the previous result we can create a correlation matrix as follows:

```r m <- corrp::corr_matrix(results, col = 'infer.value', isig = FALSE) m

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Sepal.Length 0.9994870 0.2770503 0.7682996 0.6683281 0.4075487

Sepal.Width 0.2770503 0.9967831 0.4391362 0.4354146 0.2012876

Petal.Length 0.7682996 0.4391362 1.0000000 0.9182958 0.7904907

Petal.Width 0.6683281 0.4354146 0.9182958 0.9995144 0.7561113

Species 0.5591864 0.3134401 0.9167580 0.9398532 0.9999758

attr(,"class")

[1] "cmatrix" "matrix"

```

We can use the corrplot::corrplot function to plot the correlation matrix.

r corrplot::corrplot(m) Correlation Matrix Plot

Now, we can clustering the data set variables through ACCA and the correlation matrix. By way of example, consider 2 clusters k = 2:

```r acca.res <- corrp::acca(m, 2) acca.res

$cluster1

[1] "Species" "Sepal.Length" "Petal.Width"

$cluster2

[1] "Petal.Length" "Sepal.Width"

attr(,"class")

[1] "acca_list" "list"

```

Also,we can calculate The average silhouette width to the cluster acca.res:

```r corrp::sil_acca(acca.res, m)

[1] -0.02831006

attr(,"class")

[1] "corrpstat"

attr(,"statistic")

[1] "Silhouette"

``` Observations with a large average silhouette width (almost 1) are very well clustered.

Contributing to corrp

To contribute to corrp, follow these steps:

Fork this repository.
Create a branch: git checkout -b <branch_name>.
Make your changes and commit them: git commit -m '<commit_message>'
Push to the original branch: git push origin corrp/<location>
Create the pull request.

Alternatively see the GitHub documentation on creating a pull request.

Bug Reports

If you have detected a bug (or want to ask for a new feature), please file an issue with a minimal reproducible example on GitHub.

License

This project uses the following license: GLP3 License.

Owner

Name: MEANTRIX
Login: meantrix
Kind: organization
Email: contato@meantrix.com
Location: Florianópolis , Brazil

Website: https://www.meantrix.com
Repositories: 4
Profile: https://github.com/meantrix

We are a company specialized in Artificial Intelligence (AI) and data analysis software.

JOSS Publication

corrp: An R package for multiple correlation-like analysis and clustering in mixed data

Published

May 27, 2025

DOI

10.21105/joss.07319

Volume 10, Issue 109, Page 7319

Authors

Igor Dornelles Schoeller Siciliani

Meantrix, Brazil, Universidade Federal de Santa Catarina, Brazil

Paulo Henrique dos Santos

Meantrix, Brazil, Universidade Federal de Santa Catarina, Brazil

Editor

Julia Romanowska

GitHub Events

Total

Create event: 7
Release event: 1
Issues event: 45
Watch event: 6
Delete event: 4
Issue comment event: 27
Push event: 120
Pull request review comment event: 6
Pull request review event: 8
Pull request event: 26
Fork event: 3

Last Year

Create event: 7
Release event: 1
Issues event: 45
Watch event: 6
Delete event: 4
Issue comment event: 27
Push event: 120
Pull request review comment event: 6
Pull request review event: 8
Pull request event: 26
Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 20
Total pull requests: 15
Average time to close issues: 4 months
Average time to close pull requests: about 1 month
Total issue authors: 7
Total pull request authors: 6
Average comments per issue: 0.65
Average comments per pull request: 0.53
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 19
Pull requests: 15
Average time to close issues: 3 months
Average time to close pull requests: about 1 month
Issue authors: 6
Pull request authors: 6
Average comments per issue: 0.68
Average comments per pull request: 0.53
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

malcolmbarrett (12)
devSJR (6)
moylettsinead (3)
PHS-Meantrix (2)
13479776 (1)
anspiess (1)
jromanowska (1)

Pull Request Authors

PHS-Meantrix (8)
malcolmbarrett (2)
moylettsinead (2)
igor-siciliani (2)
devSJR (1)
jromanowska (1)

Top Labels

Issue Labels

documentation (5) :beetle: bug (1) question (1)

Pull Request Labels

done (6) documentation (2)

Dependencies

DESCRIPTION cran

R >= 3.6.0 depends
Rcpp >= 1.0.4.6 depends
DescTools >= 0.99.40 imports
RcppArmadillo * imports
caret >= 6.0 imports
checkmate >= 2.0.0 imports
energy >= 1.7 imports
lsr >= 0.5 imports
minerva >= 1.5.8 imports
parallel * imports
ppsr >= 0.0.2 imports
stats * imports
testthat * suggests

corrp

Science Score: 93.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

corrp

Details

integer/numeric pair:

integer/numeric - factor/categorical pair:

factor/categorical pair:

' The result data and index will have \eqn{N^2} rows, where N is the number of variables of the input data.

Installation

Basic Usage

coorp with using iris using parallel processing

an sequential example with different correlation pair types

infer infer.value stat stat.value isig msg varx vary

Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length

Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width

Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length

Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width

Predictive Power Score 0.5591864 F1_weighted 0.7028029 NA Sepal.Length Species

Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length

infer infer.value stat stat.value isig msg varx vary

Cramer's V 1.0000000 P-value 4.997501e-04 TRUE species species

Cramer's V 0.6598431 P-value 4.997501e-04 TRUE species island

Linear Model 0.8413139 P-value 2.694614e-91 TRUE species billlengthmm

Linear Model 0.8244751 P-value 1.507658e-84 TRUE species billdepthmm

Linear Model 0.8821728 P-value 1.351710e-111 TRUE species flipperlengthmm

Linear Model 0.8183349 P-value 2.892368e-82 TRUE species bodymassg

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Sepal.Length 0.9994870 0.2770503 0.7682996 0.6683281 0.4075487

Sepal.Width 0.2770503 0.9967831 0.4391362 0.4354146 0.2012876

Petal.Length 0.7682996 0.4391362 1.0000000 0.9182958 0.7904907

Petal.Width 0.6683281 0.4354146 0.9182958 0.9995144 0.7561113

Species 0.5591864 0.3134401 0.9167580 0.9398532 0.9999758

attr(,"class")

[1] "cmatrix" "matrix"

$cluster1

[1] "Species" "Sepal.Length" "Petal.Width"

$cluster2

[1] "Petal.Length" "Sepal.Width"

attr(,"class")

[1] "acca_list" "list"

[1] -0.02831006

attr(,"class")

[1] "corrpstat"

attr(,"statistic")

[1] "Silhouette"

Contributing to corrp

Bug Reports

License

Owner

JOSS Publication

corrp: An R package for multiple correlation-like analysis and clustering in mixed data

Authors

Editor

Tags

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

' The result `data` and `index` will have \eqn{N^2} rows, where N is the number of variables of the input data.