corrp
corrp: An R package for multiple correlation-like analysis and clustering in mixed data - Published in JOSS (2025)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
✓Academic publication links
Links to: sciencedirect.com -
○Academic email domains
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Scientific Fields
Repository
Compute multiple types of correlations analysis (Pearson correlation, R^2 coefficient of linear regression, Cramer's V measure of association, Distance Correlation,The Maximal Information Coefficient, Uncertainty coefficient and Predictive Power Score) in large dataframes with mixed columns classes(integer, numeric, factor and character) in parallel backend.
Basic Info
- Host: GitHub
- Owner: meantrix
- License: gpl-3.0
- Language: R
- Default Branch: main
- Homepage: https://meantrix.github.io/corrp/
- Size: 18.7 MB
Statistics
- Stars: 8
- Watchers: 2
- Forks: 3
- Open Issues: 0
- Releases: 4
Topics
Metadata Files
README.md
corrp
Correlation-like analysis provides an important statistical measure that describes the size and direction of an association between variables. However, there are few R packages that can efficiently perform this type of analysis on large datasets with mixed data types. The corrp package provides a full suite of solutions for computing various correlation-like measures, such as Pearson correlation, Distance Correlation, Maximal Information Coefficient (MIC), Predictive Power Score (PPS), Cramér's V, and the Uncertainty Coefficient. These methods support the analysis of data frames with mixed classes (integer, numeric, factor, and character).
Additionally, it offers a C++ implementation of the Average Correlation Clustering Algorithm (ACCA) ACCA, which was originally developed for genetic studies using Pearson correlation as a similarity measure. In general, ACCA is an unsupervised clustering method, as it identifies patterns in the data without requiring predefined labels. Moreover, it requires the K parameter to be defined, similar to k-means. One of its main differences compared to other clustering methods is that it operates based on correlations rather than traditional distance metrics, such as Euclidean or Mahalanobis distance.
In this package, the ACCA algorithm has been extended to work directly with correlation matrices derived from different association methods, depending on the data types and user preferences. Furthermore, the package is designed for parallel processing in R, making it highly efficient for large datasets.
Details
The corrp package under development by Meantrix team and original based on Srikanth KS (talegari) cor2 function can provide to R users a way to work with correlation analysis among large data.frames, tibbles or data.tables through a R parallel backend and C++ functions.
The data.frame is allowed to have columns of these four classes: integer, numeric, factor and character. The character column is considered as categorical variable.
In this new package the correlation is automatically computed according to the follow options:
integer/numeric pair:
- Pearson correlation test;
- Distance Correlation;
- Maximal Information Coefficient;
- Predictive Power Score.
integer/numeric - factor/categorical pair:
- correlation coefficient or squared root of R^2 coefficient of linear regression;
- Predictive Power Score.
factor/categorical pair:
- cramersV a measure of association between two nominal .;
- Uncertainty coefficient.
- Predictive Power Score.
Also, All statistical tests are controlled by the confidence interval of p.value parameter. If the statistical tests do not obtain a significance greater/less than p.value the value of variable isig will be FALSE.
If any errors occur during operations the association measure (infer.value) will be NA.
' The result data and index will have \eqn{N^2} rows, where N is the number of variables of the input data.
By default, the statistical significance test for the PPS algorithm is not calculated, as it is prohibitively expensive for medium to large datasets. In this case isig is NA, you can enable it by setting ptest = TRUE in pps.args.
All the *.args can modify the parameters (p.value, comp, alternative, num.s, rk, ptest) for the respective method on it's prefix.
Installation
Before you begin, ensure you have met the following requirement(s):
- You have
R >= 3.6.2installed.
Install the development version from GitHub:
r
library('remotes')
remotes::install_github("meantrix/corrp@main")
Basic Usage
corrp package provides seven main functions for correlation calculations, clustering and basic data manipulation: corrp,
corr_fun, corr_matrix, corr_rm, acca , sil_acca and best_acca.
corrp Next, we calculate the correlations for the data set iris using: Maximal Information Coefficient for numeric pair, the Power Predictive Score algorithm for numeric/categorical pair and Uncertainty coefficient for categorical pair.
```r
coorp with using iris using parallel processing
results <- corrp::corrp(iris, cor.nn = 'mic', cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2, verbose = FALSE)
an sequential example with different correlation pair types
results_2 <- corrp::corrp(palmerpenguins::penguins, cor.nn = 'pps', cor.nc = 'lm', cor.cc = 'cramersV', parallel = FALSE, verbose = FALSE)
head(results$data)
infer infer.value stat stat.value isig msg varx vary
Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length
Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width
Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length
Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width
Predictive Power Score 0.5591864 F1_weighted 0.7028029 NA Sepal.Length Species
Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length
head(results_2$data)
infer infer.value stat stat.value isig msg varx vary
Cramer's V 1.0000000 P-value 4.997501e-04 TRUE species species
Cramer's V 0.6598431 P-value 4.997501e-04 TRUE species island
Linear Model 0.8413139 P-value 2.694614e-91 TRUE species billlengthmm
Linear Model 0.8244751 P-value 1.507658e-84 TRUE species billdepthmm
Linear Model 0.8821728 P-value 1.351710e-111 TRUE species flipperlengthmm
Linear Model 0.8183349 P-value 2.892368e-82 TRUE species bodymassg
```
corr_matrix Using the previous result we can create a correlation matrix as follows:
```r m <- corrp::corr_matrix(results, col = 'infer.value', isig = FALSE) m
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Sepal.Length 0.9994870 0.2770503 0.7682996 0.6683281 0.4075487
Sepal.Width 0.2770503 0.9967831 0.4391362 0.4354146 0.2012876
Petal.Length 0.7682996 0.4391362 1.0000000 0.9182958 0.7904907
Petal.Width 0.6683281 0.4354146 0.9182958 0.9995144 0.7561113
Species 0.5591864 0.3134401 0.9167580 0.9398532 0.9999758
attr(,"class")
[1] "cmatrix" "matrix"
```
We can use the corrplot::corrplot function to plot the correlation matrix.
r
corrplot::corrplot(m)
Now, we can clustering the data set variables through ACCA and the correlation matrix.
By way of example, consider 2 clusters k = 2:
```r acca.res <- corrp::acca(m, 2) acca.res
$cluster1
[1] "Species" "Sepal.Length" "Petal.Width"
$cluster2
[1] "Petal.Length" "Sepal.Width"
attr(,"class")
[1] "acca_list" "list"
```
Also,we can calculate The average silhouette width to the cluster acca.res:
```r corrp::sil_acca(acca.res, m)
[1] -0.02831006
attr(,"class")
[1] "corrpstat"
attr(,"statistic")
[1] "Silhouette"
``` Observations with a large average silhouette width (almost 1) are very well clustered.
Contributing to corrp
To contribute to corrp, follow these steps:
- Fork this repository.
- Create a branch:
git checkout -b <branch_name>. - Make your changes and commit them:
git commit -m '<commit_message>' - Push to the original branch:
git push origin corrp/<location> - Create the pull request.
Alternatively see the GitHub documentation on creating a pull request.
Bug Reports
If you have detected a bug (or want to ask for a new feature), please file an issue with a minimal reproducible example on GitHub.
License
This project uses the following license: GLP3 License.
Owner
- Name: MEANTRIX
- Login: meantrix
- Kind: organization
- Email: contato@meantrix.com
- Location: Florianópolis , Brazil
- Website: https://www.meantrix.com
- Repositories: 4
- Profile: https://github.com/meantrix
We are a company specialized in Artificial Intelligence (AI) and data analysis software.
JOSS Publication
corrp: An R package for multiple correlation-like analysis and clustering in mixed data
Authors
Tags
correlation clustering mixed data ACCAGitHub Events
Total
- Create event: 7
- Release event: 1
- Issues event: 45
- Watch event: 6
- Delete event: 4
- Issue comment event: 27
- Push event: 120
- Pull request review comment event: 6
- Pull request review event: 8
- Pull request event: 26
- Fork event: 3
Last Year
- Create event: 7
- Release event: 1
- Issues event: 45
- Watch event: 6
- Delete event: 4
- Issue comment event: 27
- Push event: 120
- Pull request review comment event: 6
- Pull request review event: 8
- Pull request event: 26
- Fork event: 3
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 20
- Total pull requests: 15
- Average time to close issues: 4 months
- Average time to close pull requests: about 1 month
- Total issue authors: 7
- Total pull request authors: 6
- Average comments per issue: 0.65
- Average comments per pull request: 0.53
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 19
- Pull requests: 15
- Average time to close issues: 3 months
- Average time to close pull requests: about 1 month
- Issue authors: 6
- Pull request authors: 6
- Average comments per issue: 0.68
- Average comments per pull request: 0.53
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- malcolmbarrett (12)
- devSJR (6)
- moylettsinead (3)
- PHS-Meantrix (2)
- 13479776 (1)
- anspiess (1)
- jromanowska (1)
Pull Request Authors
- PHS-Meantrix (8)
- malcolmbarrett (2)
- moylettsinead (2)
- igor-siciliani (2)
- devSJR (1)
- jromanowska (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- R >= 3.6.0 depends
- Rcpp >= 1.0.4.6 depends
- DescTools >= 0.99.40 imports
- RcppArmadillo * imports
- caret >= 6.0 imports
- checkmate >= 2.0.0 imports
- energy >= 1.7 imports
- lsr >= 0.5 imports
- minerva >= 1.5.8 imports
- parallel * imports
- ppsr >= 0.0.2 imports
- stats * imports
- testthat * suggests

