multid

Multivariate Difference between Two Groups

https://github.com/vjilmari/multid

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 25 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Last synced: 7 months ago · JSON representation

Repository

Multivariate Difference between Two Groups

Basic Info

Host: GitHub
Owner: vjilmari
License: gpl-3.0
Language: R
Default Branch: main
Size: 387 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 11

Created about 5 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# multid



[![CRAN version](https://www.r-pkg.org/badges/version-ago/multid)](https://www.r-pkg.org/badges/version-ago/multid)

[![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/multid)](https://cranlogs.r-pkg.org/badges/grand-total/multid)

[![DOI](https://zenodo.org/badge/323575627.svg)](https://zenodo.org/doi/10.5281/zenodo.10669345)



*multid* provides tools for regularized measurement of multivariate differences between two groups (e.g., sex differences). Regularization via logistic regression variants enables inclusion of large number of correlated variables in the multivariate set while providing k-fold cross-validation and regularization to avoid overfitting (**D_regularized** -function).

See fully reproducible exemplary [vignette on multivariate sex differences in personality with regularized regression](https://CRAN.R-project.org/package=multid/vignettes/multivariate_sex_differences_in_personality.html), or the examples presented below.

Predictive approach as implemented with regularized methods also allows for examination of group-membership probabilities and their distributions across individuals. In the context of statistical predictions of sex, these distributions are an updated variant to gender-typicality distributions used in gender diagnosticity methodology [(Lippa & Connelly, 1990)](https://doi.org/10.1037/0022-3514.59.5.1051).

Studies in which these methods have been used:

1.  [Lönnqvist, J. E., & Ilmarinen, V. J. (2021). Using a continuous measure of genderedness to assess sex differences in the attitudes of the political elite. *Political Behavior, 43*, 1779–1800.](https://doi.org/10.1007/s11109-021-09681-2)

2.  [Ilmarinen, V. J., Vainikainen, M. P., & Lönnqvist, J. E. (2023). Is there a g-factor of genderedness? Using a continuous measure of genderedness to assess sex differences in personality, values, cognitive ability, school grades, and educational track. *European Journal of Personality, 37*, 313-337.](https://doi.org/10.1177/08902070221088155)

3.  [Ilmarinen, V. J. & Lönnqvist, J. E. (2024). Deconstructing the Gender-Equality Paradox. *Journal of Personality and Social Psychology, 127*, 217-237.](https://doi.org/10.1037/pspp0000508)

4.  [Leikas, S., Ilmarinen, V. J., Vainikainen, M. P., & Lönnqvist, J. E. (2024). “Male-typicality Disadvantage” in Educational Outcomes Is Reflected in Personal Values, but Not in Personality Traits. *Collabra: Psychology, 10*, 118840.](https://doi.org/10.1525/collabra.118840)

5.  [Sortheix, F. M., Ilmarinen, V. J., Mannerström, R., & Salmela-Aro, K. (2025). Gender and values in 20 years of the European Social Survey: Are gender-typical values linked to parenthood? *European Journal of Personality*.](https://doi.org/10.1177/08902070251332098)

*multid* also includes a function for testing several hypotheses that are typically compressed to correlation between predictor (x) and an algebraic difference score (y1-y2) by deconstructing this difference score correlation. Deconstructing difference score correlations can be applied with structural path models (**ddsc_sem**) and multi-level models (**ddsc_ml**). 

In addition, *multid* includes various helper functions:

-   Calculation of variance partition coefficient (i.e., intraclass correlation, ICC) with **vpc_at** -function at different levels of lower-level predictors in two-level model including random slope fitted with lmer ([Goldstein et al., 2002](https://doi.org/10.1207/S15328031US0104_02))

-   Calculation of coefficient of variance variation and standardized variance heterogeneity, either with manual input of estimates (**cvv_manual**) or directly from data (**cvv**) ([Ruscio & Roche, 2012](https://doi.org/10.1027/1614-2241/a000034))

-   Calculation of reliability of difference score variable that is a difference between two mean values (e.g., difference between men and women across countries) by using ICC2 reliability estimates ([Bliese, 2000](https://psycnet.apa.org/record/2000-16936-008)) as inputs in the equation for difference score reliability ([Johns, 1981](https://doi.org/10.1016/0030-5073(81)90033-7)). Can be calculated from long format data file or from lmer-fitted two-level model with **reliability_dms** -function

-   Computing quantile correlation coefficient(s) with **qcc** - function defined as the geometric mean of two quantile regression slopes — that of X on Y and that of Y on X ([Choi & Shin, 2022](https://doi.org/10.1007/s00362-021-01268-7))

## Installation

You can install the released version of multid from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("multid")
```

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("vjilmari/multid")
```

## Examples

### Single sample with two groups

This example shows how to measure standardized multivariate (both Sepal and Petal dimensions, four variables in total) distance between setosa and versicolor Species in iris dataset.

```{r example1}
library(multid)
set.seed(91237)

D.iris<-
  D_regularized(
  data = iris[iris$Species == "setosa" | iris$Species == "versicolor", ],
  mv.vars = c(
    "Sepal.Length", "Sepal.Width",
    "Petal.Length", "Petal.Width"
  ),
  group.var = "Species",
  group.values = c("setosa", "versicolor")
)

round(D.iris$D,2)

# Use different partitions of data for regularization and estimation
D.iris_out<-
  D_regularized(
  data = iris[iris$Species == "setosa" |
    iris$Species == "versicolor", ],
  mv.vars = c(
    "Sepal.Length", "Sepal.Width",
    "Petal.Length", "Petal.Width"
  ),
  group.var = "Species",
  group.values = c("setosa", "versicolor"),
  size = 35,
  out = TRUE,
  pred.prob = TRUE,
  prob.cutoffs = seq(0,1,0.25)
  
)

# print group differences (D)
round(D.iris_out$D,2)

# print table of predicted probabilities
D.iris_out$P.table
```

### Multiple samples with two groups in each

This example first generates artificial multi-group data which are then used as separate data folds in the regularization procedure following separate predictions made for each fold.

```{r example2}
# generate data for 10 groups
set.seed(34246)
n1 <- 100
n2 <- 10
d <-
  data.frame(
    sex = sample(c("male", "female"), n1 * n2, replace = TRUE),
    fold = sample(x = LETTERS[1:n2], size = n1 * n2, replace = TRUE),
    x1 = rnorm(n1 * n2),
    x2 = rnorm(n1 * n2),
    x3 = rnorm(n1 * n2)
  )
#'
# Fit and predict with same data
round(D_regularized(
  data = d,
  mv.vars = c("x1", "x2", "x3"),
  group.var = "sex",
  group.values = c("female", "male"),
  fold.var = "fold",
  fold = TRUE,
  rename.output = TRUE
)$D,2)
#'

# Different partitions for regularization and estimation for each data fold.

# Request probabilities of correct classification (pcc) and 
# area under the receiver operating characteristics (auc) for the output.

round(D_regularized(
  data = d,
  mv.vars = c("x1", "x2", "x3"),
  group.var = "sex",
  group.values = c("female", "male"),
  fold.var = "fold",
  size = 17,
  out = TRUE,
  fold = TRUE,
  rename.output = TRUE,
  pcc = TRUE,
  auc = TRUE
)$D,2)


```

### Comparison of Mahalanobis' D and Regularized D when Difference in Population Exists

This example compares a measure of standardized distance between group centroids (Mahalanobis' D) and a regularized variant provided in the multid-package in small-sample scenario when the distance between group centroids in the population is D = 1.

```{r}
set.seed(8327482)
# generate data from sixteen correlated (r = .20) variables each with d = .50 difference 
#(equals to Mahalanobis' D = 1)
k=16
r=0.2
d=0.5
n=200

# population correlation matrix
cor_mat<-matrix(ncol=k,nrow=k,rep(r,k*k))
diag(cor_mat)<-1

# population difference vector
d_vector<-rep(d,k)

# population Mahalanobis' D is exactly 1

sqrt(t(d_vector) %*% solve(cor_mat) %*% d_vector)

# generate data
library(MASS)

male.dat<-
  data.frame(sex="male",
             mvrnorm(n = n/2,
                     mu = 0.5*d_vector,
                     Sigma = cor_mat,empirical = F))

female.dat<-
  data.frame(sex="female",
             mvrnorm(n = n/2,
                     mu = -0.5*d_vector,
                     Sigma = cor_mat,empirical = F))

dat<-rbind(male.dat,female.dat)

# sample Mahalanobis' D

# obtain mean differences

d_vector_sample<-rep(NA,k)

for (i in 1:k){
  d_vector_sample[i]<-mean(male.dat[,i+1]-female.dat[,i+1])
  
}

# sample pooled covariance matrix (use mean, because equal sample sizes)

cov_mat_sample<-
  (cov(male.dat[,2:17])+cov(female.dat[,2:17]))/2

# calculate sample Mahalanobis' D
sqrt(t(d_vector_sample) %*% solve(cov_mat_sample) %*% d_vector_sample)

# calculate elastic net D

D.ela<-
  D_regularized(data=dat,
              mv.vars=paste0("X",1:k),
              group.var = "sex",
              group.values = c("male","female"))

round(D.ela$D,2)

# use separate data for regularization and estimation

D.ela_out<-D_regularized(data=dat,
              mv.vars=paste0("X",1:k),
              group.var = "sex",
              group.values = c("male","female"),
              out=T,size = 50,pcc = T, auc=T,pred.prob = T)

round(D.ela_out$D,2)

# Table of predicted probabilites
D.ela_out$P.table

```

### Comparison of Mahalanobis' D and Regularized D when Sex Difference in Population Does Not Exist

This example compares a measure of standardized distance between group centroids (Mahalanobis' D) and a regularized variant provided in the multid-package in small-sample scenario when the group centroids in the population is are at the same location, D = 0. In this sample, Mahalanobis' D is measured at D = 0.5, elastic net D with same data used for regularization and estimation at D = 0.35, whereas elastic net D with independent estimation data shows D = 0.

```{r}
set.seed(8327482)
# generate data from sixteen correlated (r = .20) variables each with d = .00 difference 
# (equals to Mahalanobis' D = 0)
k=16
r=0.2
d=0.0
n=200

# population correlation matrix
cor_mat<-matrix(ncol=k,nrow=k,rep(r,k*k))
diag(cor_mat)<-1

# population difference vector
d_vector<-rep(d,k)

# population Mahalanobis' D is exactly 1

sqrt(t(d_vector) %*% solve(cor_mat) %*% d_vector)

# generate data

male.dat<-
  data.frame(sex="male",
             mvrnorm(n = n/2,
                     mu = 0.5*d_vector,
                     Sigma = cor_mat,empirical = F))

female.dat<-
  data.frame(sex="female",
             mvrnorm(n = n/2,
                     mu = -0.5*d_vector,
                     Sigma = cor_mat,empirical = F))

dat<-rbind(male.dat,female.dat)

# sample Mahalanobis' D

# obtain mean differences

d_vector_sample<-rep(NA,k)

for (i in 1:k){
  d_vector_sample[i]<-mean(male.dat[,i+1]-female.dat[,i+1])
  
}

# sample pooled covariance matrix (use mean, because equal sample sizes)

cov_mat_sample<-
  (cov(male.dat[,2:17])+cov(female.dat[,2:17]))/2

# calculate sample Mahalanobis' D
sqrt(t(d_vector_sample) %*% solve(cov_mat_sample) %*% d_vector_sample)

# calculate elastic net D

D.ela.zero<-
  D_regularized(data=dat,
              mv.vars=paste0("X",1:k),
              group.var = "sex",
              group.values = c("male","female"))

round(D.ela.zero$D,2)

# use separate data for regularization and estimation

D.ela.zero_out<-
  D_regularized(data=dat,
              mv.vars=paste0("X",1:k),
              group.var = "sex",
              group.values = c("male","female"),
              out=T,size = 50,pcc = T, auc=T,pred.prob = T)

round(D.ela.zero_out$D,2)

# Table of predicted probabilites
D.ela.zero_out$P.table

```

### Distribution overlap

This example shows how the degree of overlap between the predicted values across the two groups can be visualized and estimated.

For parametric variants, see [Del Giudice (2022)](https://marcodgdotnet.files.wordpress.com/2022/10/delgiudice_2022_measuring_sex_differences-similarities_chapter.pdf).

For non-parametric variants, see [Pastore (2018)](https://doi.org/10.21105/joss.01023) and [Pastore & Calcagnì (2019)](https://doi.org/10.3389/fpsyg.2019.01089).

```{r message=FALSE, warning=FALSE}
# Use predicted values from elastic net D (out) when difference in population exists-

library(ggplot2)

ggplot(D.ela_out$pred.dat,
       aes(x=pred,fill=group))+
  geom_density(alpha=0.5)+
  xlab("Predicted log odds of being male (FM-score)")

# parametric overlap 

## Proportion of overlap relative to a single distribution (OVL)

## obtain D first
(D<-unname(D.ela_out$D[,"D"]))

(OVL<-2*pnorm((-D/2)))

## Proportion of overlap relative to the joint distribution

(OVL2<-OVL/(2-OVL))

# non-parametric overlap

library(overlapping)

np.overlap<-
  overlap(x = list(D.ela_out$pred.dat[
  D.ela_out$pred.dat$group=="male","pred"],
  D.ela_out$pred.dat[
  D.ela_out$pred.dat$group=="female","pred"]),
  plot=T)

# this corresponds to Proportion of overlap relative to the joint distribution (OVL2)
(np.OVL2<-unname(np.overlap$OV))

# from which Proportion of overlap relative to a single distribution (OVL) is approximated at
(np.OVL<-(2*np.OVL2)/(1+np.OVL2))

# compare overlaps

round(cbind(OVL,np.OVL,OVL2,np.OVL2),2)

```

### Predicting Difference Scores

```{r message=FALSE, warning=FALSE}
# sem example
set.seed(342356)
d <- data.frame(
 var1 = rnorm(50),
 var2 = rnorm(50),
 x = rnorm(50)
)
round(ddsc_sem(
   data = d, y1 = "var1", y2 = "var2",
   x = "x", 
 )$results,3)

# multilevel example

set.seed(95332)
n1 <- 10 # groups
n2 <- 10 # observations per group

dat <- data.frame(
  group = rep(c(LETTERS[1:n1]), each = n2),
  w = sample(c(-0.5, 0.5), n1 * n2, replace = TRUE),
  x = rep(sample(1:5, n1, replace = TRUE), each = n2),
  y = sample(1:5, n1 * n2, replace = TRUE)
)
library(lmerTest)
fit <- lmerTest::lmer(y ~ x * w + (w | group),
  data = dat
)

round(ddsc_ml(fit,
               predictor = "x",
               moderator = "w",
               moderator_values = c(0.5, -0.5))$results, 3)

```

Owner

Name: Ville Ilmarinen
Login: vjilmari
Kind: user
Location: Helsinki, Finland
Company: University of Helsinki

Twitter: vilmarine
Repositories: 3
Profile: https://github.com/vjilmari

GitHub Events

Total

Push event: 7

Last Year

Push event: 7

Committers

Last synced: over 2 years ago

All Time

Total Commits: 240
Total Committers: 1
Avg Commits per committer: 240.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 32
Committers: 1
Avg Commits per committer: 32.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
vjilmari	v**n@g**m	240

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: about 9 hours
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

vjilmari (2)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 289 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 12
Total maintainers: 1

cran.r-project.org: multid

Multivariate Difference Between Two Groups

Documentation: http://cran.r-project.org/web/packages/multid/multid.pdf
License: GPL-3
Latest release: 1.0.1
published 7 months ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 289 Last month

Rankings

Forks count: 28.8%

Dependent packages count: 29.8%

Stargazers count: 35.2%

Dependent repos count: 35.5%

Average: 35.6%

Downloads: 48.5%

Maintainers (1)

vj.ilmarinen@gmail.com

Last synced: 7 months ago

Dependencies

DESCRIPTION cran

dplyr >= 1.0.7 imports
emmeans >= 1.6.3 imports
glmnet >= 4.1.2 imports
lavaan >= 0.6.9 imports
lme4 >= 1.1.27.1 imports
pROC >= 1.18.0 imports
quantreg >= 5.88 imports
stats >= 4.0.2 imports
knitr >= 1.39 suggests
overlapping >= 1.7 suggests
rio >= 0.5.29 suggests
rmarkdown >= 2.14 suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

multid

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: multid

Rankings

Maintainers (1)

Dependencies