pedmod
R package with quasi-Monte Carlo methods to estimate mixed models commonly used for random effect structures from pedigrees.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 10 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.0%) to scientific vocabulary
Keywords
Repository
R package with quasi-Monte Carlo methods to estimate mixed models commonly used for random effect structures from pedigrees.
Statistics
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
pedmod: Pedigree Models
The pedmod package provides functions to estimate models for pedigree data. Particularly, the package provides functions to estimate mixed models of the form:
where
is the binary outcome of interest for individual
in family/cluster
,
is the individual’s known covariates,
is the standard normal distribution’s CDF, and
implies a binomial distribution such if
then the density of
is:
A different and equivalent way of writing the model is as:
where
is the
dimensional identity matrix which comes from the
unshared/individual specific random effect. This effect is always
included. The models are commonly known as liability threshold models or
mixed probit models.
The
s are known scale/correlation matrices where each of the
’th types correspond to a type of effect. An arbitrary number of
such matrices can be passed to include e.g. a genetic effect, a maternal
effect, a paternal, an effect of a shared adult environment etc.
Usually, these matrices are correlation matrices as this simplifies
later interpretation and we will assume that all the matrices are
correlation matrices. A typical example is that
is two times the kinship matrix in which case we call:
the heritability. That is, the proportion of the variance attributable
to the the
’th effect which in this case is the direct genetic effect. The
scale parameters, the
s, may be the primary interest in an analysis. The scale
in the model cannot be identified. That is, an equivalent model is:
for any . A common option other than
is to set
. This has the effect
that
is the proportion of variance attributable to the
’th effect (assuming all
matrices are correlation matrices). Moreover,
is the proportion of variance attributable to the individual
specific effect.
The parameterizations used in the package are which we call the direct parameterizations and
which we call the standardized
parameterizations. The latter have the advantage that it is easier to
interpret as the scale parameters are the proportion of variance
attributable to each effect (assuming that only correlation matrices are
used) and the
are often very close the estimate from a GLM
(that is, the model without the other random effects) when the
covariates are unrelated to random effects that are added to the model.
The latter makes it easy to find starting values.
For the above reason, two parameterization are used. For the direct
parameterization where , we work directly with
, and we use
. For the standardized parameterization
where
, we work with
,
, and
This package provides randomized quasi-Monte Carlo methods to
approximate the log marginal likelihood for these types of models with
an arbitrary number scale matrices,
, and the derivatives with respect to
(that is, we work with
) or
.
In some cases, it may be hypothesized that some individuals are less
effected by e.g. their genes than others. A model to incorporate such
effects is implemented in the pedigree_ll_terms_loadings function. See
the Individual Specific Loadings
section for details and examples.
We have re-written the Fortran code by Genz and Bretz (2002) in C++,
made it easy to extend from a log marginal likelihood approximation to
other approximations such as the derivatives, and added less precise but
faster approximations of the
and
. Our own experience suggests that using the latter has a
small effect on the precision of the result but can yield substantial
reduction in computation times for moderate sized families/clusters.
The approximation by Genz and Bretz (2002) have already been used to estimate these types of models (Pawitan et al. 2004). However, not having the gradients may slow down estimation substantially. Moreover, our implementation supports computation in parallel which is a major advantage given the availability of multi-core processors.
Since the implementation is easy to extend, possible extensions are:
- Survival times using mixed generalized survival models (Liu, Pawitan, and Clements 2017) with a similar random effect structure as the model shown above. This way, one avoids dichotomizing outcomes and can account for censoring.
- Generalized linear mixed model with binary, binomial, ordinal, or multinomial outcomes with a probit link. The method we use here may be beneficial if the number of random effects per cluster is not much smaller then the number observations in each cluster. This is used for imputation in the mdgc package.
Installation
The package can be installed from GitHub by calling:
r
remotes::install_github("boennecd/pedmod", build_vignettes = TRUE)
The package can also be installed from CRAN by calling:
r
install.packages("pedmod")
The code benefits from being build with automatic vectorization so
having e.g. -O3 in the CXX14FLAGS flags in your Makevars file may
be useful.
Example
We start with a simple example only with a direct genetic effect. We have one type of family which consists of two couples which are related through one of the parents being siblings. The family is shown below.
``` r
create the family we will use
fam <- data.frame(id = 1:10, sex = rep(1:2, 5L), father = c(NA, NA, 1L, NA, 1L, NA, 3L, 3L, 5L, 5L), mother = c(NA, NA, 2L, NA, 2L, NA, 4L, 4L, 6L, 6L))
plot the pedigree
library(kinship2) ped <- with(fam, pedigree(id = id, dadid = father, momid = mother, sex = sex)) plot(ped) ```

We set the scale matrix to be two times the kinship matrix to model the direct genetic effect. Each individual also has a standard normally distributed covariate and a binary covariate. Thus, we can simulate a data set with a function like:
``` r
simulates a data set.
Args:
n_fams: number of families.
beta: the fixed effect coefficients.
sig_sq: the scale parameter.
simdat <- function(nfams, beta = c(-3, 1, 2), sigsq = 3){ # setup before the simulations Cmat <- 2 * kinship(ped) nobs <- NROW(fam) Sig <- diag(nobs) + sigsq * Cmat Sig_chol <- chol(Sig)
# simulate the data
out <- replicate(
nfams, {
# simulate covariates
X <- cbind((Intercept) = 1, Continuous = rnorm(nobs),
Binary = runif(n_obs) > .5)
# assign the linear predictor + noise
eta <- drop(X %*% beta) + drop(rnorm(n_obs) %*% Sig_chol)
# return the list in the format needed for the package
list(y = as.numeric(eta > 0), X = X, scale_mats = list(Cmat))
}, simplify = FALSE)
# add attributes with the true values and return attributes(out) <- list(beta = beta, sigsq = sigsq) out } ```
The model is
where
is two times the kinship matrix and
and
are observed covariates. We can now estimate the model with a
simulated data set as follows:
``` r
simulate a data set
set.seed(27107390) dat <- simdat(nfams = 400L)
perform the optimization. We start with finding the starting values
library(pedmod) llterms <- pedigreellterms(dat, maxthreads = 4L) system.time(start <- pedmodstart(ptr = llterms, data = dat, n_threads = 4L))
> user system elapsed
> 14.813 0.003 3.723
log likelihood without the random effects and at the starting values
start$logLiknorng
> [1] -1690
start$logLik_est # this is unreliably/imprecise
> [1] -1619
estimate the model
system.time( optout <- pedmodopt( ptr = llterms, par = start$par, abseps = 0, useaprx = TRUE, nthreads = 4L, maxvls = 25000L, rel_eps = 1e-3, minvls = 5000L))
> user system elapsed
> 42.16 0.00 10.57
```
The results of the estimation are shown below:
``` r
parameter estimates versus the truth
rbind(optout = head(optout$par, -1), optoutquick = head(start $par, -1), truth = attr(dat, "beta"))
> (Intercept) Continuous Binary
> opt_out -2.872 0.9689 1.878
> optoutquick -2.844 0.9860 1.857
> truth -3.000 1.0000 2.000
c(optout = exp(tail(optout$par, 1)), optoutquick = exp(tail(start $par, 1)), truth = attr(dat, "sig_sq"))
> optout optout_quick truth
> 2.908 2.812 3.000
log marginal likelihoods
print(start $logLik_est, digits = 8) # this is unreliably/imprecise
> [1] -1618.5064
print(-opt_out$value , digits = 8)
> [1] -1618.4045
```
We emphasize that we set the rel_eps parameter to 1e-3 above which
perhaps is fine for this size of a data set but may not be fine for
larger data sets for the following reason. Suppose that we have families/clusters and suppose that we estimate the log
likelihood term for each family with a variance of
. This implies that the variance of the log likelihood for all
the families is
. Thus, the precision we require for each family’s log
likelihood term needs to be proportional to
if we want a fixed number of precise digits for
the log likelihood for all number of families. The latter is important
e.g. for the profile likelihood curve we compute later and also for the
line search used by some optimization methods. Thus, one may need to
reduce
rel_eps and increase maxvls when there are many families.
We can construct standard errors by computing the Hessian using the
eval_pedigree_hess function as shown below. Like eval_pedigree_grad,
the eval_pedigree_hess functions takes in the log of the scale
parameters but the Hessian is computed on the scale of the scale
parameters.
``` r set.seed(1) system.time(hess <- evalpedigreehess( ptr = llterms, par = optout$par, maxvls = 25000L, minvls = 5000L, abseps = 0, releps = 1e-4, doreorder = TRUE, useaprx = FALSE, n_threads = 4L))
> user system elapsed
> 7.799 0.000 1.980
the gradient is quite small
sqrt(sum(attr(hess, "grad")^2))
> [1] 0.02917
show parameter estimates along with standard errors
rbind(Estimates = opt_out$par, SE = sqrt(diag(attr(hess, "vcov"))))
> (Intercept) Continuous Binary
> Estimates -2.8718 0.9689 1.878 1.0673
> SE 0.3427 0.1203 0.236 0.3211
rbind(Estimates = c(head(optout$par, -1), exp(tail(optout$par, 1))), SE = sqrt(diag(attr(hess, "vcov_org"))))
> (Intercept) Continuous Binary
> Estimates -2.8718 0.9689 1.8783 2.9075
> SE 0.3422 0.1202 0.2358 0.9323
```
We may want to report estimates with the proportion of variances and the standardized fixed effects coefficients which we show later. This can be done by applying the delta method. An example is given below.
``` r
computes the standardized coefficients and proportion of variances. The
covariance matrix is computed using the delta method.
Args:
par: the parameter estimates.
n_scales: the number of scale parameters.
hess: the output from evalpedigreehess
stdpropestimates <- function(par, nscales, hess = NULL){ # transform the parameter estimates npar <- length(par) nfixef <- npar - nscales idxscale <- seqlen(nscales) + nfixef par[idxscale] <- exp(par[idxscale]) totalvar <- 1 + sum(par[idx_scale])
denom <- sqrt(totalvar) ddenom <- -1/(2 * denom * totalvar) parout <- c(par[-idxscale] / denom, par[idxscale] / total_var)
if(!is.null(hess)){ # compute the Jacobian from par to parout jac <- matrix(0, npar, npar) nfixed <- npar - nscales for(i in seqlen(nfixed)){ jac[i, i] <- 1 / denom jac[i, idxscale] <- par[i] * ddenom }
for(i in seq_len(n_scales)){
jac[idx_scale[i], idx_scale ] <- -par[idx_scale[i]] / total_var^2
jac[idx_scale[i], idx_scale[i]] <-
jac[idx_scale[i], idx_scale[i]] + 1 / total_var
}
# compute the Hessian using the delta method
vcov_var <- tcrossprod(jac %*% attr(hess, "vcov_org"), jac)
} else vcov_var <- NULL
list(par = parout, vcovvar = vcov_var) }
show the transformed estimates along with standard errors
stdprop <- stdpropestimates(optout$par, nscales = 1L, hess = hess) rbind( Truth = stdpropestimates( c(attr(dat, "beta"), log(attr(dat, "sigsq"))), 1)$par, Estimates = stdprop$par, SE = sqrt(diag(stdprop$vcov_var)))
> (Intercept) Continuous Binary
> Truth -1.5000 0.50000 1.00000 0.75000
> Estimates -1.4528 0.49014 0.95018 0.74408
> SE 0.0476 0.02613 0.05042 0.06106
```
Minimax Tilting
The minimax tilting method suggested by Botev (2017) is also implemented. The method is more numerically stable when the marginal likelihood terms are small (for instance with large clusters) or for certain problems. However, there is some overhead in the implementation of the method as underflow becomes an issue. This requires more care which increases the computation time.
We estimate the model below with the minimax tilting using the
use_tilting argument.
``` r
perform the optimization. We start with finding the starting values
set.seed(60941821) system.time( starttilt <- pedmodstart( ptr = llterms, data = dat, nthreads = 4L, usetilting = TRUE, useaprx = FALSE))
> user system elapsed
> 21.943 0.000 5.503
estimate the model
system.time( optouttilt <- pedmodopt( ptr = llterms, par = starttilt$par, abseps = 0, useaprx = FALSE, nthreads = 4L, usetilting = TRUE, maxvls = 25000L, releps = 1e-3, minvls = 5000L))
> user system elapsed
> 163.367 0.233 41.043
```
The results of the estimation are shown below:
``` r
parameter estimates versus the truth
rbind(optouttilt = head(optouttilt$par, -1), optout = head(optout$par , -1), truth = attr(dat, "beta"))
> (Intercept) Continuous Binary
> optouttilt -2.874 0.9694 1.879
> opt_out -2.872 0.9689 1.878
> truth -3.000 1.0000 2.000
c(optouttilt = exp(tail(optouttilt$par, 1)), optout = exp(tail(optout$par, 1)), truth = attr(dat, "sig_sq"))
> optouttilt opt_out truth
> 2.912 2.908 3.000
log marginal likelihoods
print(start $logLik_est, digits = 8) # this is unreliably/imprecise
> [1] -1618.5064
print(starttilt $logLikest, digits = 8) # this is unreliably/imprecise
> [1] -1618.5602
print(-opt_out $value , digits = 8)
> [1] -1618.4045
print(-optouttilt$value , digits = 8)
> [1] -1618.4067
```
Different Optimizer
As the gradient is an approximation, some nonlinear optimizer may give
better results than others. We illustrate this below by using the
nlminb function.
``` r
create a wrapper function
nlminb_wrapper <- function( par, fn, gr = NULL, control = list(eval.max = 1000L, iter.max = 1000L), ...){ out <- nlminb( start = par, objective = fn, gradient = gr, control = control, ...) within(out, { counts <- evaluations value <- objective }) }
estimate the model
system.time( optouttiltnlminb <- pedmodopt( ptr = llterms, par = starttilt$par, abseps = 0, useaprx = FALSE, nthreads = 4L, usetilting = TRUE, maxvls = 25000L, releps = 1e-3, minvls = 5000L, optfunc = nlminb_wrapper))
> user system elapsed
> 579.41 0.02 145.41
```
The results of the estimation are shown below:
``` r
parameter estimates versus the truth
rbind(optouttiltnlminb = head(optouttiltnlminb$par, -1), optouttilt = head(optouttilt$par, -1), optout = head(optout$par , -1), truth = attr(dat, "beta"))
> (Intercept) Continuous Binary
> optouttilt_nlminb -2.860 0.9649 1.870
> optouttilt -2.874 0.9694 1.879
> opt_out -2.872 0.9689 1.878
> truth -3.000 1.0000 2.000
c(optouttiltnlminb = exp(tail(optouttiltnlminb$par, 1)), optouttilt = exp(tail(optouttilt$par, 1)), optout = exp(tail(optout$par, 1)), truth = attr(dat, "sig_sq"))
> optouttiltnlminb optouttilt optout truth
> 2.874 2.912 2.908 3.000
log marginal likelihoods
print(-opt_out $value, digits = 8)
> [1] -1618.4045
print(-optouttilt $value, digits = 8)
> [1] -1618.4067
print(-optouttilt_nlminb$value, digits = 8)
> [1] -1618.408
```
Alternative Parameterization
As an alternative to the direct parameterization we use above, we can also use the standardized parameterization. Below are some illustrations which you may skip.
``` r
transform the parameters and check that we get the same likelihood
stdpar <- directtostandardized(optout$par, nscales = 1L) stdpar # the standardized parameterization
> (Intercept) Continuous Binary
> -1.4528 0.4901 0.9502 1.0673
opt_out$par # the direct parameterization
> (Intercept) Continuous Binary
> -2.8718 0.9689 1.8783 1.0673
we can map back as follows
parback <- standardizedtodirect(stdpar, nscales = 1L) all.equal(optout$par, par_back, check.attributes = FALSE)
> [1] TRUE
the proportion of variance of each effect
attr(par_back, "variance proportions")
> Residual
> 0.2559 0.7441
the proportion match
exp(tail(optout$par, 1)) / (exp(tail(optout$par, 1)) + 1)
>
> 0.7441
compute the likelihood with either parameterization
set.seed(1L) evalpedigreell(ptr = llterms, par = optout$par, maxvls = 10000L, minvls = 1000L, releps = 1e-3, useaprx = TRUE, abs_eps = 0)
> [1] -1618
> attr(,"n_fails")
> [1] 10
> attr(,"std")
> [1] 0.004053
set.seed(1L) evalpedigreell(ptr = llterms, par = stdpar , maxvls = 10000L, minvls = 1000L, releps = 1e-3, useaprx = TRUE, abs_eps = 0, standardized = TRUE)
> [1] -1618
> attr(,"n_fails")
> [1] 10
> attr(,"std")
> [1] 0.004053
we can also get the same gradient with an application of the chain rule
jac <- attr( standardizedtodirect(stdpar, nscales = 1L, jacobian = TRUE), "jacobian")
set.seed(1L) g1 <- evalpedigreegrad(ptr = llterms, par = optout$par, maxvls = 10000L, minvls = 1000L, releps = 1e-3, useaprx = TRUE, abseps = 0) set.seed(1L) g2 <- evalpedigreegrad(ptr = llterms, par = stdpar, maxvls = 10000L, minvls = 1000L, releps = 1e-3, useaprx = TRUE, abseps = 0, standardized = TRUE) all.equal(drop(g1 %*% jac), g2, check.attributes = FALSE)
> [1] TRUE
```
The model can also be estimated with the standardized parameterization:
``` r
perform the optimization. We start with finding the starting values
system.time(startstd <- pedmodstart( ptr = llterms, data = dat, nthreads = 4L, standardized = TRUE))
> user system elapsed
> 6.249 0.000 1.570
the starting values are close
standardizedtodirect(startstd$par, nscales = 1L)
> (Intercept) Continuous Binary
> -2.8435 0.9858 1.8566 1.0332
> attr(,"variance proportions")
> Residual
> 0.2625 0.7375
start$par
> (Intercept) Continuous Binary
> -2.844 0.986 1.857 1.034
this may have required different number of gradient and function evaluations
start_std$opt$counts
> function gradient
> 31 31
start $opt$counts
> function gradient
> 48 48
estimate the model
system.time( optoutstd <- pedmodopt( ptr = llterms, par = startstd$par, abseps = 0, useaprx = TRUE, nthreads = 4L, standardized = TRUE, maxvls = 25000L, rel_eps = 1e-3, minvls = 5000L))
> user system elapsed
> 31.347 0.000 7.845
we get the same
standardizedtodirect(optoutstd$par, n_scales = 1L)
> (Intercept) Continuous Binary
> -2.8708 0.9691 1.8772 1.0674
> attr(,"variance proportions")
> Residual
> 0.2559 0.7441
opt_out$par
> (Intercept) Continuous Binary
> -2.8718 0.9689 1.8783 1.0673
this may have required different number of gradient and function evaluations
optoutstd$counts
> function gradient
> 15 10
opt_out $counts
> function gradient
> 31 12
```
Stochastic Quasi-Newton Method
The package includes a stochastic quasi-Newton method which can be used
to estimate the model. This may be useful for larger data sets or in
situations where pedmod_opt “get stuck” near a maximum. The reason for
the latter is presumably that pedmod_opt (by default) uses the BFGS
method which does not assume any noise in the gradient or the function.
We give an example below of how to use the stochastic quasi-Newton
method provided through the pedmod_sqn function.
``` r
fit the model with the stochastic quasi-Newton method
set.seed(46712994) system.time( sqnout <- pedmodsqn( ptr = llterms, par = start$par, abseps = 0, useaprx = TRUE, nthreads = 4L, releps = 1e-3, stepfactor = .1, maxvls = 25000L, minvls = 1000L, nit = 400L, ngradsteps = 10L, ngrad = 100L, n_hess = 400L))
> user system elapsed
> 339.779 0.004 84.991
show the log marginal likelihood
llwrapper <- function(x) evalpedigreell( ptr = llterms, x, maxvls = 50000L, minvls = 1000L, abseps = 0, releps = 1e-4, nthreads = 4L) print(llwrapper(sqn_out$par), digits = 8)
> [1] -1618.4635
> attr(,"n_fails")
> [1] 151
> attr(,"std")
> [1] 0.00073468344
print(llwrapper(optout$par), digits = 8)
> [1] -1618.4063
> attr(,"n_fails")
> [1] 169
> attr(,"std")
> [1] 0.00073978509
compare the parameters
rbind(optim = optout$par, sqn = sqnout$par)
> (Intercept) Continuous Binary
> optim -2.872 0.9689 1.878 1.067
> sqn -2.841 0.9734 1.865 1.039
plot the marginal log likelihood versus the iteration number
lls <- apply(sqnout$omegas, 2L, llwrapper) par(mar = c(5, 5, 1, 1)) plot(lls, ylab = "Log marginal likelihood", bty = "l", pch = 16, xlab = "Hessian updates") lines(smooth.spline(seq_along(lls), lls)) grid() ```

``` r
perhaps we could have used fewer samples in each iteration
set.seed(46712994) system.time( sqnoutfew <- pedmodsqn( ptr = llterms, par = start$par, abseps = 0, useaprx = TRUE, nthreads = 4L, releps = 1e-3, stepfactor = .1, maxvls = 25000L, minvls = 1000L, ngradsteps = 20L, # we take more iterations nit = 2000L, # but use fewer samples in each iteration ngrad = 20L, nhess = 100L))
> user system elapsed
> 334.146 0.008 83.575
compute the marginal log likelihood and compare the parameter estimates
print(llwrapper(sqnout_few$par), digits = 8)
> [1] -1618.4489
> attr(,"n_fails")
> [1] 156
> attr(,"std")
> [1] 0.00074678963
rbind(optim = optout $par,
sqn = sqnout $par,
sqn (few) = sqnoutfew$par)
> (Intercept) Continuous Binary
> optim -2.872 0.9689 1.878 1.067
> sqn -2.841 0.9734 1.865 1.039
> sqn (few) -2.845 0.9533 1.877 1.035
```
Profile Likelihood Curve
We can compute a profile likelihood curve like this:
``` r
assign the scale parameter at which we will evaluate the profile likelihood
rg <- range(exp(tail(optout$par, 1) / 2) * c(.5, 2), sqrt(attr(dat, "sigsq")) * c(.9, 1.1)) sigs <- seq(rg[1], rg[2], length.out = 10) sigs <- sort(c(sigs, exp(tail(opt_out$par, 1) / 2)))
compute the profile likelihood
llterms <- pedigreellterms(dat, maxthreads = 4L) plcurveres <- lapply(sigs, function(sig){ # set the parameters to pass beta <- start$betanorng sigsqlog <- 2 * log(sig) beta_scaled <- beta * sqrt(1 + sig^2)
# optimize like before but using the fix argument optoutquick <- pedmodopt( ptr = llterms, par = c(betascaled, sigsqlog), maxvls = 1000L, abseps = 0, releps = 1e-2, minvls = 100L, useaprx = TRUE, nthreads = 4L, fix = length(beta) + 1L) optout <- pedmodopt( ptr = llterms, par = c(optoutquick$par, sigsqlog), abseps = 0, useaprx = TRUE, nthreads = 4L, fix = length(beta) + 1L, # we changed these parameters maxvls = 25000L, releps = 1e-3, minvls = 5000L)
# report to console and return message(sprintf("\nLog likelihood %.5f (%.5f). Estimated parameters:", -optout$value, -optoutquick$value)) message(paste0(capture.output(print( c(optout$par, Scale = sig))), collapse = "\n"))
list(optoutquick = optoutquick, optout = optout) }) ```
We can construct an approximate 95% confidence interval using an
estimated cubic smoothing spline for the profile likelihood (more sigs
points may be needed to get a good estimate of the smoothing spline):
``` r
get the critical values
alpha <- .05 crit_val <- qchisq(1 - alpha, 1)
fit the cubic smoothing spline
pls <- -sapply(plcurveres, function(x) x$optout$value) smoothest <- smooth.spline(sigs, pls)
check that we have values within the bounds
maxml <- -optout$value lldiffs <- 2 * (maxml - pls) stopifnot(any(head(lldiffs, length(lldiffs) / 2) > critval), any(tail(lldiffs, length(lldiffs) / 2) > critval))
find the values
maxpar <- tail(optout$par, 1) lb <- uniroot(function(x) 2 * (maxml - predict(smoothest, x)$y) - critval, c(min(sigs) , exp(maxpar / 2)))$root ub <- uniroot(function(x) 2 * (maxml - predict(smoothest, x)$y) - critval, c(exp(maxpar / 2), max(sigs)))$root
the confidence interval
c(lb, ub)
> [1] 1.260 2.528
c(lb, ub)^2 # on the variance scale
> [1] 1.587 6.393
```
A caveat is that issues with the
approximation may arise on the boundary of the scale
parameter (
; e.g. see
https://stats.stackexchange.com/a/4894/81865). Notice that the above
may fail if the estimated profile likelihood is not smooth e.g. because
of convergence issues. We can plot the profile likelihood and highlight
the critical value as follows:
r
par(mar = c(5, 5, 1, 1))
plot(sigs, pls, bty = "l",
pch = 16, xlab = expression(sigma), ylab = "Profile likelihood")
grid()
lines(predict(smooth_est, seq(min(sigs), max(sigs), length.out = 100)))
abline(v = exp(tail(opt_out$par, 1) / 2), lty = 2) # the estimate
abline(v = sqrt(attr(dat, "sig_sq")), lty = 3) # the true value
abline(v = lb, lty = 3) # mark the lower bound
abline(v = ub, lty = 3) # mark the upper bound
abline(h = max_ml - crit_val / 2, lty = 3) # mark the critical value

The pedmod_profile function is a convenience function to do like
above. An example of using the pedmod_profile function is provided
below:
``` r
find the profile likelihood based confidence interval
profres <- pedmodprofile( ptr = llterms, par = optout$par, delta = .5, maxvls = 10000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 4L, useaprx = TRUE, n_threads = 4L, verbose = TRUE)
> The estimate of the standard error of the log likelihood is 0.00264089. Preferably this should be below 0.001
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -1619.7619 at 0.567300
> LogLike: -1619.7602 at 0.567300
> LogLike: -1624.4396 at 0.067300
> LogLike: -1624.4340 at 0.067300
> LogLike: -1620.8744 at 0.406401. Lb, target, ub: -1620.8744, -1620.3315, -1619.7602
> LogLike: -1620.8691 at 0.406401. Lb, target, ub: -1620.8691, -1620.3315, -1619.7602
> LogLike: -1620.3400 at 0.477029. Lb, target, ub: -1620.3400, -1620.3315, -1619.7602
> LogLike: -1620.3377 at 0.477029. Lb, target, ub: -1620.3377, -1620.3315, -1619.7602
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -1619.3169 at 1.567300
> LogLike: -1619.3037 at 1.567300
> LogLike: -1621.2055 at 2.067300
> LogLike: -1621.1781 at 2.067300
> LogLike: -1620.2901 at 1.838266. Lb, target, ub: -1621.1781, -1620.3315, -1620.2901
> LogLike: -1620.2681 at 1.838266. Lb, target, ub: -1621.1781, -1620.3315, -1620.2681
> LogLike: -1620.4497 at 1.878606. Lb, target, ub: -1620.4497, -1620.3315, -1620.2681
> LogLike: -1620.4236 at 1.878606. Lb, target, ub: -1620.4236, -1620.3315, -1620.2681
> LogLike: -1618.4107 at 1.067300
the confidence interval for the scale parameter
exp(prof_res$confs)
> 2.50 pct. 97.50 pct.
> 1.613 6.390
compare with Wald based confidence intervals on the log scale
Waldconf <- tail(optout$par, 1) + c(-1, 1) * qnorm(.975) *
sqrt(tail(diag(attr(hess, "vcov")), 1))
rbind(Wald = Waldconf, Profile likelihood = profres$confs)
> 2.50 pct. 97.50 pct.
> Wald 0.4380 1.697
> Profile likelihood 0.4779 1.855
plot the estimated profile likelihood curve and check that everything looks
fine
sigs <- exp(profres$xs / 2) pls <- profres$plogLik par(mar = c(5, 5, 1, 1)) plot(log(sigs), pls, bty = "l", pch = 16, xlab = expression(log(sigma)), ylab = "Profile likelihood") grid() smoothest <- smooth.spline(log(sigs), pls) lines(predict(smoothest, log(seq(min(sigs), max(sigs), length.out = 100)))) abline(v = exp(tail(optout$par, 1) / 2), lty = 2) # the estimate abline(v = sqrt(attr(dat, "sigsq")), lty = 3) # the true value abline(h = max(pls) - qchisq(.95, 1) / 2, lty = 3) # mark the critical value
abline(v = Waldconf / 2, lty = 4) # Wald abline(v = profres$confs / 2, lty = 3) # Profile likelihood ```

``` r
we can do the same for the slope of the binary covariates
profres <- pedmodprofile( ptr = llterms, par = optout$par, delta = .5, maxvls = 10000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 3L, useaprx = TRUE, n_threads = 4L, verbose = TRUE)
> The estimate of the standard error of the log likelihood is 0.00264089. Preferably this should be below 0.001
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -1622.3662 at 1.378256
> LogLike: -1622.3591 at 1.378256
> LogLike: -1618.4107 at 1.878256
> LogLike: -1619.2925 at 1.606492. Lb, target, ub: -1622.3591, -1620.3315, -1619.2925
> LogLike: -1619.2884 at 1.606492. Lb, target, ub: -1622.3591, -1620.3315, -1619.2884
> LogLike: -1620.4820 at 1.490233. Lb, target, ub: -1620.4820, -1620.3315, -1619.2884
> LogLike: -1620.4792 at 1.490233. Lb, target, ub: -1620.4792, -1620.3315, -1619.2884
> LogLike: -1620.1981 at 1.512517. Lb, target, ub: -1620.4792, -1620.3315, -1620.1981
> LogLike: -1620.1979 at 1.512517. Lb, target, ub: -1620.4792, -1620.3315, -1620.1979
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -1619.6178 at 2.378256
> LogLike: -1619.5991 at 2.378256
> LogLike: -1621.3787 at 2.878256
> LogLike: -1621.3504 at 2.878256
> LogLike: -1620.5401 at 2.634567. Lb, target, ub: -1620.5401, -1620.3315, -1619.5991
> LogLike: -1620.5161 at 2.634567. Lb, target, ub: -1620.5161, -1620.3315, -1619.5991
> LogLike: -1620.2801 at 2.561444. Lb, target, ub: -1620.5161, -1620.3315, -1620.2801
> LogLike: -1620.2571 at 2.561444. Lb, target, ub: -1620.5161, -1620.3315, -1620.2571
> LogLike: -1618.4107 at 1.878256
the confidence interval for the slope of the binary covariate
prof_res$confs
> 2.50 pct. 97.50 pct.
> 1.502 2.582
compare w/ Wald
Waldconf <- optout$par[3] + c(-1, 1) * qnorm(.975) *
sqrt(diag(attr(hess, "vcov"))[3])
rbind(Wald = Waldconf, Profile likelihood = profres$confs)
> 2.50 pct. 97.50 pct.
> Wald 1.416 2.341
> Profile likelihood 1.502 2.582
```
``` r
plot the estimated profile likelihood curve and check that everything looks
fine
binslope <- profres$xs pls <- profres$plogLik par(mar = c(5, 5, 1, 1)) plot(binslope, pls, bty = "l", pch = 16, xlab = expression(beta[2]), ylab = "Profile likelihood") grid() lines(spline(binslope, pls, n = 100)) abline(v = optout$par[3], lty = 2) # the estimate abline(v = attr(dat, "beta")[3], lty = 3) # the true value abline(h = max(pls) - qchisq(.95, 1) / 2, lty = 3) # mark the critical value ```

We only ran the above with one seed. We can draw the curve with using different seeds to check if this does not change the estimates. We will likely need to use more samples if the result depends on the seed.
``` r
compute the profile likelihood using different seeds
plcurveres <- lapply(1:5, function(seed) pedmodprofile( ptr = llterms, par = optout$par, delta = .5, maxvls = 10000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 4L, useaprx = TRUE, nthreads = 4L, seed = seed)) ```
We show the estimated profile likelihood based confidence intervals below:
``` r
the profile likelihood based confidence intervals
print(exp(t(sapply(plcurveres, [[, "confs"))), digits = 8)
> 2.50 pct. 97.50 pct.
> [1,] 1.6127142 6.3902930
> [2,] 1.6111401 6.4102724
> [3,] 1.6124553 6.3921109
> [4,] 1.6122517 6.3889698
> [5,] 1.6122009 6.4139313
```
Randomized Quasi-Monte Carlo
There are two randomized quasi-Monte Carlo methods which are implemented in the package: randomized Korobov rules as in the implementation by Genz and Bretz (2002) and scrambled Sobol sequences. The former is used by default. The questions is which method to use. As an example, we will increase the number of samples with either methods and see how this effects the error for the gradient of the log likelihood from the first couple of families. We do this below:
``` r
create a simple function which computes the gradient. We set the convergence
threshold values low such that all the samples will be used
gr <- function(maxvls, method, par = start$par, minvls = 500L) evalpedigreegrad(ptr = llterms, par = par, maxvls = maxvls, abseps = 0, releps = 1e-12, indices = 0:9, minvls = minvls, method = method, nthreads = 4L)
compute the estimator for either method using an increasing number of samples
n_samp <- 1000 * 2^(0:9) # the sample sizes we will use seeds <- 1:40 # the seeds we will use
res <- sapply(setNames(nsamp, nsamp), function(maxvls){ sapply(c(Korobov = 0, Sobol = 1), function(method){ # estimate the gradient ests <- sapply(seeds, function(s){ set.seed(s) gr(maxvls = maxvls, minvls = maxvls, method = method) })
# return the mean of the estimators and the standard deviation
rbind(mean = rowMeans(ests),
sd = apply(ests, 1L, sd))
}, simplify = "array") }, simplify = "array")
set the names of the dimensions
dimnames(res) <- list( metric = dimnames(res)[[1L]], parameter = names(optout$par), method = dimnames(res)[[3L]], samples = nsamp)
they seem to converge to the same estimate as expected
print(t(res["mean", , "Korobov", ]), digits = 6)
> parameter
> samples (Intercept) Continuous Binary
> 1000 -0.542977 3.07220 -1.64744 -0.903425
> 2000 -0.545156 3.07124 -1.64875 -0.904159
> 4000 -0.545396 3.07055 -1.64847 -0.903605
> 8000 -0.545606 3.07174 -1.64928 -0.902010
> 16000 -0.545329 3.07147 -1.64913 -0.903307
> 32000 -0.545353 3.07142 -1.64903 -0.903075
> 64000 -0.545338 3.07154 -1.64908 -0.902849
> 128000 -0.545369 3.07151 -1.64908 -0.902838
> 256000 -0.545366 3.07148 -1.64908 -0.902898
> 512000 -0.545370 3.07149 -1.64910 -0.902875
print(t(res["mean", , "Sobol" , ]), digits = 6)
> parameter
> samples (Intercept) Continuous Binary
> 1000 -0.545443 3.07244 -1.64925 -0.909546
> 2000 -0.544713 3.07247 -1.64857 -0.907893
> 4000 -0.545858 3.07177 -1.64887 -0.903273
> 8000 -0.545198 3.07091 -1.64901 -0.903082
> 16000 -0.545413 3.07152 -1.64880 -0.902484
> 32000 -0.545362 3.07154 -1.64900 -0.902564
> 64000 -0.545370 3.07142 -1.64907 -0.902848
> 128000 -0.545363 3.07144 -1.64906 -0.902843
> 256000 -0.545373 3.07149 -1.64907 -0.902815
> 512000 -0.545372 3.07149 -1.64909 -0.902861
get a best estimator of the gradient by combining the two
preciseest <- rowMeans(res["mean", , , length(nsamp)])
the standard deviation of the result scaled by the absolute value of the
estimated gradient to get the number of significant digits
round(t(res["sd", , "Korobov", ] / abs(precise_est)), 6)
> parameter
> samples (Intercept) Continuous Binary
> 1000 0.020412 0.008023 0.006444 0.026864
> 2000 0.003959 0.001780 0.001473 0.007806
> 4000 0.004619 0.002070 0.001824 0.008830
> 8000 0.001635 0.000607 0.000610 0.003488
> 16000 0.000653 0.000251 0.000256 0.001580
> 32000 0.000389 0.000155 0.000168 0.001423
> 64000 0.000235 0.000103 0.000094 0.000637
> 128000 0.000075 0.000028 0.000025 0.000217
> 256000 0.000046 0.000022 0.000024 0.000162
> 512000 0.000091 0.000041 0.000033 0.000286
round(t(res["sd", , "Sobol" , ] / abs(precise_est)), 6)
> parameter
> samples (Intercept) Continuous Binary
> 1000 0.019472 0.008728 0.007275 0.033470
> 2000 0.011401 0.004239 0.004862 0.020085
> 4000 0.006189 0.002074 0.002653 0.013707
> 8000 0.003146 0.001051 0.001301 0.005197
> 16000 0.001674 0.000675 0.000741 0.003351
> 32000 0.000834 0.000346 0.000284 0.001169
> 64000 0.000352 0.000175 0.000173 0.000862
> 128000 0.000193 0.000083 0.000076 0.000398
> 256000 0.000099 0.000051 0.000049 0.000203
> 512000 0.000047 0.000020 0.000017 0.000132
```
``` r
look at a log-log regression to check convergence rate. We expect a rate
between 0.5, O(sqrt(n)) rate, and 1, O(n) rate, which can be seen from minus
the slopes below
coef(lm(t(log(res["sd", , "Korobov", ])) ~ log(n_samp)))
> (Intercept) Continuous Binary
> (Intercept) 1.404 2.1636 1.2910 1.4868
> log(n_samp) -0.934 -0.9249 -0.9073 -0.8022
coef(lm(t(log(res["sd", , "Sobol", ])) ~ log(n_samp)))
> (Intercept) Continuous Binary
> (Intercept) 2.3743 2.8575 2.551 3.0372
> log(n_samp) -0.9797 -0.9437 -0.975 -0.9277
plot the two standard deviation estimates
par(mar = c(5, 5, 1, 1)) matplot(nsamp, t(res["sd", , "Korobov", ]), log = "xy", ylab = "L2 error", type = "p", pch = c(0:2, 5L), col = "black", bty = "l", xlab = "Number of samples", ylim = range(res["sd", , , ])) matlines(nsamp, t(res["sd", , "Korobov", ]), col = "black", lty = 2)
add the points from Sobol method
matplot(nsamp, t(res["sd", , "Sobol", ]), type = "p", pch = 15:18, col = "darkgray", add = TRUE) matlines(nsamp, t(res["sd", , "Sobol", ]), col = "darkgray", lty = 3) ```

The above seems to suggest that the randomized Korobov rules are
preferable and that both method achieve close to a rate for some small
. Notice that we have to set
minvls equal to maxvls to
achieve the rate with randomized Korobov rules.
We can also consider the convergence rate for the log likelihood. This time, we also consider the error using the minimax tilted version suggested by Botev (2017). We also show how the error can be reduced by using fewer randomized qausi-Monte Carlo sequences at the cost of the precision of the error estimate:
``` r
create a simple function which computes the log likelihood. We set the
convergence threshold values low such that all the samples will be used
fn <- function(maxvls, method, par = start$par, ptr = llterms, minvls = 500L, usetilting) evalpedigreell(ptr = ptr, par = par, maxvls = maxvls, abseps = 0, releps = 1e-12, indices = 0:9, minvls = minvls, method = method, nthreads = 4L, usetilting = use_tilting)
compute the estimator for either method using an increasing number of samples
res <- sapply(setNames(nsamp, nsamp), function(maxvls){
sapply(c(W/ tilting = TRUE, W/o tilting = FALSE), function(usetilting){
sapply(c(Korobov = 0, Sobol = 1), function(method){
# estimate the gradient
ests <- sapply(seeds, function(s){
set.seed(s)
fn(maxvls = maxvls, minvls = maxvls, method = method,
usetilting = use_tilting)
})
# return the mean of the estimators and the standard deviation
c(mean = mean(ests), sd = sd(ests))
}, simplify = "array")
}, simplify = "array") }, simplify = "array")
compute the errors with fewer randomized quasi-Monte Carlo sequences
lltermsfewsequences <- pedigreellterms(dat, maxthreads = 4L,
nsequences = 1L)
resfewseqs <- sapply(setNames(nsamp, nsamp), function(maxvls){
sapply(c(W/ tilting = TRUE, W/o tilting = FALSE), function(usetilting){
sapply(c(Korobov = 0, Sobol = 1), function(method){
# estimate the gradient
ests <- sapply(seeds, function(s){
set.seed(s)
fn(maxvls = maxvls, minvls = maxvls, method = method,
ptr = lltermsfewsequences, usetilting = use_tilting)
})
# return the mean of the estimators and the standard deviation
c(mean = mean(ests), sd = sd(ests))
}, simplify = "array")
}, simplify = "array") }, simplify = "array") ```
``` r
the standard deviation of the result scaled by the absolute value of the
estimated log likelihood to get the number of significant digits. Notice that
we scale up the figures by 1000!
preciseest <- mean(res["mean", , , length(nsamp)]) round(1000 * res["sd", "Korobov", , ] / abs(precise_est), 6)
> 1000 2000 4000 8000 16000 32000 64000
> W/ tilting 0.05252 0.008957 0.01149 0.004568 0.001833 0.001027 0.000574
> W/o tilting 0.06358 0.011445 0.01383 0.004873 0.002329 0.000949 0.000855
> 128000 256000 512000
> W/ tilting 0.000190 0.000102 0.000123
> W/o tilting 0.000219 0.000160 0.000245
round(1000 * res["sd", "Sobol" , , ] / abs(precise_est), 6)
> 1000 2000 4000 8000 16000 32000 64000
> W/ tilting 0.03416 0.02132 0.01507 0.006704 0.003523 0.002073 0.001081
> W/o tilting 0.10916 0.04650 0.02482 0.011072 0.006090 0.003202 0.001260
> 128000 256000 512000
> W/ tilting 0.000479 0.000238 0.000101
> W/o tilting 0.000630 0.000336 0.000170
with fewer sequences
round(1000 * resfewseqs["sd", "Korobov", , ] / abs(precise_est), 6)
> 1000 2000 4000 8000 16000 32000 64000
> W/ tilting 0.01134 0.005193 0.002952 0.001582 0.000506 0.000354 0.000412
> W/o tilting 0.01390 0.004954 0.003055 0.002181 0.000653 0.000439 0.000625
> 128000 256000 512000
> W/ tilting 0.000223 4.8e-05 5.1e-05
> W/o tilting 0.000190 5.1e-05 5.3e-05
round(1000 * resfewseqs["sd", "Sobol" , , ] / abs(precise_est), 6)
> 1000 2000 4000 8000 16000 32000 64000
> W/ tilting 0.01701 0.008483 0.005322 0.002887 0.001269 0.000766 0.000323
> W/o tilting 0.03370 0.016389 0.007411 0.004601 0.001951 0.000921 0.000505
> 128000 256000 512000
> W/ tilting 0.000130 0.000066 3.0e-05
> W/o tilting 0.000208 0.000109 6.3e-05
look at log-log regressions
apply(res["sd", , , ], 1:2, function(sds) coef(lm(log(sds) ~ log(n_samp))))
> , , W/ tilting
>
> Korobov Sobol
> (Intercept) 0.1604 0.1002
> log(n_samp) -0.9890 -0.9366
>
> , , W/o tilting
>
> Korobov Sobol
> (Intercept) -0.1441 1.593
> log(n_samp) -0.9335 -1.033
plot the standard deviation estimates. Dashed lines are with fewer sequences
par(mar = c(5, 5, 1, 1)) create_plot <- function(results, ylim){ sds <- matrix(results["sd", , , ], ncol = dim(results)[4]) dimnames(sds) <- list(do.call(outer, c(dimnames(results)[2:3], list(FUN = paste))), NULL)
lty <- c(1, 1, 2, 2) col <- rep(c("black", "darkgray"), 2) matplot(nsamp, t(sds), log = "xy", ylab = "L2 error", lty = lty, type = "l", bty = "l", xlab = "Number of samples", col = col, ylim = ylim) matplot(nsamp, t(sds), pch = c(1, 16), col = col, add = TRUE) legend("bottomleft", bty = "n", lty = lty, col = col, legend = rownames(sds)) grid() }
with more sequences
ylimplot <- range(res["sd", , , ], resfewseqs["sd", , , ]) createplot(res, ylim = ylim_plot) ```

``` r
with one sequence
createplot(resfewseqs, ylim = ylimplot) ```

Again the randomized Korobov rules seems preferable. In general, a
strategy can be to use only one randomized quasi-Monte Carlo sequence as
above and set minvls and maxvls to the desired number of samples.
This will though imply that the method cannot stop early if it is easy
to approximate the log likelihood and its derivative. We fit the model
again below as example of using the scrambled Sobol sequences:
``` r
estimate the model using Sobol sequences
system.time( optoutsobol <- pedmodopt( ptr = llterms, par = start$par, abseps = 0, useaprx = TRUE, nthreads = 4L, maxvls = 25000L, releps = 1e-3, minvls = 5000L, method = 1L))
> user system elapsed
> 47.35 0.00 11.88
compare the result. We start with the log likelihood
print(-optoutsobol$value, digits = 8)
> [1] -1618.4027
print(-opt_out $value, digits = 8)
> [1] -1618.4045
the parameters
rbind(Korobov = optout $par, Sobol = optout_sobol$par)
> (Intercept) Continuous Binary
> Korobov -2.872 0.9689 1.878 1.067
> Sobol -2.874 0.9692 1.880 1.068
number of used function and gradient evaluations
opt_out$counts
> function gradient
> 31 12
optoutsobol$counts
> function gradient
> 12 10
```
Simulation Study
We make a small simulation study below where we are interested in the estimation time, bias and coverage of Wald type confidence intervals.
``` r
the seeds we will use
seeds <- c(36451989L, 18774630L, 76585289L, 31898455L, 55733878L, 99681114L, 37725150L, 99188448L, 66989159L, 20673587L, 47985954L, 42571905L, 53089211L, 18457743L, 96049437L, 70222325L, 86393368L, 45380572L, 81116968L, 48291155L, 89755299L, 69891073L, 1846862L, 15263013L, 37537710L, 25194071L, 14471551L, 38278606L, 55596031L, 5436537L, 75008107L, 83382936L, 50689482L, 71708788L, 52258337L, 23423931L, 61069524L, 24452554L, 32406673L, 14900280L, 24818537L, 59733700L, 82407492L, 95500692L, 62528680L, 88728797L, 9891891L, 36354594L, 69630736L, 41755287L)
run the simulation study
sim_study <- lapply(seeds, function(s){ set.seed(s)
# only run the result if it has not been computed f <- file.path("cache", "simstudysimple", paste0("simple-", s, ".RDS")) if(!file.exists(f)){ # simulate the data dat <- simdat(nfams = 400L)
# get the starting values
library(pedmod)
do_fit <- function(standardized){
ll_terms <- pedigree_ll_terms(dat, max_threads = 4L)
ti_start <- system.time(start <- pedmod_start(
ptr = ll_terms, data = dat, n_threads = 4L,
standardized = standardized))
start$time <- ti_start
ti_fit <- system.time(
opt_out <- pedmod_opt(
ptr = ll_terms, par = start$par, abs_eps = 0, use_aprx = TRUE,
n_threads = 4L,
maxvls = 25000L, rel_eps = 1e-3, minvls = 5000L,
standardized = standardized))
opt_out$time <- ti_fit
if(standardized){
start$par <- standardized_to_direct(start$par, 1L)
opt_out$par <- standardized_to_direct(opt_out$par, 1L)
}
if(!standardized){
hess_time <- system.time(
hess <- eval_pedigree_hess(
ptr = ll_terms, par = opt_out$par, maxvls = 25000L,
abs_eps = 0, minvls = 5000L, use_aprx = TRUE,
rel_eps = 1e-4, n_threads = 4L))
attr(hess, "time") <- hess_time
} else
hess <- NULL
list(start = start, opt_out = opt_out, hess = hess,
ll_no_rng = start$logLik_no_rng)
}
fit_direct <- do_fit(standardized = FALSE)
fit_std <- do_fit(standardized = TRUE)
saveRDS(list(fit_direct = fit_direct, fit_std = fit_std), f)
}
# report to console and return out <- readRDS(f) message(paste0(capture.output(out$fitdirect$optout$par), collapse = "\n")) message(paste0(capture.output(out$fitstd $optout$par), collapse = "\n"))
par <- out$fitdirect$optout$par SEs <- sqrt(diag(attr(out$fit_direct$hess, "vcov")))
message(paste0(capture.output(rbind( Estimate = par, SE = SEs)), collapse = "\n")) message(sprintf( "Time %12.1f, %12.1f. Max ll: %12.4f, %12.4f\n", with(out$fitdirect, start$time["elapsed"] + optout$time["elapsed"]), with(out$fitstd , start$time["elapsed"] + optout$time["elapsed"]), -out$fitdirect$optout$value, -out$fitstd $optout$value))
out })
gather the estimates
betaest <- sapply(simstudy, function(x) cbind(Direct = head(x$fitdirect$optout$par, 3), Standardized = head(x$fitstd $optout$par, 3)), simplify = "array") sigmaest <- sapply(simstudy, function(x) cbind(Direct = exp(tail(x$fitdirect$optout$par, 1) / 2), Standardized = exp(tail(x$fitstd $optout$par, 1) / 2)), simplify = "array")
compute the errors
tmp <- simdat(2L) errbeta <- betaest - attr(tmp, "beta") errsigma <- sigmaest - sqrt(attr(tmp, "sigsq")) dimnames(errsigma)[[1L]] <- "std genetic" err <- abind::abind(errbeta, err_sigma, along = 1)
get the bias estimates and the standard errors
bias <- apply(err, 1:2, mean) nsims <- dim(err)[[3]] SE <- apply(err , 1:2, sd) / sqrt(nsims) bias
> Direct Standardized
> (Intercept) -0.06529 -0.06527
> Continuous 0.02801 0.02803
> Binary 0.03706 0.03692
> std genetic 0.05591 0.05602
SE
> Direct Standardized
> (Intercept) 0.05073 0.05029
> Continuous 0.01714 0.01704
> Binary 0.03364 0.03332
> std genetic 0.03904 0.03872
make a box plot
bvals <- expand.grid(rownames(err), strtrim(colnames(err), 1)) boxdat <- data.frame(Error = c(err), Parameter = rep(bvals$Var1, nsims), Method = rep(b_vals$Var2, dim(err)[[3]])) par(mar = c(7, 5, 1, 1))
S is for the standardized and D is for the direct parameterization
boxplot(Error ~ Method + Parameter, box_dat, ylab = "Error", las = 2, xlab = "") abline(h = 0, lty = 2) grid() ```

``` r
get the average computation times
timevals <- sapply(simstudy, function(x) { . <- function(z){ keep <- c("opt_out", "start") out <- setNames(sapply(z[keep], function(z) z$time["elapsed"]), keep) c(out, total = sum(out)) }
rbind(Direct = .(x$fitdirect), Standardized = .(x$fitstd)) }, simplify = "array") apply(time_vals, 1:2, mean)
> opt_out start total
> Direct 5.990 1.750 7.74
> Standardized 7.644 1.706 9.35
apply(time_vals, 1:2, sd)
> opt_out start total
> Direct 3.448 1.0666 3.65
> Standardized 2.415 0.8325 2.33
apply(time_vals, 1:2, quantile)
> , , opt_out
>
> Direct Standardized
> 0% 2.660 4.013
> 25% 3.862 5.177
> 50% 4.179 7.904
> 75% 8.012 9.456
> 100% 21.358 12.279
>
> , , start
>
> Direct Standardized
> 0% 0.696 0.695
> 25% 1.156 1.248
> 50% 1.319 1.388
> 75% 2.064 1.957
> 100% 5.862 5.882
>
> , , total
>
> Direct Standardized
> 0% 3.861 5.389
> 25% 5.219 7.394
> 50% 6.547 9.439
> 75% 9.388 11.072
> 100% 24.547 14.219
get the standardized errors
erssds <- sapply(simstudy, function(x){ par <- x$fitdirect$optout$par err <- par - c(attr(tmp, "beta"), log(attr(tmp, "sigsq"))) SEs <- sqrt(diag(attr(x$fitdirect$hess, "vcov"))) err / SEs })
rowMeans(abs(ers_sds) < qnorm(.95)) # 90% coverage
> (Intercept) Continuous Binary
> 0.96 0.98 0.96 0.94
rowMeans(abs(ers_sds) < qnorm(.975)) # 95% coverage
> (Intercept) Continuous Binary
> 0.98 0.98 0.98 0.96
rowMeans(abs(ers_sds) < qnorm(.995)) # 99% coverage
> (Intercept) Continuous Binary
> 1.00 0.98 0.98 1.00
stats for the computation time of the Hessian
hesstime <- sapply( simstudy, function(x) attr(x$fitdirect$hess, "time")["elapsed"]) mean(hesstime)
> [1] 1.03
quantile(hess_time, probs = seq(0, 1, .1))
> 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
> 0.9810 0.9938 1.0062 1.0250 1.0306 1.0320 1.0364 1.0400 1.0422 1.0471 1.1410
compute the coverage on the standardized scale with the proportion of
variances
erssds <- sapply(simstudy, function(x){ parnvcov <- stdpropestimates( x$fitdirect$optout$par, 1L, x$fitdirect$hess) truth <- stdpropestimates(c(attr(tmp, "beta"), log(attr(tmp, "sigsq"))), 1) (parnvcov$par - truth$par) / sqrt(diag(parnvcov$vcov_var)) })
rowMeans(abs(ers_sds) < qnorm(.95)) # 90% coverage
> (Intercept) Continuous Binary
> 0.92 0.88 0.92 0.94
rowMeans(abs(ers_sds) < qnorm(.975)) # 95% coverage
> (Intercept) Continuous Binary
> 0.96 0.94 0.92 0.96
rowMeans(abs(ers_sds) < qnorm(.995)) # 99% coverage
> (Intercept) Continuous Binary
> 1.00 1.00 0.98 0.98
```
Example: Adding Child Environment Effects
As an extension, we can add a child environment effect. The new scale
matrix, the
’s, can be written as:
``` r Cenv <- diag(1, NROW(fam)) Cenv[c(3, 5), c(3, 5)] <- 1 Cenv[c(7:8 ), c(7:8 )] <- 1 Cenv[c(9:10), c(9:10)] <- 1
Matrix::Matrix(C_env, sparse = TRUE)
> 10 x 10 sparse Matrix of class "dsCMatrix"
>
> [1,] 1 . . . . . . . . .
> [2,] . 1 . . . . . . . .
> [3,] . . 1 . 1 . . . . .
> [4,] . . . 1 . . . . . .
> [5,] . . 1 . 1 . . . . .
> [6,] . . . . . 1 . . . .
> [7,] . . . . . . 1 1 . .
> [8,] . . . . . . 1 1 . .
> [9,] . . . . . . . . 1 1
> [10,] . . . . . . . . 1 1
```
We assign the new simulation function below but this time we include only binary covariates:
``` r
simulates a data set.
Args:
n_fams: number of families.
beta: the fixed effect coefficients.
sig_sq: the scale parameters.
simdat <- function(nfams, beta = c(-3, 4), sigsq = c(2, 1)){ # setup before the simulations Cmat <- 2 * kinship(ped) nobs <- NROW(fam) Sig <- diag(nobs) + sigsq[1] * Cmat + sigsq[2] * Cenv Sig_chol <- chol(Sig)
# simulate the data
out <- replicate(
nfams, {
# simulate covariates
X <- cbind((Intercept) = 1, Binary = runif(nobs) > .9)
# assign the linear predictor + noise
eta <- drop(X %*% beta) + drop(rnorm(n_obs) %*% Sig_chol)
# return the list in the format needed for the package
list(y = as.numeric(eta > 0), X = X, scale_mats = list(
Genetic = Cmat, Environment = C_env))
}, simplify = FALSE)
# add attributes with the true values and return attributes(out) <- list(beta = beta, sigsq = sigsq) out } ```
The model is
where
is two times the kinship matrix,
is singular matrix for the environment effect, and
is an observed covariate. In this case, we exploit that some
of log marginal likelihood terms are identical. That is, some of the
combinations of pedigrees, covariates, and outcomes match. Therefor, we
can use the
cluster_weights arguments to reduce the computation time
as shown below:
``` r
simulate a data set
set.seed(27107390) dat <- simdat(nfams = 1000L)
compute the log marginal likelihood by not using that some of the log marginal
likelihood terms are identical
betatrue <- attr(dat, "beta") sigsqtrue <- attr(dat, "sigsq")
library(pedmod) lltermswoweights <- pedigreellterms(dat, maxthreads = 4L) system.time(llres <- evalpedigreell( lltermswoweights, c(betatrue, log(sigsqtrue)), maxvls = 100000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, n_threads = 4))
> user system elapsed
> 0.598 0.000 0.151
system.time(gradres <- evalpedigreegrad( lltermswoweights, c(betatrue, log(sigsqtrue)), maxvls = 100000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, n_threads = 4))
> user system elapsed
> 15.038 0.000 3.818
find the duplicated combinations of pedigrees, covariates, and outcomes. One
likely needs to change this code if the pedigrees are not identical but are
identical if they are permuted. In this case, the code below will miss
identical log marginal likelihood terms
datunqiue <- dat[!duplicated(dat)] attributes(datunqiue) <- attributes(dat) length(dat_unqiue) # number of unique terms
> [1] 420
get the weights. This can be written in a much more efficient way
cweights <- sapply(datunqiue, function(x) sum(sapply(dat, identical, y = x)))
get the C++ object and show that the computation time is reduced
llterms <- pedigreellterms(datunqiue, max_threads = 4L)
system.time(llresfast <- evalpedigreell( llterms, c(betatrue, log(sigsqtrue)), maxvls = 100000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, nthreads = 4, clusterweights = cweights))
> user system elapsed
> 0.251 0.000 0.064
system.time(gradresfast <- evalpedigreegrad( llterms, c(betatrue, log(sigsqtrue)), maxvls = 100000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, nthreads = 4, clusterweights = cweights))
> user system elapsed
> 6.337 0.000 1.657
show that we get the same (up to a Monte Carlo error)
print(c(redundant = llres, fast = llres_fast), digits = 6)
> redundant fast
> -2696.62 -2696.63
rbind(redundant = gradres, fast = gradres_fast)
> [,1] [,2] [,3] [,4]
> redundant -12.03 5.148 -13.48 -8.580
> fast -12.05 5.155 -13.56 -8.665
rm(dat) # will not need this anymore
note that the variance is greater for the weighted version
llests <- sapply(1:50, function(seed){ set.seed(seed) evalpedigreell( lltermswoweights, c(betatrue, log(sigsqtrue)), maxvls = 100000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, nthreads = 4) }) llestsfast <- sapply(1:50, function(seed){ set.seed(seed) evalpedigreell( llterms, c(betatrue, log(sigsqtrue)), maxvls = 10000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, nthreads = 4, clusterweights = c_weights) })
the estimates are comparable
c(Without weights = mean(llests), With weights = mean(llests_fast))
> Without weights With weights
> -2697 -2697
the standard deviation is different
c(Without weights = sd(llests), With weights = sd(llests_fast))
> Without weights With weights
> 0.003629 0.020053
we can mitigate this by using the vls_scales argument which though is a bit
slower
llestsfastvlsscales <- sapply(1:50, function(seed){ set.seed(seed) evalpedigreell( llterms, c(betatrue, log(sigsqtrue)), maxvls = 10000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, nthreads = 4, clusterweights = cweights, vlsscales = sqrt(cweights)) })
the estimates are comparable
c(Without weights = mean(llests), With weights = mean(llestsfast),
`With weights and vlsscales` = mean(llestsfastvlsscales))
> Without weights With weights
> -2697 -2697
> With weights and vls_scales
> -2697
the standard deviation is different
c(Without weights = sd(llests), With weights = sd(llestsfast),
`With weights and vlsscales` = sd(llestsfastvlsscales))
> Without weights With weights
> 0.003629 0.020053
> With weights and vls_scales
> 0.004966
it is still faster
system.time(llresfast <- evalpedigreell( llterms, c(betatrue, log(sigsqtrue)), maxvls = 100000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, nthreads = 4, clusterweights = cweights, vlsscales = sqrt(cweights)))
> user system elapsed
> 0.384 0.000 0.130
system.time(gradresfast <- evalpedigreegrad( llterms, c(betatrue, log(sigsqtrue)), maxvls = 100000L, abseps = 0, releps = 1e-3, minvls = 2500L, useaprx = TRUE, nthreads = 4, clusterweights = cweights, vlsscales = sqrt(cweights)))
> user system elapsed
> 7.095 0.000 1.863
find the starting values
system.time(start <- pedmodstart( ptr = llterms, data = datunqiue, clusterweights = cweights, vlsscales = sqrt(c_weights)))
> user system elapsed
> 11.95 0.00 11.95
optimize
system.time( optoutquick <- pedmodopt( ptr = llterms, par = start$par, abseps = 0, useaprx = TRUE, nthreads = 4L, clusterweights = cweights, maxvls = 5000L, releps = 1e-2, minvls = 500L, vlsscales = sqrt(cweights)))
> user system elapsed
> 6.642 0.000 1.778
system.time( optout <- pedmodopt( ptr = llterms, par = optoutquick$par, abseps = 0, useaprx = TRUE, nthreads = 4L, clusterweights = cweights, vlsscales = sqrt(cweights), # we changed these parameters maxvls = 25000L, rel_eps = 1e-3, minvls = 5000L))
> user system elapsed
> 22.948 0.000 7.186
```
The results are shown below:
``` r
parameter estimates versus the truth
rbind(optout = head(optout$par, -2), optoutquick = head(start $par, -2), truth = attr(dat_unqiue, "beta"))
> (Intercept) Binary
> opt_out -2.927 3.918
> optoutquick -2.930 3.915
> truth -3.000 4.000
rbind(optout = exp(tail(optout$par, 2)), optoutquick = exp(tail(start $par, 2)), truth = attr(datunqiue, "sigsq"))
>
> opt_out 1.869 0.8448
> optoutquick 1.861 0.8709
> truth 2.000 1.0000
log marginal likelihoods
print( start $logLik_est, digits = 8) # this is unreliably/imprecise
> [1] -2696.138
print(-opt_out$value , digits = 8)
> [1] -2696.1135
```
We compute the Hessian like before to get the standard errors.
``` r set.seed(1) system.time(hess <- evalpedigreehess( ptr = llterms, par = optout$par, maxvls = 25000L, minvls = 5000L, abseps = 0, releps = 1e-4, doreorder = TRUE, useaprx = FALSE, nthreads = 4L, clusterweights = cweights, vlsscales = sqrt(c_weights)))
> user system elapsed
> 10.958 0.003 4.007
the gradient is quite small
sqrt(sum(attr(hess, "grad")^2))
> [1] 0.1967
show parameter estimates along with standard errors
rbind(Estimates = opt_out$par, SE = sqrt(diag(attr(hess, "vcov"))))
> (Intercept) Binary
> Estimates -2.927 3.9183 0.6252 -0.1687
> SE 0.308 0.4176 0.2944 0.3723
rbind(Estimates = c(head(optout$par, -2), exp(tail(optout$par, 2))), SE = sqrt(diag(attr(hess, "vcov_org"))))
> (Intercept) Binary
> Estimates -2.9268 3.9183 1.8685 0.8448
> SE 0.3102 0.4205 0.5547 0.3155
```
Again, we can look at the estimates with the standardized fixed effects coefficients and the proportion of variances.
``` r
show the transformed estimates along with standard errors
stdprop <- stdpropestimates(optout$par, nscales = 2L, hess = hess) rbind( Truth = stdpropestimates( c(attr(datunqiue, "beta"), log(attr(datunqiue, "sigsq"))), 2)$par, Estimates = stdprop$par, SE = sqrt(diag(stdprop$vcov_var)))
> (Intercept) Binary
> Truth -1.50000 2.00000 0.50000 0.25000
> Estimates -1.51886 2.03337 0.50320 0.22750
> SE 0.02395 0.04633 0.05794 0.05164
```
Motivation of Different Number of Samples
We use the cluster_weights argument above to exploit that some of the
log marginal likelihood terms are identical. Specifically, let
be the
th distinct log marginal likelihood term and
be the model parameters, then we use that the log
marginal likelihood is
The unweighted version is the left hand side and the weighted version is
the right hand side. The two have different variances. Our
quasi-Monte-Carlo method has (almost) a variance for each which is
with
being the number of samples we use for each
. Thus, the variance of the unweighted version is
which is
However, the variance of the weighted version is
which is
Though, we can get a similar variance by using
samples for term
. The variance then becomes
but we do so using only
samples rather than
Alternative Parameterization
As before, we can also work with the standardized parameterization.
``` r
transform the parameters and check that we get the same likelihood
stdpar <- directtostandardized(optout$par, nscales = 2L) stdpar # the standardized parameterization
> (Intercept) Binary
> -1.5189 2.0334 0.6252 -0.1687
opt_out$par # the direct parameterization
> (Intercept) Binary
> -2.9268 3.9183 0.6252 -0.1687
we can map back as follows
parback <- standardizedtodirect(stdpar, nscales = 2L) all.equal(optout$par, par_back, check.attributes = FALSE)
> [1] TRUE
the proportion of variance of each effect
attr(par_back, "variance proportions")
> Residual
> 0.2693 0.5032 0.2275
the proportions match
totalvar <- sum(exp(tail(optout$par, 2))) + 1 exp(tail(optout$par, 2)) / totalvar
>
> 0.5032 0.2275
compute the likelihood with either parameterization
set.seed(1L) evalpedigreell(ptr = llterms, par = optout$par, maxvls = 10000L, minvls = 1000L, releps = 1e-3, useaprx = TRUE, abseps = 0, clusterweights = cweights, vlsscales = sqrt(c_weights))
> [1] -2696
> attr(,"n_fails")
> [1] 2
> attr(,"std")
> [1] 0.008579
set.seed(1L) evalpedigreell(ptr = llterms, par = stdpar , maxvls = 10000L, minvls = 1000L, releps = 1e-3, useaprx = TRUE, abseps = 0, clusterweights = cweights, vlsscales = sqrt(c_weights), standardized = TRUE)
> [1] -2696
> attr(,"n_fails")
> [1] 2
> attr(,"std")
> [1] 0.008579
we can also get the same gradient with an application of the chain rule
jac <- attr( standardizedtodirect(stdpar, nscales = 2L, jacobian = TRUE), "jacobian")
set.seed(1L)
g1 <- evalpedigreegrad(ptr = llterms, par = optout$par, maxvls = 10000L,
minvls = 1000L, releps = 1e-3, useaprx = TRUE,
abseps = 0, clusterweights = cweights,
vlsscales = sqrt(cweights))
set.seed(1L)
g2 <- evalpedigreegrad(ptr = llterms, par = stdpar, maxvls = 10000L,
minvls = 1000L, releps = 1e-3, useaprx = TRUE,
abseps = 0, standardized = TRUE,
clusterweights = cweights,
vlsscales = sqrt(cweights))
all.equal(drop(g1 %*% jac), g2, check.attributes = FALSE)
> [1] TRUE
```
The model can also be estimated with the the standardized parameterization:
``` r
perform the optimization. We start with finding the starting values
system.time(startstd <- pedmodstart( ptr = llterms, data = datunqiue, clusterweights = cweights, vlsscales = sqrt(cweights), standardized = TRUE))
> user system elapsed
> 11.85 0.00 11.85
are the starting values similar?
standardizedtodirect(startstd$par, nscales = 2L)
> (Intercept) Binary
> -2.9305 3.9146 0.6211 -0.1382
> attr(,"variance proportions")
> Residual
> 0.2680 0.4987 0.2334
start$par
> (Intercept) Binary
> -2.9305 3.9146 0.6211 -0.1382
this may have required different number of gradient and function evaluations
start_std$opt$counts
> function gradient
> 63 63
start $opt$counts
> function gradient
> 62 62
estimate the model
system.time( optoutquickstd <- pedmodopt( ptr = llterms, par = startstd$par, abseps = 0, useaprx = TRUE, nthreads = 4L, clusterweights = cweights, standardized = TRUE, maxvls = 5000L, releps = 1e-2, minvls = 500L, vlsscales = sqrt(cweights)))
> user system elapsed
> 7.847 0.000 2.089
system.time( optoutstd <- pedmodopt( ptr = llterms, par = optoutquickstd$par, abseps = 0, useaprx = TRUE, nthreads = 4L, clusterweights = cweights, standardized = TRUE, vlsscales = sqrt(cweights), # we changed these parameters maxvls = 25000L, rel_eps = 1e-3, minvls = 5000L))
> user system elapsed
> 5.544 0.000 1.766
we get the same
standardizedtodirect(optoutstd$par, n_scales = 2L)
> (Intercept) Binary
> -2.9069 3.8915 0.6070 -0.1888
> attr(,"variance proportions")
> Residual
> 0.2730 0.5009 0.2260
opt_out$par
> (Intercept) Binary
> -2.9268 3.9183 0.6252 -0.1687
this may have required different number of gradient and function evaluations
optoutquick_std$counts
> function gradient
> 20 12
optoutquick $counts
> function gradient
> 12 9
optoutstd$counts
> function gradient
> 4 1
opt_out $counts
> function gradient
> 9 6
```
Profile Likelihood Curve
We can make a 2D profile likelihood curve as follows:
``` r
get the values at which we evaluate the profile likelihood
rg <- Map(function(est, truth) range(exp(est / 2) * c(.8, 1.25), truth), est = tail(optout$par, 2), truth = sqrt(attr(datunqiue, "sig_sq")))
sigvals1 <- seq(rg[[1]][1], rg[[1]][2], length.out = 5) sigvals2 <- seq(rg[[2]][1], rg[[2]][2], length.out = 5) sigs <- expand.grid(sigma1 = sigvals1, sigma2 = sigvals2)
function to compute the profile likelihood.
Args:
fix: indices of parameters to fix.
fix_val: values of the fixed parameters.
sig_start: starting values for the scale parameters.
llterms <- pedigreellterms(datunqiue, maxthreads = 4L) plcurvefunc <- function(fix, fixval, sigstart = exp(tail(optout$par, 2) / 2)){ # get the fixed indices of the fixed parameters beta = start$betanorng isfixbeta <- fix <= length(beta) fixbeta <- fix[isfixbeta] isfixsigs <- fix > length(beta) fixsigs <- fix[isfixsigs]
# set the parameters to pass sig <- sigstart if(length(fixsigs) > 0) sig[fixsigs - length(beta)] <- fixval[isfixsigs]
# re-scale beta and setup the sigma argument to pass sigsqlog <- 2 * log(sig) beta_scaled <- beta * sqrt(1 + sum(sig^2))
# setup the parameter vector fixpar <- c(betascaled, sigsqlog) if(length(fixbeta) > 0) fixpar[fixbeta] <- fixval[isfixbeta]
# optimize like before but using the fix argument optoutquick <- pedmodopt( ptr = llterms, par = fixpar, maxvls = 5000L, abseps = 0, releps = 1e-2, minvls = 500L, useaprx = TRUE, nthreads = 4L, fix = fix, clusterweights = cweights, vlsscales = sqrt(c_weights))
# notice that pedmodopt only returns a subset of the parameters. These are # the parameters that have been optimized over parnew <- fixpar parnew[-fix] <- optoutquick$par optout <- pedmodopt( ptr = llterms, par = parnew, abseps = 0, useaprx = TRUE, nthreads = 4L, fix = fix, clusterweights = cweights, vlsscales = sqrt(cweights), # we changed these parameters maxvls = 25000L, releps = 1e-3, minvls = 5000L)
# report to console and return
message(sprintf("\nLog likelihood %.5f (%.5f). Estimated parameters:",
-optout$value, -optoutquick$value))
message(paste0(capture.output(print(
c(non-fixed = optout$par, fixed = fix_par[fix]))), collapse = "\n"))
list(optoutquick = optoutquick, optout = optout) }
compute the profile likelihood
plcurveres <- Map( function(sig1, sig2) plcurvefunc(fix = 0:1 + length(optout$par) - 1L, fixval = c(sig1, sig2)), sig1 = sigs$sigma1, sig2 = sigs$sigma2) ```
r
par(mfcol = c(2, 2), mar = c(1, 1, 1, 1))
pls <- -sapply(pl_curve_res, function(x) x$opt_out$value)
for(i in 1:3 - 1L)
persp(sig_vals1, sig_vals2, matrix(pls, length(sig_vals1)),
xlab = "\nGenetic", ylab = "\nEnvironment",
zlab = "\n\nProfile likelihood", theta = 65 + i * 90,
ticktype = "detailed")

We may just be interested in creating two profile likelihood curves for each of the scale parameters. This can be done as follows:
``` r
first we compute data for the two profile likelihood curves staring with the
curve for the additive genetic effect
plgenetic <- pedmodprofile( ptr = llterms, par = optout$par, delta = .4, maxvls = 20000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 3L, useaprx = TRUE, nthreads = 4L, verbose = TRUE, clusterweights = cweights, vlsscales = sqrt(c_weights))
> The estimate of the standard error of the log likelihood is 0.00485355. Preferably this should be below 0.001
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -2697.0252 at 0.225154
> LogLike: -2696.9823 at 0.225154
> LogLike: -2699.9665 at -0.174846
> LogLike: -2699.9145 at -0.174846
> LogLike: -2698.2151 at 0.035090. Lb, target, ub: -2698.2151, -2698.0360, -2696.9823
> LogLike: -2698.1092 at 0.035090. Lb, target, ub: -2698.1092, -2698.0360, -2696.9823
> LogLike: -2697.9842 at 0.065088. Lb, target, ub: -2698.1092, -2698.0360, -2697.9842
> LogLike: -2697.8985 at 0.065088. Lb, target, ub: -2698.1092, -2698.0360, -2697.8985
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -2696.9671 at 1.025154
> LogLike: -2696.8870 at 1.025154
> LogLike: -2698.6619 at 1.425154
> LogLike: -2698.5592 at 1.425154
> LogLike: -2698.0013 at 1.283410. Lb, target, ub: -2698.5592, -2698.0360, -2698.0013
> LogLike: -2697.9083 at 1.283410. Lb, target, ub: -2698.5592, -2698.0360, -2697.9083
> LogLike: -2698.1910 at 1.325152. Lb, target, ub: -2698.1910, -2698.0360, -2697.9083
> LogLike: -2698.0953 at 1.325152. Lb, target, ub: -2698.0953, -2698.0360, -2697.9083
> LogLike: -2696.1152 at 0.625154
exp(pl_genetic$confs) # the confidence interval
> 2.50 pct. 97.50 pct.
> 1.046 3.714
compare with the Wald type
Wald <-
optout$par[3] + c(-1, 1) * qnorm(.975) * sqrt(diag(attr(hess, "vcov"))[3])
rbind(Wald = Wald, Profile likelihood = plgenetic$confs)
> 2.50 pct. 97.50 pct.
> Wald 0.04808 1.202
> Profile likelihood 0.04533 1.312
then we compute the curve for the environment effect
plenv <- pedmodprofile( ptr = llterms, par = optout$par, delta = .6, maxvls = 20000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 4L, useaprx = TRUE, nthreads = 4L, verbose = TRUE, clusterweights = cweights, vlsscales = sqrt(c_weights))
> The estimate of the standard error of the log likelihood is 0.00485355. Preferably this should be below 0.001
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -2697.1299 at -0.768676
> LogLike: -2697.0734 at -0.768676
> LogLike: -2699.2342 at -1.368676
> LogLike: -2699.1717 at -1.368676
> LogLike: -2698.0807 at -1.055866. Lb, target, ub: -2698.0807, -2698.0360, -2697.0734
> LogLike: -2698.0269 at -1.055866. Lb, target, ub: -2699.1717, -2698.0360, -2698.0269
> LogLike: -2698.2209 at -1.092913. Lb, target, ub: -2698.2209, -2698.0360, -2698.0269
> LogLike: -2698.1591 at -1.092913. Lb, target, ub: -2698.1591, -2698.0360, -2698.0269
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -2697.4796 at 0.431324
> LogLike: -2697.3383 at 0.431324
> LogLike: -2700.2691 at 1.031324
> LogLike: -2700.1408 at 1.031324
> LogLike: -2698.4963 at 0.678879. Lb, target, ub: -2698.4963, -2698.0360, -2697.3383
> LogLike: -2698.3980 at 0.678879. Lb, target, ub: -2698.3980, -2698.0360, -2697.3383
> LogLike: -2698.0725 at 0.587669. Lb, target, ub: -2698.0725, -2698.0360, -2697.3383
> LogLike: -2697.9798 at 0.587669. Lb, target, ub: -2698.3980, -2698.0360, -2697.9798
> LogLike: -2696.1152 at -0.168676
exp(pl_env$confs) # the confidence interval
> 2.50 pct. 97.50 pct.
> 0.347 1.823
compare with the Wald type
Wald <-
optout$par[4] + c(-1, 1) * qnorm(.975) * sqrt(diag(attr(hess, "vcov"))[4])
rbind(Wald = Wald, Profile likelihood = plenv$confs)
> 2.50 pct. 97.50 pct.
> Wald -0.8983 0.5610
> Profile likelihood -1.0584 0.6003
```
We plot the two profile likelihood curves below:
``` r doplot <- function(obj, xlab, estimate, trans = function(x) exp(x / 2), maxdiff = 8, add = FALSE, col = "black"){ xs <- trans(obj$xs) pls <- obj$plogLik keep <- pls > max(pls) - max_diff xs <- xs[keep] pls <- pls[keep] if(add) points(xs, pls, pch = 16, col = col) else { plot(xs, pls, bty = "l", pch = 16, xlab = xlab, ylab = "Profile likelihood", col = col) grid() abline(v = estimate, lty = 2, col = col) # the estimate # mark the critical value abline(h = max(pls) - qchisq(.95, 1) / 2, lty = 3, col = col) }
lines(spline(xs, pls, n = 100L), col = col) }
par(mar = c(5, 5, 1, 1)) doplot(plgenetic, expression(sigma[G]), exp(opt_out$par[3] / 2)) ```

r
do_plot(pl_env, expression(sigma[E]), exp(opt_out$par[4] / 2))

Profile Likelihood Curve: Proportion of Variance
Suppose that we want a profile likelihood curve for the proportion of
variance explained by each random effect. If then we can use the profile likelihood curve for
as the proportion of variance for the first effect when
is a monotone transformation of this parameter only and thus we
can use the scale invariance of the likelihood ratio. However, this is
not true for more effects,
. To see this, notice that proportion of variance is given by
Let be the log likelihood.
Then the profile likelihood in the proportion of variance explained by
the
th effect is
As these proportions are often the interest of the analysis, the
pedmod_profile_prop function is implemented to produce profile
likelihood based confidence intervals for . We provide an example of using
pedmod_profile_prop below.
``` r
confidence interval for the proportion of variance for the genetic effect
plgeneticprop <- pedmodprofileprop( ptr = llterms, par = optout$par, maxvls = 20000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 1L, useaprx = TRUE, nthreads = 4L, verbose = TRUE, clusterweights = cweights, vlsscales = sqrt(c_weights))
> The estimate of the standard error of the log likelihood is 0.00485355. Preferably this should be below 0.001
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -2746.1578 at 0.990000
> LogLike: -2746.2084 at 0.990000
> LogLike: -2696.1152 at 0.503198
> LogLike: -2696.9007 at 0.573879. Lb, target, ub: -2746.2084, -2698.0360, -2696.9007
> LogLike: -2696.9005 at 0.573879. Lb, target, ub: -2746.2084, -2698.0360, -2696.9005
> LogLike: -2699.2078 at 0.643801. Lb, target, ub: -2699.2078, -2698.0360, -2696.9005
> LogLike: -2699.2179 at 0.643801. Lb, target, ub: -2699.2179, -2698.0360, -2696.9005
> LogLike: -2698.0856 at 0.615037. Lb, target, ub: -2698.0856, -2698.0360, -2696.9005
> LogLike: -2698.0797 at 0.615037. Lb, target, ub: -2698.0797, -2698.0360, -2696.9005
> LogLike: -2697.8859 at 0.609329. Lb, target, ub: -2698.0797, -2698.0360, -2697.8859
> LogLike: -2697.8830 at 0.609329. Lb, target, ub: -2698.0797, -2698.0360, -2697.8830
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -2730.9045 at 0.010000
> LogLike: -2730.9120 at 0.010000
> LogLike: -2696.1152 at 0.503198
> LogLike: -2696.9370 at 0.424199. Lb, target, ub: -2730.9120, -2698.0360, -2696.9370
> LogLike: -2696.9449 at 0.424199. Lb, target, ub: -2730.9120, -2698.0360, -2696.9449
> LogLike: -2699.5217 at 0.345715. Lb, target, ub: -2699.5217, -2698.0360, -2696.9449
> LogLike: -2699.5253 at 0.345715. Lb, target, ub: -2699.5253, -2698.0360, -2696.9449
> LogLike: -2698.1268 at 0.381905. Lb, target, ub: -2698.1268, -2698.0360, -2696.9449
> LogLike: -2698.1258 at 0.381905. Lb, target, ub: -2698.1258, -2698.0360, -2696.9449
> LogLike: -2697.8799 at 0.388948. Lb, target, ub: -2698.1258, -2698.0360, -2697.8799
> LogLike: -2697.8953 at 0.388948. Lb, target, ub: -2698.1258, -2698.0360, -2697.8953
> LogLike: -2696.1152 at 0.503198
plgeneticprop$confs # the confidence interval
> 2.50 pct. 97.50 pct.
> 0.3846 0.6138
confidence interval for the proportion of variance for the environment
effect
plenvprop <- pedmodprofileprop( ptr = llterms, par = optout$par, maxvls = 20000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 2L, useaprx = TRUE, nthreads = 4L, verbose = TRUE, clusterweights = cweights, vlsscales = sqrt(c_weights))
> The estimate of the standard error of the log likelihood is 0.00485355. Preferably this should be below 0.001
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -3063.0661 at 0.990000
> LogLike: -3045.1027 at 0.990000
> LogLike: -2696.1152 at 0.227501
> LogLike: -2697.5521 at 0.315953. Lb, target, ub: -3045.1027, -2698.0360, -2697.5521
> LogLike: -2697.5598 at 0.315953. Lb, target, ub: -3045.1027, -2698.0360, -2697.5598
> LogLike: -2701.2481 at 0.393171. Lb, target, ub: -2701.2481, -2698.0360, -2697.5598
> LogLike: -2701.2467 at 0.393171. Lb, target, ub: -2701.2467, -2698.0360, -2697.5598
> LogLike: -2698.4675 at 0.340565. Lb, target, ub: -2698.4675, -2698.0360, -2697.5598
> LogLike: -2698.4842 at 0.340565. Lb, target, ub: -2698.4842, -2698.0360, -2697.5598
> LogLike: -2698.0145 at 0.329342. Lb, target, ub: -2698.4842, -2698.0360, -2698.0145
> LogLike: -2698.0311 at 0.329342. Lb, target, ub: -2698.4842, -2698.0360, -2698.0311
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -2704.2329 at 0.010000
> LogLike: -2704.2423 at 0.010000
> LogLike: -2696.1152 at 0.227501
> LogLike: -2696.9430 at 0.157642. Lb, target, ub: -2704.2423, -2698.0360, -2696.9430
> LogLike: -2696.9591 at 0.157642. Lb, target, ub: -2704.2423, -2698.0360, -2696.9591
> LogLike: -2698.8929 at 0.100249. Lb, target, ub: -2698.8929, -2698.0360, -2696.9591
> LogLike: -2698.9071 at 0.100249. Lb, target, ub: -2698.9071, -2698.0360, -2696.9591
> LogLike: -2697.9997 at 0.122869. Lb, target, ub: -2698.9071, -2698.0360, -2697.9997
> LogLike: -2698.0064 at 0.122869. Lb, target, ub: -2698.9071, -2698.0360, -2698.0064
> LogLike: -2698.1153 at 0.119594. Lb, target, ub: -2698.1153, -2698.0360, -2698.0064
> LogLike: -2698.1260 at 0.119594. Lb, target, ub: -2698.1260, -2698.0360, -2698.0064
> LogLike: -2696.1152 at 0.227501
plenvprop$confs # the confidence interval
> 2.50 pct. 97.50 pct.
> 0.1220 0.3295
```
A wrong approach is to use the confidence interval for
to attempt to construct a confidence interval for
. To see that this is wrong, let
Now, suppose that exists a function
such that
Then it follows that
Thus, if one uses the profile likelihood curve of
to attempt to construct a confidence interval for
then the result is anti-conservative. This is illustrated below
where the black curves are the proper profile likelihoods and the gray
curves are the invalid/attempted profile likelihood curves.
``` r
using the right approach
estimate <- exp(tail(optout$par, 2)) estimate <- estimate / (1 + sum(estimate)) par(mar = c(5, 5, 1, 1)) doplot(plgeneticprop, expression(h[G]), estimate[1], identity)
create curve using the wrong approach
dumpl <- plgenetic dumpl$xs <- sapply(dumpl$data, function(x) { scales <- exp(c(x$x, tail(x$optim$par, 1))) scales[1] / (1 + sum(scales)) }) doplot(dumpl, expression(h[G]), estimate[1], identity, col = "gray40", add = TRUE) ```

``` r
do the same for the environment effect
doplot(plenv_prop, expression(h[E]), estimate[2], identity)
dumpl <- plenv dumpl$xs <- sapply(dumpl$data, function(x) { scales <- exp(c(x$x, tail(x$optim$par, 1))) scales[1] / (1 + sum(scales)) }) doplot(dumpl, expression(h[E]), estimate[2], identity, col = "gray40", add = TRUE) ```

It is also possible to pass starting bounds to pedmod_profile_prop as
shown below.
``` r
confidence interval for the proportion of variance for the genetic effect
plgeneticpropbounds <- pedmodprofileprop( ptr = llterms, par = optout$par, maxvls = 20000L, minvls = 1000L, alpha = .05, abseps = 0, releps = 1e-4, whichprof = 1L, useaprx = TRUE, nthreads = 4L, verbose = TRUE, clusterweights = cweights, vlsscales = sqrt(cweights), bound = c(.3, .65))
> The estimate of the standard error of the log likelihood is 0.00485355. Preferably this should be below 0.001
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -2699.4910 at 0.650000
> LogLike: -2699.5002 at 0.650000
> LogLike: -2696.1152 at 0.503198
> LogLike: -2696.9779 at 0.577242. Lb, target, ub: -2699.5002, -2698.0360, -2696.9779
> LogLike: -2696.9763 at 0.577242. Lb, target, ub: -2699.5002, -2698.0360, -2696.9763
> LogLike: -2698.0296 at 0.613455. Lb, target, ub: -2699.5002, -2698.0360, -2698.0296
> LogLike: -2698.0244 at 0.613455. Lb, target, ub: -2699.5002, -2698.0360, -2698.0244
> LogLike: -2698.1872 at 0.617816. Lb, target, ub: -2698.1872, -2698.0360, -2698.0244
> LogLike: -2698.1831 at 0.617816. Lb, target, ub: -2698.1831, -2698.0360, -2698.0244
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -2701.8072 at 0.300000
> LogLike: -2701.8211 at 0.300000
> LogLike: -2696.1152 at 0.503198
> LogLike: -2697.0364 at 0.419820. Lb, target, ub: -2701.8211, -2698.0360, -2697.0364
> LogLike: -2697.0466 at 0.419820. Lb, target, ub: -2701.8211, -2698.0360, -2697.0466
> LogLike: -2698.6668 at 0.366663. Lb, target, ub: -2698.6668, -2698.0360, -2697.0466
> LogLike: -2698.6685 at 0.366663. Lb, target, ub: -2698.6685, -2698.0360, -2697.0466
> LogLike: -2697.9751 at 0.385995. Lb, target, ub: -2698.6685, -2698.0360, -2697.9751
> LogLike: -2697.9907 at 0.385995. Lb, target, ub: -2698.6685, -2698.0360, -2697.9907
> LogLike: -2698.0933 at 0.382563. Lb, target, ub: -2698.0933, -2698.0360, -2697.9907
> LogLike: -2698.1021 at 0.382563. Lb, target, ub: -2698.1021, -2698.0360, -2697.9907
> LogLike: -2696.1152 at 0.503198
compare the result
plgeneticprop_bounds$confs
> 2.50 pct. 97.50 pct.
> 0.3846 0.6138
plgeneticprop$confs
> 2.50 pct. 97.50 pct.
> 0.3846 0.6138
```
Simulation Study
We make a small simulation study below where we are interested in the estimation time, bias and coverage of Wald type confidence intervals.
``` r
the seeds we will use
seeds <- c(36451989L, 18774630L, 76585289L, 31898455L, 55733878L, 99681114L, 37725150L, 99188448L, 66989159L, 20673587L, 47985954L, 42571905L, 53089211L, 18457743L, 96049437L, 70222325L, 86393368L, 45380572L, 81116968L, 48291155L, 89755299L, 69891073L, 1846862L, 15263013L, 37537710L, 25194071L, 14471551L, 38278606L, 55596031L, 5436537L, 75008107L, 83382936L, 50689482L, 71708788L, 52258337L, 23423931L, 61069524L, 24452554L, 32406673L, 14900280L, 24818537L, 59733700L, 82407492L, 95500692L, 62528680L, 88728797L, 9891891L, 36354594L, 69630736L, 41755287L)
run the simulation study
sim_study <- lapply(seeds, function(s){ set.seed(s)
# only run the result if it has not been computed f <- file.path("cache", "simstudysimplewenv", paste0("simple-w-env-", s, ".RDS")) if(!file.exists(f)){ # simulate the data dat <- simdat(nfams = 1000L)
# get the weighted data set
dat_unqiue <- dat[!duplicated(dat)]
attributes(dat_unqiue) <- attributes(dat)
c_weights <- sapply(dat_unqiue, function(x)
sum(sapply(dat, identical, y = x)))
rm(dat)
# get the starting values
library(pedmod)
do_fit <- function(standardized){
ll_terms <- pedigree_ll_terms(dat_unqiue, max_threads = 4L)
ti_start <- system.time(start <- pedmod_start(
ptr = ll_terms, data = dat_unqiue, n_threads = 4L,
cluster_weights = c_weights, standardized = standardized,
vls_scales = sqrt(c_weights)))
start$time <- ti_start
# fit the model
ti_quick <- system.time(
opt_out_quick <- pedmod_opt(
ptr = ll_terms, par = start$par, maxvls = 5000L, abs_eps = 0,
rel_eps = 1e-2, minvls = 500L, use_aprx = TRUE, n_threads = 4L,
cluster_weights = c_weights, standardized = standardized,
vls_scales = sqrt(c_weights)))
opt_out_quick$time <- ti_quick
ti_slow <- system.time(
opt_out <- pedmod_opt(
ptr = ll_terms, par = opt_out_quick$par, abs_eps = 0, use_aprx = TRUE,
n_threads = 4L, cluster_weights = c_weights,
standardized = standardized, vls_scales = sqrt(c_weights),
# we changed these parameters
maxvls = 25000L, rel_eps = 1e-3, minvls = 5000L))
opt_out$time <- ti_slow
if(standardized){
start$par <- standardized_to_direct(start$par , 2L)
opt_out$par <- standardized_to_direct(opt_out$par , 2L)
opt_out_quick$par <- standardized_to_direct(opt_out_quick$par, 2L)
}
if(!standardized){
hess_time <- system.time(
hess <- eval_pedigree_hess(
ptr = ll_terms, par = opt_out$par, maxvls = 25000L,
abs_eps = 0, minvls = 5000L, use_aprx = TRUE,
rel_eps = 1e-4, n_threads = 4L, cluster_weights = c_weights,
vls_scales = sqrt(c_weights)))
attr(hess, "time") <- hess_time
} else
hess <- NULL
list(start = start, opt_out = opt_out, opt_out_quick = opt_out_quick,
ll_no_rng = start$logLik_no_rng, hess = hess)
}
fit_direct <- do_fit(standardized = FALSE)
fit_std <- do_fit(standardized = TRUE)
saveRDS(list(fit_direct = fit_direct, fit_std = fit_std), f)
}
# report to console and return out <- readRDS(f) message(paste0(capture.output(out$fitdirect$optout$par), collapse = "\n")) message(paste0(capture.output(out$fitstd $optout$par), collapse = "\n"))
par <- out$fitdirect$optout$par SEs <- sqrt(diag(attr(out$fit_direct$hess, "vcov")))
message(paste0(capture.output(rbind( Estimate = par, SE = SEs)), collapse = "\n")) message(sprintf( "Time %12.1f, %12.1f. Max ll: %12.4f, %12.4f\n", with(out$fitdirect, start$time["elapsed"] + optout$time["elapsed"] + optoutquick$time["elapsed"]), with(out$fitstd , start$time["elapsed"] + optout$time["elapsed"] + optoutquick$time["elapsed"]), -out$fitdirect$optout$value, -out$fitstd $optout$value))
out })
gather the estimates
betaest <- sapply(simstudy, function(x) cbind(Direct = head(x$fitdirect$optout$par, 2), Standardized = head(x$fitstd $optout$par, 2)), simplify = "array") sigmaest <- sapply(simstudy, function(x) cbind(Direct = exp(tail(x$fitdirect$optout$par, 2) / 2), Standardized = exp(tail(x$fitstd $optout$par, 2) / 2)), simplify = "array")
compute the errors
tmp <- simdat(2L) errbeta <- betaest - attr(tmp, "beta") errsigma <- sigmaest - sqrt(attr(tmp, "sigsq")) dimnames(errsigma)[[1L]] <- c("std genetic", "std env.") err <- abind::abind(errbeta, err_sigma, along = 1)
get the bias estimates and the standard errors
bias <- apply(err, 1:2, mean) nsims <- dim(err)[[3]] SE <- apply(err , 1:2, sd) / sqrt(nsims) bias
> Direct Standardized
> (Intercept) -0.06465 -0.06606
> Binary 0.09380 0.09577
> std genetic 0.02787 0.02880
> std env. 0.03434 0.03474
SE
> Direct Standardized
> (Intercept) 0.06443 0.06491
> Binary 0.08831 0.08894
> std genetic 0.04078 0.04093
> std env. 0.02998 0.03028
make a box plot
bvals <- expand.grid(rownames(err), strtrim(colnames(err), 1)) boxdat <- data.frame(Error = c(err), Parameter = rep(bvals$Var1, nsims), Method = rep(b_vals$Var2, dim(err)[[3]])) par(mar = c(7, 5, 1, 1))
S is for the standardized and D is for the direct parameterization
boxplot(Error ~ Method + Parameter, box_dat, ylab = "Error", las = 2, xlab = "") abline(h = 0, lty = 2) grid() ```

``` r
get the average computation times
timevals <- sapply(simstudy, function(x) { . <- function(z){ keep <- c("opt_out", "start") out <- setNames(sapply(z[keep], function(z) z$time["elapsed"]), keep) c(out, total = sum(out)) }
rbind(Direct = .(x$fitdirect), Standardized = .(x$fitstd)) }, simplify = "array") apply(time_vals, 1:2, mean)
> opt_out start total
> Direct 12.260 3.937 16.20
> Standardized 9.645 3.487 13.13
apply(time_vals, 1:2, sd)
> opt_out start total
> Direct 6.980 2.370 7.450
> Standardized 5.259 1.535 5.503
apply(time_vals, 1:2, quantile)
> , , opt_out
>
> Direct Standardized
> 0% 1.314 1.342
> 25% 7.869 6.268
> 50% 11.912 10.788
> 75% 16.981 13.259
> 100% 33.904 17.819
>
> , , start
>
> Direct Standardized
> 0% 1.699 1.512
> 25% 2.438 2.479
> 50% 2.890 3.109
> 75% 4.834 3.835
> 100% 13.843 8.596
>
> , , total
>
> Direct Standardized
> 0% 3.729 3.505
> 25% 10.699 8.844
> 50% 16.901 14.721
> 75% 21.136 17.993
> 100% 37.195 21.201
get the standardized errors
erssds <- sapply(simstudy, function(x){ par <- x$fitdirect$optout$par err <- par - c(attr(tmp, "beta"), log(attr(tmp, "sigsq"))) SEs <- sqrt(diag(solve(-x$fitdirect$hess)))
err / SEs })
rowMeans(abs(ers_sds) < qnorm(.95)) # 90% coverage
> (Intercept) Binary
> 0.86 0.88 0.90 0.94
rowMeans(abs(ers_sds) < qnorm(.975)) # 95% coverage
> (Intercept) Binary
> 0.94 0.90 0.92 0.96
rowMeans(abs(ers_sds) < qnorm(.995)) # 99% coverage
> (Intercept) Binary
> 1.00 1.00 1.00 0.98
stats for the computation time of the Hessian
hesstime <- sapply( simstudy, function(x) attr(x$fitdirect$hess, "time")["elapsed"]) mean(hesstime)
> [1] 2.098
quantile(hess_time, probs = seq(0, 1, .1))
> 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
> 1.872 1.976 2.013 2.041 2.066 2.084 2.100 2.136 2.187 2.223 2.531
compute the coverage on the standardized scale with the proportion of
variances
erssds <- sapply(simstudy, function(x){ parnvcov <- stdpropestimates( x$fitdirect$optout$par, 2L, x$fitdirect$hess) truth <- stdpropestimates(c(attr(tmp, "beta"), log(attr(tmp, "sigsq"))), 2) (parnvcov$par - truth$par) / sqrt(diag(parnvcov$vcov_var)) })
rowMeans(abs(ers_sds) < qnorm(.95)) # 90% coverage
> (Intercept) Binary
> 0.96 0.94 0.88 0.92
rowMeans(abs(ers_sds) < qnorm(.975)) # 95% coverage
> (Intercept) Binary
> 1.00 0.98 0.96 0.98
rowMeans(abs(ers_sds) < qnorm(.995)) # 99% coverage
> (Intercept) Binary
> 1.00 0.98 1.00 1.00
```
Individual Specific Loadings
The models have used till now are in this form
for known fixed effects covariates and scale matrices
. The
is the
’th effect on individual
in cluster
. For instance, this could be the genetic effect or an environmental
effect.
We may consider the case where all individuals load differently on each of the random effects. A model to incorporate such effects is
where the are known covariates. If all the scale matrices are
correlation matrices, then this implies that the proportion of variance
attributable to the
’th effect for individual
in cluster
is
rather than
The model can equivalent be written as
where
returns a diagonal matrix. This form is useful
for simulations.
As en example, we extend our previous simulation to
where is a vector containing an intercept, an indicator for
whether the individual is a male, and a covariate between minus one and
one. We will let the heritability for males be larger than for females
but the environmental effect will be the same given the second
covariate.
We assign the new simulation function below:
``` r
the covariates for the scale parameters, Z
vcovcovs <- cbind(intercept = rep(1, 10), ismale = rep(1:0, 5), cov = seq(-1, 1, length.out = 10)) vcov_covs
> intercept is_male cov
> [1,] 1 1 -1.0000
> [2,] 1 0 -0.7778
> [3,] 1 1 -0.5556
> [4,] 1 0 -0.3333
> [5,] 1 1 -0.1111
> [6,] 1 0 0.1111
> [7,] 1 1 0.3333
> [8,] 1 0 0.5556
> [9,] 1 1 0.7778
> [10,] 1 0 1.0000
set the parameters we will use
beta <- c(-2, 4) thetas <- matrix(c(-0.394228680182135, 1.12739721457885, 1, -0.50580045583924, 0.64964149206513, -1), 3)
we can compute the individual specific proportion of variances as follows
scales <- exp(vcov_covs %*% thetas) cbind(scales^2, 1) / rowSums(cbind(scales^2, 1))
> [,1] [,2] [,3]
> [1,] 0.05127 0.86131 0.08742
> [2,] 0.03404 0.61120 0.35477
> [3,] 0.22025 0.62536 0.15440
> [4,] 0.12019 0.36478 0.51503
> [5,] 0.56559 0.27142 0.16300
> [6,] 0.30538 0.15664 0.53797
> [7,] 0.83362 0.06761 0.09877
> [8,] 0.55221 0.04787 0.39992
> [9,] 0.94125 0.01290 0.04585
> [10,] 0.76197 0.01116 0.22687
the heritability differs between males and females but the environmental
effect is the same given the second covariate as shown below
vcovcovstmp <- vcovcovs vcovcovstmp[, 3] <- 0 scales <- exp(vcovcovs_tmp %*% thetas) cbind(scales^2, 1) / rowSums(cbind(scales^2, 1))
> [,1] [,2] [,3]
> [1,] 0.65 0.2 0.15
> [2,] 0.25 0.2 0.55
> [3,] 0.65 0.2 0.15
> [4,] 0.25 0.2 0.55
> [5,] 0.65 0.2 0.15
> [6,] 0.25 0.2 0.55
> [7,] 0.65 0.2 0.15
> [8,] 0.25 0.2 0.55
> [9,] 0.65 0.2 0.15
> [10,] 0.25 0.2 0.55
simulates a data set.
Args:
n_fams: number of families.
beta: the fixed effect coefficients.
thetas: the coefficients for the scale parameters.
simdat <- function(nfams, beta, thetas){ # setup before the simulations Cmat <- 2 * kinship(ped) n_obs <- NROW(fam)
scales <- exp(vcovcovs %*% thetas) Sig <- diag(nobs) + diag(scales[, 1]) %% Cmat %% diag(scales[, 1]) + diag(scales[, 2]) %% C_env %% diag(scales[, 2]) Sig_chol <- chol(Sig)
# simulate the data
out <- replicate(
nfams, {
# simulate covariates
X <- cbind((Intercept) = 1, Binary = runif(nobs) > .9)
# assign the linear predictor + noise
eta <- drop(X %*% beta) + drop(rnorm(n_obs) %*% Sig_chol)
# return the list in the format needed for the package. We also have to
# pass the covariates for the scale parameters
list(y = as.numeric(eta > 0), X = X, Z = vcov_covs, scale_mats = list(
Genetic = Cmat, Environment = C_env))
}, simplify = FALSE)
# add attributes with the true values and return attributes(out) <- list(beta = beta, thetas = thetas) out } ```
A data set is sampled below and the model is estimated.
``` r
simulate a data set
set.seed(72466753) dat <- simdat(nfams = 1000L, beta = beta, thetas = thetas)
evaluate the log marginal likelihood at the true parameters
library(pedmod) lltermswoweights <- pedigreelltermsloadings(dat, max_threads = 4L)
logLiktruth <- evalpedigreell( lltermswoweights, c(beta, thetas), maxvls = 25000L, minvls = 1000L, abseps = 0, releps = 1e-3, n_threads = 4L)
remove the duplicated terms and use weights. This can be done more efficiently
and may not catch all duplicates
datunqiue <- dat[!duplicated(dat)] length(datunqiue) # number of unique terms
> [1] 633
get the weights. This can be written in a much more efficient way
cweights <- sapply(datunqiue, function(x) sum(sapply(dat, identical, y = x)))
evaluate log likelihood again and show that we got the same
llterms <- pedigreelltermsloadings(datunqiue, maxthreads = 4L)
logLiktruthweighted <- evalpedigreell( llterms, c(beta, thetas), maxvls = 25000L, minvls = 1000L, abseps = 0, releps = 1e-3, nthreads = 4L, clusterweights = cweights)
print(logLiktruthweighted, digits = 8)
> [1] -4373.3542
> attr(,"n_fails")
> [1] 0
> attr(,"std")
> [1] 0.019336072
print(logLik_truth, digits = 8)
> [1] -4373.3585
> attr(,"n_fails")
> [1] 0
> attr(,"std")
> [1] 0.0064573353
note that the variance is greater for the weighted version
llests <- sapply(1:50, function(seed){ set.seed(seed) evalpedigreell( lltermswoweights, c(beta, thetas), maxvls = 10000L, minvls = 1000L, abseps = 0, releps = 1e-3, nthreads = 4L) }) llestsfast <- sapply(1:50, function(seed){ set.seed(seed) evalpedigreell( llterms, c(beta, thetas), maxvls = 10000L, minvls = 1000L, abseps = 0, releps = 1e-3, nthreads = 4L, clusterweights = c_weights) })
the estimates are comparable
c(Without weights = mean(llests), With weights = mean(llests_fast))
> Without weights With weights
> -4373 -4373
the standard deviation is different
c(Without weights = sd(llests), With weights = sd(llests_fast))
> Without weights With weights
> 0.009941 0.032046
we can mitigate this by using the vls_scales argument which though is a bit
slower
llestsfastvlsscales <- sapply(1:50, function(seed){ set.seed(seed) evalpedigreell( llterms, c(beta, thetas), maxvls = 10000L, minvls = 1000L, abseps = 0, releps = 1e-3, nthreads = 4L, clusterweights = cweights, vlsscales = sqrt(cweights)) })
the estimates are comparable
c(Without weights = mean(llests), With weights = mean(llestsfast),
`With weights and vlsscales` = mean(llestsfastvlsscales))
> Without weights With weights
> -4373 -4373
> With weights and vls_scales
> -4373
the standard deviation is different
c(Without weights = sd(llests), With weights = sd(llestsfast),
`With weights and vlsscales` = sd(llestsfastvlsscales))
> Without weights With weights
> 0.009941 0.032046
> With weights and vls_scales
> 0.010431
get the starting values
system.time(start <- pedmodstartloadings( llterms, data = datunqiue, clusterweights = cweights))
> user system elapsed
> 0.01 0.00 0.01
find the maximum likelihood estimator
system.time( optres <- pedmodopt( llterms, par = start$par, maxvls = 25000L, minvls = 5000L, abseps = 0, releps = 1e-3, nthreads = 4L, useaprx = TRUE, clusterweights = cweights, vlsscales = sqrt(c_weights)))
> user system elapsed
> 437.1 0.0 133.5
```
We compare the maximum likelihood estimator with the true values below.
``` r
the fixed effects
rbind(Truth = beta, Start = head(start$par, 2), Estimate = head(opt_res$par, 2))
> (Intercept) Binary
> Truth -2.000 4.000
> Start -1.102 2.184
> Estimate -2.105 4.282
the scale coefficients
array(c(thetas, tail(start$par, -2), tail(opt_res$par, -2)), dim = c(dim(thetas), 3L), dimnames = list(NULL, NULL, c("Truth", "Start", "Estimate")))
> , , Truth
>
> [,1] [,2]
> [1,] -0.3942 -0.5058
> [2,] 1.1274 0.6496
> [3,] 1.0000 -1.0000
>
> , , Start
>
> [,1] [,2]
> [1,] -6.931e-01 -6.931e-01
> [2,] 1.801e-15 1.801e-15
> [3,] -2.701e-15 -2.701e-15
>
> , , Estimate
>
> [,1] [,2]
> [1,] -0.305 -0.4726
> [2,] 1.095 0.5476
> [3,] 1.020 -1.1924
compare the proportion of variance for the individual. First the estimates
thetasest <- matrix(tail(optres$par, -2), NCOL(vcovcovs)) scales <- exp(vcovcovs %*% thetas_est) cbind(scales^2, 1) / rowSums(cbind(scales^2, 1))
> [,1] [,2] [,3]
> [1,] 0.04432 0.885480 0.07020
> [2,] 0.03093 0.690871 0.27820
> [3,] 0.22545 0.630321 0.14423
> [4,] 0.12890 0.402875 0.46822
> [5,] 0.60621 0.237164 0.15663
> [6,] 0.34431 0.150583 0.50511
> [7,] 0.86274 0.047231 0.09003
> [8,] 0.60471 0.037007 0.35828
> [9,] 0.95256 0.007297 0.04014
> [10,] 0.80138 0.006863 0.19176
then the true proportions
scales <- exp(vcov_covs %*% thetas) cbind(scales^2, 1) / rowSums(cbind(scales^2, 1))
> [,1] [,2] [,3]
> [1,] 0.05127 0.86131 0.08742
> [2,] 0.03404 0.61120 0.35477
> [3,] 0.22025 0.62536 0.15440
> [4,] 0.12019 0.36478 0.51503
> [5,] 0.56559 0.27142 0.16300
> [6,] 0.30538 0.15664 0.53797
> [7,] 0.83362 0.06761 0.09877
> [8,] 0.55221 0.04787 0.39992
> [9,] 0.94125 0.01290 0.04585
> [10,] 0.76197 0.01116 0.22687
the log likelihood at the true parameters and at the estimate
print(logLiktruthweighted, digits = 8)
> [1] -4373.3542
> attr(,"n_fails")
> [1] 0
> attr(,"std")
> [1] 0.019336072
print(-opt_res$value, digits = 8)
> [1] -4370.6815
```
Profile Likelihood
We can construct a profile likelihood for the parameters like before. For instance, we can look at the scale parameter for the heritability shift for the males with the following code.
``` r system.time( plcurve <- pedmodprofile( llterms, par = optres$par, maxvls = 25000L, minvls = 5000L, abseps = 0, releps = 1e-3, nthreads = 4L, useaprx = TRUE, clusterweights = cweights, vlsscales = sqrt(cweights), delta = .2, verbose = TRUE, which_prof = 4L))
> The estimate of the standard error of the log likelihood is 0.00189664. Preferably this should be below 0.001
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -4374.9109 at 0.894952
> LogLike: -4373.9116 at 0.894952
> LogLike: -4370.6815 at 1.094952
> LogLike: -4371.7804 at 0.989239. Lb, target, ub: -4373.9116, -4372.6022, -4371.7804
> LogLike: -4371.4939 at 0.989239. Lb, target, ub: -4373.9116, -4372.6022, -4371.4939
> LogLike: -4372.8734 at 0.937691. Lb, target, ub: -4372.8734, -4372.6022, -4371.4939
> LogLike: -4372.5926 at 0.937691. Lb, target, ub: -4373.9116, -4372.6022, -4372.5926
> LogLike: -4372.9987 at 0.932610. Lb, target, ub: -4372.9987, -4372.6022, -4372.5926
> LogLike: -4372.7283 at 0.932610. Lb, target, ub: -4372.7283, -4372.6022, -4372.5926
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -4373.2335 at 1.294952
> LogLike: -4373.1099 at 1.294952
> LogLike: -4370.6815 at 1.094952
> LogLike: -4371.9157 at 1.235563. Lb, target, ub: -4373.1099, -4372.6022, -4371.9157
> LogLike: -4371.9529 at 1.235563. Lb, target, ub: -4373.1099, -4372.6022, -4371.9529
> LogLike: -4372.5866 at 1.268475. Lb, target, ub: -4373.1099, -4372.6022, -4372.5866
> LogLike: -4372.5552 at 1.268475. Lb, target, ub: -4373.1099, -4372.6022, -4372.5552
> LogLike: -4370.6815 at 1.094952
> user system elapsed
> 1729.173 0.072 433.445
```
The confidence interval is shown below along with a plot of the profile likelihood curve.
``` r pl_curve$confs # the confidence interval
> 2.50 pct. 97.50 pct.
> 0.9373 1.2708
plot the profile likelihood curve
local({ maxdiff <- 4 xs <- plcurve$xs pls <- plcurve$plogLik keep <- pls > max(pls) - maxdiff xs <- xs[keep] pls <- pls[keep]
par(mar = c(5, 5, 1, 1))
plot(xs, pls, bty = "l", pch = 16, xlab = expression(theta[2]),
ylab = "Profile likelihood")
grid()
abline(v = opt_res$par[4], lty = 2) # the estimate
# mark the critical value
abline(h = max(pls) - qchisq(.95, 1) / 2, lty = 3)
lines(spline(xs, pls, n = 100L)) }) ```

Some of the quantities of interest are nonlinear functions of the
parameters, however. For instance, we may be interested in the
difference in the proportion of variance for males at cov = 0. We can
construct a profile likelihood based confidence interval for this
difference but this requires an optimizer that supports nonlinear
equality constraints. The pedmod_profile_nleq function is created for
this purpose and an example of using it to compute the aforementioned
difference is shown below.
``` r
computes the difference between the male and females heritability at
cov = 0
heq <- function(par){ theta <- matrix(tail(par, 6), 3) scs <- matrix(c(1, 1, 0, 1, 0, 0), 2) %*% theta scs <- exp(scs) propgenetic <- scs[, 1]^2 / (1 + rowSums(scs^2)) diff(propgenetic) } heq(opt_res$par)
> [1] 0.4107
construct the profile likelihood curve
system.time( plcurvenleq <- pedmodprofilenleq( llterms, par = optres$par, maxvls = 5000L, minvls = 1000L, minvlsstart = 500L, maxvlsstart = 500L, abseps = 0, releps = 1e-3, nthreads = 4L, useaprx = TRUE, clusterweights = cweights, vlsscales = sqrt(cweights), delta = .2, verbose = TRUE, heq = heq, heq_bounds = c(-1, 1)))
> The estimate of the standard error of the log likelihood is 0.00814429. Preferably this should be below 0.001
>
> Finding the upper limit of the profile likelihood curve
> LogLike: -4385.8000 at 0.610651
> LogLike: -4385.8021 at 0.610651
> LogLike: -4370.6861 at 0.410651
> LogLike: -4371.3423 at 0.455463. Lb, target, ub: -4385.8021, -4372.6068, -4371.3423
> LogLike: -4371.3155 at 0.455463. Lb, target, ub: -4385.8021, -4372.6068, -4371.3155
> LogLike: -4374.2589 at 0.513399. Lb, target, ub: -4374.2589, -4372.6068, -4371.3155
> LogLike: -4374.2560 at 0.513399. Lb, target, ub: -4374.2560, -4372.6068, -4371.3155
> LogLike: -4372.7176 at 0.488937. Lb, target, ub: -4372.7176, -4372.6068, -4371.3155
> LogLike: -4372.6888 at 0.488937. Lb, target, ub: -4372.6888, -4372.6068, -4371.3155
> LogLike: -4372.4552 at 0.483824. Lb, target, ub: -4372.6888, -4372.6068, -4372.4552
> LogLike: -4372.4248 at 0.483824. Lb, target, ub: -4372.6888, -4372.6068, -4372.4248
>
> Finding the lower limit of the profile likelihood curve
> LogLike: -4379.4649 at 0.210651
> LogLike: -4379.4469 at 0.210651
> LogLike: -4370.6861 at 0.410651
> LogLike: -4371.6444 at 0.349452. Lb, target, ub: -4379.4469, -4372.6068, -4371.6444
> LogLike: -4371.6290 at 0.349452. Lb, target, ub: -4379.4469, -4372.6068, -4371.6290
> LogLike: -4373.5286 at 0.303085. Lb, target, ub: -4373.5286, -4372.6068, -4371.6290
> LogLike: -4373.5118 at 0.303085. Lb, target, ub: -4373.5118, -4372.6068, -4371.6290
> LogLike: -4372.6199 at 0.323028. Lb, target, ub: -4372.6199, -4372.6068, -4371.6290
> LogLike: -4372.5890 at 0.323028. Lb, target, ub: -4373.5118, -4372.6068, -4372.5890
> LogLike: -4372.7209 at 0.320426. Lb, target, ub: -4372.7209, -4372.6068, -4372.5890
> LogLike: -4372.6897 at 0.320426. Lb, target, ub: -4372.6897, -4372.6068, -4372.5890
> LogLike: -4370.6861 at 0.410651
> user system elapsed
> 2675.382 0.069 698.971
```
The confidence interval is shown below along with a plot of the profile likelihood curve.
``` r plcurvenleq$confs # the confidence interval
> 2.50 pct. 97.50 pct.
> 0.3226 0.4874
plot the profile likelihood curve
local({ maxdiff <- 4 xs <- plcurvenleq$xs pls <- plcurvenleq$plogLik keep <- pls > max(pls) - maxdiff xs <- xs[keep] pls <- pls[keep]
par(mar = c(5, 5, 1, 1))
plot(xs, pls, bty = "l", pch = 16, ylab = "Profile likelihood",
xlab = "Heritability difference at cov = 0")
grid()
abline(v = opt_res$par[4], lty = 2) # the estimate
# mark the critical value
abline(h = max(pls) - qchisq(.95, 1) / 2, lty = 3)
lines(spline(xs, pls, n = 100L)) }) ```

Simulation Study
We make a small simulation study below where we are interested in the estimation time and bias.
``` r
the seeds we will use
seeds <- c(36451989L, 18774630L, 76585289L, 31898455L, 55733878L, 99681114L, 37725150L, 99188448L, 66989159L, 20673587L, 47985954L, 42571905L, 53089211L, 18457743L, 96049437L, 70222325L, 86393368L, 45380572L, 81116968L, 48291155L, 89755299L, 69891073L, 1846862L, 15263013L, 37537710L, 25194071L, 14471551L, 38278606L, 55596031L, 5436537L, 75008107L, 83382936L, 50689482L, 71708788L, 52258337L, 23423931L, 61069524L, 24452554L, 32406673L, 14900280L, 24818537L, 59733700L, 82407492L, 95500692L, 62528680L, 88728797L, 9891891L, 36354594L, 69630736L, 41755287L)
run the simulation study
sim_study <- lapply(seeds, function(s){ set.seed(s)
# only run the result if it has not been computed f <- file.path("cache", "simstudyloadings", paste0("loadings-", s, ".RDS")) if(!file.exists(f)){ # simulate the data dat <- simdat(nfams = 1000L, beta = beta, thetas = thetas)
# get the weighted data set
dat_unqiue <- dat[!duplicated(dat)]
attributes(dat_unqiue) <- attributes(dat)
c_weights <- sapply(dat_unqiue, function(x)
sum(sapply(dat, identical, y = x)))
rm(dat)
# get the starting values
library(pedmod)
ll_terms <- pedigree_ll_terms_loadings(dat_unqiue, max_threads = 4L)
# fit the model
ti_start <- system.time(start <- pedmod_start_loadings(
ptr = ll_terms, data = dat_unqiue, cluster_weights = c_weights))
start$time <- ti_start
ti_quick <- system.time(
opt_out_quick <- pedmod_opt(
ptr = ll_terms, par = start$par, maxvls = 5000L, abs_eps = 0,
rel_eps = 1e-2, minvls = 500L, use_aprx = TRUE, n_threads = 4L,
cluster_weights = c_weights, vls_scales = sqrt(c_weights)))
opt_out_quick$time <- ti_quick
ti_slow <- system.time(
opt_out <- pedmod_opt(
ptr = ll_terms, par = opt_out_quick$par, abs_eps = 0, use_aprx = TRUE,
n_threads = 4L, cluster_weights = c_weights,
vls_scales = sqrt(c_weights),
# we changed these parameters
maxvls = 25000L, rel_eps = 1e-3, minvls = 5000L))
opt_out$time <- ti_slow
saveRDS(list(start = start, opt_out_quick = opt_out_quick,
opt_out = opt_out), f)
}
# report to console and return out <- readRDS(f) message(paste0(capture.output( rbind(Estimate = out$opt_out$par, Truth = c(beta, thetas))), collapse = "\n"))
message(sprintf( "Time %12.1f. Max ll: %12.4f\n", with(out, start$time["elapsed"] + optout$time["elapsed"] + optoutquick$time["elapsed"]), -out$optout$value))
out })
compute the bias estimates
estimates <- sapply(simstudy, function(x) x$optout$par) rownames(estimates) <- c("(Intercept)", "Binary", paste0("Genetic", 1:3), paste0("Env", 1:3))
err <- estimates - c(beta, thetas) rbind(Bias = rowMeans(err), SE = apply(err, 1, sd) / sqrt(NCOL(err)))
> (Intercept) Binary Genetic1 Genetic2 Genetic3 Env1 Env2
> Bias 2.726e-05 0.009754 -0.01328 0.01445 -0.007225 -0.03772 -0.01051
> SE 2.156e-02 0.044836 0.02209 0.01420 0.012170 0.02982 0.01423
> Env3
> Bias -0.05552
> SE 0.02245
make a box plot
par(mar = c(7, 5, 1, 1))
S is for the standardized and D is for the direct parameterization
boxplot(t(err), ylab = "Error", las = 2) abline(h = 0, lty = 2) grid() ```

``` r
summary stats for the computation time
comptimes <- sapply(
simstudy, function(x) sapply(x, [[, "time")["elapsed", ])
summary(t(comp_times))
> start optoutquick opt_out
> Min. :0.0080 Min. :18.2 Min. :41.0
> 1st Qu.:0.0090 1st Qu.:22.0 1st Qu.:50.3
> Median :0.0100 Median :24.3 Median :52.7
> Mean :0.0103 Mean :25.4 Mean :55.4
> 3rd Qu.:0.0120 3rd Qu.:26.1 3rd Qu.:61.2
> Max. :0.0130 Max. :56.4 Max. :83.0
summary(colSums(comp_times))
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 62.1 74.3 76.6 80.8 87.5 122.7
```
More Complicated Example
We consider a more complicated example in this section and use some of the lower level functions in the package as an example. We start by sourcing a file to get a function to simulate a data set with a maternal effect and a genetic effect like in Mahjani et al. (2020):
``` r
source the file to get the simulation function
source(system.file("gen-pedigree-data.R", package = "pedmod"))
simulate a data set
set.seed(68167102) dat <- simpedigreedata(n_families = 1000L)
distribution of family sizes
par(mar = c(5, 4, 1, 1)) plot(table(sapply(dat$sim_data, function(x) length(x$y))), xlab = "Family size", ylab = "Number of families", bty = "l") ```

``` r
total number of observations
sum(sapply(dat$sim_data, function(x) length(x$y)))
> [1] 49734
average event rate
mean(unlist(sapply(dat$sim_data, [[, "y")))
> [1] 0.2386
```
As Mahjani et al. (2020), we have data families linked by three generations but we only have data for the last generation. We illustrate this with the first family in the simulated data set:
``` r
this is the full family
library(kinship2) fam1 <- dat$sim_data[[1L]] plot(fam1$pedAll) ```

``` r
here is the C matrix for the genetic effect
rev_img <- function(x, ...) image(x[, NROW(x):1], ...) cl <- colorRampPalette(c("Red", "White", "Blue"))(101)
par(mar = c(2, 2, 1, 1)) revimg(fam1$relmat, xaxt = "n", yaxt = "n", col = cl, zlim = c(-1, 1)) ```

``` r
the first part of the matrix is given below
with(fam1, Matrix::Matrix(relmat[seqlen(min(10, NROW(relmat))), seqlen(min(10, NROW(rel_mat)))], sparse = TRUE))
> 10 x 10 sparse Matrix of class "dsCMatrix"
>
> 9 1.000 0.500 0.125 0.125 0.125 . . 0.125 0.125 .
> 10 0.500 1.000 0.125 0.125 0.125 . . 0.125 0.125 .
> 15 0.125 0.125 1.000 0.500 0.500 0.125 0.125 . . .
> 16 0.125 0.125 0.500 1.000 0.500 0.125 0.125 . . .
> 17 0.125 0.125 0.500 0.500 1.000 0.125 0.125 . . .
> 21 . . 0.125 0.125 0.125 1.000 0.500 . . .
> 22 . . 0.125 0.125 0.125 0.500 1.000 . . .
> 28 0.125 0.125 . . . . . 1.000 0.500 0.125
> 29 0.125 0.125 . . . . . 0.500 1.000 0.125
> 36 . . . . . . . 0.125 0.125 1.000
here is the C matrix for the maternal effect
revimg(fam1$metmat, xaxt = "n", yaxt = "n", col = cl, zlim = c(-1, 1)) ```

``` r
the first part of the matrix is given below
with(fam1, Matrix::Matrix(metmat[seqlen(min(10, NROW(metmat))), seqlen(min(10, NROW(met_mat)))], sparse = TRUE))
> 10 x 10 sparse Matrix of class "dsCMatrix"
>
> 9 1 1 . . . . . . . .
> 10 1 1 . . . . . . . .
> 15 . . 1 1 1 . . . . .
> 16 . . 1 1 1 . . . . .
> 17 . . 1 1 1 . . . . .
> 21 . . . . . 1 1 . . .
> 22 . . . . . 1 1 . . .
> 28 . . . . . . . 1.0 1.0 0.5
> 29 . . . . . . . 1.0 1.0 0.5
> 36 . . . . . . . 0.5 0.5 1.0
each simulated family has such two matrices in addition to a design matrix
for the fixed effects, X, and a vector with outcomes, y
str(fam1[c("X", "y")])
> List of 2
> $ X: num [1:52, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
> ..- attr(*, "dimnames")=List of 2
> .. ..$ : NULL
> .. ..$ : chr [1:3] "(Intercept)" "X1" ""
> $ y: Named logi [1:52] FALSE TRUE TRUE TRUE FALSE FALSE ...
> ..- attr(*, "names")= chr [1:52] "9" "10" "15" "16" ...
```
Then we perform the model estimation:
``` r
the true parameters are
dat$beta
> (Intercept) X1 X2
> -1.0 0.3 0.2
dat$sc # the sigmas squared
> Genetic Maternal
> 0.50 0.33
prepare the data to pass to the functions in the package
datarg <- lapply(dat$simdata, function(x){ # we need the following for each family: # y: the zero-one outcomes. # X: the design matrix for the fixed effects. # scalemats: list with the scale matrices for each type of effect. list(y = as.numeric(x$y), X = x$X, scalemats = list(x$relmat, x$metmat)) })
create a C++ object
library(pedmod) llterms <- pedigreellterms(datarg, max_threads = 4L)
get the starting values. This is very fast
y <- unlist(lapply(datarg, [[, "y"))
X <- do.call(rbind, lapply(datarg, [[, "X"))
start_fit <- glm.fit(X, y, family = binomial("probit"))
log likelihood at the starting values without random effects
-sum(start_fit$deviance) / 2
> [1] -26480
(beta <- start_fit$coefficients) # starting values for fixed effects
> (Intercept) X1
> -0.7342 0.2234 0.1349
start at moderate sized scale parameters
sc <- rep(log(.2), 2)
check log likelihood at the starting values. First we assign a function
to approximate the log likelihood and the gradient
fn <- function(par, seed = 1L, releps = 1e-2, useaprx = TRUE, nthreads = 4L, indices = NULL, maxvls = 25000L, method = 0L, usesparse = FALSE, usetilting = FALSE){ set.seed(seed) -evalpedigreell( ptr = if(usesparse) lltermssparse else llterms, par = par, maxvls = maxvls, abseps = 0, releps = releps, minvls = 1000L, useaprx = useaprx, nthreads = nthreads, indices = indices, method = method, usetilting = usetilting) } gr <- function(par, seed = 1L, releps = 1e-2, useaprx = TRUE, nthreads = 4L, indices = NULL, maxvls = 25000L, method = 0L, usesparse = FALSE, usetilting = FALSE){ set.seed(seed) out <- -evalpedigreegrad( ptr = if(usesparse) lltermssparse else llterms, par = par, maxvls = maxvls, abseps = 0, releps = releps, minvls = 1000L, useaprx = useaprx, nthreads = nthreads, indices = indices, method = method, usetilting = usetilting) structure(c(out), value = -attr(out, "logLik"), nfails = attr(out, "nfails"), std = attr(out, "std")) }
check output at the starting values
system.time(ll <- -fn(c(beta, sc)))
> user system elapsed
> 4.186 0.000 1.060
ll # the log likelihood at the starting values
> [1] -26042
> attr(,"n_fails")
> [1] 0
> attr(,"std")
> [1] 0.05963
system.time(gr_val <- gr(c(beta, sc)))
> user system elapsed
> 43.22 0.00 10.90
gr_val # the gradient at the starting values
> [1] 1894.83 -549.43 -235.73 47.21 -47.84
> attr(,"value")
> [1] 26042
> attr(,"n_fails")
> [1] 715
> attr(,"std")
> [1] 0.01845 0.25149 0.28043 0.20515 0.10778 0.11060
standard deviation of the approximation
sd(sapply(1:25, function(seed) fn(c(beta, sc), seed = seed)))
> [1] 0.09254
we do the same for the gradient elements but only for a subset of the
log marginal likelihood elements
grhats <- sapply( 1:25, function(seed) gr(c(beta, sc), seed = seed, indices = 0:99)) apply(grhats, 1, sd)
> [1] 0.06953 0.11432 0.06340 0.02204 0.02467
the errors are on similar magnitudes
gr(c(beta, sc), indices = 0:99)
> [1] 197.674 -81.013 20.820 5.137 -6.452
> attr(,"value")
> [1] 2602
> attr(,"n_fails")
> [1] 73
> attr(,"std")
> [1] 0.005841 0.076801 0.084451 0.068685 0.032688 0.033749
verify the gradient (may not be exactly equal due to MC error)
rbind(numDeriv = numDeriv::grad(fn, c(beta, sc), indices = 0:10), pedmod = gr(c(beta, sc), indices = 0:10))
> [,1] [,2] [,3] [,4] [,5]
> numDeriv 28.00 -0.298 7.415 1.105 -1.071
> pedmod 27.98 -0.331 7.402 1.113 -1.062
optimize the log likelihood approximation
system.time(opt <- optim(c(beta, sc), fn, gr, method = "BFGS"))
> user system elapsed
> 1602.379 0.016 407.611
```
The output from the optimization is shown below:
``` r print(-opt$value, digits = 8) # the maximum log likelihood
> [1] -25823.021
opt$convergence # check convergence
> [1] 0
compare the estimated fixed effects with the true values
rbind(truth = dat$beta, estimated = head(opt$par, length(dat$beta)))
> (Intercept) X1 X2
> truth -1.000 0.3000 0.2000
> estimated -1.007 0.3059 0.1866
compare estimated scale parameters with the true values
rbind(truth = dat$sc, estimated = exp(tail(opt$par, length(dat$sc))))
> Genetic Maternal
> truth 0.5000 0.3300
> estimated 0.5233 0.3643
```
Computation in Parallel
The method scales reasonably well in the number of threads as shown below:
``r
library(microbenchmark)
microbenchmark(
fn (1 thread)= fn(c(beta, sc), n_threads = 1),
fn (2 threads)= fn(c(beta, sc), n_threads = 2),
fn (4 threads)= fn(c(beta, sc), n_threads = 4),
gr (1 thread)= gr(c(beta, sc), n_threads = 1),
gr (2 threads)= gr(c(beta, sc), n_threads = 2),
gr (4 threads)` = gr(c(beta, sc), n_threads = 4),
times = 1)
> Unit: seconds
> expr min lq mean median uq max neval
> fn (1 thread) 3.881 3.881 3.881 3.881 3.881 3.881 1
> fn (2 threads) 1.970 1.970 1.970 1.970 1.970 1.970 1
> fn (4 threads) 1.140 1.140 1.140 1.140 1.140 1.140 1
> gr (1 thread) 36.177 36.177 36.177 36.177 36.177 36.177 1
> gr (2 threads) 19.129 19.129 19.129 19.129 19.129 19.129 1
> gr (4 threads) 9.435 9.435 9.435 9.435 9.435 9.435 1
```
Using ADAM
We use stochastic gradient descent with the ADAM method (Kingma and Ba 2015) in this section. We define a function below to apply ADAM and use it to estimate the model.
``` r
performs stochastic gradient descent (using ADAM).
Args:
par: starting value.
gr: function to evaluate the log marginal likelihood.
n_clust: number of observation.
n_blocks: number of blocks.
maxit: maximum number of iteration.
seed: seed to use.
epsilon, alpha, beta1, beta2: ADAM parameters.
maxvls: maximum number of samples to draw in each iteration. Thus, it
needs maxit elements.
verbose: print output during the estimation.
...: arguments passed to gr.
adam <- function(par, gr, nclust, nblocks, maxit = 10L, seed = 1L, epsilon = 1e-8, alpha = .001, beta1 = .9, beta2 = .999, maxvls = rep(10000L, maxit), verbose = FALSE, ...){ grpdummy <- (seqlen(nclust) - 1L) %% nblocks npar <- length(par) m <- v <- numeric(npar) funvals <- numeric(maxit) estimates <- matrix(NAreal, npar, maxit) i <- -1L
for(k in 1:maxit){ # sample groups indices <- sample.int(nclust, replace = FALSE) - 1L blocks <- tapply(indices, grpdummy, identity, simplify = FALSE)
for(ii in 1:n_blocks){
i <- i + 1L
idx_b <- (i %% n_blocks) + 1L
m_old <- m
v_old <- v
res <- gr(par, indices = blocks[[idx_b]], maxvls = maxvls[k])
fun_vals[(i %/% n_blocks) + 1L] <-
fun_vals[(i %/% n_blocks) + 1L] + attr(res, "value")
res <- c(res)
m <- beta_1 * m_old + (1 - beta_1) * res
v <- beta_2 * v_old + (1 - beta_2) * res^2
m_hat <- m / (1 - beta_1^(i + 1))
v_hat <- v / (1 - beta_2^(i + 1))
par <- par - alpha * m_hat / (sqrt(v_hat) + epsilon)
}
if(verbose){
cat(sprintf("Ended iteration %4d. Running estimate of the function value is: %14.2f\n",
k, fun_vals[k]))
cat("Parameter estimates are:\n")
cat(capture.output(print(par)), sep = "\n")
cat("\n")
}
estimates[, k] <- par
}
list(par = par, estimates = estimates, funvals = funvals) }
use the function
assign the maximum number of samples we will use
maxit <- 100L minvls <- 250L maxpts <- formals(gr)$maxvls maxpts_use <- exp(seq(log(2 * minvls), log(maxpts), length.out = maxit))
show the maximum number of samples we use
par(mar = c(5, 4, 1, 1)) plot(maxptsuse, pch = 16, xlab = "Iteration number", bty = "l", ylab = "Maximum number of samples", ylim = range(0, maxptsuse)) ```

``` r set.seed(1) system.time( adamres <- adam(c(beta, sc), gr = gr, nclust = length(datarg), nblocks = 10L, alpha = 1e-2, maxit = maxit, verbose = FALSE, maxvls = maxpts_use, minvls = minvls))
> user system elapsed
> 1570.129 0.084 398.476
```
The result is shown below.
``` r print(-fn(adam_res$par), digits = 8) # the maximum log likelihood
> [1] -25823.228
> attr(,"n_fails")
> [1] 0
> attr(,"std")
> [1] 0.066737305
compare the estimated fixed effects with the true values
rbind(truth = dat$beta,
estimated optim = head(opt$par , length(dat$beta)),
estimated ADAM = head(adam_res$par, length(dat$beta)))
> (Intercept) X1 X2
> truth -1.000 0.3000 0.2000
> estimated optim -1.007 0.3059 0.1866
> estimated ADAM -1.006 0.3068 0.1858
compare estimated scale parameters with the true values
rbind(truth = dat$sc,
estimated optim = exp(tail(opt$par , length(dat$sc))),
estimated ADAM = exp(tail(adam_res$par, length(dat$sc))))
> Genetic Maternal
> truth 0.5000 0.3300
> estimated optim 0.5233 0.3643
> estimated ADAM 0.5191 0.3653
could possibly have stopped much earlier maybe. Dashed lines are final
estimates
par(mar = c(5, 4, 1, 1)) matplot(t(adamres$estimates), type = "l", col = "Black", lty = 1, bty = "l", xlab = "Iteration", ylab = "Estimate") for(s in adamres$par) abline(h = s, lty = 2) ```

The Multivariate Normal CDF Approximation
We compare the multivariate normal CDF approximation in this section with the approximation from the mvtnorm package which uses the implementation by Genz and Bretz (2002). The same algorithm is used but the version in this package is re-written in C++ and differs slightly. Moreover, we have implemented an approximation of the standard normal CDF and its inverse which reduces the computation time as we will show below.
We also compare our implementation of the minimax titling method suggested by Botev (2017) with the implementation in the TruncatedNormal package.
``` r
settings for the simulation study
library(mvtnorm) library(pedmod) library(microbenchmark) set.seed(78459126) n <- 5L # number of variables to integrate out rel_eps <- 1e-4 # the relative error to use
run the simulation study
sim_res <- replicate(expr = { # simulate covariance matrix and the upper bound S <- drop(rWishart(1L, 2 * n, diag(n) / 2 / n)) u <- rnorm(n)
# function to use pmvnorm usemvtnorm <- function(releps) mvtnorm::pmvnorm( upper = u, sigma = S, algorithm = GenzBretz( abseps = 0, releps = rel_eps, maxpts = 1e7))
# function to use pmvnorm from TruncatedNormal usetruncnorm <- function(nsample) TruncatedNormal::pmvnorm( sigma = S, lb = rep(-Inf, n), ub = u, type = "qmc", B = nsample)
# function to use this package usemvndst <- function(useaprx = FALSE, method = 0L, usetilting = TRUE) mvndst(lower = rep(-Inf, n), upper = u, mu = rep(0, n), sigma = S, useaprx = useaprx, abseps = 0, releps = releps, maxvls = 1e7, method = method, usetilting = usetilting)
# get a very precise estimate truth <- usemvtnorm(releps / 100)
# computes the error with repeated approximations and compute the time it # takes nrep <- 5L runntime <- function(expr){ expr <- substitute(expr) ti <- getnanotime() res <- replicate(nrep, eval(expr)) ti <- getnanotime() - ti err <- (res - truth) / truth c(SE = sqrt(sum(err^2) / nrep), time = ti / nrep / 1e9) }
mvtnormres <- runntime(usemvtnorm(rel_eps))
nsample <- attr(usemvndst(TRUE, method = 0L, usetilting = TRUE), "nit") TruncatedNormalres <- runntime(usetruncnorm(nsample))
mvndstnoaprxresKorobov <- runntime(usemvndst(FALSE, method = 0L, usetilting = FALSE)) mvndstwaprxresKorobov <- runntime(usemvndst(TRUE , method = 0L, usetilting = FALSE)) mvndstnoaprxresSobol <- runntime(usemvndst(FALSE, method = 1L, usetilting = FALSE)) mvndstwaprxresSobol <- runntime(usemvndst(TRUE , method = 1L, usetilting = FALSE))
mvndstnoaprxresKorobovtilt <- runntime(usemvndst(FALSE, method = 0L, usetilting = TRUE)) mvndstwaprxresKorobovtilt <- runntime(usemvndst(TRUE , method = 0L, usetilting = TRUE)) mvndstnoaprxresSoboltilt <- runntime(usemvndst(FALSE, method = 1L, usetilting = TRUE)) mvndstwaprxresSoboltilt <- runntime(usemvndst(TRUE , method = 1L, usetilting = TRUE))
# return
rbind(mvtnorm = mvtnormres,
TruncatedNormal = TruncatedNormalres,
no aprx; Korobov = mvndstnoaprxresKorobov,
no aprx; Sobol = mvndstnoaprxresSobol,
w/ aprx; Korobov = mvndstwaprxresKorobov,
w/ aprx; Sobol = mvndstwaprxresSobol,
`no aprx; Korobov (tilt)` = mvndst_no_aprx_res_Korobov_tilt,
`no aprx; Sobol (tilt)` = mvndst_no_aprx_res_Sobol_tilt,
`w/ aprx; Korobov (tilt)` = mvndst_w_aprx_res_Korobov_tilt,
`w/ aprx; Sobol (tilt)` = mvndst_w_aprx_res_Sobol_tilt)
}, n = 100, simplify = "array") ```
Box plots of the relative errors are shown below:
``` r rowMeans(sim_res[, "SE", ])
> mvtnorm TruncatedNormal no aprx; Korobov
> 2.800e-05 2.396e-04 3.160e-05
> no aprx; Sobol w/ aprx; Korobov w/ aprx; Sobol
> 3.073e-05 3.042e-05 3.129e-05
> no aprx; Korobov (tilt) no aprx; Sobol (tilt) w/ aprx; Korobov (tilt)
> 3.327e-05 2.803e-05 3.400e-05
> w/ aprx; Sobol (tilt)
> 2.963e-05
par(mar = c(10, 4, 1, 1), bty = "l") boxplot(t(sim_res[, "SE", ]), las = 2) grid() ```

The new implementation is faster when the approximation is used:
``` r rowMeans(sim_res[, "time", ])
> mvtnorm TruncatedNormal no aprx; Korobov
> 0.018016 0.054301 0.012852
> no aprx; Sobol w/ aprx; Korobov w/ aprx; Sobol
> 0.015486 0.004854 0.006240
> no aprx; Korobov (tilt) no aprx; Sobol (tilt) w/ aprx; Korobov (tilt)
> 0.012644 0.011646 0.009643
> w/ aprx; Sobol (tilt)
> 0.008862
par(mar = c(9, 4, 1, 1), bty = "l") boxplot(t(sim_res[, "time", ]), log = "y", las = 2) grid() ```

Next, we compare the methods with the first example from Botev (2017). This is with a low probability case and we would expect the minimax tilted version to perform better. We fix the number of samples with all packages in this example.
``` r
settings for the test like in Botev (2017)
library(mvtnorm) library(pedmod) library(microbenchmark) ds <- c(3, 5, 10, 15, 20, 25) n_sample <- 10000L
run the simulation study
set.seed(15418038) sim_res <- sapply(ds, (d){ S <- solve(diag(1/2, d) + 1/2) l <- rep(1/2, d) u <- rep(1, d)
# function to use pmvnorm usemvtnorm <- function(nsample) mvtnorm::pmvnorm(lower = l, upper = u, sigma = S, algorithm = GenzBretz( abseps = 0, releps = 0, maxpts = n_sample))
# function to use pmvnorm from TruncatedNormal usetruncnorm <- function(nsample) TruncatedNormal::pmvnorm( sigma = S, lb = l, ub = u, type = "qmc", B = nsample)
# function to use this package usemvndst <- function(useaprx = FALSE, method = 0L, usetilting = TRUE) mvndst(lower = l, upper = u, mu = rep(0, d), sigma = S, useaprx = useaprx, abseps = 0, releps = 0, maxvls = nsample, method = method, usetilting = usetilting, minvls = n_sample)
# get a very precise estimate truth <- usetruncnorm(n_sample * 100L)
# computes the error with repeated approximations and compute the time it # takes nrep <- 25L runntime <- function(expr){ expr <- substitute(expr) ti <- getnanotime() res <- replicate(nrep, eval(expr)) ti <- getnanotime() - ti err <- (res - truth) / truth c(SE = sqrt(sum(err^2) / nrep), time = ti / nrep / 1e9) }
mvtnormres <- runntime(usemvtnorm(n_sample))
TruncatedNormalres <- runntime(usetruncnorm(nsample))
mvndstnoaprxresKorobov <- runntime(usemvndst(FALSE, method = 0L, usetilting = FALSE)) mvndstwaprxresKorobov <- runntime(usemvndst(TRUE , method = 0L, usetilting = FALSE)) mvndstnoaprxresSobol <- runntime(usemvndst(FALSE, method = 1L, usetilting = FALSE)) mvndstwaprxresSobol <- runntime(usemvndst(TRUE , method = 1L, usetilting = FALSE))
mvndstnoaprxresKorobovtilt <- runntime(usemvndst(FALSE, method = 0L, usetilting = TRUE)) mvndstwaprxresKorobovtilt <- runntime(usemvndst(TRUE , method = 0L, usetilting = TRUE)) mvndstnoaprxresSoboltilt <- runntime(usemvndst(FALSE, method = 1L, usetilting = TRUE)) mvndstwaprxresSoboltilt <- runntime(usemvndst(TRUE , method = 1L, usetilting = TRUE))
rbind(mvtnorm = mvtnormres,
TruncatedNormal = TruncatedNormalres,
no aprx; Korobov = mvndstnoaprxresKorobov,
no aprx; Sobol = mvndstnoaprxresSobol,
w/ aprx; Korobov = mvndstwaprxresKorobov,
w/ aprx; Sobol = mvndstwaprxresSobol,
`no aprx; Korobov (tilt)` = mvndst_no_aprx_res_Korobov_tilt,
`no aprx; Sobol (tilt)` = mvndst_no_aprx_res_Sobol_tilt,
`w/ aprx; Korobov (tilt)` = mvndst_w_aprx_res_Korobov_tilt,
`w/ aprx; Sobol (tilt)` = mvndst_w_aprx_res_Sobol_tilt)
}, simplify = "array")
dimnames(simres) <- setNames(c(dimnames(simres)[1:2], list(ds)), c("Method", "Metric", "Dimension")) ```
The relative errors plotted against the dimension is shown below:
``` r
the errors for each method and dimension
sim_res[, "SE", ]
> Dimension
> Method 3 5 10 15 20
> mvtnorm 1.005e-06 1.752e-05 7.677e-04 0.0245205 6.705e-01
> TruncatedNormal 3.229e-05 1.517e-04 4.024e-04 0.0009756 1.600e-03
> no aprx; Korobov 2.932e-07 1.997e-06 2.050e-03 0.0201007 2.959e-01
> no aprx; Sobol 5.201e-05 1.559e-04 3.788e-03 0.0625289 5.190e-01
> w/ aprx; Korobov 6.803e-07 3.089e-06 2.247e-03 0.0228891 3.701e-01
> w/ aprx; Sobol 4.538e-05 1.610e-04 3.891e-03 0.0517658 3.766e-01
> no aprx; Korobov (tilt) 2.860e-07 1.434e-06 1.233e-05 0.0000215 5.722e-05
> no aprx; Sobol (tilt) 3.209e-06 1.318e-05 5.806e-05 0.0001141 3.116e-04
> w/ aprx; Korobov (tilt) 4.175e-07 8.242e-06 8.643e-05 0.0002578 NaN
> w/ aprx; Sobol (tilt) 3.118e-06 1.623e-05 1.112e-04 0.0002670 NaN
> Dimension
> Method 25
> mvtnorm 0.5974938
> TruncatedNormal 0.0019317
> no aprx; Korobov 0.6516706
> no aprx; Sobol 0.6469726
> w/ aprx; Korobov 0.5753185
> w/ aprx; Sobol 0.5478786
> no aprx; Korobov (tilt) 0.0001354
> no aprx; Sobol (tilt) 0.0003721
> w/ aprx; Korobov (tilt) NaN
> w/ aprx; Sobol (tilt) NaN
plot the errors
par(mar = c(5, 5, 1, 1), cex = .8) matplot(ds, t(simres[, "SE", ]), type = "p", log = "y", pch = 1:dim(simres)[1], xlab = "Dimension", ylab = "Relative error", col = "black", bty = "l") matlines(ds, t(simres[, "SE", ]), col = "black", lty = 2) legend("bottomright", bty = "n", pch = 1:dim(simres)[1], legend = dimnames(sim_res)[[1]]) grid() ```

A similar plot for the average estimation time is shown below.
``` r
the computation time for each method and dimension
sim_res[, "time", ]
> Dimension
> Method 3 5 10 15 20
> mvtnorm 0.0024594 0.005693 0.018783 0.043671 0.058178
> TruncatedNormal 0.0189829 0.026533 0.047174 0.070437 0.092527
> no aprx; Korobov 0.0036634 0.006398 0.013653 0.021524 0.028756
> no aprx; Sobol 0.0028079 0.004870 0.009911 0.015499 0.021142
> w/ aprx; Korobov 0.0009455 0.001706 0.003707 0.005911 0.008496
> w/ aprx; Sobol 0.0009710 0.001544 0.003210 0.005003 0.007176
> no aprx; Korobov (tilt) 0.0067950 0.011466 0.023547 0.036336 0.048791
> no aprx; Sobol (tilt) 0.0051081 0.008716 0.016716 0.025904 0.035360
> w/ aprx; Korobov (tilt) 0.0058555 0.009854 0.019565 0.034711 0.018599
> w/ aprx; Sobol (tilt) 0.0042097 0.007238 0.014215 0.025702 0.014323
> Dimension
> Method 25
> mvtnorm 0.070511
> TruncatedNormal 0.114410
> no aprx; Korobov 0.035051
> no aprx; Sobol 0.026401
> w/ aprx; Korobov 0.010705
> w/ aprx; Sobol 0.009492
> no aprx; Korobov (tilt) 0.061957
> no aprx; Sobol (tilt) 0.044211
> w/ aprx; Korobov (tilt) 0.023848
> w/ aprx; Sobol (tilt) 0.018139
plot the computation time
par(mar = c(5, 5, 1, 1), cex = .8) matplot(ds, t(simres[, "time", ]), type = "p", log = "y", pch = 1:dim(simres)[1], xlab = "Dimension", ylab = "Time", col = "black", bty = "l") matlines(ds, t(simres[, "time", ]), col = "black", lty = 2) legend("bottomright", bty = "n", pch = 1:dim(simres)[1], legend = dimnames(sim_res)[[1]]) grid() ```

References
Owner
- Login: boennecd
- Kind: user
- Repositories: 15
- Profile: https://github.com/boennecd
GitHub Events
Total
Last Year
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Benjamin Christoffersen | b****d@g****m | 158 |
Issues and Pull Requests
Last synced: over 2 years ago
All Time
- Total issues: 3
- Total pull requests: 1
- Average time to close issues: 6 days
- Average time to close pull requests: less than a minute
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- boennecd (3)
- adriancorrendo (1)
Pull Request Authors
- boennecd (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 203 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 6
- Total maintainers: 1
cran.r-project.org: pedmod
Pedigree Models
- Homepage: https://github.com/boennecd/pedmod
- Documentation: http://cran.r-project.org/web/packages/pedmod/pedmod.pdf
- License: GPL-3
- Status: removed
-
Latest release: 0.2.4
published over 3 years ago
Rankings
Maintainers (1)
Dependencies
- R >= 3.5.0 depends
- Rcpp * imports
- alabama * imports
- R.rsp * suggests
- TruncatedNormal * suggests
- abind * suggests
- igraph * suggests
- kinship2 * suggests
- knitr * suggests
- mvtnorm * suggests
- numDeriv * suggests
- rmarkdown * suggests
- testthat * suggests
- xml2 * suggests
- actions/checkout v2 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite