brglm2

Estimation and inference from generalized linear models using explicit and implicit methods for bias reduction

https://github.com/ikosmidis/brglm2

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 18 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    3 of 5 committers (60.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.8%) to scientific vocabulary

Keywords

adjusted-score-equations algorithms bias-reducing-adjustments bias-reduction estimation glm logistic-regression nominal-responses ordinal-responses r regression regression-algorithms rstats statistics
Last synced: 6 months ago · JSON representation

Repository

Estimation and inference from generalized linear models using explicit and implicit methods for bias reduction

Basic Info
  • Host: GitHub
  • Owner: ikosmidis
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 2.74 MB
Statistics
  • Stars: 31
  • Watchers: 5
  • Forks: 11
  • Open Issues: 3
  • Releases: 13
Topics
adjusted-score-equations algorithms bias-reducing-adjustments bias-reduction estimation glm logistic-regression nominal-responses ordinal-responses r regression regression-algorithms rstats statistics
Created over 9 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Code of conduct

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```


# brglm2 


[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/brglm2)](https://cran.r-project.org/package=brglm2)
[![R-CMD-check](https://github.com/ikosmidis/brglm2/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ikosmidis/brglm2/actions/workflows/R-CMD-check.yaml)
[![Licence](https://img.shields.io/badge/licence-GPL--3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0.en.html)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html)
[![Codecov test coverage](https://codecov.io/gh/ikosmidis/brglm2/graph/badge.svg)](https://app.codecov.io/gh/ikosmidis/brglm2)




[**brglm2**](https://github.com/ikosmidis/brglm2) provides tools for
the estimation and inference from generalized linear models using
various methods for bias reduction. **brglm2** supports all
generalized linear models supported in R, and provides methods for
multinomial logistic regression (nominal responses), adjacent category
models (ordinal responses), and negative binomial regression (for
potentially overdispered count responses).

Reduction of estimation bias is achieved by solving either the
mean-bias reducing adjusted score equations in [Firth
(1993)](https://doi.org/10.1093/biomet/80.1.27) and [Kosmidis & Firth
(2009)](https://doi.org/10.1093/biomet/asp055) or the median-bias
reducing adjusted score equations in [Kenne et al
(2017)](https://doi.org/10.1093/biomet/asx046), or through the direct
subtraction of an estimate of the bias of the maximum likelihood
estimator from the maximum likelihood estimates as prescribed in
[Cordeiro and McCullagh
(1991)](https://www.jstor.org/stable/2345592). [Kosmidis et al
(2020)](https://doi.org/10.1007/s11222-019-09860-6) provides a
unifying framework and algorithms for mean and median bias reduction
for the estimation of generalized linear models. 

In the special case of generalized linear models for binomial and
multinomial responses (both ordinal and nominal), the adjusted score
equations return estimates with improved frequentist properties, that
are also always finite, even in cases where the maximum likelihood
estimates are infinite (e.g. complete and quasi-complete
separation). See, [Kosmidis & Firth
(2021)](https://doi.org/10.1093/biomet/asaa052) for the proof of the
latter result in the case of mean bias reduction for logistic
regression (and, for more general binomial-response models where the
likelihood is penalized by a power of the Jeffreys' invariant prior).

For logistic regression, **brglm2** also provides methods for maximum
Diaconis-Ylvisaker prior penalized likelihood (MDYPL) estimation, and
corresponding methods for high-dimensionality corrections of the
aggregate bias of the estimator and the usual statistics used for
inference; see [Sterzinger and Kosmidis,
2024](https://arxiv.org/abs/2311.07419).

The core model fitters are implemented by the functions `brglm_fit()`
(univariate generalized linear models) and `mdyplFit()` (logistic
regression), and `brmultinom()` (baseline category logit models for
nominal multinomial responses), `bracl()` (adjacent category logit
models for ordinal multinomial responses), and `brnb()` (negative
binomial regression).

## Installation

Install the current version from CRAN:

```r
install.packages("brglm2")
```

or the development version from github:

```r
# install.packages("remotes")
remotes::install_github("ikosmidis/brglm2", ref = "develop")
```

## Examples

### Estimation of binomial-response GLMs with separated data

Below we follow the example of [Heinze and Schemper
(2002)](https://doi.org/10.1002/sim.1047) and fit a logistic
regression model using maximum likelihood (ML) to analyze data from a
study on endometrial cancer (see `?brglm2::endometrial` for details
and references).

```{r, echo = TRUE, eval = TRUE}
library("brglm2")
data("endometrial", package = "brglm2")
modML <- glm(HG ~ NV + PI + EH, family = binomial("logit"), data = endometrial)
summary(modML)
```

The ML estimate of the parameter for `NV` is
actually infinite, as can be quickly verified using the
[**detectseparation**](https://cran.r-project.org/package=detectseparation) R package
```{r, echo = TRUE, eval = TRUE}
# install.packages("detectseparation")
library("detectseparation")
update(modML, method = detect_separation)
```

The reported, apparently finite estimate `r
round(coef(summary(modML))["NV", "Estimate"], 3)` for `NV` is merely
due to false convergence of the iterative estimation procedure for
ML. The same is true for the estimated standard error, and, hence the
value `r round(coef(summary(modML))["NV", "z value"], 3)` for the
$z$-statistic cannot be trusted for inference on the effect size for `NV`.

As mentioned earlier, many of the estimation methods implemented in
**brglm2** not only return estimates with improved frequentist
properties (e.g. asymptotically smaller mean and median bias than what
ML typically delivers), but also estimates and estimated standard
errors that are always finite in binomial (e.g. logistic, probit, and
complementary log-log regression) and multinomial regression models
(e.g. baseline category logit models for nominal responses, and
adjacent category logit models for ordinal responses). For example,
the code chunk below refits the model on the endometrial cancer study
data using mean bias reduction.

```{r, echo = TRUE, eval = TRUE}
summary(update(modML, method = "brglm_fit"))
```

A quick comparison of the output from mean bias reduction to that from
ML reveals a dramatic change in the $z$-statistic for `NV`, now that
estimates and estimated standard errors are finite. In particular, the
evidence against the null of `NV` not contributing to the model in the
presence of the other covariates being now stronger.

See `?brglm_fit` and `?brglm_control` for more examples and the other
estimation methods for generalized linear models, including median
bias reduction and maximum penalized likelihood with Jeffreys' prior
penalty. Also do not forget to take a look at the vignettes
(`vignette(package = "brglm2")`) for details and more case studies.

### Improved estimation of the exponential of regression parameters

See, also `?expo` for a method to estimate the exponential of
regression parameters, such as odds ratios from logistic regression
models, while controlling for other covariate information. Estimation
can be performed using maximum likelihood or various estimators with
smaller asymptotic mean and median bias, that are also guaranteed to
be finite, even if the corresponding maximum likelihood estimates are
infinite. For example, `modML` is a logistic regression fit, so the
exponential of each coefficient is an odds ratio while controlling for
other covariates. To estimate those odds ratios using the
`correction*` method for mean bias reduction (see `?expo` for details)
we do

```{r, echo = TRUE, eval = TRUE}
expoRB <- expo(modML, type = "correction*")
expoRB
```

The odds ratio between presence of neovasculation and high histology
grade (`HG`) is estimated to be `r round(coef(expoRB)["NV"], 3)`, while
controlling for PI and EH. So, for each value of `PI` and `EH`, the
estimated odds of high histology grade are about `r round(coef(expoRB)["NV"], 1)` times higher when neovasculation is present. An approximate 95\% interval for the latter odds ratio is 
`r paste0("(", paste(round(expoRB$ci["NV",], 1), collapse = ", "), ")")`
providing evidence of association between `NV` and `HG` while
controlling for `PI` and `EH`. Note here that, the maximum likelihood estimate of the odds ratio is not as useful as the `correction*` estimate, because it is $+\infty$ with an infinite standard error (see previous section). 

### MDYPL estimation and high-dimensionality corrections

Consider the [Multiple Features
dataset](https://doi.org/10.24432/C5HC70), which consists of digits
(0-9) extracted from a collection of maps from a Dutch public utility.
Two hundred `30 × 48` binary images per digit were available, which
have then been used to extract feature sets. The digits are shown
below using pixel averages in `2 x 3` windows.

```{r, echo = TRUE, eval = TRUE}
data("MultipleFeatures", package = "brglm2")
par(mfrow = c(10, 20), mar = numeric(4) + 0.1)
for (c_digit in 0:9) {
    df <- subset(MultipleFeatures, digit == c_digit)
    df <- as.matrix(df[, paste("pix", 1:240, sep = ".")])
    for (inst in 1:20) {
        m <- matrix(df[inst, ], 15, 16)[, 16:1]
        image(m, col = grey.colors(7, 1, 0), xaxt = "n", yaxt = "n")
    }
}
```

We focus on the setting of [Sterzinger and Kosmidis (2024, Section
8)](https://arxiv.org/abs/2311.07419) on explaining the character
shapes of the digit `7` in terms of 76 Fourier coefficients (`fou`
features), which are computed to be rotationally invariant, and 64
Karhunen-Loève coefficients (`kar` features), using 1000 randomly
selected digits. Depending on the font, the level of noise introduced
during digitization, and the downscaling of the digits to binary
images, difficulties may arise in discriminating instances of the
digit `7` to instances of the digits `1` and `4`. Also, if only
rotation invariant features, like `fou`, are used difficulties may
arise in discriminating instances of the digit `7` to instances of the
digit `2`.

The data is perfectly separated for both the model with only `fou`
features, and the model with `fou` and `kar` features, and the
maximized likelihood is zero for both models. 
```{r, echo = TRUE, eval = TRUE}
## Center the fou.* and kar.* features
vars <- grep("fou|kar", names(MultipleFeatures), value = TRUE)
train_id <- which(MultipleFeatures$training)
MultipleFeatures[train_id, vars] <- scale(MultipleFeatures[train_id, vars], scale = FALSE)
## Set up module formulas
full_fm <- formula(paste("I(digit == 7) ~", paste(vars, collapse = " + ")))
nest_vars <- grep("fou", vars, value = TRUE)
nest_fm <- formula(paste("I(digit == 7) ~", paste(nest_vars, collapse = " + ")))
## Fit the models using maximum likelihood
full_sep <- glm(full_fm, data = MultipleFeatures, family = binomial(), subset = training,
                method = detect_separation)
nest_sep <- update(full_sep, nest_fm)
full_sep$outcome
nest_sep$outcome
```

As a result, the likelihood ratio statistic comparing the two models
will be trivially zero, regardless of any evidence against the hypothesis that
the model with only `fou` features is an as good description of `7` as
the model with both `fou` and `kar` features.
```{r, echo = TRUE, eval = TRUE}
anova(update(nest_sep, method = glm.fit),
      update(full_sep, method = glm.fit))
```

Let's fit the models using maximum Diaconis-Ylvisaker prior penalized
likelihood (MDYPL) with a shrinkage parameter `alpha` in `(0, 1)`, which
always results in finite estimates; see `?brglm2::mdypl_fit`. The
corresponding penalized likelihood ratio test (see `?brglm2::plrtest`
and `?brglm2::summary.mdyplFit`) results again in no evidence against
the hypothesis.

```{r, echo = TRUE, eval = TRUE}
full_m <- update(full_sep, method = mdypl_fit)
nest_m <- update(nest_sep, method = mdypl_fit, alpha = full_m$alpha)
plrtest(nest_m, full_m)
```

Nevertheless, `full_m` involves 141 parameters, which is relatively
large compared to the 1000 available observations. The distribution of
the penalized likelihood ratio statistic may be far from the
asymptotic $\chi^2$ distribution that we expect under usual
asymptotics.

In stark contrast to the evidence quantified by the previous tests,
the high-dimensionality correction to the penalized likelihood ratio
statistic under proportional asymptotics proposed in [Sterzinger and Kosmidis
(2024)](https://arxiv.org/abs/2311.07419) results in strong evidence
against the model with `fou` features only.

```{r, echo = TRUE, eval = TRUE}
plrtest(nest_m, full_m, hd_correction = TRUE)
```

The estimates can be corrected in terms of aggregate bias using the
`summary()` method. 
```{r, echo = TRUE, eval = TRUE}
summ_full_m <- summary(full_m, hd_correction = TRUE)
```

The correction proceeds by estimating the constant $\mu$ by which the
estimates are divided in order to recover the asymptotic aggregate
unbiasedness of the estimator. The figure below illustrates that the
impact of the correction is to inflate the MDYPL estimates.

```{r, echo = TRUE, eval = FALSE}
rescaled_coefs <- coef(summ_full_m)[-1, ]
acols <- hcl.colors(3, alpha = 0.2)
cols <- hcl.colors(3)
plot(coef(full_m)[-1], rescaled_coefs[, "Estimate"],
     xlim = c(-9, 9), ylim = c(-9, 9),
     xlab = "MDYPL estimates", ylab = "rescaled MDYPL estimates",
     pch = 21,
     bg = acols[grepl("kar", rownames(rescaled_coefs)) + 1],
     col = NULL)
legend(-9, 9, legend = c("fou", "kar"), pt.bg = cols[1:2], col = NA, pch = 21,
       title = "Features")
legend(-5.4, 9, legend = expression(1, 1/hat(mu)), lty = c(2, 1), col = "grey",
       title = "Slope")
abline(0, 1, col = "grey", lty = 2)
abline(0, 1/summ_full_m$se_parameters[1], col = "grey")

```


## Estimation methods 

The workhorse function in **brglm2** is `brglm_fit()` (or equivalently
`brglmFit()` if you like camel case), which, as we did in the example
above, can be passed directly to the `method` argument of the `glm()`
function. `brglm_fit()` implements a quasi [Fisher
scoring](https://en.wikipedia.org/wiki/Scoring_algorithm) procedure,
whose special cases result in a range of explicit and implicit bias
reduction methods for generalized linear models for more
details). Bias reduction for multinomial logistic regression (nominal
responses) can be performed using the function `brmultinom()`, and for
adjacent category models (ordinal responses) using the function
`bracl()`. Both `brmultinom()` and `bracl()` rely on `brglm_fit`.

The classification of bias reduction methods into explicit and
implicit is as given in [Kosmidis
(2014)](https://doi.org/10.1002/wics.1296).

For logistic regression models, in particular, the `mdypl_fit()`
function provides maximum Diaconis-Ylvisaker prior penalized
likelihood estimation, and can again be passed directly to the
`method` argument of the `glm()` function. The `summary()` method for
`mdyplFit` objects, then allows for high-dimensional corrections of
aggregate bias and of standard $z$-statistics under proportional
asymptotics, and the `plrtest()` method allows for penalized
likelihood ratio tests with and without high-dimensional corrections;
see [Sterzinger and Kosmidis
(2024)](https://arxiv.org/abs/2311.07419), the example above, and the
help pages of the methods.


## References and resources

**brglm2** was presented by [Ioannis
Kosmidis](https://www.ikosmidis.com) at the useR! 2016 international
conference at University of Stanford on 16 June 2016. The presentation
was titled "Reduced-bias inference in generalized linear models".

Motivation, details and discussion on the methods that **brglm2** implements are provided in

> Kosmidis, I, Kenne Pagui, E C, Sartori N. (2020). Mean and median bias reduction in generalized linear models. [*Statistics and Computing*](https://doi.org/10.1007/s11222-019-09860-6) *30*, 43–59.


The [iteration vignette](https://cran.r-project.org/package=brglm2/vignettes/iteration.html) presents the iteration and give mathematical details for the bias-reducing adjustments to the score functions for generalized linear models.

Maximum Diaconis-Ylvisaker prior penalized likelihood and high-dimensionality corrections under proportional asymptotics are described in

> Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for $p/n \to \kappa \in (0,1)$ logistic regression. [*arXiv*:2311.07419v2](https://arxiv.org/abs/2311.07419).





## Code of Conduct

Please note that the **brglm2** project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.



Owner

  • Name: Ioannis Kosmidis
  • Login: ikosmidis
  • Kind: user
  • Company: Department of Statistics, University of Warwick & The Alan Turing Institute

Professor of Statistics at #warwickuni & #turinginst; interested in methods for statistical learning and inference, computing and programming (mainly R & Julia)

GitHub Events

Total
  • Create event: 2
  • Release event: 2
  • Issues event: 3
  • Watch event: 3
  • Push event: 38
  • Fork event: 3
Last Year
  • Create event: 2
  • Release event: 2
  • Issues event: 3
  • Watch event: 3
  • Push event: 38
  • Fork event: 3

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 555
  • Total Committers: 5
  • Avg Commits per committer: 111.0
  • Development Distribution Score (DDS): 0.132
Past Year
  • Commits: 62
  • Committers: 2
  • Avg Commits per committer: 31.0
  • Development Distribution Score (DDS): 0.081
Top Committers
Name Email Commits
Ioannis Kosmidis i****s@u****k 482
Ioannis Kosmidis i****s@w****k 57
euloge clovis kenne pagui k****e@s****t 9
Ioannis Kosmidis y****s@f****l 5
Ioannis Kosmidis y****s@f****k 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 25
  • Total pull requests: 15
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 21 days
  • Total issue authors: 16
  • Total pull request authors: 3
  • Average comments per issue: 2.52
  • Average comments per pull request: 0.53
  • Merged pull requests: 13
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ikosmidis (6)
  • rebeccaltaylor (4)
  • dasiav7 (2)
  • bpuladi (1)
  • michalovadek (1)
  • cccccccH (1)
  • FinBauer (1)
  • tomwenseleers (1)
  • cperk (1)
  • MelanieCNM (1)
  • twest820 (1)
  • jamiahuswalton (1)
  • IlhemB (1)
  • patrickmziet (1)
  • TaylorAndrew (1)
Pull Request Authors
  • ikosmidis (11)
  • eulogepagui (4)
  • StaffanBetner (1)
Top Labels
Issue Labels
enhancement (4)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 7,804 last-month
  • Total docker downloads: 21,952
  • Total dependent packages: 12
    (may contain duplicates)
  • Total dependent repositories: 17
    (may contain duplicates)
  • Total versions: 27
  • Total maintainers: 1
cran.r-project.org: brglm2

Bias Reduction in Generalized Linear Models

  • Versions: 20
  • Dependent Packages: 12
  • Dependent Repositories: 17
  • Downloads: 7,804 Last month
  • Docker Downloads: 21,952
Rankings
Dependent packages count: 4.6%
Downloads: 6.8%
Dependent repos count: 6.9%
Forks count: 7.3%
Stargazers count: 9.7%
Average: 10.5%
Docker downloads count: 27.6%
Last synced: 6 months ago
conda-forge.org: r-brglm2
  • Versions: 7
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 34.0%
Average: 44.3%
Stargazers count: 45.8%
Forks count: 46.0%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/upload-artifact main composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 3.3.0 depends
  • MASS * imports
  • Matrix * imports
  • enrichwith * imports
  • graphics * imports
  • nnet * imports
  • numDeriv * imports
  • stats * imports
  • covr * suggests
  • detectseparation * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • testthat * suggests