kcmeans

kcmeans: Conditional Expectation Function Estimation with K-Conditional-Means

https://github.com/thomaswiemann/kcmeans

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

kcmeans: Conditional Expectation Function Estimation with K-Conditional-Means

Basic Info

Host: GitHub
Owner: thomaswiemann
License: gpl-3.0
Language: R
Default Branch: main
Homepage: https://thomaswiemann.com/kcmeans/
Size: 3.92 MB

Statistics

Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

README.Rmd

---
output: github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# kcmeans


[![R-CMD-check](https://github.com/thomaswiemann/kcmeans/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/thomaswiemann/kcmeans/actions/workflows/R-CMD-check.yaml)
[![codecov](https://codecov.io/gh/thomaswiemann/kcmeans/graph/badge.svg?token=1U0XDRMKEP)](https://app.codecov.io/gh/thomaswiemann/kcmeans)
[![CodeFactor](https://www.codefactor.io/repository/github/thomaswiemann/kcmeans/badge)](https://www.codefactor.io/repository/github/thomaswiemann/kcmeans)
[![CRAN Version](https://www.r-pkg.org/badges/version/kcmeans)](https://cran.r-project.org/package=kcmeans)


``kcmeans`` is an implementation of the K-Conditional-Means (KCMeans) regression estimator analyzed by Wiemann (2023; [arxiv:2311.17021](https://arxiv.org/abs/2311.17021)) for conditional expectation function estimation using categorical features. The implementation leverages the unconditional KMeans implementation in one dimension using dynamic programming of the [``Ckmeans.1d.dp``](https://CRAN.R-project.org/package=Ckmeans.1d.dp) package.

See the working paper [Optimal Categorical Instrumental Variables](https://arxiv.org/abs/2311.17021) for further discussion of the KCMeans estimator.

## Installation

Install the latest development version from GitHub (requires [devtools](https://github.com/r-lib/devtools) package):

```{r, eval = FALSE}
if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("thomaswiemann/kcmeans", dependencies = TRUE)
```

Install the latest public release from CRAN:
```{r, eval = FALSE}
install.packages("kcmeans")
```

## Usage

To illustrate ``kcmeans``, consider simulating a small dataset with a continuous outcome variable ``y``, two observed predictors -- a categorical variable ``Z`` and a continuous variable ``X`` -- and an (unobserved) Gaussian error. As in Wiemann (2023), the reduced form has an unobserved lower-dimensional representation dependent on the latent categorical variable ``Z0``.
```{r}
# Load package
library(kcmeans)
# Set seed
set.seed(51944)
# Sample parameters
nobs = 800 # sample size
# Sample data
X <- rnorm(nobs)
Z <- sample(1:20, nobs, replace = T)
Z0 <- Z %% 4 # lower-dimensional latent categorical variable
y <- Z0 + X + rnorm(nobs)
```

``kcmeans`` is then computed by combining the categorical feature with the continuous feature. By default, the categorical feature is the first column. Alternatively, the column corresponding to the categorical feature can be set via the ``which_is_cat`` argument. Computation is _very_ quick -- indeed the dynamic programming algorithm of the leveraged ``Ckmeans.1d.dp`` package is polynomial in the number of values taken by the categorical feature ``Z``. See also ``?kcmeans`` for details.

```{r}
system.time({
kcmeans_fit <- kcmeans(y = y, X = cbind(Z, X), K = 4)
})
```

We may now use the ``predict.kcmeans`` method to construct fitted values and/or compute predictions of the lower-dimensional latent categorical feature ``Z0``. See also ``?predict.kcmeans`` for details.
```{r}
# Predicted values for the outcome + R^2
y_hat <- predict(kcmeans_fit, cbind(Z, X))
round(1 - mean((y - y_hat)^2) / mean((y - mean(y))^2), 3)

# Predicted values for the latent categorical feature + missclassification rate
Z0_hat <- predict(kcmeans_fit, cbind(Z, X), clusters = T) - 1
mean((Z0 - Z0_hat)!=0)


```

Finally, it is also straightforward to compute standard errors for the final coefficients, e.g., using ``summary.lm``:

```{r}
# Compute the linear regression object and call summary.lm
lm_fit <- lm(y ~ as.factor(Z0_hat) + X)
summary(lm_fit)
```


## Choice of K via Cross-Validation

Since the cardinality of the support of the underlying low-dimensional latent categorical variable is often unknown, it is useful to consider multiple KCMeans estimators with varying values for K. The below code snippet uses the [``ddml``](https://thomaswiemann.com/ddml/) package to compute the cross-validation mean-square prediction error (MSPE) of three KCMeans estimators (see also ``?ddml::crossval`` for details). 

In addition, the KCMeans MSPEs are compared to the MSPE of three alternative conditional expectation function estimators:

1. Ordinary least squares (see also ``?ddml::ols``)
2. Lasso with cross-validated penalty parameter (see also ``?ddml::mdl_glmnet``)
3. Ridge with cross-validated penalty parameter 


```{r}
# load the ddml package
library(ddml)

# one-hot encoding for ols, lasso, and ridge
Z_indicators <- model.matrix(~ as.factor(Z)) 

# Combine features and create indices
X_all <- cbind(Z, X, Z_indicators)
indx_factor <- 1:2
indx_indicators <- 2:(2 + ncol(Z_indicators))

# Create the learners, assign indicators to ols, lasso, and ridge
learner_list <- list(list(fun = kcmeans,
                          args = list(K = 2),
                          assign_X = indx_factor),
                     list(fun = kcmeans,
                          args = list(K = 4),
                          assign_X = indx_factor),
                     list(fun = kcmeans,
                          args = list(K = 6),
                          assign_X = indx_factor),
                     list(fun = ols,
                          assign_X = indx_indicators),
                     list(fun = mdl_glmnet,
                          assign_X = indx_indicators),
                     list(fun = mdl_glmnet,
                          args = list(alpha = 0),
                          assign_X = indx_indicators))

# Compute the cross-valdiation MSPE
cv_res <- crossval(y = y, X = X_all, 
                   learners = learner_list, 
                   cv_folds = 20, silent = T)

```

The results show that KCMeans with K=4 and K=6 achieve the smallest MSPE among the considered estimators.

```{r}
# Print the results
names(cv_res$mspe) <- c("KCMeans (K=2)", "KCMeans (K=4)", "KCMeans (K=6)",
                        "OLS", "Lasso", "Ridge")
round(cv_res$mspe, 4)

# Which learner is the best?
names(which.min(cv_res$mspe))
```

# References
Wiemann T (2023). "Optimal Categorical Instruments." https://arxiv.org/abs/2311.17021

Owner

Name: Thomas Wiemann
Login: thomaswiemann
Kind: user
Location: Chicago, IL

Website: thomaswiemann.com
Twitter: econwiemann
Repositories: 2
Profile: https://github.com/thomaswiemann

Economics PhD Student

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- cran 202 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 1
Total maintainers: 1

cran.r-project.org: kcmeans

Conditional Expectation Function Estimation with K-Conditional-Means

Homepage: https://github.com/thomaswiemann/kcmeans
Documentation: http://cran.r-project.org/web/packages/kcmeans/kcmeans.pdf
License: GPL (≥ 3)
Latest release: 0.1.0
published over 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 202 Last month

Rankings

Dependent repos count: 23.8%

Forks count: 27.8%

Dependent packages count: 28.6%

Stargazers count: 30.8%

Average: 39.3%

Downloads: 85.7%

Maintainers (1)

wiemann@uchicago.edu

Last synced: 11 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions

actions/checkout v3 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action v4.4.1 composite
actions/checkout v3 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/test-coverage.yaml actions

actions/checkout v3 composite
actions/upload-artifact v3 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

DESCRIPTION cran

R >= 3.6 depends
Ckmeans.1d.dp * imports
MASS * imports
Matrix * imports
stats * imports
covr * suggests
knitr * suggests
rmarkdown * suggests
testthat >= 3.0.0 suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science