PAGFL

Simultaneously identify latent group structures and estimate group-specific coefficients in (time-varying) panel data models

https://github.com/paul-haimerl/pagfl

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 30 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.5%) to scientific vocabulary

Keywords

classification panel-data-model time-varying-coefficients
Last synced: 6 months ago · JSON representation

Repository

Simultaneously identify latent group structures and estimate group-specific coefficients in (time-varying) panel data models

Basic Info
Statistics
  • Stars: 6
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 3
Topics
classification panel-data-model time-varying-coefficients
Created about 2 years ago · Last pushed 8 months ago
Metadata Files
Readme License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
```

# PAGFL 



[![CRAN status](https://www.r-pkg.org/badges/version/PAGFL)](https://cran.r-project.org/package=PAGFL) [![CRAN_Downloads_Badge](https://cranlogs.r-pkg.org/badges/grand-total/PAGFL)](https://cran.r-project.org/package=PAGFL) [![License_GPLv3_Badge](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://www.gnu.org/licenses/gpl-3.0.html) [![R-CMD-check](https://github.com/Paul-Haimerl/PAGFL/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/Paul-Haimerl/PAGFL/actions/workflows/R-CMD-check.yaml) [![codecov](https://codecov.io/gh/Paul-Haimerl/PAGFL/graph/badge.svg?token=22WHU5SU63)](https://app.codecov.io/gh/Paul-Haimerl/PAGFL)



Unobservable group structures are a common challenge in panel data analysis. Disregarding group-level heterogeneity can introduce bias. Conversely, estimating individual coefficients for each cross-sectional unit is inefficient and may lead to high uncertainty. Furthermore, neglecting time-variance in slope coefficients can lead to inconsistent and misleading results, even in large panels.

This package efficiently addresses these issues by implementing the pairwise adaptive group fused Lasso (*PAGFL*) by Mehrabani ([2023](https://doi.org/10.1016/j.jeconom.2022.12.002)) and the *time-varying PAGFL* by Haimerl et al. ([2025](https://doi.org/10.48550/arXiv.2503.23165)). *PAGFL* is a regularizer that identifies latent group structures and estimates group-specific (time-varying) coefficients in a single step. 

The `PAGFL` package makes these powerful procedures easy to use.

## Installation

Always stay up-to-date with the development version (1.1.4) of `PAGFL` from [GitHub](https://github.com/):

```{r message=FALSE, warning=FALSE}
# install.packages("devtools")
devtools::install_github("Paul-Haimerl/PAGFL")
library(PAGFL)
```

The stable version (1.1.3) is available on CRAN:

```         
install.packages("PAGFL")
```

## Data

The `PAGFL` package includes a function that automatically simulates a panel data set with a group structure in the slope coefficients:

```{r}
# Simulate a simple panel with three distinct groups and two exogenous explanatory variables
set.seed(1)
sim <- sim_DGP(N = 20, n_periods = 150, p = 2, n_groups = 3)
data <- sim$data
```

$$y_{it} = \beta_i^{0 \prime} x_{it} + \gamma_i + u_{it}, \quad i = 1, \dots, N, \quad t = 1, \dots, T,$$ 

where $y_{it}$ is a scalar dependent variable, $x_{it}$ a $p \times 1$ vector of explanatory variables, and $\gamma_i$ reflects an individual fixed effect. The $p$-dimensional vector slope coefficients $\beta_i^0$ follows the (latent) group structure 

$$\beta_{i} = \sum_{k = 1}^K \alpha_k 1 \{i \in G_k \},$$ 

with $\cup_{k = 1}^K G_k = \{1, \dots, N \}$, and $G_k \cap G_j = \emptyset$ as well as $\| \alpha_k \neq \alpha_j \|$ for any $k \neq j$, $k,j = 1, \dots, K$ (see Mehrabani [2023](https://doi.org/10.1016/j.jeconom.2022.12.002), sec. 2).

`sim_DGP()` also nests, among other, all DGPs employed in the simulation study of Mehrabani ([2023](https://doi.org/10.1016/j.jeconom.2022.12.002), sec. 6).

## Applying PAGFL

To execute the PAGFL procedure, pass the dependent and independent variables, the number of time periods, and a penalization parameter $\lambda$.

```{r}
estim <- pagfl(y ~ ., data = data, n_periods = 150, lambda = 20, verbose = F)
summary(estim)
```

`pagfl()` returns an object of type `pagfl`, which holds

1.  `model`: A `data.frame` containing the dependent and explanatory variables as well as individual and time indices (if provided).
2.  `coefficients`: A $K \times p$ matrix of the post-Lasso group-specific parameter estimates.
3.  `groups`: A `list` containing (i) the total number of groups $\hat{K}$ and (ii) a vector of estimated group memberships $(\hat{g}_1, \dots, \hat{g}_N)$, where $\hat{g}_i = k$ if $i$ is assigned to group $k$.
4.  `residuals`: A vector of residuals of the demeaned model.
5.  `fitted`: A vector of fitted values of the demeaned model.
6.  `args`: A list of additional arguments.
7.  `IC`: A `list` containing (i) the value of the IC, (ii) the employed tuning parameter $\lambda$, and (iii) the mean squared error.
8.  `convergence`: A `list` containing (i) a logical variable if convergence was achieved and (ii) the number of executed *ADMM* algorithm iterations.
9.  `call`: The function call.

> [!TIP] 
> `pagfl` objects support a variety of useful generic functions like `summary()`, `fitted()`, `resid()`, `df.residual`, `formula`, and `coef()`.

```{r, fig.width = 7, fig.height = 5}
estim_fit <- fitted(estim)
```

> [!IMPORTANT] 
> Selecting a $\lambda$ value a priori can be tricky. For instance, it seems like `lambda = 20` is too high since the number of groups $K$ is underestimated.

We suggest iterating over a comprehensive range of candidate values to trace out the correct model. To specify a suitable grid, create a logarithmic sequence ranging from 0 to a penalty parameter that induces slope homogeneity (i.e., $\widehat{K} = 1$). The resulting $\lambda$ grid vector can be passed in place of any specific value, and a BIC IC selects the best-fitting parameter.

Furthermore, it is also possible to supply a `data.frame` with named variables and choose a specific formula that selects the variables in that `data.frame`, just like in the base `lm()` function. If the explanatory variables in `X` are named, these names also appear in the output.

```{r}
colnames(data)[-1] <- c("a", "b")

lambda_set <- exp(log(10) * seq(log10(1e-4), log10(10), length.out = 10))
estim_set <- pagfl(y ~ a + b, data = data, n_periods = 150, lambda = lambda_set, verbose = F)
summary(estim_set)
```

When, as above, the specific estimation method is left unspecified, `pagfl()` defaults to penalized Least Squares (*PLS*) `method = 'PLS'` (Mehrabani, [2023](https://doi.org/10.1016/j.jeconom.2022.12.002), sec. 2.2). *PLS* is very efficient but requires weakly exogenous regressors. However, even endogenous predictors can be accounted for by employing a penalized Generalized Method of Moments (*PGMM*) routine in combination with exogenous instruments $Z$.

Specify a slightly more elaborate endogenous and dynamic panel data set and apply *PGMM*. When encountering a dynamic panel data set, we recommend using a Jackknife bias correction, as proposed by Dhaene and Jochmans ([2015](https://doi.org/10.1093/restud/rdv007)).

```{r}
# Generate a panel where the predictors X correlate with the cross-sectional innovation,
# but can be instrumented with q = 3 variables in Z. Furthermore, include GARCH(1,1)
# innovations, an AR lag of the dependent variable, and specific group sizes
sim_endo <- sim_DGP(
  N = 20, n_periods = 200, p = 2, n_groups = 3, group_proportions = c(0.3, 0.3, 0.4),
  error_spec = "GARCH", q = 2, dynamic = FALSE
)
data_endo <- sim_endo$data
Z <- sim_endo$Z

# Note that the method PGMM and the instrument matrix Z need to be passed
estim_endo <- pagfl(y ~ ., data = data_endo, n_periods = 200, lambda = 2, method = "PGMM", Z = Z, bias_correc = TRUE, max_iter = 50e3, verbose = F)
summary(estim_endo)
```

> [!TIP] 
> `pagfl()` lets you select a minimum group size, adjust the efficiency vs. accuracy trade-off of the iterative estimation algorithm, and modify a list of further settings. Visit the documentation `?pagfl()` for more information.

## The Time-varying PAGFL

The package also includes the functions `sim_tv_DGP()`and `tv_pagfl()`, which generate and estimate grouped panel data models with smoothly time-varying coefficients $\beta_i (t/T)$. As detailed in Haimerl et al. ([2025](https://doi.org/10.48550/arXiv.2503.23165)), the functional coefficients admit to the group structure

$$\beta_{i} (t/T) = \sum_{k = 1}^K \alpha_k (t/T) 1 \{i \in G_k \}.$$ 

The time-varying coefficients are estimated using polynomial B-spline functions, yielding a penalized sieve estimator (*PSE*) (see [Haimerl et al., 2025](https://doi.org/10.48550/arXiv.2503.23165), sec. 2).

```{r, fig.width = 9, fig.height = 4}
# Simulate a time-varying panel with a trend and a group pattern
N <- 20
n_periods <- 100
tv_sim <- sim_tv_DGP(N = N, n_periods = n_periods, sd_error = 1, intercept = TRUE, p = 1)
tv_data <- tv_sim$data

tv_estim <- tv_pagfl(y ~ 1, data = tv_data, n_periods = n_periods, lambda = 5, verbose = F)
summary(tv_estim)
```

`tv_pagfl()` returns an object of class `tvpagfl`, which contains

1.  `model`: A `data.frame` containing the dependent and explanatory variables as well as individual and time indices (if provided).
2.  `coefficients`: A list holding (i) a $T \times p^{(1)} \times \hat{K}$ array of the post-Lasso group-specific functional coefficients and (ii) a $K \times p^{(2)}$ matrix of time-constant parameter estimates (when running a mixed time-varying panel data model).
3.  `groups`: A `list` containing (i) the total number of groups $\hat{K}$ and (ii) a vector of estimated group memberships $(\hat{g}_1, \dots, \hat{g}_N)$, where $\hat{g}_i = k$ if $i$ is assigned to group $k$.
4.  `residuals`: A vector of residuals of the demeaned model.
5.  `fitted`: A vector of fitted values of the demeaned model.
6.  `args`: A list of additional arguments.
7.  `IC`: A `list` containing (i) the value of the IC, (ii) the employed tuning parameter $\lambda$, and (iii) the mean squared error.
8.  `convergence`: A `list` containing (i) a logical variable if convergence was achieved and (ii) the number of executed *ADMM* algorithm iterations.
9.  `call`: The function call.

> [!TIP] 
> Again, `tvpagfl` objects support generic `summary()`, `fitted()`, `resid()`, `df.residual`, `formula`, and `coef()` functions.

In empirical settings, unbalanced panel datasets are common; nevertheless, time-varying coefficient functions remain identifiable (see Haimerl et al. [2025](https://doi.org/10.48550/arXiv.2503.23165), Appendix D). The spline functions simply interpolate missing time periods and the group structure compensates for missings in any individual cross-sectional unit. However, when using unbalanced datasets it is required to provide explicit indicator variables that declare the cross-sectional individual and time period each observation belongs to.

Lets delete 30% of observations, add indicator variables, and run `tv_pagfl()` again.

```{r, fig.width = 9, fig.height = 4}
# Draw some observations to be omitted
delete_index <- as.logical(rbinom(n = N * n_periods, prob = 0.7, size = 1))
# Construct cross-sectional and time indicator variables
tv_data$i_index <- rep(1:N, each = n_periods)
tv_data$t_index <- rep(1:n_periods, N)
# Delete some observations
tv_data <- tv_data[delete_index, ]
# Apply the time-varying PAGFL to an unbalanced panel
tv_estim_unbalanced <- tv_pagfl(y ~ 1, data = tv_data, index = c("i_index", "t_index"), lambda = 5, verbose = F)
summary(tv_estim_unbalanced)
```

> [!TIP] 
> `tv_pagfl()` lets you specify a lot more optionalities than shown here. For example, it is possible to adjust the polyomial degree and the number of interior knots in the spline basis system, or estimate a panel data model with a mix of time-varying and time-constant coefficients. See `?tv_pagfl()` for details.

## Observing a Group Structure

It may occur that the group structure is known and does not need to be estimated. In such instances, `grouped_plm()` and `grouped_tv_plm()` make it easy to estimate a (time-varying) linear panel data model given an observed grouping.

```{r}
groups <- sim$groups
estim_oracle <- grouped_plm(y ~ ., data = data, n_periods = 150, groups = groups, method = "PLS")
summary(estim_oracle)
```

Besides not estimating the group structure, `grouped_plm()` (`grouped_tv_plm()`) behave identically to `pagfl()` (`tv_pagfl()`)

## Future Outlook

The package is still under active development. Future versions are planned to include

-   Global coefficients
-   Un-penalized individual coefficients
-   Inference methods

You are not a R-user? Worry not - An equivalent Python library is in the works.

Feel free to [reach out](mailto:paul.haimerl@econ.au.dk) if you have any suggestions or questions.

## References

Dhaene, G., & Jochmans, K. (2015). Split-panel jackknife estimation of fixed-effect models. *The Review of Economic Studies*, 82(3), 991-1030. DOI: [doi.org/10.1093/restud/rdv007](https://doi.org/10.1093/restud/rdv007)

Haimerl, P., Smeekes, S., & Wilms, I. (2025). Estimation of latent group structures in time-varying panel data models. *arXiv preprint arXiv:2503.23165*. DOI: [doi.org/10.48550/arXiv.2503.23165](https://doi.org/10.48550/arXiv.2503.23165)

Mehrabani, A. (2023). Estimation and identification of latent group structures in panel data. *Journal of Econometrics*, 235(2), 1464-1482. DOI: [doi.org/10.1016/j.jeconom.2022.12.002](https://doi.org/10.1016/j.jeconom.2022.12.002)

Owner

  • Name: Paul Haimerl
  • Login: Paul-Haimerl
  • Kind: user
  • Location: Maastricht

M.Sc. Economic and Financial Research at Maastricht University

GitHub Events

Total
  • Release event: 2
  • Watch event: 3
  • Push event: 28
  • Pull request event: 2
  • Fork event: 2
  • Create event: 2
Last Year
  • Release event: 2
  • Watch event: 3
  • Push event: 28
  • Pull request event: 2
  • Fork event: 2
  • Create event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: about 20 hours
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: about 20 hours
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • eddelbuettel (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 231 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 6
  • Total maintainers: 1
cran.r-project.org: PAGFL

Joint Estimation of Latent Groups and Group-Specific Coefficients in Panel Data Models

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 231 Last month
Rankings
Dependent packages count: 28.2%
Dependent repos count: 36.2%
Average: 49.7%
Downloads: 84.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • Rcpp * imports
  • pbapply * imports