PAGFL
Simultaneously identify latent group structures and estimate group-specific coefficients in (time-varying) panel data models
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 30 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.5%) to scientific vocabulary
Keywords
classification
panel-data-model
time-varying-coefficients
Last synced: 6 months ago
·
JSON representation
Repository
Simultaneously identify latent group structures and estimate group-specific coefficients in (time-varying) panel data models
Basic Info
- Host: GitHub
- Owner: Paul-Haimerl
- License: agpl-3.0
- Language: R
- Default Branch: main
- Homepage: https://doi.org/10.48550/arXiv.2503.23165
- Size: 1.98 MB
Statistics
- Stars: 6
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 3
Topics
classification
panel-data-model
time-varying-coefficients
Created about 2 years ago
· Last pushed 8 months ago
Metadata Files
Readme
License
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# PAGFL
[](https://cran.r-project.org/package=PAGFL) [](https://cran.r-project.org/package=PAGFL) [](https://www.gnu.org/licenses/gpl-3.0.html) [](https://github.com/Paul-Haimerl/PAGFL/actions/workflows/R-CMD-check.yaml) [](https://app.codecov.io/gh/Paul-Haimerl/PAGFL)
Unobservable group structures are a common challenge in panel data analysis. Disregarding group-level heterogeneity can introduce bias. Conversely, estimating individual coefficients for each cross-sectional unit is inefficient and may lead to high uncertainty. Furthermore, neglecting time-variance in slope coefficients can lead to inconsistent and misleading results, even in large panels.
This package efficiently addresses these issues by implementing the pairwise adaptive group fused Lasso (*PAGFL*) by Mehrabani ([2023](https://doi.org/10.1016/j.jeconom.2022.12.002)) and the *time-varying PAGFL* by Haimerl et al. ([2025](https://doi.org/10.48550/arXiv.2503.23165)). *PAGFL* is a regularizer that identifies latent group structures and estimates group-specific (time-varying) coefficients in a single step.
The `PAGFL` package makes these powerful procedures easy to use.
## Installation
Always stay up-to-date with the development version (1.1.4) of `PAGFL` from [GitHub](https://github.com/):
```{r message=FALSE, warning=FALSE}
# install.packages("devtools")
devtools::install_github("Paul-Haimerl/PAGFL")
library(PAGFL)
```
The stable version (1.1.3) is available on CRAN:
```
install.packages("PAGFL")
```
## Data
The `PAGFL` package includes a function that automatically simulates a panel data set with a group structure in the slope coefficients:
```{r}
# Simulate a simple panel with three distinct groups and two exogenous explanatory variables
set.seed(1)
sim <- sim_DGP(N = 20, n_periods = 150, p = 2, n_groups = 3)
data <- sim$data
```
$$y_{it} = \beta_i^{0 \prime} x_{it} + \gamma_i + u_{it}, \quad i = 1, \dots, N, \quad t = 1, \dots, T,$$
where $y_{it}$ is a scalar dependent variable, $x_{it}$ a $p \times 1$ vector of explanatory variables, and $\gamma_i$ reflects an individual fixed effect. The $p$-dimensional vector slope coefficients $\beta_i^0$ follows the (latent) group structure
$$\beta_{i} = \sum_{k = 1}^K \alpha_k 1 \{i \in G_k \},$$
with $\cup_{k = 1}^K G_k = \{1, \dots, N \}$, and $G_k \cap G_j = \emptyset$ as well as $\| \alpha_k \neq \alpha_j \|$ for any $k \neq j$, $k,j = 1, \dots, K$ (see Mehrabani [2023](https://doi.org/10.1016/j.jeconom.2022.12.002), sec. 2).
`sim_DGP()` also nests, among other, all DGPs employed in the simulation study of Mehrabani ([2023](https://doi.org/10.1016/j.jeconom.2022.12.002), sec. 6).
## Applying PAGFL
To execute the PAGFL procedure, pass the dependent and independent variables, the number of time periods, and a penalization parameter $\lambda$.
```{r}
estim <- pagfl(y ~ ., data = data, n_periods = 150, lambda = 20, verbose = F)
summary(estim)
```
`pagfl()` returns an object of type `pagfl`, which holds
1. `model`: A `data.frame` containing the dependent and explanatory variables as well as individual and time indices (if provided).
2. `coefficients`: A $K \times p$ matrix of the post-Lasso group-specific parameter estimates.
3. `groups`: A `list` containing (i) the total number of groups $\hat{K}$ and (ii) a vector of estimated group memberships $(\hat{g}_1, \dots, \hat{g}_N)$, where $\hat{g}_i = k$ if $i$ is assigned to group $k$.
4. `residuals`: A vector of residuals of the demeaned model.
5. `fitted`: A vector of fitted values of the demeaned model.
6. `args`: A list of additional arguments.
7. `IC`: A `list` containing (i) the value of the IC, (ii) the employed tuning parameter $\lambda$, and (iii) the mean squared error.
8. `convergence`: A `list` containing (i) a logical variable if convergence was achieved and (ii) the number of executed *ADMM* algorithm iterations.
9. `call`: The function call.
> [!TIP]
> `pagfl` objects support a variety of useful generic functions like `summary()`, `fitted()`, `resid()`, `df.residual`, `formula`, and `coef()`.
```{r, fig.width = 7, fig.height = 5}
estim_fit <- fitted(estim)
```
> [!IMPORTANT]
> Selecting a $\lambda$ value a priori can be tricky. For instance, it seems like `lambda = 20` is too high since the number of groups $K$ is underestimated.
We suggest iterating over a comprehensive range of candidate values to trace out the correct model. To specify a suitable grid, create a logarithmic sequence ranging from 0 to a penalty parameter that induces slope homogeneity (i.e., $\widehat{K} = 1$). The resulting $\lambda$ grid vector can be passed in place of any specific value, and a BIC IC selects the best-fitting parameter.
Furthermore, it is also possible to supply a `data.frame` with named variables and choose a specific formula that selects the variables in that `data.frame`, just like in the base `lm()` function. If the explanatory variables in `X` are named, these names also appear in the output.
```{r}
colnames(data)[-1] <- c("a", "b")
lambda_set <- exp(log(10) * seq(log10(1e-4), log10(10), length.out = 10))
estim_set <- pagfl(y ~ a + b, data = data, n_periods = 150, lambda = lambda_set, verbose = F)
summary(estim_set)
```
When, as above, the specific estimation method is left unspecified, `pagfl()` defaults to penalized Least Squares (*PLS*) `method = 'PLS'` (Mehrabani, [2023](https://doi.org/10.1016/j.jeconom.2022.12.002), sec. 2.2). *PLS* is very efficient but requires weakly exogenous regressors. However, even endogenous predictors can be accounted for by employing a penalized Generalized Method of Moments (*PGMM*) routine in combination with exogenous instruments $Z$.
Specify a slightly more elaborate endogenous and dynamic panel data set and apply *PGMM*. When encountering a dynamic panel data set, we recommend using a Jackknife bias correction, as proposed by Dhaene and Jochmans ([2015](https://doi.org/10.1093/restud/rdv007)).
```{r}
# Generate a panel where the predictors X correlate with the cross-sectional innovation,
# but can be instrumented with q = 3 variables in Z. Furthermore, include GARCH(1,1)
# innovations, an AR lag of the dependent variable, and specific group sizes
sim_endo <- sim_DGP(
N = 20, n_periods = 200, p = 2, n_groups = 3, group_proportions = c(0.3, 0.3, 0.4),
error_spec = "GARCH", q = 2, dynamic = FALSE
)
data_endo <- sim_endo$data
Z <- sim_endo$Z
# Note that the method PGMM and the instrument matrix Z need to be passed
estim_endo <- pagfl(y ~ ., data = data_endo, n_periods = 200, lambda = 2, method = "PGMM", Z = Z, bias_correc = TRUE, max_iter = 50e3, verbose = F)
summary(estim_endo)
```
> [!TIP]
> `pagfl()` lets you select a minimum group size, adjust the efficiency vs. accuracy trade-off of the iterative estimation algorithm, and modify a list of further settings. Visit the documentation `?pagfl()` for more information.
## The Time-varying PAGFL
The package also includes the functions `sim_tv_DGP()`and `tv_pagfl()`, which generate and estimate grouped panel data models with smoothly time-varying coefficients $\beta_i (t/T)$. As detailed in Haimerl et al. ([2025](https://doi.org/10.48550/arXiv.2503.23165)), the functional coefficients admit to the group structure
$$\beta_{i} (t/T) = \sum_{k = 1}^K \alpha_k (t/T) 1 \{i \in G_k \}.$$
The time-varying coefficients are estimated using polynomial B-spline functions, yielding a penalized sieve estimator (*PSE*) (see [Haimerl et al., 2025](https://doi.org/10.48550/arXiv.2503.23165), sec. 2).
```{r, fig.width = 9, fig.height = 4}
# Simulate a time-varying panel with a trend and a group pattern
N <- 20
n_periods <- 100
tv_sim <- sim_tv_DGP(N = N, n_periods = n_periods, sd_error = 1, intercept = TRUE, p = 1)
tv_data <- tv_sim$data
tv_estim <- tv_pagfl(y ~ 1, data = tv_data, n_periods = n_periods, lambda = 5, verbose = F)
summary(tv_estim)
```
`tv_pagfl()` returns an object of class `tvpagfl`, which contains
1. `model`: A `data.frame` containing the dependent and explanatory variables as well as individual and time indices (if provided).
2. `coefficients`: A list holding (i) a $T \times p^{(1)} \times \hat{K}$ array of the post-Lasso group-specific functional coefficients and (ii) a $K \times p^{(2)}$ matrix of time-constant parameter estimates (when running a mixed time-varying panel data model).
3. `groups`: A `list` containing (i) the total number of groups $\hat{K}$ and (ii) a vector of estimated group memberships $(\hat{g}_1, \dots, \hat{g}_N)$, where $\hat{g}_i = k$ if $i$ is assigned to group $k$.
4. `residuals`: A vector of residuals of the demeaned model.
5. `fitted`: A vector of fitted values of the demeaned model.
6. `args`: A list of additional arguments.
7. `IC`: A `list` containing (i) the value of the IC, (ii) the employed tuning parameter $\lambda$, and (iii) the mean squared error.
8. `convergence`: A `list` containing (i) a logical variable if convergence was achieved and (ii) the number of executed *ADMM* algorithm iterations.
9. `call`: The function call.
> [!TIP]
> Again, `tvpagfl` objects support generic `summary()`, `fitted()`, `resid()`, `df.residual`, `formula`, and `coef()` functions.
In empirical settings, unbalanced panel datasets are common; nevertheless, time-varying coefficient functions remain identifiable (see Haimerl et al. [2025](https://doi.org/10.48550/arXiv.2503.23165), Appendix D). The spline functions simply interpolate missing time periods and the group structure compensates for missings in any individual cross-sectional unit. However, when using unbalanced datasets it is required to provide explicit indicator variables that declare the cross-sectional individual and time period each observation belongs to.
Lets delete 30% of observations, add indicator variables, and run `tv_pagfl()` again.
```{r, fig.width = 9, fig.height = 4}
# Draw some observations to be omitted
delete_index <- as.logical(rbinom(n = N * n_periods, prob = 0.7, size = 1))
# Construct cross-sectional and time indicator variables
tv_data$i_index <- rep(1:N, each = n_periods)
tv_data$t_index <- rep(1:n_periods, N)
# Delete some observations
tv_data <- tv_data[delete_index, ]
# Apply the time-varying PAGFL to an unbalanced panel
tv_estim_unbalanced <- tv_pagfl(y ~ 1, data = tv_data, index = c("i_index", "t_index"), lambda = 5, verbose = F)
summary(tv_estim_unbalanced)
```
> [!TIP]
> `tv_pagfl()` lets you specify a lot more optionalities than shown here. For example, it is possible to adjust the polyomial degree and the number of interior knots in the spline basis system, or estimate a panel data model with a mix of time-varying and time-constant coefficients. See `?tv_pagfl()` for details.
## Observing a Group Structure
It may occur that the group structure is known and does not need to be estimated. In such instances, `grouped_plm()` and `grouped_tv_plm()` make it easy to estimate a (time-varying) linear panel data model given an observed grouping.
```{r}
groups <- sim$groups
estim_oracle <- grouped_plm(y ~ ., data = data, n_periods = 150, groups = groups, method = "PLS")
summary(estim_oracle)
```
Besides not estimating the group structure, `grouped_plm()` (`grouped_tv_plm()`) behave identically to `pagfl()` (`tv_pagfl()`)
## Future Outlook
The package is still under active development. Future versions are planned to include
- Global coefficients
- Un-penalized individual coefficients
- Inference methods
You are not a R-user? Worry not - An equivalent Python library is in the works.
Feel free to [reach out](mailto:paul.haimerl@econ.au.dk) if you have any suggestions or questions.
## References
Dhaene, G., & Jochmans, K. (2015). Split-panel jackknife estimation of fixed-effect models. *The Review of Economic Studies*, 82(3), 991-1030. DOI: [doi.org/10.1093/restud/rdv007](https://doi.org/10.1093/restud/rdv007)
Haimerl, P., Smeekes, S., & Wilms, I. (2025). Estimation of latent group structures in time-varying panel data models. *arXiv preprint arXiv:2503.23165*. DOI: [doi.org/10.48550/arXiv.2503.23165](https://doi.org/10.48550/arXiv.2503.23165)
Mehrabani, A. (2023). Estimation and identification of latent group structures in panel data. *Journal of Econometrics*, 235(2), 1464-1482. DOI: [doi.org/10.1016/j.jeconom.2022.12.002](https://doi.org/10.1016/j.jeconom.2022.12.002)
Owner
- Name: Paul Haimerl
- Login: Paul-Haimerl
- Kind: user
- Location: Maastricht
- Repositories: 2
- Profile: https://github.com/Paul-Haimerl
M.Sc. Economic and Financial Research at Maastricht University
GitHub Events
Total
- Release event: 2
- Watch event: 3
- Push event: 28
- Pull request event: 2
- Fork event: 2
- Create event: 2
Last Year
- Release event: 2
- Watch event: 3
- Push event: 28
- Pull request event: 2
- Fork event: 2
- Create event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: about 20 hours
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: about 20 hours
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- eddelbuettel (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 231 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 6
- Total maintainers: 1
cran.r-project.org: PAGFL
Joint Estimation of Latent Groups and Group-Specific Coefficients in Panel Data Models
- Homepage: https://github.com/Paul-Haimerl/PAGFL
- Documentation: http://cran.r-project.org/web/packages/PAGFL/PAGFL.pdf
- License: AGPL (≥ 3)
-
Latest release: 1.1.3
published about 1 year ago
Rankings
Dependent packages count: 28.2%
Dependent repos count: 36.2%
Average: 49.7%
Downloads: 84.8%
Maintainers (1)
Last synced:
6 months ago
Dependencies
.github/workflows/R-CMD-check.yaml
actions
- actions/checkout v3 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION
cran
- Rcpp * imports
- pbapply * imports