pcpr
An R package implementing Principal Component Pursuit for pattern recognition in environmental health.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (19.9%) to scientific vocabulary
Keywords
cran
dimensionality-reduction
environmental-health
environmental-mixtures
epidemiology
machine-learning
pattern-recognition
public-health
r
statistical-modeling
Last synced: 6 months ago
·
JSON representation
Repository
An R package implementing Principal Component Pursuit for pattern recognition in environmental health.
Basic Info
- Host: GitHub
- Owner: Columbia-PRIME
- License: gpl-3.0
- Language: R
- Default Branch: main
- Homepage: https://columbia-prime.github.io/pcpr/
- Size: 10.7 MB
Statistics
- Stars: 5
- Watchers: 0
- Forks: 1
- Open Issues: 8
- Releases: 1
Topics
cran
dimensionality-reduction
environmental-health
environmental-mixtures
epidemiology
machine-learning
pattern-recognition
public-health
r
statistical-modeling
Created over 5 years ago
· Last pushed 11 months ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# pcpr
[](https://cran.r-project.org/package=pcpr)
[](https://github.com/Columbia-PRIME/pcpr/actions/workflows/R-CMD-check.yaml)
[](https://www.gnu.org/licenses/gpl-3.0)
[](https://cranlogs.r-pkg.org/badges/pcpr)
## Overview
The R package `pcpr` implements Principal Component Pursuit (PCP), a robust
dimensionality reduction technique, for pattern recognition tailored to
environmental health data. The statistical methodology and computational details
are provided in Gibson et al. (2022).
## Installation
You can install the latest official CRAN release of `pcpr` with:
``` r
install.packages("pcpr")
```
The development version of `pcpr` can be installed from GitHub with:
``` r
# install.packages("pak")
pak::pak("Columbia-PRIME/pcpr")
```
`pcpr` can then be loaded and attached in your current R session as usual with
```{r lib}
library(pcpr)
```
## Getting help
Extensive documentation is available on our pkgdown [website](https://columbia-prime.github.io/pcpr/reference/index.html) and offline within R.
You can see the `pcpr` reference manual in R with:
``` r
help("pcpr")
```
A number of vignettes are available from within R. They can be browsed using:
``` r
browseVignettes("pcpr")
```
We recommend reading the vignettes in the following order:
1. [Theory crash course](https://columbia-prime.github.io/pcpr/articles/theory-crash-course.html), or if directly in R: `vignette("theory-crash-course")`
2. [Quickstart](https://columbia-prime.github.io/pcpr/articles/pcp-quickstart.html), or if directly in R: `vignette("pcp-quickstart")`
3. [Air pollution source apportionment with PCP](https://columbia-prime.github.io/pcpr/articles/pcp-applied.html), or if directly in R: `vignette("pcp-applied")`
Have a bug to report or question to ask? [Open an issue on our GitHub](https://github.com/Columbia-PRIME/pcpr/issues).
## Modeling overview
PCP algorithms model an observed exposure matrix $D$ as the sum of three underlying
ground-truth matrices:

a low-rank matrix $L_0$ encoding consistent patterns of exposure, a sparse
matrix $S_0$ isolating unique or outlying exposure events (that cannot be
explained by the consistent exposure patterns), and dense noise $Z_0$.
The models in `pcpr` seek to decompose an observed data matrix `D` into estimated
low-rank and sparse components `L` and `S` for use in downstream environmental health
analyses. The functions in `pcpr` are outfitted with three environmental health
(EH)-specific extensions making `pcpr` particularly powerful for EH research:
1. Missing value functionality
2. Leveraging potential limit of detection (LOD) information
3. Non-negativity constraint on the estimated `L` matrix
## PCP in environmental health studies
The methods in `pcpr` have already been applied in many environmental health studies. Several are listed below:
* Tao et al. (2023) apply PCP to investigate the association between source-specific fine particulate matter and myocardial infarction hospitalizations in NYC.
* Wu et al. (2024) employ PCP for exposome profiling of environmental pollutants in seminal plasma, uncovering novel associations with semen parameters.
* Benavides et al. (2024) use PCP to develop a Community Severity Index in NYC, measuring the barrier effect of road infrastructure and traffic in cities.
## Acknowledgements
Please cite use of `pcpr` with:
Chillrud L, Benavides J, Gibson E, Zhang J, Yan J, Wright J, Goldsmith J, Kioumourtzoglou M (2025).
pcpr: Principal Component Pursuit for Environmental Epidemiology. R package version 1.0.0,
https://columbia-prime.github.io/pcpr/, https://github.com/Columbia-PRIME/pcpr.
```
@Manual{,
title = {pcpr: Principal Component Pursuit for Environmental Epidemiology},
author = {Lawrence G. Chillrud and Jaime Benavides and Elizabeth A. Gibson and Junhui Zhang and Jingkai Yan and John N. Wright and Jeff Goldsmith and Marianthi-Anna Kioumourtzoglou},
year = {2025},
note = {R package version 1.0.0, https://github.com/Columbia-PRIME/pcpr},
url = {https://columbia-prime.github.io/pcpr/},
}
```
Please also cite Gibson et al. (2022).
This work was supported by NIEHS PRIME R01 ES028805.
Special thanks to Sophie Calhoun for designing `pcpr`'s logo!
## Usage
```{r usage}
# In the below example, we simulate a simple mixtures model and run PCP,
# comparing it's performance to that of PCA. For an in depth example with
# simulated data, see vignette("pcp-quickstart"). For more realistic
# PCP usage, check out vignette("pcp-applied").
# Simulate an environmental mixture
data <- sim_data(
n = 100, p = 10, r = 3,
sparse_nonzero_idxs = seq(1, 1000, 101),
sigma = 0.05
)
D <- data$D # Observed matrix
L_0 <- data$L # Ground truth low-rank matrix
S_0 <- data$S # Ground truth sparse matrix
Z_0 <- data$Z # Ground truth noise matrix
# Simulate a limit of detection for each chemical in mixture
lod_info <- sim_lod(D, q = 0.1)
D_lod <- lod_info$D_tilde
lod <- lod_info$lod
# Simulate missing observations
corrupted_data <- sim_na(D_lod, perc = 0.05)
D_tilde <- corrupted_data$D_tilde
# Finish simulating LOD by imputing values < LOD with LOD/sqrt(2)
lod_root2 <- matrix(
lod / sqrt(2),
nrow = nrow(D_tilde),
ncol = ncol(D_tilde), byrow = TRUE
)
lod_idxs <- which(lod_info$tilde_mask == 1)
D_tilde[lod_idxs] <- lod_root2[lod_idxs]
# Run grid search to obtain optimal r, eta parameters
# (Not shown here to save space, see vignette("pcp-quickstart")
# for full example which obtains r = 3, eta = 0.224)
r_star <- 3
eta_star <- 0.224
# Run non-convex PCP to estimate L, S from D_tilde
pcp_model <- rrmc(D_tilde, r = r_star, eta = eta_star, LOD = lod)
# Clean up sparse matrix
pcp_model$S <- hard_threshold(pcp_model$S, thresh = 0.4)
# Benchmark with PCA's attempt at recovering L
D_imputed <- impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE))
L_pca <- proj_rank_r(D_imputed, r = r_star)
# Evaluate PCP ground truth
data.frame(
"Obs_rel_err" = norm(L_0 - D_imputed, "F") / norm(L_0, "F"),
"PCA_L_rel_err" = norm(L_0 - L_pca, "F") / norm(L_0, "F"),
"PCP_L_rel_err" = norm(L_0 - pcp_model$L, "F") / norm(L_0, "F"),
"PCP_S_rel_err" = norm(S_0 - pcp_model$S, "F") / norm(S_0, "F"),
"PCP_L_rank" = matrix_rank(pcp_model$L),
"PCP_S_sparsity" = sparsity(pcp_model$S)
)
```
## References
Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud, Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Principal component pursuit for pattern identification in environmental mixtures." Environmental Health Perspectives 130, no. 11 (2022): 117008.
Tao, Rachel H., Lawrence G. Chillrud, Yanelli Nunez, Sebastian T. Rowland, Amelia K. Boehme, Jingkai Yan, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Applying principal component pursuit to investigate the association between source-specific fine particulate matter and myocardial infarction hospitalizations in New York City." Environmental Epidemiology 7 (2), (2023).
Wu, Haotian, Vrinda Kalia, Katherine E. Manz, Lawrence Chillrud, Nathalie Hoffmann Dishon, Gabriela L. Jackson, Christian K. Dye, Raoul Orvieto, Adva Aizer, Hagai Levine, Marianthi-Anna Kioumourtzoglou, Kurt D. Pennell, Andrea A. Baccarelli, and Ronit Machtinger. "Exposome Profiling of Environmental Pollutants in Seminal Plasma and Novel Associations with Semen Parameters." Environmental Science & Technology, 58 (31), (2024): 13594-13604.
Benavides, Jaime, Sabah Usmani, Vijay Kumar, and Marianthi-Anna Kioumourtzoglou. "Development of a community severance index for urban areas in the United States: A case study in New York City." Environment International, 185, (2024): 108526.
Owner
- Name: Columbia PRIME
- Login: Columbia-PRIME
- Kind: organization
- Website: marianthi.github.io/makLAB.github.io
- Repositories: 1
- Profile: https://github.com/Columbia-PRIME
Adapting PCP to environmental health data
GitHub Events
Total
- Create event: 4
- Release event: 1
- Issues event: 44
- Watch event: 6
- Delete event: 2
- Issue comment event: 21
- Push event: 47
- Fork event: 1
Last Year
- Create event: 4
- Release event: 1
- Issues event: 44
- Watch event: 6
- Delete event: 2
- Issue comment event: 21
- Push event: 47
- Fork event: 1
Packages
- Total packages: 1
-
Total downloads:
- cran 208 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
- Total maintainers: 1
cran.r-project.org: pcpr
Principal Component Pursuit for Environmental Epidemiology
- Homepage: https://columbia-prime.github.io/pcpr/
- Documentation: http://cran.r-project.org/web/packages/pcpr/pcpr.pdf
- License: GPL (≥ 3)
-
Latest release: 1.0.0
published 11 months ago
Rankings
Forks count: 21.6%
Stargazers count: 24.9%
Dependent packages count: 26.9%
Dependent repos count: 33.2%
Average: 38.7%
Downloads: 87.0%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran