pcsstools
Tools for regression using pre-computed summary statistics
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 12 DOI reference(s) in README -
○Academic publication links
-
✓Committers with academic emails
1 of 3 committers (33.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (19.9%) to scientific vocabulary
Keywords
gwas
r
statistical-genetics
Last synced: 7 months ago
·
JSON representation
Repository
Tools for regression using pre-computed summary statistics
Basic Info
Statistics
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 2
- Releases: 1
Topics
gwas
r
statistical-genetics
Created over 6 years ago
· Last pushed almost 2 years ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# pcsstools
[](https://CRAN.R-project.org/package=pcsstools)

[](https://github.com/jackmwolf/pcsstools/actions)
## Overview
pcsstools is an R package to describe various regression models using only pre-computed summary statistics (PCSS) from genome-wide association studies (GWASs) and PCSS repositories such as [GeneAtlas](http://geneatlas.roslin.ed.ac.uk/).
This eliminates the logistic, privacy, and access concerns that accompany the use of individual patient-level data (IPD).
The following figure highlights the information typically needed to perform regression analysis on a set of $m$ phenotypes with $p$ covariates when IPD is available, and the PCSS that are commonly needed to approximate this same model in pcsstools.

Currently, pcsstools supports the linear modeling of complex phenotypes defined via functions of other phenotypes.
Supported functions include:
* linear combinations (e.g. $\phi_1y_1 + \phi_2y_2$)
* products (e.g. $y_1\circ y_2$)
* logical combinations (e.g. $y_1\wedge y_2$ or $y_1\vee y_2$)
## Installation
You can install pcsstools from CRAN with
``` r
install.packages("pcsstools")
```
### Development Version
You can install the in-development version of pcsstools from [GitHub](https://github.com/) with
``` r
# install.packages("devtools")
devtools::install_github("jackmwolf/pcsstools")
```
## Examples
We will walk through two examples using pcsstools to model combinations of phenotypes using PCSS and then compare our results to those found using IPD.
```{r}
library(pcsstools)
```
### Principal Component Analysis
Let's model the first principal component score of three phenotypes using PCSS.
First, we'll load in some data. We have three SNPs; minor allele counts (`g1`, `g2`, and `g3`), a continuous covariate (`x1`), and three continuous phenotypes (`y1`, `y2`, and `y3`).
```{r}
dat <- pcsstools_example[c("g1", "g2", "g3", "x1", "y1", "y2", "y3")]
head(dat)
```
First, we need our assumed summary statistics: means, the full covariance matrix, and our sample size.
```{r}
pcss <- list(
means = colMeans(dat),
covs = cov(dat),
n = nrow(dat)
)
```
Then, we can calculate the linear model by using `pcsslm()`.
Our `formula` will list all phenotypes as one sum, joined together by `+` operators and we indicate that we want the first principal component score by setting `comp = 1`.
We also want to center and standardize `y1`, `y2`, and `y3` before computing principal component scores; we will do so by setting `center = TRUE` and `standardize = TRUE`.
```{r}
model_pcss <- pcsslm(y1 + y2 + y3 ~ g1 + g2 + g3 + x1, pcss = pcss, comp = 1,
center = TRUE, standardize = TRUE)
model_pcss
```
Here's the same model using individual patient data.
```{r}
pc_1 <- prcomp(x = dat[c("y1", "y2", "y3")], center = TRUE, scale. = TRUE)$x[, "PC1"]
model_ipd <- lm(pc_1 ~ g1 + g2 + g3 + x1, data = dat)
summary(model_ipd)
```
In this case, our coefficient estimates are off by a factor of -1; this is because we picked the opposite vector of principal component weights to `prcomp`.
This distinction in sign is arbitrary (see the note in `?prcomp`).
We can also compare this model to a smaller model using `anova` and find the same results when using both PCSS and IPD.
```{r}
model_pcss_reduced <- update(model_pcss, . ~ . - g1 - g2 - g3)
anova(model_pcss_reduced, model_pcss)
model_ipd_reduced <-update(model_ipd, . ~ . - g1 - g2 - g3)
anova(model_ipd_reduced, model_ipd)
```
### Logical Combination
In this example we will approximate a linear model where our response is the logical combination "$y_4$ or $y_5$" ($y_4\vee y_5$).
First we need data with binary phenotypes.
```{r}
dat <- pcsstools_example[c("g1", "g2", "x1", "y4", "y5")]
head(dat)
```
Once again we will organized our assumed PCSS.
In addition to the summary statistics we needed for the previous example, we also need to describe the distributions of both of our predictors through objects of class `predictor`.
(See `?new_predictor`.)
`pcsstools` has shortcut functions to create `predictor` objects for common types of variables, which we will use to create a list of `predictor`s.
```{r}
pcss <- list(
means = colMeans(dat),
covs = cov(dat),
n = nrow(dat),
predictors = list(
g1 = new_predictor_snp(maf = mean(dat$g1) / 2),
g2 = new_predictor_snp(maf = mean(dat$g2) / 2),
x1 = new_predictor_normal(mean = mean(dat$x1), sd = sd(dat$x1))
)
)
class(pcss$predictors[[1]])
```
Then we can approximate the linear model using `pcsslm()`.
```{r}
model_pcss <- pcsslm(y4 | y5 ~ g1 + g2 + x1, pcss = pcss)
model_pcss
```
And here's the result we would get using IPD:
```{r}
model_ipd <- lm(y4 | y5 ~ g1 + g2 + x1, data = dat)
summary(model_ipd)
```
## Future Work
* Support function notation for linear combinations of phenotypes (e.g. `y1 - y2 + 0.5 * y3 ~ 1 + g + x`) instead of requiring a separate vector of weights
* Support functions using `.` and `-` in the dependent variable (e.g. `y1 ~ .`, `y1 ~ . -x`)
* Write a vignette
## References
Following are the key references for the functions in this package
* Wolf, J.M., Westra, J., and Tintle, N. (2021). Using summary statistics to
model multiplicative combinations of initially analyzed phenotypes with a
flexible choice of covariates. *Frontiers in Genetics*, 25, 1962.
[https://doi.org/10.3389/fgene.2021.745901](https://doi.org/10.3389/fgene.2021.745901).
* Wolf, J.M., Barnard, M., Xueting, X., Ryder, N., Westra, J., and Tintle, N.
(2020). Computationally efficient, exact, covariate-adjusted genetic principal
component analysis by leveraging individual marker summary statistics from
large biobanks. *Pacific Symposium on Biocomputing*, 25, 719-730.
[https://doi.org/10.1142/9789811215636_0063](https://doi.org/10.1142/9789811215636_0063).
* Gasdaska A., Friend D., Chen R., Westra J., Zawistowski M., Lindsey W. and
Tintle N. (2019) Leveraging summary statistics to make inferences about
complex phenotypes in large biobanks. *Pacific Symposium on Biocomputing*, 24,
391-402.
[https://doi.org/10.1142/9789813279827_0036](https://doi.org/10.1142/9789813279827_0036).
Owner
- Name: Jack M Wolf
- Login: jackmwolf
- Kind: user
- Company: University of Minnesota Biostatistics
- Website: jackmwolf.rbind.io
- Twitter: _jackmwolf
- Repositories: 4
- Profile: https://github.com/jackmwolf
he/him \\ Biostatistics PhD Student
GitHub Events
Total
Last Year
Committers
Last synced: about 3 years ago
All Time
- Total Commits: 173
- Total Committers: 3
- Avg Commits per committer: 57.667
- Development Distribution Score (DDS): 0.127
Top Committers
| Name | Commits | |
|---|---|---|
| Jack M Wolf | 3****f@u****m | 151 |
| Jack Wolf | = | 17 |
| Jack Wolf | w****5@s****u | 5 |
Committer Domains (Top 20 + Academic)
stolaf.edu: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 6
- Total pull requests: 3
- Average time to close issues: 3 months
- Average time to close pull requests: about 1 hour
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.17
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jackmwolf (5)
Pull Request Authors
- jackmwolf (3)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 185 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
cran.r-project.org: pcsstools
Tools for Regression Using Pre-Computed Summary Statistics
- Homepage: https://github.com/jackmwolf/pcsstools/
- Documentation: http://cran.r-project.org/web/packages/pcsstools/pcsstools.pdf
- License: GPL (≥ 3)
-
Latest release: 0.1.2
published over 2 years ago
Rankings
Stargazers count: 24.2%
Forks count: 28.8%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Average: 41.3%
Downloads: 88.1%
Maintainers (1)
Last synced:
7 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.5.0 depends
- Rdpack * imports
- gtools * imports
- stats * imports
- knitr * suggests
- rmarkdown * suggests
- spelling * suggests
- testthat * suggests
.github/workflows/check-standard.yaml
actions
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/upload-artifact main composite
- r-lib/actions/setup-pandoc v1 composite
- r-lib/actions/setup-r v1 composite