saigegds

Scalable Implementation of generalized mixed models using GDS files in Phenome-Wide Association Studies

https://github.com/abbvie-computationalgenomics/saigegds

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 9 DOI reference(s) in README
✓
Academic publication links
Links to: nature.com
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Keywords

gds gwas mixed-model phewas

Last synced: 9 months ago · JSON representation

Repository

Scalable Implementation of generalized mixed models using GDS files in Phenome-Wide Association Studies

Basic Info

Host: GitHub
Owner: AbbVie-ComputationalGenomics
Language: C++
Default Branch: master
Homepage:
Size: 2.61 MB

Statistics

Stars: 7
Watchers: 6
Forks: 4
Open Issues: 2
Releases: 0

Topics

gds gwas mixed-model phewas

Created almost 7 years ago · Last pushed about 3 years ago

Metadata Files

Readme

SAIGEgds: Scalable Implementation of Generalized mixed models in PheWAS using GDS files

GPLv3 GNU General Public License, GPLv3

Features

Scalable implementation of generalized mixed mode with the support of Genomic Data Structure (GDS) files and highly optimized C++ implementation. It is designed for single variant tests in large-scale phenome-wide association studies (PheWAS) with millions of variants and hundreds of thousands of samples (e.g., UK Biobank genotype data), controlling for case-control imbalance and sample structure in single variant association studies.

The implementation of SAIGEgds is based on the original SAIGE R package (v0.29.4.4) [Zhou et al. 2018]. It is implemented with optimized C++ codes taking advantage of sparse structure of genotypes. All of the calculation with single-precision floating-point numbers in SAIGE are replaced by the double-precision calculation in SAIGEgds. SAIGEgds also implements some of the SPAtest functions in C to speed up the calculation of Saddlepoint approximation.

Benchmarks using the UK Biobank White British genotype data (N=430K) with coronary heart disease and simulated cases, show that SAIGEgds is 5 to 6 times faster than the SAIGE R package in the steps of fitting null models and p-value calculations. When used in conjunction with high-performance computing (HPC) clusters and/or cloud resources, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS.

Bioconductor:

Release Version: v1.12.1 (http://www.bioconductor.org/packages/SAIGEgds)

Package Maintainer

Xiuwen Zheng

Installation

Requires R (≥ v3.5.0), gdsfmt (≥ v1.20.0), SeqArray (≥ v1.32.0)
Recommend GNU GCC (≥ v6.0), requiring C++11
Bioconductor repository R if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("SAIGEgds") The BiocManager::install() approach may require that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.
Development version from Github (for developers/testers only) R library("devtools") install_github("AbbVie-ComputationalGenomics/SAIGEgds")

Package vignette

If the package is installed from Bioconductor repository or package rebuilding, users can start R and enter to view documentation: R browseVignettes("SAIGEgds")

Examples

```R library(SeqArray) library(SAIGEgds)

open the GDS file for genetic relationship matrix (GRM)

grmfn <- system.file("extdata", "grm1k10ksnp.gds", package="SAIGEgds") (grmgds <- seqOpen(grm_fn))

load phenotype

phenofn <- system.file("extdata", "pheno.txt.gz", package="SAIGEgds") pheno <- read.table(phenofn, header=TRUE, as.is=TRUE) head(pheno)

sample.id y yy x1 x2

1 s1 0 4.5542 1.5118 1

2 s2 0 3.7941 0.3898 1

3 s3 0 5.0411 -0.6212 1

...

fit the null model

glmm <- seqFitNullGLMMSPA(y ~ x1 + x2, pheno, grmgds, trait.type="binary", sample.col="sample.id", num.thread=2)

SAIGE association analysis:

Filtering variants:

[==================================================] 100%, completed, 0s

Fit the null model: y ~ x1 + x2 + var(GRM)

# of samples: 1,000

# of variants: 9,976

using 2 threads

...

close the file

seqClose(grm_gds)

open the GDS file for association testing

genofn <- system.file("extdata", "assoc100snp.gds", package="SAIGEgds") (genogds <- seqOpen(genofn))

File: assoc_100snp.gds (10.5K)

+ [ ] *

|--+ description [ ] *

|--+ sample.id { Str8 1000 LZMA_ra(12.6%), 625B }

|--+ variant.id { Int32 100 LZMA_ra(48.5%), 201B } *

...

p-value calculation

assoc <- seqAssocGLMMSPA(genogds, glmm, mac=10, parallel=2)

SAIGE association analysis:

# of samples: 1,000

# of variants: 100

MAF threshold: NaN

MAC threshold: 10

missing threshold for variants: 0.1

p-value threshold for SPA adjustment: 0.05

variance ratio for approximation: 0.9391186

# of processes: 2

[==================================================] 100%, completed, 0s

# of variants after filtering by MAF, MAC and missing thresholds: 38

Done.

head(assoc)

id chr pos rs.id ref alt AF.alt mac num beta SE pval pval.noadj converged

1 4 1 4 rs4 A C 0.0100 20 1000 -0.074992 0.791685 0.924533 0.924533 TRUE

2 12 1 12 rs12 A C 0.0150 30 1000 -0.091001 0.657140 0.889861 0.889861 TRUE

3 14 1 14 rs14 A C 0.0375 75 1000 -0.075455 0.434152 0.862023 0.862023 TRUE

...

close the file

seqClose(geno_gds) ```

Citations

Zheng X, Davis J.Wade. SAIGEgds -- an efficient statistical tool for large-scale PheWAS with mixed models. Bioinformatics (2020). DOI: 10.1093/bioinformatics/btaa731.

Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A, Bastarache LA, Wei WQ, Denny JC, Lin M, Hveem K, Kang HM, Abecasis GR, Willer CJ, Lee S. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet (2018). Sep;50(9):1335-1341. DOI: 10.1038/s41588-018-0184-y.

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D. SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics (2017). DOI: 10.1093/bioinformatics/btx145.

Name	Email	Commits
Xiuwen Zheng	x**g@a**m	6
Xiuwen Zheng	5****b	4
A. Jason Grundstad	a**d@a**m	2
Jason Grundstad	j**d@g**m	1
Xiuwen Zheng	z**n@g**m	1
Martin Tzvetanov Grigorov	m**v@a**g	1

Total issues: 9
Total pull requests: 2
Average time to close issues: 2 months
Average time to close pull requests: about 9 hours
Total issue authors: 7
Total pull request authors: 2
Average comments per issue: 2.78
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

sariya (3)
lauraand1705 (1)
martin-g (1)
Richard-Packer (1)
SunYidan2021 (1)
silviaadiz (1)
ldcato (1)

Pull Request Authors

jgrundstad (1)
martin-g (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
Rcpp * depends
SeqArray >= 1.31.8 depends
gdsfmt >= 1.20.0 depends
RcppParallel * imports
SPAtest >= 3.0.0 imports
methods * imports
stats * imports
utils * imports
BiocGenerics * suggests
RUnit * suggests
SNPRelate * suggests
crayon * suggests
ggmanh * suggests
knitr * suggests
markdown * suggests
parallel * suggests
rmarkdown * suggests

saigegds

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SAIGEgds: Scalable Implementation of Generalized mixed models in PheWAS using GDS files

Features

Bioconductor:

Package Maintainer

Installation

Package vignette

Examples

open the GDS file for genetic relationship matrix (GRM)

load phenotype

sample.id y yy x1 x2

1 s1 0 4.5542 1.5118 1

2 s2 0 3.7941 0.3898 1

3 s3 0 5.0411 -0.6212 1

...

fit the null model

SAIGE association analysis:

Filtering variants:

[==================================================] 100%, completed, 0s

Fit the null model: y ~ x1 + x2 + var(GRM)

# of samples: 1,000

# of variants: 9,976

using 2 threads

...

close the file

open the GDS file for association testing

File: assoc_100snp.gds (10.5K)

+ [ ] *

|--+ description [ ] *

|--+ sample.id { Str8 1000 LZMA_ra(12.6%), 625B }

|--+ variant.id { Int32 100 LZMA_ra(48.5%), 201B } *

...

p-value calculation

SAIGE association analysis:

# of samples: 1,000

# of variants: 100

MAF threshold: NaN

MAC threshold: 10

missing threshold for variants: 0.1

p-value threshold for SPA adjustment: 0.05

variance ratio for approximation: 0.9391186

# of processes: 2

[==================================================] 100%, completed, 0s

# of variants after filtering by MAF, MAC and missing thresholds: 38

Done.

id chr pos rs.id ref alt AF.alt mac num beta SE pval pval.noadj converged

1 4 1 4 rs4 A C 0.0100 20 1000 -0.074992 0.791685 0.924533 0.924533 TRUE

2 12 1 12 rs12 A C 0.0150 30 1000 -0.091001 0.657140 0.889861 0.889861 TRUE

3 14 1 14 rs14 A C 0.0375 75 1000 -0.075455 0.434152 0.862023 0.862023 TRUE

...

close the file

Citations

See Also

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies