crypticphenotypeanalysisscripts

A collection scripts used in the cryptic phenotype analysis manuscript.

https://github.com/daverblair/crypticphenotypeanalysisscripts

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

A collection scripts used in the cryptic phenotype analysis manuscript.

Basic Info

Host: GitHub
Owner: daverblair
License: mit
Language: Python
Default Branch: master
Size: 403 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created almost 6 years ago · Last pushed about 4 years ago

Metadata Files

Readme License Citation

CrypticPhenotypeAnalysisScripts

This directory contains the scripts that were used to perform the analyses described in Blair et al 2022. Note, these scripts are provided as is and have not been tested across multiple architectures. If software/python module installation is required, it can be done using pip/conda. Many of these scripts rely on multiple data files that cannot be directly shared due to data use agreements, so they are provided with dummy paths in the place of such data. However, the scripts can be adapted for use on customized datasets. Below, the purpose of each script is briefly described, separated by sub-directories used to organize the research.

Sub-Directories

AuxillaryFunctions

Functions called by other scripts that are not part of a distributed module.

FirthRegression.py: Class that performs firth-corrected logistic regression based on the approach described in PMID: 12758140.

GWASPlots.py: Python functions for making the quantile-quantile and Manhattan plots (see Figures 4, 5, and 6).

AuxFuncs.py: A handful of functions used repeatedly by other scripts.

Illustration

This directory contains the script used to generate the simulated data and figure panels in Figure 1a,b.

vLPI_Illustration.py: This script should reproduce the plots in Figures 1a and 1b.

IllustrativeExample.pth: Pickled simulated dataset for Figure 1.

CPA

This directory contains the scripts used for Cryptic Phenotype Analysis (CPA). The details of this analysis are described in the Supplementary Methods. Supplementary Figure 5 depicts the overall CPA pipeline, and the steps performed by each script are provided in the file name. The final output of this analysis is depicted in Figure 2.

CPAvLPIFitStep1.py: Fits the latent phenotype model to some observed symptom dataset. The model is saved to disk. Type 'python CPAvLPIFitStep1.py -h' for command line arguments.

CPAModelSelectionStep21.py: Compares all the different models fit to a single disease and identifies the one with the lowest perplexity. This is done in the training dataset. Type 'python CPAModelSelection_Step1-2.py -h' for command line arguments.

CPAConsistencyAnalysisStep22.py: Performs the consistency checks for the different model inference trials. Corresponds to Step 2 of Supplementary Figure 5. Results of this analysis are displayed in Supplementary Figure 3. Type 'python CPAConsistencyAnalysis_Step2.py -h' for command line arguments.

Note, there is no Step 3 script, as it corresponds to repeating Step 1 but with a different set of hyper-parameters/set of manually curated symptoms.

CPAEffectiveRankStep4.py: Computes fraction of variance explained by each component of the top performing latent phenotype model. This information is used to estimate effective rank. Type 'python CPAEffectiveRankStep4.py -h' for command line arguments.

CPAIdentifyCrypticPhenoStep5.py: This script simply identifies the top performing cryptic phenotype for a given model.

CPAValidationStep6_UCSFModel.py: This script computes the increase in case severity among withheld disease cases in the UCSF dataset for the UCSF model.

CPAValidationStep6_UKBBModel.py: This script computes the increase in case severity among withheld disease cases in the UCSF dataset for the UKBB model. When possible, it also computes this information for the UKBB model and dataset (Diagnoses are not available for all diseases in the UKBB dataset. In such cases, the analysis is only perfored in the UCSF dataset).

BuildCombinedUCSF_UKBBTable.py: Simple script that concatenates summary tables for model inference results into a single table. It is included for reference so that 'ModelInferenceCombinedResults.pth' can be reconstructed if needed.

CollectResults_FilterFinalDiseases.py: This script performs the filtering describe in Blair et al. 2022, which results in the final 10 diseases that replicated in both the UCSF and UKBB datasets.

DatasetModelCompare.py: This script was used to produce the analyses displayed in Figures 2d, 2e, and 2f.

MolecularValidation

MolecularValidation.py: Script that performs the validation analysis for the cryptic phenotypes using exome sequencing data. These results are displayed in Figure 3 and Supplementary Figure 6.

GWAS

plink_gwas.sh: shell script for performing GWAS using plink2.

calcldakweightings_taggings.sh: Script used to calculate LDAK weights/taggings, which are needed for downstream analyses. See https://dougspeed.com/calculate-taggings/ for details.

ldak_sumher.sh: Script used to compute heritability estimates from GWAS summary statistics

ldak_bolt.sh: Script used to estimate the LDAK-BOLT model for genomic predictions/PGS inference.

ldak_scores.sh: Script used to impute polygenic scores into the training, validation, and target cohorts. Note, only the validation/target datasets (not used for BOLT model inference) are analyzed in Figures 5/6.

GenerateQQPlots.py: Generates the 5 QQ plots shown in Figure 4

A1ATD_PostGWASAnalysis.py: Performs all PGS and P/LP-related analyses in the target/validation cohorts for A1ATD. Results are displayed in Figures 5 and Supplementary Figure 7.

AS_PostGWASAnalysis.py: Performs all PGS and P/LP-related analyses in the target/validation cohorts for AS. Results are displayed in Figures 6 and Supplementary Figure 8.

ADPKD_PostGWASAnalysis.py: Performs all PGS and P/LP-related analyses in the target/validation cohorts for ADPKD. Results are displayed in Figures 6 and Supplementary Figure 9.

Owner

Login: daverblair
Kind: user

Repositories: 5
Profile: https://github.com/daverblair

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Blair"
  given-names: "David Randall"
  orcid: "https://orcid.org/0000-0002-7455-6488"
- family-names: "Hoffmann"
  given-names:  "Thomas J"
- family-names: "Shieh"
  given-names:  "Joseph T"
title: "Common genetic variation associated with Mendelian disease severity revealed through cryptic phenotype analysis"
version: 0.0.1
doi: 10.5281/zenodo.6468762
date-released: 2022-04-18
url: "https://github.com/daverblair/CrypticPhenotypeAnalysisScripts"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science