dpe

Distribution Proportion Estimation

https://github.com/bdevans/dpe

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
✓
Academic publication links
Links to: nature.com, zenodo.org
✓
Committers with academic emails
3 of 5 committers (60.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Distribution Proportion Estimation

Basic Info

Host: GitHub
Owner: bdevans
License: bsd-3-clause
Language: Python
Default Branch: main
Size: 511 KB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 1

Created over 5 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

Distribution Proportion Estimation

Methods for estimating the prevalence of cases in a mixture population based on genetic risk scores.

This repository contains the Python 3 implementation of the proportion estimation algorithms. The following instructions assume that you have a working Python (>=3.6) installation obtained either through Anaconda (recommended) or through other means (which requires pip).

A seperate Matlab implementation is provided in the matlab subdirectory with its own instructions.

Table of Contents

Installation
Running the examples
- Expected Output
- Reproducibility
Running on your own data
Methods
Citation

Installation

Install Python >= 3.6.
Clone this repository and add this folder to your PYTHONPATH: git clone https://github.com/bdevans/DPE.git
Install the requirements using one of the provided requirements files:
1. If you use a standard Python distribution: pip3 install -r requirements.txt.
2. Or, if you use Anaconda: conda env create -f environment.yml. Then run source activate dpe to activate the environment.

This should take approximately 1-2 minutes to complete, depending on the speed of your internet connection and the number of dependencies already satisfied.

The examples given were tested with Python 3.8.2 on macOS 11.4 (20F71) running in a Docker container (version 20.10.7, build f0df350). The exact packages installed with conda during testing are given in .devcontainer/frozen_environment.yml.

Running the examples

Three main code files are relevant for review purposes: 1. scripts/run_examples.py the main script for applying the proportion estimation algorithms to the example data sets. 2. dpe/datasets.py has utilities for generating, loading and saving synthetic data sets. 3. dpe/estimate.py has the main routines for estimating proportions in a mixture distribution.

Once the requirements are installed (and the environment activated if necessary) run the example script with:

python scripts/run_examples.py

The proportion estimates and confidence intervals are then generated, with a plain text summary printed to the console (when verbose > 0) and written to a log file (named after the data set file with a .log extension).

The analysis parameters may be changed by editing the run_examples.py script. It is recommended to keep the total number of mixture bootstraps (n_mix * n_boot) below around 10,000 when using the KDE method, as this may take a long time to finish.

Optionally, the file datasets.py may be edited to change the construction parameters and generate new data sets to analyse.

Expected Output

When running with the parameters N_M = 100 and N_B = 100 on a 2015 15" Macbook Pro (2.8 GHz Intel Core i7) with 8 threads, processing each data set takes around 40m. The majority of this time is spent running the KDE algorithm (the other three each take approximately 1m each). This run time increases considerably with the number of mixtures generated (n_mix) and/or the number of bootstraps generated for each mixture (n_boot). Accordingly, a short simulation for demonstration purposes has been configured in run_examples.py by using fewer bootstraps (N_B = 100) than in the manuscript (N_B = 1000), although this can be edited for longer runs if desired. The output produced for the first data set (p_C = 0.25) is given below for N_M = 100 and N_B = 100 parallelised across 8 threads:

```

Running on example dataset: pC = 0.25 Loading dataset: examplepC025... Running 100 bootstraps with 8 processors... Method: 100%|████████████████████████████████████| 4/4 [19:41<00:00, 295.33s/it]

Method    |   Estimated p_C   |   Estimated p_N

====================================================== Excess point | 0.18200 | 0.81800
Excess (µ±σ) | 0.13660 +/- 0.018 | 0.86340 +/- 0.018 Bias | -0.04560 | 0.04560

C.I. (95.0%) | 0.19040 , 0.20320 | 0.79680 , 0.80960

Means point | 0.24279 | 0.75721
Means (µ±σ) | 0.24264 +/- 0.020 | 0.75736 +/- 0.020 Bias | -0.00026 | 0.00026

C.I. (95.0%) | 0.20433 , 0.28278 | 0.71722 , 0.79567

EMD point | 0.24313 | 0.75687
EMD (µ±σ) | 0.24399 +/- 0.019 | 0.75601 +/- 0.019 Bias | 0.00093 | -0.00093

C.I. (95.0%) | 0.20334 , 0.27881 | 0.72119 , 0.79666

KDE point | 0.24843 | 0.75157
KDE (µ±σ) | 0.25001 +/- 0.022 | 0.74999 +/- 0.022 Bias | 0.00136 | -0.00136

C.I. (95.0%) | 0.20503 , 0.29097 | 0.70903 , 0.79497

Ground Truth | 0.25000 | 0.75000

Elapsed time = 2503.352 seconds

```

Additionally a results directory will be created with a subdirectory for each data set processed containing a csv file with the initial point estimates and bootstrap estimates for each method.

Reproducibility

The results are reproducible by default since a seed is set (42) for the pseudo random number generator. This seed may be changed (or set to None for a random seed) or set to any integer in the range [0, 2^32) to explore variations in results due to stochasticity in sampling.

Running on your own data

The main requirement is to prepare a dictionary (dict) containing the keys R_C, R_N and Mix. The associated values should be (one dimensional) arrays (or lists) of the GRS scores for the "Cases Reference" (R_C) distribution, the "Non-cases Reference" (R_N) distribution and the "Mixture" distribution (Mix) respectively.

Alternatively a csv file may be prepared and loaded with the function datasets.load_dataset(filename) as demonstrated in the run_examples.py script. The csv file should contain the header Group,GRS followed by pairs of code,score values (one per line for each GRS score) where code is 1 for the Cases Reference distribution, 2 for the Non-cases Reference distribution and 3 for the Mixture distribution.

Once the GRS scores have been prepared in a suitable form, they may be passed to the analyse_mixture() function as demonstrated in the run_examples.py script. Further details about this function are given in the following sections.

Methods

Pseudocode

The main algorithm is as follows. Details of the specific methods may be found in the methods section of the accompanying manuscript.

sh bins <-- generate_bins({R_C, R_N, Mix}, method='fd') # Freedman-Diaconis p^hat <-- {} # Dictionary (hashtable) of proportion estimates p^cor <-- {} # Dictionary (hashtable) of corrected proportion estimates p^mbe <-- {} # Dictionary (hashtable) of mixture-bootstrap estimates CI <-- {} # Dictionary (hashtable) of confidence intervals (2-tuple) for meth in ["Excess", "Means", "EMD", "KDE"]: p^hat[meth] <-- get_point_estimate({R_C, R_N, Mix}, bins, meth) # Proportion of cases # Calculate confidence intervals around the initial proportion estimates p^mbe[meth] <-- [] # Create empty list of estimates for each method for m in 1 to N_M: # Number of Monte Carlo mixtures Mix_meth_m <-- get_monte_carlo_mixture({R_C, R_N}, p^hat[meth]) for b in 1 to N_B: # Number of bootstraps Mix_meth_m_b <-- get_bootstrap({Mix_meth_m}, bins) # Sample with replacement p_meth_m_b <-- get_point_estimate({R_C, R_N, Mix_meth_m_b}, bins, meth) p^mbe[meth].append(p_meth_m_b) end end p^cor[meth] <-- correct_estimate(p^hat[meth], p^mbe[meth]) # Use the set of all bootstraps of all mixtures for each method CI[meth] <-- get_confidence_intervals(p^cor[meth], p^mbe[meth], alpha=0.05) end

Explanation of the main function

python def analyse_mixture(scores, bins='fd', methods='all', n_boot=1000, boot_size=-1, n_mix=0, alpha=0.05, ci_method="bca", correct_bias=False, seed=None, n_jobs=1, verbose=1, true_pC=None, logfile=''):

Inputs

scores (dict): A required dictionary of the form, {'R_C': array_of_cases_scores, 'R_N': array_of_non-cases_scores, 'Mix': array_of_mixture_scores}.
bins (str): A string specifying the binning method: ['auto', 'fd', 'doane', 'scott', 'rice', 'sturges', 'sqrt']. Default: 'fd'. Alternatively, a dictionary, {'width': bin_width, 'min', min_edge, 'max': max_edge, 'edges': array_of_bin_edges, 'centers': array_of_bin_centers, 'n': number_of_bins}.
methods (str): A string with the name of the method or 'all' to run all methods (default). Alternatively, a list of method names (strings), ["Excess", "Means", "EMD", "KDE"], or a dictionary of (bool) flags, {'Excess': True, 'Means': True, 'EMD': True, 'KDE': True}.
n_boot (int): Number of bootstraps of the mixture to generate. Default: 1000.
boot_size (int): The size of each mixture bootstrap. Default is the same size as the mixture.
n_mix (int): Number of mixtures to construct based on the initial point estimate. Default: 0.
alpha (float): The alpha value for calculating confidence intervals from bootstrap distributions. Default: 0.05.
ci_method (str): The name of the method used to calculate the confidence intervals Default: bca.
correct_bias (bool): A boolean flag specifing whether to apply the bootstrap correction method or not. Default: False.
seed (int): An optional value to seed the random number generator with (in the range [0, (2^32)-1]) for reproducibility of sampling used for confidence intervals. Defaults: None.
n_jobs (int): Number of bootstrap jobs to run in parallel. Default: 1. (n_jobs = -1 runs on all CPUs).
verbose (int): Integer to control the level of output (0, 1, 2). Set to -1 to turn off all console output except the progress bars.
true_pC (float): Optionally pass the true proportion for showing the comparison with estimated proportion(s).
logfile (str): Optional filename for the output logs. Default: "proportion_estimates.log".

Outputs

(summary, bootstraps) (tuple): A tuple consisting of the following data structures.

summary (dict): A nested dictionary with a key for each estimation method within which is a dictionary with the following keys:
- p_C : the prevalence estimate
Optionally, if bootstrapping is used: - CI : the confidence intervals around the prevalence - mean : the mean of the bootstrapped estimates - std : the standard deviation of the bootstrap estimates - p_cor_C : the corrected prevalence estimate when correct_bias == True
bootstraps (DataFrame): A pandas dataframe of the proportion estimates. The first row is the point estimate. The remaining n_boot * n_mix rows are the bootstrapped estimates. Each column is the name of the estimation method.

Additionally the logfile is written to the working directory.

Code Snippets

We bootstrap (i.e. sample with replacement) from the available data, systematically varying sample size and mixture proportion. We then apply the following methods to yield estimates of the true proportion and from those, calculate the errors throughout the bootstrap parameter space.

Excess

python number_low = len(RM[RM <= population_median]) number_high = len(RM[RM > population_median]) p_hat_C = abs(number_high - number_low) / len(RM) p_hat_C = np.clip(p_hat_C, 0.0, 1.0)
Means

python mu_C, mu_N = methods["Means"]["mu_C"], methods["Means"]["mu_N"] if mu_C > mu_N: # This should be the case p_hat_C = (RM.mean() - mu_N) / (mu_C - mu_N) else: p_hat_C = (mu_N - RM.mean()) / (mu_N - mu_C) p_hat_C = np.clip(p_hat_C, 0.0, 1.0)
EMD math EMD(P,Q)={\frac {\sum _{i=1}^{m}\sum _{j=1}^{n}f_{i,j}d_{i,j}}{\sum _{i=1}^{m}\sum _{j=1}^{n}f_{i,j}}}

``python def interpolate_CDF(scores, x_i, min_edge, max_edge): """Interpolate the cumulative density function of the scores at the points in the arrayx_i`. """
```
x = [min_edge, *sorted(scores), max_edge]
y = np.linspace(0, 1, num=len(x), endpoint=True)
(iv, ii) = np.unique(x, return_index=True)
return np.interp(x_i, iv, y[ii])
```
Interpolate the cdfs at the same points for comparison

CDF1 = interpolateCDF(scores["RC"], bins['centers'], bins['min'], bins['max']) CDF2 = interpolateCDF(scores["RN"], bins['centers'], bins['min'], bins['max']) CDFM = interpolateCDF(RM, bins['centers'], bins['min'], bins['max'])

EMDs computed with interpolated CDFs

EMD12 = sum(abs(CDF1 - CDF2)) EMDM1 = sum(abs(CDFM - CDF1)) EMDM2 = sum(abs(CDFM - CDF2))

phatC = 0.5 * (1 + (EMDM2 - EMDM1) / EMD12) ```
KDE
1. Convolve a Gaussian kernel (bandwidth set with FD method by default) with each datum for each reference population (R_C and R_N) to produce two distribution templates: kde_R_C and kde_R_N.
2. Create a new model which is the weighted sum of these two distributions (initialise to equal amplitudes: amp_R_C = amp_R_N = 1): y = amp_R_C * kde_R_C + amp_R_N * kde_R_N.
3. Convolve the same kernel with the unknown mixture data to form kde_M.
4. Fit the combined model to the smoothed mixture allowing the amplitude of each reference distribution to vary. Optimise with the least squares algorithm.
5. p_hat_C = amp_R_C / (amp_R_C + amp_R_N).
python def fit_KDE_model(Mix, bins, model, params_mix, kernel, method='leastsq'): x = bins["centers"] bw = bins['width'] kde_mix = KernelDensity(kernel=kernel, bandwidth=bw).fit(Mix[:, np.newaxis]) res_mix = model.fit(np.exp(kde_mix.score_samples(x[:, np.newaxis])), x=x, params=params_mix, method=method) amp_R_C = res_mix.params['amp_1'].value amp_R_N = res_mix.params['amp_2'].value return amp_R_C / (amp_R_C + amp_R_N)

Note: The population of cases is sometimes referred to as Reference 1 and the population of non-cases referred to as Reference 2. Accordingly these notations may be used interchangeably. R_C == Ref1, R_N == Ref2 p_C == p1, p_N == p2

Summary

| | Excess | Means | EMD | KDE | | ------------- | --------------------- | --------------------- | --------------------- | --------------------- | | Advantages | Simple to calculate. | Simple to calculate. | Computationally cheap. | Accurate. | | | Relies only on summary statistics. | Relies only on summary statistics. | | | | | | Works for even highly overlapping distributions. | | | ------------ | ------------ | ------------ | ------------ | ------------ | | Disadvantages | Less accurate when the distributions significantly overlap. | Less accurate when data is not normally distributed. | Requires a choice of bin size. | Requires a choice of bandwidth and kernel. | | | Not recommended for data analysis (included for comparison only). | | | Computationally expensive. |

Citation

If you use this software, please cite our accompanying publication:

bibtex @article{EvansSlowinskiEstimatingDiseasePrevalence2021, author = {Evans, Benjamin D. and S{\l}owi{\'n}ski, Piotr and Hattersley, Andrew T. and Jones, Samuel E. and Sharp, Seth and Kimmitt, Robert A. and Weedon, Michael N. and Oram, Richard A. and Tsaneva-Atanasova, Krasimira and Thomas, Nicholas J.}, title = {Estimating disease prevalence in large datasets using genetic risk scores}, volume = {12}, number = {1}, pages = {6441}, journal = {Nature Communications}, issn = {2041-1723}, year = {2021}, month = nov, url = {https://www.nature.com/articles/s41467-021-26501-7}, doi = {10.1038/s41467-021-26501-7}, }

Owner

Name: Ben Evans
Login: bdevans
Kind: user
Location: Brighton, UK
Company: University of Sussex

Website: https://www.researchgate.net/profile/Benjamin_Evans6
Twitter: b_d_evans
Repositories: 35
Profile: https://github.com/bdevans

Computational Neuroscientist (s̶t̶i̶l̶l̶ ̶i̶n̶ ̶b̶e̶t̶a̶) and Python evangelist! Likes include open-science, machine learning and Jack Russell Terriers! 🐶

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite both the article from preferred-citation and the software itself."
authors:
- family-names: "Evans"
  given-names: "Benjamin"
  orcid: "https://orcid.org/0000-0002-1734-6070"
- family-names: "Slowinski"
  given-names: "Piotr"
  orcid: "https://orcid.org/0000-0002-6612-9902"
title: "Distribution Proportion Estimation"
version: 1.0.0
doi: 10.5281/zenodo.5512651
date-released: 2021-09-16
url: "https://github.com/bdevans/DPE"
preferred-citation:
  type: article
  authors:
  - family-names: "Evans"
    given-names: "Benjamin D."
    orcid: "https://orcid.org/0000-0002-1734-6070"
  - family-names: "Slowinski"
    given-names: "Piotr"
    orcid: "https://orcid.org/0000-0002-6612-9902"
  - family-names: "Hattersley"
    given-names: "Andrew T."
  - family-names: "Jones"
    given-names: "Samuel E."
    orcid: "http://orcid.org/0000-0003-0153-922X"
  - family-names: "Sharp"
    given-names: "Seth"
  - family-names: "Kimmitt"
    given-names: "Robert A."
  - family-names: "Weedon"
    given-names: "Michael N."
  - family-names: "Oram"
    given-names: "Richard A."
  - family-names: "Tsaneva-Atanasova"
    given-names: "Krasimira"
    orcid: "http://orcid.org/0000-0002-6294-7051"
  - family-names: "Thomas"
    given-names: "Nicholas J."
  doi: "10.1038/s41467-021-26501-7"
  journal: "Nature Communications"
  month: 11
  start: 6441 # First page number
  end: 6452 # Last page number
  title: "Estimating disease prevalence in large datasets using genetic risk scores"
  issue: 1
  volume: 12
  year: 2021
  abstract: "Clinical classification is essential for estimating disease prevalence but is difficult, often requiring complex investigations. The widespread availability of population level genetic data makes novel genetic stratification techniques a highly attractive alternative. We propose a generalizable mathematical framework for determining disease prevalence within a cohort using genetic risk scores. We compare and evaluate methods based on the means of genetic risk scores’ distributions; the Earth Mover’s Distance between distributions; a linear combination of kernel density estimates of distributions; and an Excess method. We demonstrate the performance of genetic stratification to produce robust prevalence estimates. Specifically, we show that robust estimates of prevalence are still possible even with rarer diseases, smaller cohort sizes and less discriminative genetic risk scores, highlighting the general utility of these approaches. Genetic stratification techniques offer exciting new research tools, enabling unbiased insights into disease prevalence and clinical characteristics unhampered by clinical classification criteria."

GitHub Events

Total

Last Year

Committers

Last synced: over 1 year ago

All Time

Total Commits: 837
Total Committers: 5
Avg Commits per committer: 167.4
Development Distribution Score (DDS): 0.055

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ben Evans	b**s@e**k	791
SlowinskiPiotr	5****r	29
Nick Thomas	n**3@e**k	13
pms210	p**i@e**k	3
Timing Liu	l**g@o**m	1

Committer Domains (Top 20 + Academic)

exeter.ac.uk: 3

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 10 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

liutiming (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

environment.yml conda

joblib >=0.13
lmfit >=0.9
matplotlib >=3.3
numpy >=1.7
pandas
pip
psutil
pytest
python 3.8.*
scikit-learn
scipy
seaborn
statsmodels
tqdm
xlrd

requirements.txt pypi

joblib >=0.13
lmfit >=0.9
numpy >=1.7
pandas *
psutil *
scikit-learn *
scipy *
statsmodels *
tqdm *

.github/workflows/python-package.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.devcontainer/Dockerfile docker

mcr.microsoft.com/vscode/devcontainers/miniconda 0-3 build

dpe

Science Score: 77.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Distribution Proportion Estimation

Table of Contents

Installation

Running the examples

Expected Output

```

C.I. (95.0%) | 0.19040 , 0.20320 | 0.79680 , 0.80960

C.I. (95.0%) | 0.20433 , 0.28278 | 0.71722 , 0.79567

C.I. (95.0%) | 0.20334 , 0.27881 | 0.72119 , 0.79666

C.I. (95.0%) | 0.20503 , 0.29097 | 0.70903 , 0.79497

Ground Truth | 0.25000 | 0.75000

Elapsed time = 2503.352 seconds

Reproducibility

Running on your own data

Methods

Pseudocode

Explanation of the main function

Inputs

Outputs

Code Snippets

Interpolate the cdfs at the same points for comparison

EMDs computed with interpolated CDFs

Summary

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies