dpe

Distribution Proportion Estimation

https://github.com/bdevans/dpe

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com, zenodo.org
  • Committers with academic emails
    3 of 5 committers (60.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Distribution Proportion Estimation

Basic Info
  • Host: GitHub
  • Owner: bdevans
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: main
  • Size: 511 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created about 5 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Distribution Proportion Estimation <!-- omit in toc -->

Methods for estimating the prevalence of cases in a mixture population based on genetic risk scores.

GitHub license Workflow DOI Paper


This repository contains the Python 3 implementation of the proportion estimation algorithms. The following instructions assume that you have a working Python (>=3.6) installation obtained either through Anaconda (recommended) or through other means (which requires pip).

A seperate Matlab implementation is provided in the matlab subdirectory with its own instructions.

Table of Contents <!-- omit in toc -->

Installation

  1. Install Python >= 3.6.
  2. Clone this repository and add this folder to your PYTHONPATH: git clone https://github.com/bdevans/DPE.git
  3. Install the requirements using one of the provided requirements files:
    1. If you use a standard Python distribution: pip3 install -r requirements.txt.
    2. Or, if you use Anaconda: conda env create -f environment.yml. Then run source activate dpe to activate the environment.

This should take approximately 1-2 minutes to complete, depending on the speed of your internet connection and the number of dependencies already satisfied.

The examples given were tested with Python 3.8.2 on macOS 11.4 (20F71) running in a Docker container (version 20.10.7, build f0df350). The exact packages installed with conda during testing are given in .devcontainer/frozen_environment.yml.

Running the examples

Three main code files are relevant for review purposes: 1. scripts/run_examples.py the main script for applying the proportion estimation algorithms to the example data sets. 2. dpe/datasets.py has utilities for generating, loading and saving synthetic data sets. 3. dpe/estimate.py has the main routines for estimating proportions in a mixture distribution.

Once the requirements are installed (and the environment activated if necessary) run the example script with:

python scripts/run_examples.py

The proportion estimates and confidence intervals are then generated, with a plain text summary printed to the console (when verbose > 0) and written to a log file (named after the data set file with a .log extension).

The analysis parameters may be changed by editing the run_examples.py script. It is recommended to keep the total number of mixture bootstraps (n_mix * n_boot) below around 10,000 when using the KDE method, as this may take a long time to finish.

Optionally, the file datasets.py may be edited to change the construction parameters and generate new data sets to analyse.

Expected Output

When running with the parameters N_M = 100 and N_B = 100 on a 2015 15" Macbook Pro (2.8 GHz Intel Core i7) with 8 threads, processing each data set takes around 40m. The majority of this time is spent running the KDE algorithm (the other three each take approximately 1m each). This run time increases considerably with the number of mixtures generated (n_mix) and/or the number of bootstraps generated for each mixture (n_boot). Accordingly, a short simulation for demonstration purposes has been configured in run_examples.py by using fewer bootstraps (N_B = 100) than in the manuscript (N_B = 1000), although this can be edited for longer runs if desired. The output produced for the first data set (p_C = 0.25) is given below for N_M = 100 and N_B = 100 parallelised across 8 threads:

```

Running on example dataset: pC = 0.25 Loading dataset: examplepC025... Running 100 bootstraps with 8 processors... Method: 100%|████████████████████████████████████| 4/4 [19:41<00:00, 295.33s/it]

Method    |   Estimated p_C   |   Estimated p_N   

====================================================== Excess point | 0.18200 | 0.81800
Excess (µ±σ) | 0.13660 +/- 0.018 | 0.86340 +/- 0.018 Bias | -0.04560 | 0.04560

C.I. (95.0%) | 0.19040 , 0.20320 | 0.79680 , 0.80960

Means point | 0.24279 | 0.75721
Means (µ±σ) | 0.24264 +/- 0.020 | 0.75736 +/- 0.020 Bias | -0.00026 | 0.00026

C.I. (95.0%) | 0.20433 , 0.28278 | 0.71722 , 0.79567

EMD point | 0.24313 | 0.75687
EMD (µ±σ) | 0.24399 +/- 0.019 | 0.75601 +/- 0.019 Bias | 0.00093 | -0.00093

C.I. (95.0%) | 0.20334 , 0.27881 | 0.72119 , 0.79666

KDE point | 0.24843 | 0.75157
KDE (µ±σ) | 0.25001 +/- 0.022 | 0.74999 +/- 0.022 Bias | 0.00136 | -0.00136

C.I. (95.0%) | 0.20503 , 0.29097 | 0.70903 , 0.79497

Ground Truth | 0.25000 | 0.75000

Elapsed time = 2503.352 seconds

```

Additionally a results directory will be created with a subdirectory for each data set processed containing a csv file with the initial point estimates and bootstrap estimates for each method.

Reproducibility

The results are reproducible by default since a seed is set (42) for the pseudo random number generator. This seed may be changed (or set to None for a random seed) or set to any integer in the range [0, 2^32) to explore variations in results due to stochasticity in sampling.

Running on your own data

The main requirement is to prepare a dictionary (dict) containing the keys R_C, R_N and Mix. The associated values should be (one dimensional) arrays (or lists) of the GRS scores for the "Cases Reference" (R_C) distribution, the "Non-cases Reference" (R_N) distribution and the "Mixture" distribution (Mix) respectively.

Alternatively a csv file may be prepared and loaded with the function datasets.load_dataset(filename) as demonstrated in the run_examples.py script. The csv file should contain the header Group,GRS followed by pairs of code,score values (one per line for each GRS score) where code is 1 for the Cases Reference distribution, 2 for the Non-cases Reference distribution and 3 for the Mixture distribution.

Once the GRS scores have been prepared in a suitable form, they may be passed to the analyse_mixture() function as demonstrated in the run_examples.py script. Further details about this function are given in the following sections.

Methods

Pseudocode

The main algorithm is as follows. Details of the specific methods may be found in the methods section of the accompanying manuscript.

sh bins <-- generate_bins({R_C, R_N, Mix}, method='fd') # Freedman-Diaconis p^hat <-- {} # Dictionary (hashtable) of proportion estimates p^cor <-- {} # Dictionary (hashtable) of corrected proportion estimates p^mbe <-- {} # Dictionary (hashtable) of mixture-bootstrap estimates CI <-- {} # Dictionary (hashtable) of confidence intervals (2-tuple) for meth in ["Excess", "Means", "EMD", "KDE"]: p^hat[meth] <-- get_point_estimate({R_C, R_N, Mix}, bins, meth) # Proportion of cases # Calculate confidence intervals around the initial proportion estimates p^mbe[meth] <-- [] # Create empty list of estimates for each method for m in 1 to N_M: # Number of Monte Carlo mixtures Mix_meth_m <-- get_monte_carlo_mixture({R_C, R_N}, p^hat[meth]) for b in 1 to N_B: # Number of bootstraps Mix_meth_m_b <-- get_bootstrap({Mix_meth_m}, bins) # Sample with replacement p_meth_m_b <-- get_point_estimate({R_C, R_N, Mix_meth_m_b}, bins, meth) p^mbe[meth].append(p_meth_m_b) end end p^cor[meth] <-- correct_estimate(p^hat[meth], p^mbe[meth]) # Use the set of all bootstraps of all mixtures for each method CI[meth] <-- get_confidence_intervals(p^cor[meth], p^mbe[meth], alpha=0.05) end

Explanation of the main function

python def analyse_mixture(scores, bins='fd', methods='all', n_boot=1000, boot_size=-1, n_mix=0, alpha=0.05, ci_method="bca", correct_bias=False, seed=None, n_jobs=1, verbose=1, true_pC=None, logfile=''):

Inputs

  • scores (dict): A required dictionary of the form, {'R_C': array_of_cases_scores, 'R_N': array_of_non-cases_scores, 'Mix': array_of_mixture_scores}.
  • bins (str): A string specifying the binning method: ['auto', 'fd', 'doane', 'scott', 'rice', 'sturges', 'sqrt']. Default: 'fd'. Alternatively, a dictionary, {'width': bin_width, 'min', min_edge, 'max': max_edge, 'edges': array_of_bin_edges, 'centers': array_of_bin_centers, 'n': number_of_bins}.
  • methods (str): A string with the name of the method or 'all' to run all methods (default). Alternatively, a list of method names (strings), ["Excess", "Means", "EMD", "KDE"], or a dictionary of (bool) flags, {'Excess': True, 'Means': True, 'EMD': True, 'KDE': True}.
  • n_boot (int): Number of bootstraps of the mixture to generate. Default: 1000.
  • boot_size (int): The size of each mixture bootstrap. Default is the same size as the mixture.
  • n_mix (int): Number of mixtures to construct based on the initial point estimate. Default: 0.
  • alpha (float): The alpha value for calculating confidence intervals from bootstrap distributions. Default: 0.05.
  • ci_method (str): The name of the method used to calculate the confidence intervals Default: bca.
  • correct_bias (bool): A boolean flag specifing whether to apply the bootstrap correction method or not. Default: False.
  • seed (int): An optional value to seed the random number generator with (in the range [0, (2^32)-1]) for reproducibility of sampling used for confidence intervals. Defaults: None.
  • n_jobs (int): Number of bootstrap jobs to run in parallel. Default: 1. (n_jobs = -1 runs on all CPUs).
  • verbose (int): Integer to control the level of output (0, 1, 2). Set to -1 to turn off all console output except the progress bars.
  • true_pC (float): Optionally pass the true proportion for showing the comparison with estimated proportion(s).
  • logfile (str): Optional filename for the output logs. Default: "proportion_estimates.log".

Outputs

(summary, bootstraps) (tuple): A tuple consisting of the following data structures.

  • summary (dict): A nested dictionary with a key for each estimation method within which is a dictionary with the following keys:

    • p_C : the prevalence estimate

    Optionally, if bootstrapping is used: - CI : the confidence intervals around the prevalence - mean : the mean of the bootstrapped estimates - std : the standard deviation of the bootstrap estimates - p_cor_C : the corrected prevalence estimate when correct_bias == True

  • bootstraps (DataFrame): A pandas dataframe of the proportion estimates. The first row is the point estimate. The remaining n_boot * n_mix rows are the bootstrapped estimates. Each column is the name of the estimation method.

Additionally the logfile is written to the working directory.

Code Snippets

We bootstrap (i.e. sample with replacement) from the available data, systematically varying sample size and mixture proportion. We then apply the following methods to yield estimates of the true proportion and from those, calculate the errors throughout the bootstrap parameter space.

  1. Excess

    python number_low = len(RM[RM <= population_median]) number_high = len(RM[RM > population_median]) p_hat_C = abs(number_high - number_low) / len(RM) p_hat_C = np.clip(p_hat_C, 0.0, 1.0)

  2. Means

    python mu_C, mu_N = methods["Means"]["mu_C"], methods["Means"]["mu_N"] if mu_C > mu_N: # This should be the case p_hat_C = (RM.mean() - mu_N) / (mu_C - mu_N) else: p_hat_C = (mu_N - RM.mean()) / (mu_N - mu_C) p_hat_C = np.clip(p_hat_C, 0.0, 1.0)

  3. EMD math EMD(P,Q)={\frac {\sum _{i=1}^{m}\sum _{j=1}^{n}f_{i,j}d_{i,j}}{\sum _{i=1}^{m}\sum _{j=1}^{n}f_{i,j}}}

    ``python def interpolate_CDF(scores, x_i, min_edge, max_edge): """Interpolate the cumulative density function of the scores at the points in the arrayx_i`. """

    x = [min_edge, *sorted(scores), max_edge]
    y = np.linspace(0, 1, num=len(x), endpoint=True)
    (iv, ii) = np.unique(x, return_index=True)
    return np.interp(x_i, iv, y[ii])
    

    Interpolate the cdfs at the same points for comparison

    CDF1 = interpolateCDF(scores["RC"], bins['centers'], bins['min'], bins['max']) CDF2 = interpolateCDF(scores["RN"], bins['centers'], bins['min'], bins['max']) CDFM = interpolateCDF(RM, bins['centers'], bins['min'], bins['max'])

    EMDs computed with interpolated CDFs

    EMD12 = sum(abs(CDF1 - CDF2)) EMDM1 = sum(abs(CDFM - CDF1)) EMDM2 = sum(abs(CDFM - CDF2))

    phatC = 0.5 * (1 + (EMDM2 - EMDM1) / EMD12) ```

  4. KDE

    1. Convolve a Gaussian kernel (bandwidth set with FD method by default) with each datum for each reference population (R_C and R_N) to produce two distribution templates: kde_R_C and kde_R_N.
    2. Create a new model which is the weighted sum of these two distributions (initialise to equal amplitudes: amp_R_C = amp_R_N = 1): y = amp_R_C * kde_R_C + amp_R_N * kde_R_N.
    3. Convolve the same kernel with the unknown mixture data to form kde_M.
    4. Fit the combined model to the smoothed mixture allowing the amplitude of each reference distribution to vary. Optimise with the least squares algorithm.
    5. p_hat_C = amp_R_C / (amp_R_C + amp_R_N).

    python def fit_KDE_model(Mix, bins, model, params_mix, kernel, method='leastsq'): x = bins["centers"] bw = bins['width'] kde_mix = KernelDensity(kernel=kernel, bandwidth=bw).fit(Mix[:, np.newaxis]) res_mix = model.fit(np.exp(kde_mix.score_samples(x[:, np.newaxis])), x=x, params=params_mix, method=method) amp_R_C = res_mix.params['amp_1'].value amp_R_N = res_mix.params['amp_2'].value return amp_R_C / (amp_R_C + amp_R_N)

Note: The population of cases is sometimes referred to as Reference 1 and the population of non-cases referred to as Reference 2. Accordingly these notations may be used interchangeably. R_C == Ref1, R_N == Ref2 p_C == p1, p_N == p2

Summary

| | Excess | Means | EMD | KDE | | ------------- | --------------------- | --------------------- | --------------------- | --------------------- | | Advantages | Simple to calculate. | Simple to calculate. | Computationally cheap. | Accurate. | | | Relies only on summary statistics. | Relies only on summary statistics. | | | | | | Works for even highly overlapping distributions. | | | ------------ | ------------ | ------------ | ------------ | ------------ | | Disadvantages | Less accurate when the distributions significantly overlap. | Less accurate when data is not normally distributed. | Requires a choice of bin size. | Requires a choice of bandwidth and kernel. | | | Not recommended for data analysis (included for comparison only). | | | Computationally expensive. |

Citation

If you use this software, please cite our accompanying publication:

bibtex @article{EvansSlowinskiEstimatingDiseasePrevalence2021, author = {Evans, Benjamin D. and S{\l}owi{\'n}ski, Piotr and Hattersley, Andrew T. and Jones, Samuel E. and Sharp, Seth and Kimmitt, Robert A. and Weedon, Michael N. and Oram, Richard A. and Tsaneva-Atanasova, Krasimira and Thomas, Nicholas J.}, title = {Estimating disease prevalence in large datasets using genetic risk scores}, volume = {12}, number = {1}, pages = {6441}, journal = {Nature Communications}, issn = {2041-1723}, year = {2021}, month = nov, url = {https://www.nature.com/articles/s41467-021-26501-7}, doi = {10.1038/s41467-021-26501-7}, }

Owner

  • Name: Ben Evans
  • Login: bdevans
  • Kind: user
  • Location: Brighton, UK
  • Company: University of Sussex

Computational Neuroscientist (s̶t̶i̶l̶l̶ ̶i̶n̶ ̶b̶e̶t̶a̶) and Python evangelist! Likes include open-science, machine learning and Jack Russell Terriers! 🐶

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite both the article from preferred-citation and the software itself."
authors:
- family-names: "Evans"
  given-names: "Benjamin"
  orcid: "https://orcid.org/0000-0002-1734-6070"
- family-names: "Slowinski"
  given-names: "Piotr"
  orcid: "https://orcid.org/0000-0002-6612-9902"
title: "Distribution Proportion Estimation"
version: 1.0.0
doi: 10.5281/zenodo.5512651
date-released: 2021-09-16
url: "https://github.com/bdevans/DPE"
preferred-citation:
  type: article
  authors:
  - family-names: "Evans"
    given-names: "Benjamin D."
    orcid: "https://orcid.org/0000-0002-1734-6070"
  - family-names: "Slowinski"
    given-names: "Piotr"
    orcid: "https://orcid.org/0000-0002-6612-9902"
  - family-names: "Hattersley"
    given-names: "Andrew T."
  - family-names: "Jones"
    given-names: "Samuel E."
    orcid: "http://orcid.org/0000-0003-0153-922X"
  - family-names: "Sharp"
    given-names: "Seth"
  - family-names: "Kimmitt"
    given-names: "Robert A."
  - family-names: "Weedon"
    given-names: "Michael N."
  - family-names: "Oram"
    given-names: "Richard A."
  - family-names: "Tsaneva-Atanasova"
    given-names: "Krasimira"
    orcid: "http://orcid.org/0000-0002-6294-7051"
  - family-names: "Thomas"
    given-names: "Nicholas J."
  doi: "10.1038/s41467-021-26501-7"
  journal: "Nature Communications"
  month: 11
  start: 6441 # First page number
  end: 6452 # Last page number
  title: "Estimating disease prevalence in large datasets using genetic risk scores"
  issue: 1
  volume: 12
  year: 2021
  abstract: "Clinical classification is essential for estimating disease prevalence but is difficult, often requiring complex investigations. The widespread availability of population level genetic data makes novel genetic stratification techniques a highly attractive alternative. We propose a generalizable mathematical framework for determining disease prevalence within a cohort using genetic risk scores. We compare and evaluate methods based on the means of genetic risk scores’ distributions; the Earth Mover’s Distance between distributions; a linear combination of kernel density estimates of distributions; and an Excess method. We demonstrate the performance of genetic stratification to produce robust prevalence estimates. Specifically, we show that robust estimates of prevalence are still possible even with rarer diseases, smaller cohort sizes and less discriminative genetic risk scores, highlighting the general utility of these approaches. Genetic stratification techniques offer exciting new research tools, enabling unbiased insights into disease prevalence and clinical characteristics unhampered by clinical classification criteria."

GitHub Events

Total
Last Year

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 837
  • Total Committers: 5
  • Avg Commits per committer: 167.4
  • Development Distribution Score (DDS): 0.055
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Ben Evans b****s@e****k 791
SlowinskiPiotr 5****r 29
Nick Thomas n****3@e****k 13
pms210 p****i@e****k 3
Timing Liu l****g@o****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 10 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • liutiming (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

environment.yml conda
  • joblib >=0.13
  • lmfit >=0.9
  • matplotlib >=3.3
  • numpy >=1.7
  • pandas
  • pip
  • psutil
  • pytest
  • python 3.8.*
  • scikit-learn
  • scipy
  • seaborn
  • statsmodels
  • tqdm
  • xlrd
requirements.txt pypi
  • joblib >=0.13
  • lmfit >=0.9
  • numpy >=1.7
  • pandas *
  • psutil *
  • scikit-learn *
  • scipy *
  • statsmodels *
  • tqdm *
.github/workflows/python-package.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.devcontainer/Dockerfile docker
  • mcr.microsoft.com/vscode/devcontainers/miniconda 0-3 build