PyNM

PyNM: a Lightweight Python implementation of Normative Modeling - Published in JOSS (2022)

https://github.com/ppsp-team/pynm

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 26 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: arxiv.org, biorxiv.org, zenodo.org
✓
Committers with academic emails
1 of 6 committers (16.7%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords from Contributors

exoplanets mesh

Scientific Fields

Mathematics Computer Science - 88% confidence

Economics Social Sciences - 63% confidence

Last synced: 6 months ago · JSON representation

Repository

Lightweight Python implementation of Normative Modelling

Basic Info

Host: GitHub
Owner: ppsp-team
License: bsd-3-clause
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 64.4 MB

Statistics

Stars: 45
Watchers: 1
Forks: 13
Open Issues: 0
Releases: 2

Created over 6 years ago · Last pushed about 3 years ago

Metadata Files

Readme Contributing License

README.md

PyNM Logo

PyNM is a lightweight python implementation of Normative Modeling making it approachable and easy to adopt. The package provides:

Python API and a command-line interface for wide accessibility
Automatic dataset splitting and cross-validation
Five models from various back-ends in a unified interface that cover a broad range of common use cases
- Centiles
- LOESS
- Gaussian Process (GP)
- Stochastic Variational Gaussian Process (SVGP)
- Generalized Additive Models of Location Shape and Scale (GAMLSS)
Solutions for very large datasets and heteroskedastic data
Integrated plotting and evaluation functions to quickly check the validity of the model fit and results
Comprehensive and interactive tutorials

The tutorials can be accessed without any local installation via binder:

For a more advanced implementation, see the Python library PCNtoolkit.

Installation

Note: functional installation requires python 3.9

Minimal Installation (without R)

If you aren't using the GAMLSS model/don't need to install R.

bash $ pip install pynm

Installation with R

If you are using a GAMLSS. - Must first have R (v4.2.2) installed and packages: - gamlss - gamlss.dist - gamlss.data

Instruction for installing R can be found at r-project. Once R and the gamlss packages are installed, install pynm: bash $ pip install pynm Bleeding-edge Installation

If you want to be up to date with the most recent changes to PyNM (not necessarily stable). For the options above replace pip install pynm with: bash $ git clone https://github.com/ppsp-team/PyNM.git $ cd pynm $ pip install .

Command Line Usage

``` usage: pynm [-h] --phenop PHENOP --outp OUTP --confounds CONFOUNDS --score SCORE --group GROUP [--trainsample TRAINSAMPLE] [--LOESS] [--centiles] [--binspacing BINSPACING] [--binwidth BINWIDTH] [--GP] [--gpmethod GPMETHOD] [--gpnumepochs GPNUMEPOCHS] [--gpninducing GPNINDUCING] [--gpbatchsize GPBATCHSIZE] [--gplengthscale GPLENGTHSCALE] [--gplengthscalebounds [GPLENGTHSCALEBOUNDS [GPLENGTHSCALEBOUNDS ...]]] [--gpnu NU] [--GAMLSS] [--gamlssmu GAMLSSMU] [--gamlsssigma GAMLSSSIGMA] [--gamlssnu GAMLSSNU] [--gamlsstau GAMLSS_TAU] [--gamlssfamily GAMLSS_FAMILY]

optional arguments: -h, --help show this help message and exit --phenop PHENOP Path to phenotype data. Data must be in a .csv file. --outp OUTP Path to output directory. --confounds CONFOUNDS List of confounds to use in the GP model.The list must formatted as a string with commas between confounds, each confound must be a column name from the phenotype .csv file. For GP model all confounds will be used, for LOESS and Centiles models only the first is used. For GAMLSS all confounds are used unless formulas are specified. Categorical values must be denoted by c(var) ('c' must be lower case), e.g. 'c(SEX)' for column name 'SEX'. --score SCORE Response variable for all models. Must be a column title from phenotype .csv file. --group GROUP Column name from the phenotype .csv file that distinguishes probands from controls. The column must be encoded with str labels using 'PROB' for probands and 'CTR' for controls or with int labels using 1 for probands and 0 for controls. --trainsample TRAINSAMPLE Which method to use for a training sample, can be a float in (0,1] for a percentage of controls or 'manual' to be manually set using a column of the DataFrame labelled 'trainsample'. --LOESS Flag to run LOESS model. --centiles Flag to run Centiles model. --binspacing BINSPACING Distance between bins for LOESS & centiles models. --binwidth BINWIDTH Width of bins for LOESS & centiles models. --GP Flag to run Gaussian Process model. --gpmethod GPMETHOD Method to use for the GP model. Can be set to 'auto','approx' or 'exact'. In 'auto' mode, the exact model will be used for datasets smaller than 2000 data points. SVGP is used for the approximate model. See documentation for details. Default value is 'auto'. --gpnumepochs GPNUMEPOCHS Number of training epochs for SVGP model. See documentation for details. Default value is 20. --gpninducing GPNINDUCING Number of inducing points for SVGP model. See documentation for details. Default value is 500. --gpbatchsize GPBATCHSIZE Batch size for training and predicting from SVGP model. See documentation for details. Default value is 256. --gplengthscale GPLENGTHSCALE Length scale of Matern kernel for exact model. See documentation for details. Default value is 1. --gplengthscalebounds [GPLENGTHSCALEBOUNDS [GPLENGTHSCALEBOUNDS ...]] The lower and upper bound on lengthscale. If set to 'fixed', lengthscale cannot be changed during hyperparameter tuning. See documentation for details. Default value is (1e-5,1e5). --gpnu NU Nu of Matern kernel for exact and SVGP model. See documentation for details. Default value is 2.5. --GAMLSS Flag to run GAMLSS. --gamlssmu GAMLSSMU Formula for mu (location) parameter of GAMLSS. Default formula for score is sum of confounds with non- categorical columns as smooth functions, e.g. 'score ~ ps(age) + sex'. --gamlsssigma GAMLSSSIGMA Formula for mu (location) parameter of GAMLSS. Default formula is '~ 1'. --gamlssnu GAMLSSNU Formula for mu (location) parameter of GAMLSS. Default formula is '~ 1'. --gamlsstau GAMLSSTAU Formula for mu (location) parameter of GAMLSS. Default formula is '~ 1'. --gamlssfamily GAMLSS_FAMILY Family of distributions to use for fitting, default is 'SHASHo2'. See R documentation for GAMLSS package for other available families of distributions. ```

API Example

```python from pynm.pynm import PyNM

Load data

df = pd.read_csv('data.csv')

Initialize pynm w/ data and confounds

m = PyNM(df,'score','group', confounds = ['age','c(sex)','c(site)'])

Run models

m.loessnormativemodel() m.centilesnormativemodel() m.gpnormativemodel() m.gamlssnormativemodel()

Collect output

data = m.data ```

Documentation

All the functions have the classical Python DocStrings that you can summon with help(). You can also see the tutorials for documented examples.

Training sample

By default, the models are fit on all the controls in the dataset and prediction is then done on the entire dataset. The residuals (scores of the normative model) are then calculated as the difference between the actual value and predicted value for each subject. This paradigm is not meant for situations in which the residuals will then be used in a prediction setting, since any train/test split stratified by proband/control will have information from the training set leaked into the test data.

In order to avoid contaminating the test set, in a prediction setting it is important to fit the normative model on a subset of the controls and then leave those out. This is implemented in PyNM with the --train_sample flag. It can be set to: 1. A number in (0,1] - This is simplest usage that defines the sample size, PyNM will then select a random sample of the controls and use those as a training group. The number is the proportion of controls to use, the default value is 1 to use the full set of controls. - The subjects used in the sample are recorded in the column 'train_sample' of the resulting PyNM.data object. Subjects used in the training sample are encoded as 1s, and the rest as 0s. 2. 'manual' - It is also possible to specify exactly which subjects to use as a training group by providing a column in the input data labeled 'train_sample' encoded the same way.

Models

Centiles and LOESS Models

Both the Centiles and LOESS models are non parametric models based local approximations. They accept only a single dependent variable, passed using the conf option.

Gaussian Process Model

Gaussian Process Regression (GPR), which underpins the Gaussian Process Model, can accept an arbitrary number of dependent variables passed using the confounds option. Note: in order for GPR to be effective, the data must be homoskedastic. For a full discussion see this paper.

GPR is very intensive on both memory and time usage. In order to have a scaleable method, we've implemented both an exact model for smaller datasets and an approximate method, recommended for datasets over ~1000 subjects. The method can be specified using the method option, it defaults to auto in which the approxiamte model will be chosen for datasets over 1000.

Exact Model

The exact model implements scikit-learn's Gaussian Process Regressor. The kernel is composed of a constant kernel, a white noise kernel, and a Matern kernel. The Matern kernel has parameters nu and length_scale that can be specified. The parameter nu has special values at 1.5 and 2.5, using other values will significantly increase computation time. See documentation for an overview of both.

Approximate Model

The approximate model implements a Stochastic Variational Gaussian Process (SVGP) model using GPytorch, with a kernel closely matching the one in the exact model. SVGP is a deep learning technique that needs to be trained on minibatches for a set number of epochs, this can be tuned with the parameters batch_size and num_epoch. The model speeds up computation by using a subset of the data as inducing points, this can be controlled with the parameter n_inducing that defines how many points to use. See documentation for an overview.

GAMLSS

Generalized Additive Models of Location Shape and Scale (GAMLSS) are a flexible modeling framework that can model heteroskedasticity, non-linear effects of variables, and hierarchical structure of the data. The implementation here is a python wrapper for the R package gamlss, formulas for each parameter must be specified using functions available in the package (see documentation). For a full discussion of using GAMLSS for normative modeling see this paper.

Available Models

References

Original papers with Gaussian Processes (GP): - Marquand et al. Biological Psychiatry 2016 doi:10.1016/j.biopsych.2015.12.023 - Marquand et al. Molecular Psychiatry 2019 doi:10.1038/s41380-019-0441-1

For limitations of Gaussian Proccesses: - Xu et al. PLoS ONE 2021, The pitfalls of using Gaussian Process Regression for normative modeling

Example of use of the LOESS approach: - Lefebvre et al. Front. Neurosci. 2018 doi:10.3389/fnins.2018.00662 - Maruani et al. Front. Psychiatry 2019 doi:10.3389/fpsyt.2019.00011

For the Centiles approach see: - Bethlehem et al. Communications Biology 2020 doi:10.1038/s42003-020-01212-9 - R implementation here.

For the SVGP model see: - Hensman et al. https://arxiv.org/pdf/1411.2005.pdf

For GAMLSS see: - Dinga et al. https://doi.org/10.1101/2021.06.14.448106 - R documentation https://cran.r-project.org/web/packages/gamlss/index.html

How to run tests

To test the code locally, first make sure R and the required packages are installed then follow the instructions above under Installation: Bleeding-edge Installation. Finally, run:

bash $ pip install -r requirements.txt $ pytest test/test_pynm.py

How to report errors

If you spot any bugs :beetle:? Check out the open issues to see if we're already working on it. If not, open up a new issue and we will check it out when we can!

How to contribute

Thank you for considering contributing to our project! Before getting involved, please review our contribution guidelines.

Support

This work is supported by IVADO, FRQS, CFI, MITACS, and Compute Canada.

Owner

Name: PPSP
Login: ppsp-team
Kind: organization
Email: contact@ppsp.team
Location: Montréal, Canada

Website: www.ppsp.team
Repositories: 6
Profile: https://github.com/ppsp-team

The Precision Psychiatry & Social Physiology laboratory

JOSS Publication

PyNM: a Lightweight Python implementation of Normative Modeling

Published

December 08, 2022

DOI

10.21105/joss.04321

Volume 7, Issue 80, Page 4321

Authors

Annabelle Harvey

Centre de Recherche de l’Institut Universitaire de Gériatrie de Montréal, Université de Montréal, QC, Canada, Centre de Recherche du CHU Sainte-Justine, Université de Montréal, QC, Canada

Guillaume Dumas

Centre de Recherche du CHU Sainte-Justine, Université de Montréal, QC, Canada, Mila - Quebec AI Institute, Université de Montréal, QC, Canada

Editor

Dan Foreman-Mackey

GitHub Events

Total

Watch event: 10
Fork event: 1

Last Year

Watch event: 10
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 319
Total Committers: 6
Avg Commits per committer: 53.167
Development Distribution Score (DDS): 0.232

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
harveyaa	a**v@g**m	245
Guillaume Dumas	d**p@i**u	55
Annabelle Harvey	h**a@f**l	11
dependabot[bot]	4****]	3
Dan Foreman-Mackey	d**m@d**o	3
Guillaume DUMAS	g**s@p**r	2

Committer Domains (Top 20 + Academic)

pasteur.fr: 1 dfm.io: 1 introspection.eu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 22
Total pull requests: 15
Average time to close issues: about 1 month
Average time to close pull requests: about 4 hours
Total issue authors: 8
Total pull request authors: 3
Average comments per issue: 1.41
Average comments per pull request: 0.07
Merged pull requests: 15
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

harveyaa (9)
saigerutherford (5)
smkia (3)
zapaishchykova (1)
GalKepler (1)
ruiyangge (1)
SmailDK (1)
Siyuannnnch (1)

Pull Request Authors

harveyaa (9)
dependabot[bot] (3)
dfm (3)

Top Labels

Issue Labels

bug (4) enhancement (4) question (1)

Pull Request Labels

dependencies (3)

Packages

Total packages: 1
Total downloads:
- pypi 46 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 13
Total maintainers: 2

pypi.org: pynm

('Python implementation of Normative Modelling', 'with GAMLSS, Gaussian Processes, LOESS & Centiles approaches.')

Homepage: https://github.com/ppsp-team/PyNM
Documentation: https://pynm.readthedocs.io/
License: BSD
Latest release: 1.0.1
published about 3 years ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 46 Last month

Rankings

Dependent packages count: 10.1%

Forks count: 11.4%

Stargazers count: 12.6%

Average: 17.2%

Dependent repos count: 21.6%

Downloads: 30.1%

Maintainers (2)

gdumas harveyaa

Last synced: 6 months ago

Dependencies

requirements.txt pypi

gpytorch >=1.4.0
matplotlib *
numpy >=1.21.0
pandas >=1.1.5
rpy2 >=3.1.0
scikit_learn >=0.24.1
scipy >=1.5.3
seaborn *
statsmodels >=0.13.2
torch >=1.8.0
tqdm *

setup.py pypi

gpytorch *
matplotlib *
numpy *
pandas *
rpy2 *
scikit_learn *
scipy *
seaborn *
statsmodels *
torch *
tqdm *

.github/workflows/draft_pdf.yml actions

actions/checkout v2 composite
actions/upload-artifact v1 composite
openjournals/openjournals-draft-action master composite

.binder/environment.yml pypi

pynm *

PyNM

Science Score: 95.0%

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Installation

Command Line Usage

API Example

Load data

Initialize pynm w/ data and confounds

Run models

Collect output

Documentation

Training sample

Models

Centiles and LOESS Models

Gaussian Process Model

Exact Model

Approximate Model

GAMLSS

References

How to run tests

How to report errors

How to contribute

Support

Owner

JOSS Publication

PyNM: a Lightweight Python implementation of Normative Modeling

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pynm

Rankings

Maintainers (2)

Dependencies