PyNM
PyNM: a Lightweight Python implementation of Normative Modeling - Published in JOSS (2022)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 26 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: arxiv.org, biorxiv.org, zenodo.org -
✓Committers with academic emails
1 of 6 committers (16.7%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords from Contributors
Scientific Fields
Repository
Lightweight Python implementation of Normative Modelling
Basic Info
Statistics
- Stars: 45
- Watchers: 1
- Forks: 13
- Open Issues: 0
- Releases: 2
Metadata Files
README.md

PyNM is a lightweight python implementation of Normative Modeling making it approachable and easy to adopt. The package provides:
- Python API and a command-line interface for wide accessibility
- Automatic dataset splitting and cross-validation
- Five models from various back-ends in a unified interface that cover a broad range of common use cases
- Centiles
- LOESS
- Gaussian Process (GP)
- Stochastic Variational Gaussian Process (SVGP)
- Generalized Additive Models of Location Shape and Scale (GAMLSS)
- Solutions for very large datasets and heteroskedastic data
- Integrated plotting and evaluation functions to quickly check the validity of the model fit and results
- Comprehensive and interactive tutorials
The tutorials can be accessed without any local installation via binder:
For a more advanced implementation, see the Python library PCNtoolkit.
Installation
Note: functional installation requires python 3.9
Minimal Installation (without R)
If you aren't using the GAMLSS model/don't need to install R.
bash
$ pip install pynm
Installation with R
If you are using a GAMLSS. - Must first have R (v4.2.2) installed and packages: - gamlss - gamlss.dist - gamlss.data
Instruction for installing R can be found at r-project. Once R and the gamlss packages are installed, install pynm:
bash
$ pip install pynm
Bleeding-edge Installation
If you want to be up to date with the most recent changes to PyNM (not necessarily stable). For the options above replace pip install pynm with:
bash
$ git clone https://github.com/ppsp-team/PyNM.git
$ cd pynm
$ pip install .
Command Line Usage
``` usage: pynm [-h] --phenop PHENOP --outp OUTP --confounds CONFOUNDS --score SCORE --group GROUP [--trainsample TRAINSAMPLE] [--LOESS] [--centiles] [--binspacing BINSPACING] [--binwidth BINWIDTH] [--GP] [--gpmethod GPMETHOD] [--gpnumepochs GPNUMEPOCHS] [--gpninducing GPNINDUCING] [--gpbatchsize GPBATCHSIZE] [--gplengthscale GPLENGTHSCALE] [--gplengthscalebounds [GPLENGTHSCALEBOUNDS [GPLENGTHSCALEBOUNDS ...]]] [--gpnu NU] [--GAMLSS] [--gamlssmu GAMLSSMU] [--gamlsssigma GAMLSSSIGMA] [--gamlssnu GAMLSSNU] [--gamlsstau GAMLSS_TAU] [--gamlssfamily GAMLSS_FAMILY]
optional arguments: -h, --help show this help message and exit --phenop PHENOP Path to phenotype data. Data must be in a .csv file. --outp OUTP Path to output directory. --confounds CONFOUNDS List of confounds to use in the GP model.The list must formatted as a string with commas between confounds, each confound must be a column name from the phenotype .csv file. For GP model all confounds will be used, for LOESS and Centiles models only the first is used. For GAMLSS all confounds are used unless formulas are specified. Categorical values must be denoted by c(var) ('c' must be lower case), e.g. 'c(SEX)' for column name 'SEX'. --score SCORE Response variable for all models. Must be a column title from phenotype .csv file. --group GROUP Column name from the phenotype .csv file that distinguishes probands from controls. The column must be encoded with str labels using 'PROB' for probands and 'CTR' for controls or with int labels using 1 for probands and 0 for controls. --trainsample TRAINSAMPLE Which method to use for a training sample, can be a float in (0,1] for a percentage of controls or 'manual' to be manually set using a column of the DataFrame labelled 'trainsample'. --LOESS Flag to run LOESS model. --centiles Flag to run Centiles model. --binspacing BINSPACING Distance between bins for LOESS & centiles models. --binwidth BINWIDTH Width of bins for LOESS & centiles models. --GP Flag to run Gaussian Process model. --gpmethod GPMETHOD Method to use for the GP model. Can be set to 'auto','approx' or 'exact'. In 'auto' mode, the exact model will be used for datasets smaller than 2000 data points. SVGP is used for the approximate model. See documentation for details. Default value is 'auto'. --gpnumepochs GPNUMEPOCHS Number of training epochs for SVGP model. See documentation for details. Default value is 20. --gpninducing GPNINDUCING Number of inducing points for SVGP model. See documentation for details. Default value is 500. --gpbatchsize GPBATCHSIZE Batch size for training and predicting from SVGP model. See documentation for details. Default value is 256. --gplengthscale GPLENGTHSCALE Length scale of Matern kernel for exact model. See documentation for details. Default value is 1. --gplengthscalebounds [GPLENGTHSCALEBOUNDS [GPLENGTHSCALEBOUNDS ...]] The lower and upper bound on lengthscale. If set to 'fixed', lengthscale cannot be changed during hyperparameter tuning. See documentation for details. Default value is (1e-5,1e5). --gpnu NU Nu of Matern kernel for exact and SVGP model. See documentation for details. Default value is 2.5. --GAMLSS Flag to run GAMLSS. --gamlssmu GAMLSSMU Formula for mu (location) parameter of GAMLSS. Default formula for score is sum of confounds with non- categorical columns as smooth functions, e.g. 'score ~ ps(age) + sex'. --gamlsssigma GAMLSSSIGMA Formula for mu (location) parameter of GAMLSS. Default formula is '~ 1'. --gamlssnu GAMLSSNU Formula for mu (location) parameter of GAMLSS. Default formula is '~ 1'. --gamlsstau GAMLSSTAU Formula for mu (location) parameter of GAMLSS. Default formula is '~ 1'. --gamlssfamily GAMLSS_FAMILY Family of distributions to use for fitting, default is 'SHASHo2'. See R documentation for GAMLSS package for other available families of distributions. ```
API Example
```python from pynm.pynm import PyNM
Load data
df = pd.read_csv('data.csv')
Initialize pynm w/ data and confounds
m = PyNM(df,'score','group', confounds = ['age','c(sex)','c(site)'])
Run models
m.loessnormativemodel() m.centilesnormativemodel() m.gpnormativemodel() m.gamlssnormativemodel()
Collect output
data = m.data ```
Documentation
All the functions have the classical Python DocStrings that you can summon with help(). You can also see the tutorials for documented examples.
Training sample
By default, the models are fit on all the controls in the dataset and prediction is then done on the entire dataset. The residuals (scores of the normative model) are then calculated as the difference between the actual value and predicted value for each subject. This paradigm is not meant for situations in which the residuals will then be used in a prediction setting, since any train/test split stratified by proband/control will have information from the training set leaked into the test data.
In order to avoid contaminating the test set, in a prediction setting it is important to fit the normative model on a subset of the controls and then leave those out. This is implemented in PyNM with the --train_sample flag. It can be set to:
1. A number in (0,1]
- This is simplest usage that defines the sample size, PyNM will then select a random sample of the controls and use those as a training group. The number is the proportion of controls to use, the default value is 1 to use the full set of controls.
- The subjects used in the sample are recorded in the column 'train_sample' of the resulting PyNM.data object. Subjects used in the training sample are encoded as 1s, and the rest as 0s.
2. 'manual'
- It is also possible to specify exactly which subjects to use as a training group by providing a column in the input data labeled 'train_sample' encoded the same way.
Models
Centiles and LOESS Models
Both the Centiles and LOESS models are non parametric models based local approximations. They accept only a single dependent variable, passed using the conf option.
Gaussian Process Model
Gaussian Process Regression (GPR), which underpins the Gaussian Process Model, can accept an arbitrary number of dependent variables passed using the confounds option. Note: in order for GPR to be effective, the data must be homoskedastic. For a full discussion see this paper.
GPR is very intensive on both memory and time usage. In order to have a scaleable method, we've implemented both an exact model for smaller datasets and an approximate method, recommended for datasets over ~1000 subjects. The method can be specified using the method option, it defaults to auto in which the approxiamte model will be chosen for datasets over 1000.
Exact Model
The exact model implements scikit-learn's Gaussian Process Regressor. The kernel is composed of a constant kernel, a white noise kernel, and a Matern kernel. The Matern kernel has parameters nu and length_scale that can be specified. The parameter nu has special values at 1.5 and 2.5, using other values will significantly increase computation time. See documentation for an overview of both.
Approximate Model
The approximate model implements a Stochastic Variational Gaussian Process (SVGP) model using GPytorch, with a kernel closely matching the one in the exact model. SVGP is a deep learning technique that needs to be trained on minibatches for a set number of epochs, this can be tuned with the parameters batch_size and num_epoch. The model speeds up computation by using a subset of the data as inducing points, this can be controlled with the parameter n_inducing that defines how many points to use. See documentation for an overview.
GAMLSS
Generalized Additive Models of Location Shape and Scale (GAMLSS) are a flexible modeling framework that can model heteroskedasticity, non-linear effects of variables, and hierarchical structure of the data. The implementation here is a python wrapper for the R package gamlss, formulas for each parameter must be specified using functions available in the package (see documentation). For a full discussion of using GAMLSS for normative modeling see this paper.

References
Original papers with Gaussian Processes (GP): - Marquand et al. Biological Psychiatry 2016 doi:10.1016/j.biopsych.2015.12.023 - Marquand et al. Molecular Psychiatry 2019 doi:10.1038/s41380-019-0441-1
For limitations of Gaussian Proccesses: - Xu et al. PLoS ONE 2021, The pitfalls of using Gaussian Process Regression for normative modeling
Example of use of the LOESS approach: - Lefebvre et al. Front. Neurosci. 2018 doi:10.3389/fnins.2018.00662 - Maruani et al. Front. Psychiatry 2019 doi:10.3389/fpsyt.2019.00011
For the Centiles approach see: - Bethlehem et al. Communications Biology 2020 doi:10.1038/s42003-020-01212-9 - R implementation here.
For the SVGP model see: - Hensman et al. https://arxiv.org/pdf/1411.2005.pdf
For GAMLSS see: - Dinga et al. https://doi.org/10.1101/2021.06.14.448106 - R documentation https://cran.r-project.org/web/packages/gamlss/index.html
How to run tests
To test the code locally, first make sure R and the required packages are installed then follow the instructions above under Installation: Bleeding-edge Installation. Finally, run:
bash
$ pip install -r requirements.txt
$ pytest test/test_pynm.py
How to report errors
If you spot any bugs :beetle:? Check out the open issues to see if we're already working on it. If not, open up a new issue and we will check it out when we can!
How to contribute
Thank you for considering contributing to our project! Before getting involved, please review our contribution guidelines.
Support
This work is supported by IVADO, FRQS, CFI, MITACS, and Compute Canada.
Owner
- Name: PPSP
- Login: ppsp-team
- Kind: organization
- Email: contact@ppsp.team
- Location: Montréal, Canada
- Website: www.ppsp.team
- Repositories: 6
- Profile: https://github.com/ppsp-team
The Precision Psychiatry & Social Physiology laboratory
JOSS Publication
PyNM: a Lightweight Python implementation of Normative Modeling
Authors
Tags
Normative Modeling Heterogeneity Heteroskedasticity Big Data Centiles LOESS Gaussian Process Stochastic Variational Gaussian Process GAMLSS Computational Psychiatry NeuroscienceGitHub Events
Total
- Watch event: 10
- Fork event: 1
Last Year
- Watch event: 10
- Fork event: 1
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| harveyaa | a****v@g****m | 245 |
| Guillaume Dumas | d****p@i****u | 55 |
| Annabelle Harvey | h****a@f****l | 11 |
| dependabot[bot] | 4****] | 3 |
| Dan Foreman-Mackey | d****m@d****o | 3 |
| Guillaume DUMAS | g****s@p****r | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 22
- Total pull requests: 15
- Average time to close issues: about 1 month
- Average time to close pull requests: about 4 hours
- Total issue authors: 8
- Total pull request authors: 3
- Average comments per issue: 1.41
- Average comments per pull request: 0.07
- Merged pull requests: 15
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- harveyaa (9)
- saigerutherford (5)
- smkia (3)
- zapaishchykova (1)
- GalKepler (1)
- ruiyangge (1)
- SmailDK (1)
- Siyuannnnch (1)
Pull Request Authors
- harveyaa (9)
- dependabot[bot] (3)
- dfm (3)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 46 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 13
- Total maintainers: 2
pypi.org: pynm
('Python implementation of Normative Modelling', 'with GAMLSS, Gaussian Processes, LOESS & Centiles approaches.')
- Homepage: https://github.com/ppsp-team/PyNM
- Documentation: https://pynm.readthedocs.io/
- License: BSD
-
Latest release: 1.0.1
published about 3 years ago
Rankings
Dependencies
- gpytorch >=1.4.0
- matplotlib *
- numpy >=1.21.0
- pandas >=1.1.5
- rpy2 >=3.1.0
- scikit_learn >=0.24.1
- scipy >=1.5.3
- seaborn *
- statsmodels >=0.13.2
- torch >=1.8.0
- tqdm *
- gpytorch *
- matplotlib *
- numpy *
- pandas *
- rpy2 *
- scikit_learn *
- scipy *
- seaborn *
- statsmodels *
- torch *
- tqdm *
- actions/checkout v2 composite
- actions/upload-artifact v1 composite
- openjournals/openjournals-draft-action master composite
- pynm *
