Prism

Prism: Multiple spline regression with regularization, dimensionality reduction, and feature selection - Published in JOSS (2016)

https://github.com/cmadan/prism

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software
Last synced: 6 months ago · JSON representation ·

Repository

Multiple spline regression with regularization, dimensionality reduction, and feature selection

Basic Info
  • Host: GitHub
  • Owner: cMadan
  • License: bsd-2-clause
  • Language: MATLAB
  • Default Branch: master
  • Size: 1.13 MB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 2
  • Open Issues: 1
  • Releases: 3
Created over 9 years ago · Last pushed over 6 years ago
Metadata Files
Readme License Citation

README.md

DOI

Prism: Multiple spline regression with regularization, dimensionality reduction, and feature selection

Prism uses a combination of statistical methods to conduct spline-based multiple regression. Prism conducts this regression using regularization, dimensionality reduction, and feature selection, through a combination of smoothing spline regression, PCA, and RVR/LASSO.

Figure 1: Illustration of Prism regression procedure, first conducting spline regression for each predictor, followed by dimensionality reduction and feature selection (panel A). The logo for Prism is shown in panel B.

Illustration of Prism regression procedure.

Prism has been tested in MATLAB 2015b and requires three first-party toolboxes: (1) Curve Fitting Toolbox; (2) Statistics and Machine Learning Toolbox; (3) Signal Processing Toolbox. Relevance Vector Regression (RVR) requires the SparseBayes V2 toolbox, which can be obtained from http://www.relevancevector.com.

Citing the toolbox

Please cite this paper if you use the toolbox:

Madan, C. R. (2016). Prism: Multiple spline regression with regularization, dimensionality reduction, and feature selection. Journal of Open Source Software, 1, 31. doi:10.21105/joss.00031

Usage

See demo.m for an outline of the implementation.

```matlab % Demo of Prism functionality % 20160627 CRM

%% init % import prism addpath(fullfile('..','prism'))

% load demo data load('demo') % demo contains four variables: % % trX and try are our training data % teX and tey are our test data % % tr data has 500 cases, te data has 100 cases % X has 175 predictors

%% Train

% clear opt, in case it has any previous values clear opt % some warnings 'commonly' occur % I don't want disabling these to be the default % but, also are not a problem in most cases % (won't be best model anyway) % so let's disable them with the options opt.disableWarnings = 1;

% force least-squares spline, often reduces over-fitting opt.spline.p = 0;

% options to experiment with: %opt.doSpline = 0; %opt.doPCA = 0; %opt.doRVR = 0;

% fit prism to training data mdl = prismtrain(trX,try,opt); % get performance metrics % is an estimate of best-case of sorts: % poor performance here is a bad sign % but, good performance could represent over-fitting trperf = prismeval(try,mdl)

% results with spline.p=0 % mdl performance (tr_perf) should be approx: % r2 = .77; MdAE = 6.40

%% Test

% apply model from training data to test data % mdl.pred here is yhat fit = prismtest(teX,mdl); % measure model performance, i.e., the whole point of this teperf = prismeval(tey,fit)

% fit performance (te_perf) should be approx: % r2 = .78; MdAE = 7.75

% visualize predictions in test data % Fig A. Data vs. Prediction figure; scatter(tey,fit.pred); xlabel('Data'); ylabel('Prediction'); % Fig B. Data vs. Prediction Error figure; scatter(tey,fit.pred-te_y); xlabel('Data'); ylabel('Prediction Error');

```

Figure 2: Output scatter plots for code in demo.m.

Example output.

function mdl = prism_train(tr_X,tr_y,opt) % Build multiple regression model from training data. % % Inputs: % tr_X = X data to train on % each column is considered as an indepdent predictor % should be size NxM % tr_y = Y data to train on % should be size Nx1 % % opt = set options for multiple regression model % % .doSpline = enables/disables the spline regression % (default: 1) % .doPCA = enable/disable the PCA (default: 1) % .doRVR = use RVR (default: 1), or LASSO (0) % % .spline.p = smoothing parameter to use in spline % regression % if not specified, will be selected by csaps % if is a vector, must be length M, where % each index will be used for the % respective column in X % if is a single value, will use it for all % 0 results in least-squares cubic spline % 1 results in variational/'natural' % cubic spline interpolant % See 'csaps' for further details. % default: [ 10:10:90 100:100:1000 Inf ] % .pc.thresh = Threshold for number of principle % components to keep, based on miniumum % amount of explained variance, as a % percent (default: 95) % .lasso.k = Set 'l' for LASSO k-fold CV (default: 10) % % .disableWarnings = disable common warnings (default: 0) % % % Outputs: % mdl = fitted Prism model given training data % struct with two subfields: % .stats = model parameters for spline, pca, rvr/lasso % .pred = predicted y (i.e., y_hat) % .opt = store options underlying model % % Matlab toolbox dependencies: stats, curvefit, signal % For RVR, also requires SparseBayes V2 (http://www.relevancevector.com). % % 20160621 CRM

function fit = prism_test(te_X,params) % Apply previous model to test data. % % Inputs: % te_X = X data to test with % should be same organization as tr_X % params = output from prism_train % prefers mdl.stats % also accepts the 'mdl' model output itself, % for convienence % Outputs: % fit = output after applying Prism to training data % struct with one subfield: % .pred = predicted y (i.e., y_hat) % % 20160621 CRM

function perf = prism_eval(y,pred) % Calculate basic performance for Prism model. % % Inputs: % y = 'y' from model (i.e., dependent variable) % pred = pred field from prism_train/test % also accepts the 'mdl' model output itself, % for convienence % % Outputs: % perf = struct with three fields: % .r2 = r^2, explained variance % .mae = mean absolute error % .mdae = median absolute error % .rmsd = root mean squared deviation % % 20160619 CRM

License

Matlab code is made available under a BSD license (see LICENSE).

``` Copyright (c) 2016, Christopher R. Madan. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ```

Example implementation

To provide an example of Prism's implementation, see the 'demo' folder. demo.mat was generated using brain morphology data, where each column corresponds to a brain morphology measure. The test and training data are derived from independent MR sites, and the dependent variable is age. The data is from an unpublished study and the pre-processing pipeline is not finalized (e.g., correcting for site effects), so further information of the dataset is not provided at this time.

Benchmark

As a benchmark, demo.m was run using several configurations: (1) RVR (default) or LASSO for the multiple regression; (2) with or without PCA; and (3) with or without spline regression--yielding eight comparison analyses. A 'constant' model, where all ages are simply predicted to be the mean age (within the testing dataset) was also included as an upper-limit of performance (i.e., a good model should do no worse than this).

Here I ran each analyses 100 times and report several measures of performance: R2 and median absolute error (MdAE) in the testing data, average time to compute (all run on the same machine), and stability of error performance (SD of MdAE across the 100 calculations). All analyses were run with least-squares splines (opt.spline.p=0).

| Model variant | R2 | MdAE | Time [s] | SD(MdAE) | | -------------------------- | ----: | -----: | -------: | ----------: | | Spline+PCA+RVR | .777 | 7.74 | 0.55 | 0 | | (Prism default) | | | | | | | | | | | | | | | | | | Spline+PCA+LASSO | .779 | 8.06 | 3.84 | 0.17 | | Spline+RVR | .758 | 8.42 | 1.26 | 0 | | Spline+LASSO | .776 | 8.35 | 232.84 | 0.05 | | PCA+RVR | .601 | 12.31 | 0.12 | 0 | | PCA+LASSO | .713 | 9.66 | 2.54 | 0.11 | | RVR | .703 | 10.57 | 0.12 | 0 | | LASSO | .777 | 8.53 | 231.43 | 0.04 | | | | | | | | | | | | | | Constant [y_hat = Mean(y)] | .000 | 22.28 | | |

The above benchmarks suggest that, given the demo.mat training and test data, spline-based regression leads to substantially better predictions, particularly in the case of the median absolute error (MdAE). When splines are not used, LASSO performs better than RVR, but comes with a much larger computational cost. This computational cost can be largely mitigated through the use of PCA for dimensionality reduction, but even otherwise, PCA is associated with a marked improvement in prediction performance. We also see that given this dataset, regression models using LASSO (as opposed to RVR), are nondeterministic (i.e., there is variability in the performance of the model across each time it is run ), as SD(MdAE)>0.

Owner

  • Name: Christopher Madan
  • Login: cMadan
  • Kind: user
  • Company: University of Nottingham

Assistant Professor in Psychology

JOSS Publication

Prism: Multiple spline regression with regularization, dimensionality reduction, and feature selection
Published
July 01, 2016
Volume 1, Issue 3, Page 31
Authors
Christopher R. Madan ORCID
Boston College
Editor
Arfon Smith ORCID
Tags
regression smoothing spline matlab pca relevance vector machine multiple regression lasso

Citation (CITATION)

# Citation details for prism toolbox

## APA

Madan, C. R. (2016). Prism: Multiple spline regression with regularization, dimensionality reduction, and feature selection. Journal of Open Source Software, 1, 31. doi:10.21105/joss.00031


## BiBTeX

@Article{Mada2016,
  author    = {Christopher R Madan},
  title     = {Prism: Multiple spline regression with regularization, dimensionality reduction, and feature selection},
  journal   = {Journal of Open Source Software},
  year      = {2016},
  volume    = {1},
  pages     = {31},
  doi       = {10.21105/joss.00031},
}

GitHub Events

Total
Last Year

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 28
  • Total Committers: 2
  • Avg Commits per committer: 14.0
  • Development Distribution Score (DDS): 0.036
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Christopher Madan c****n@g****m 27
Arfon Smith a****n 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 hour
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • cMadan (1)
Pull Request Authors
  • arfon (2)
Top Labels
Issue Labels
Pull Request Labels