c-lasso - a Python package for constrained sparse and robust regression and classification

c-lasso - a Python package for constrained sparse and robust regression and classification - Published in JOSS (2021)

https://github.com/leo-simpson/c-lasso

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: arxiv.org, biorxiv.org, pubmed.ncbi, ncbi.nlm.nih.gov, springer.com, joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Economics Social Sciences - 60% confidence
Mathematics Computer Science - 54% confidence
Biology Life Sciences - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

c-lasso: a Python package for constrained sparse regression and classification

Basic Info
  • Host: GitHub
  • Owner: Leo-Simpson
  • License: mit
  • Language: HTML
  • Default Branch: master
  • Homepage:
  • Size: 62.6 MB
Statistics
  • Stars: 32
  • Watchers: 4
  • Forks: 6
  • Open Issues: 0
  • Releases: 10
Created about 6 years ago · Last pushed over 4 years ago
Metadata Files
Readme License Code of conduct

README.md

arXiv DOI

c-lasso

c-lasso: a Python package for constrained sparse regression and classification

c-lasso is a Python package that enables sparse and robust linear regression and classification with linear equality constraints on the model parameters. For detailed info, one can check the documentation.

The forward model is assumed to be:

Here, y and X are given outcome and predictor data. The vector y can be continuous (for regression) or binary (for classification). C is a general constraint matrix. The vector β comprises the unknown coefficients and σ an unknown scale.

The package handles several different estimators for inferring β (and σ), including the constrained Lasso, the constrained scaled Lasso, sparse Huber M-estimation with linear equality constraints, and regularized Support Vector Machines. Several different algorithmic strategies, including path and proximal splitting algorithms, are implemented to solve the underlying convex optimization problems.

We also include two model selection strategies for determining the sparsity of the model parameters: k-fold cross-validation and stability selection.

This package is intended to fill the gap between popular python tools such as scikit-learn which CANNOT solve sparse constrained problems and general-purpose optimization solvers that do not scale well or are inaccurate (see benchmarks) for the considered problems. In its current stage, however, c-lasso is not yet compatible with the scikit-learn API but rather a stand-alone tool.

Below we show several use cases of the package, including an application of sparse log-contrast regression tasks for compositional microbiome data.

The code builds on results from several papers which can be found in the References. We also refer to the accompanying JOSS paper submission, also available on arXiv.

Table of Contents

Installation

c-lasso is available on pip. You can install the package in the shell using

shell pip install c-lasso To use the c-lasso package in Python, type

```python

from classo import classo_problem

one can add auxiliary functions as well such as randomdata or csvto_np

```

The c-lasso package depends on the following Python packages:

  • numpy;
  • matplotlib;
  • scipy;
  • pandas;
  • pytest (for tests)

Regression and classification problems

The c-lasso package can solve six different types of estimation problems: four regression-type and two classification-type formulations.

[R1] Standard constrained Lasso regression:

This is the standard Lasso problem with linear equality constraints on the β vector. The objective function combines Least-Squares for model fitting with l1 penalty for sparsity.

[R2] Constrained sparse Huber regression:

This regression problem uses the Huber loss as objective function for robust model fitting with l1 and linear equality constraints on the β vector. The parameter ρ=1.345.

[R3] Constrained scaled Lasso regression:

References [4,5] for further info). This is the default problem formulation in c-lasso.

[R4] Constrained sparse Huber regression with concomitant scale estimation:

References [4,5] for further info).

[C1] Constrained sparse classification with Square Hinge loss:

where the xi are the rows of X and l is defined as:

This formulation is similar to [R1] but adapted for classification tasks using the Square Hinge loss with (constrained) sparse β vector estimation.

[C2] Constrained sparse classification with Huberized Square Hinge loss:

where the xi are the rows of X and lρ is defined as:

This formulation is similar to [C1] but uses the Huberized Square Hinge loss for robust classification with (constrained) sparse β vector estimation.

Getting started

Basic example

We begin with a basic example that shows how to run c-lasso on synthetic data. This example and the next one can be found on the notebook 'Synthetic data Notebook.ipynb'

The c-lasso package includes the routine random_data that allows you to generate problem instances using normally distributed data.

python m, d, d_nonzero, k, sigma = 100, 200, 5, 1, 0.5 (X, C, y), sol = random_data(m, d, d_nonzero, k, sigma, zerosum=True, seed=1) This code snippet generates a problem instance with sparse β in dimension d=100 (sparsity d_nonzero=5). The design matrix X comprises n=100 samples generated from an i.i.d standard normal distribution. The dimension of the constraint matrix C is d x k matrix. The noise level is σ=0.5. The input zerosum=True implies that C is the all-ones vector and Cβ=0. The n-dimensional outcome vector y and the regression vector β is then generated to satisfy the given constraints.

Next we can define a default c-lasso problem instance with the generated data: python problem = classo_problem(X, y, C) You can look at the generated problem instance by typing:

python print(problem)

This gives you a summary of the form:

``` FORMULATION: R3

MODEL SELECTION COMPUTED:
Stability selection

STABILITY SELECTION PARAMETERS: numericalmethod : not specified method : first B = 50 q = 10 percentnS = 0.5 threshold = 0.7 lamin = 0.01 Nlam = 50 ``` As we have not specified any problem, algorithm, or model selection settings, this problem instance represents the default settings for a c-lasso instance: - The problem is of regression type and uses formulation [R3], i.e. with concomitant scale estimation. - The default optimization scheme is the path algorithm (see Optimization schemes for further info). - For model selection, stability selection at a theoretically derived λ value is used (see Reference [4] for details). Stability selection comprises a relatively large number of parameters. For a description of the settings, we refer to the more advanced examples below and the API.

You can solve the corresponding c-lasso problem instance using

python problem.solve()

After completion, the results of the optimization and model selection routines can be visualized using

python print(problem.solution)

The command shows the running time(s) for the c-lasso problem instance, and the selected variables for sability selection

``` STABILITY SELECTION : Selected variables : 7 63 148 164 168
Running time : 1.546s

```

Here, we only used stability selection as default model selection strategy. The command also allows you to inspect the computed stability profile for all variables at the theoretical λ

1.StabSel

The refitted β values on the selected support are also displayed in the next plot

beta

Advanced example

In the next example, we show how one can specify different aspects of the problem formulation and model selection strategy.

```python m, d, dnonzero, k, sigma = 100, 200, 5, 0, 0.5 (X, C, y), sol = randomdata(m, d, dnonzero, k, sigma, zerosum = True, seed = 4) problem = classoproblem(X, y, C) problem.formulation.huber = True problem.formulation.concomitant = False problem.modelselection.CV = True problem.modelselection.LAMfixed = True problem.modelselection.PATH = True problem.modelselection.StabSelparameters.method = 'max' problem.modelselection.CVparameters.seed = 1 problem.modelselection.LAMfixedparameters.rescaledlam = True problem.modelselection.LAMfixedparameters.lam = .1

problem.solve() print(problem)

print(problem.solution)

```

Results : ``` FORMULATION: R2

 MODEL SELECTION COMPUTED:  
      Lambda fixed
      Path
      Cross Validation
      Stability selection

 LAMBDA FIXED PARAMETERS: 
      numerical_method = Path-Alg
      rescaled lam : True
      threshold = 0.09
      lam = 0.1
      theoretical_lam = 0.224

 PATH PARAMETERS: 
      numerical_method : Path-Alg
      lamin = 0.001
      Nlam = 80


 CROSS VALIDATION PARAMETERS: 
      numerical_method : Path-Alg
      one-SE method : True
      Nsubset = 5
      lamin = 0.001
      Nlam = 80


 STABILITY SELECTION PARAMETERS: 
      numerical_method : Path-Alg
      method : max
      B = 50
      q = 10
      percent_nS = 0.5
      threshold = 0.7
      lamin = 0.01
      Nlam = 50

 LAMBDA FIXED : 
 Selected variables :  17    59    123    
 Running time :  0.104s

 PATH COMPUTATION : 
 Running time :  0.638s

 CROSS VALIDATION : 
 Selected variables :  16    17    57    59    64    73    74    76    93    115    123    134    137    181    
 Running time :  2.1s

 STABILITY SELECTION : 
 Selected variables :  17    59    76    123    137    
 Running time :  6.062s

```

2.StabSel

2.StabSel-beta

2.CV-beta

2.CV-graph

2.LAM-beta

2.Path

Log-contrast regression for microbiome data

In the the accompanying notebook we study several microbiome data sets. We showcase two examples below.

BMI prediction using the COMBO dataset

We first consider the COMBO data set and show how to predict Body Mass Index (BMI) from microbial genus abundances and two non-compositional covariates using "filtered_data".

```python from classo import csvtonp, classo_problem, clr

Load microbiome and covariate data X

X0 = csvtonp('COMBOdata/completedata/GeneraCounts.csv', begin = 0).astype(float) XC = csvtonp('COMBOdata/CaloriData.csv', begin = 0).astype(float) XF = csvtonp('COMBOdata/FatData.csv', begin = 0).astype(float)

Load BMI measurements y

y = csvtonp('COMBOdata/BMI.csv', begin = 0).astype(float)[:, 0] labels = csvtonp('COMBOdata/complete_data/GeneraPhylo.csv').astype(str)[:, -1]

Normalize/transform data

y = y - np.mean(y) #BMI data (n = 96) XC = XC - np.mean(XC, axis = 0) #Covariate data (Calorie) XF = XF - np.mean(XF, axis = 0) #Covariate data (Fat) X0 = clr(X0, 1 / 2).T

Set up design matrix and zero-sum constraints for 45 genera

X = np.concatenate((X0, XC, XF, np.ones((len(X0), 1))), axis = 1) # Joint microbiome and covariate data and offset label = np.concatenate([labels, np.array(['Calorie', 'Fat', 'Bias'])]) C = np.ones((1, len(X[0]))) C[0, -1], C[0, -2], C[0, -3] = 0., 0., 0.

Set up c-lassso problem

problem = classo_problem(X, y, C, label = label)

Use stability selection with theoretical lambda [Combettes & Müller, 2020b]

problem.modelselection.StabSelparameters.method = 'lam' problem.modelselection.StabSelparameters.threshold_label = 0.5

Use formulation R3

problem.formulation.concomitant = True

problem.solve() print(problem) print(problem.solution)

Use formulation R4

problem.formulation.huber = True problem.formulation.concomitant = True

problem.solve() print(problem) print(problem.solution)

```

3.Stability profile R3

3.Beta solution R3

3.Stability profile R4

3.Beta solution R4

pH prediction using the 88 soils dataset

The next microbiome example considers the 88 soils dataset from Lauber et al., 2009.

The task is to predict pH concentration in the soil from microbial abundance data. A similar analysis is available in Tree-Aggregated Predictive Modeling of Microbiome Data with Central Park soil data from Ramirez et al..

Code to run this application is available in the accompanying notebook under pH data. Below is a summary of a c-lasso problem instance (using the R3 formulation).

``` FORMULATION: R3

MODEL SELECTION COMPUTED:
Lambda fixed Path Stability selection

LAMBDA FIXED PARAMETERS: numericalmethod = Path-Alg rescaled lam : True threshold = 0.004 lam : theoretical theoreticallam = 0.2182

PATH PARAMETERS: numerical_method : Path-Alg lamin = 0.001 Nlam = 80

STABILITY SELECTION PARAMETERS: numericalmethod : Path-Alg method : lam B = 50 q = 10 percentnS = 0.5 threshold = 0.7 lam = theoretical theoretical_lam = 0.3085 ```

The c-lasso estimation results are summarized below:

``` LAMBDA FIXED : Sigma = 0.198 Selected variables : 14 18 19 39 43 57 62 85 93 94 104 107
Running time : 0.008s

PATH COMPUTATION : Running time : 0.12s

STABILITY SELECTION : Selected variables : 2 12 15
Running time : 0.287s ```

Ex4.1

Ex4.2

Ex4.3

Ex4.4

Ex4.5

Optimization schemes

The available problem formulations [R1-C2] require different algorithmic strategies for efficiently solving the underlying optimization problem. We have implemented four algorithms (with provable convergence guarantees) that vary in generality and are not necessarily applicable to all problems. For each problem type, c-lasso has a default algorithm setting that proved to be the fastest in our numerical experiments.

Path algorithms (Path-Alg)

This is the default algorithm for non-concomitant problems [R1,R3,C1,C2]. The algorithm uses the fact that the solution path along λ is piecewise- affine (as shown, e.g., in [1]). When Least-Squares is used as objective function, we derive a novel efficient procedure that allows us to also derive the solution for the concomitant problem [R2] along the path with little extra computational overhead.

Projected primal-dual splitting method (P-PDS):

This algorithm is derived from [2] and belongs to the class of proximal splitting algorithms. It extends the classical Forward-Backward (FB) (aka proximal gradient descent) algorithm to handle an additional linear equality constraint via projection. In the absence of a linear constraint, the method reduces to FB. This method can solve problem [R1]. For the Huber problem [R3], P-PDS can solve the mean-shift formulation of the problem (see [6]).

Projection-free primal-dual splitting method (PF-PDS):

This algorithm is a special case of an algorithm proposed in 3 and also belongs to the class of proximal splitting algorithms. The algorithm does not require projection operators which may be beneficial when C has a more complex structure. In the absence of a linear constraint, the method reduces to the Forward-Backward-Forward scheme. This method can solve problem [R1]. For the Huber problem [R3], PF-PDS can solve the mean-shift formulation of the problem (see [6]).

Douglas-Rachford-type splitting method (DR)

This algorithm is the most general algorithm and can solve all regression problems [R1-R4]. It is based on Doulgas Rachford splitting in a higher-dimensional product space. It makes use of the proximity operators of the perspective of the LS objective (see [4,5]) The Huber problem with concomitant scale [R4] is reformulated as scaled Lasso problem with the mean shift (see [6]) and thus solved in (n + d) dimensions.

References

Owner

  • Login: Leo-Simpson
  • Kind: user

JOSS Publication

c-lasso - a Python package for constrained sparse and robust regression and classification
Published
January 17, 2021
Volume 6, Issue 57, Page 2844
Authors
Léo Simpson
Technische Universität München
Patrick L. Combettes
Department of Mathematics, North Carolina State University, Raleigh
Christian L. Müller ORCID
Center for Computational Mathematics, Flatiron Institute, New York, Institute of Computational Biology, Helmholtz Zentrum München, Department of Statistics, Ludwig-Maximilians-Universität München
Editor
Matthew Sottile ORCID
Tags
regression classification constrained regression Lasso Huber function Square Hinge SVM convex optimization perspective function

GitHub Events

Total
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 3
  • Pull request event: 1
  • Fork event: 1
Last Year
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 3
  • Pull request event: 1
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 906
  • Total Committers: 3
  • Avg Commits per committer: 302.0
  • Development Distribution Score (DDS): 0.268
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Leo-Simpson 5****n 663
Christian L. Müller m****n 242
Leo Simpson s****n@r****t 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 14
  • Total pull requests: 4
  • Average time to close issues: 9 days
  • Average time to close pull requests: 1 minute
  • Total issue authors: 4
  • Total pull request authors: 3
  • Average comments per issue: 3.07
  • Average comments per pull request: 0.25
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 1
  • Average time to close issues: 1 day
  • Average time to close pull requests: 1 minute
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • glemaitre (7)
  • jbytecode (5)
  • viettr (1)
  • bl6594 (1)
Pull Request Authors
  • muellsen (2)
  • adamovanja (2)
  • Leo-Simpson (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • matplotlib *
  • numpy *
  • pandas *
  • pytest *
  • pytest-cov *
  • scipy *
  • sphinx *
  • sphinx-gallery *
  • sphinx_rtd_theme *
setup.py pypi
  • numpy *