pysr3

pysr3: A Python Package for Sparse Relaxed Regularized Regression - Published in JOSS (2023)

https://github.com/aksholokhov/pysr3

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Economics Social Sciences - 40% confidence
Engineering Computer Science - 40% confidence
Artificial Intelligence and Machine Learning Computer Science - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

SciKit-Learn compatible library for training mixed-effects models.

Basic Info
  • Host: GitHub
  • Owner: aksholokhov
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 13.4 MB
Statistics
  • Stars: 13
  • Watchers: 1
  • Forks: 5
  • Open Issues: 0
  • Releases: 3
Created about 6 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Code of conduct

README.md

codecov Codacy Badge status

Quickstart with pysr3

SR3 is a relaxation method designed for accurate feature selection. It currently supports:

  • Linear Models (L0, LASSO, A-LASSO, CAD, SCAD)
  • Linear Mixed-Effect Models (L0, LASSO, A-LASSO, CAD, SCAD)

Installation

pysr3 can be installed via bash pip install pysr3>=0.3.5

python from pysr3.__about__ import __version__ print(f"This tutorial was generated using PySR3 v{__version__}\n" "You might see slightly different numerical results if you are using a different version of the library.")

This tutorial was generated using PySR3 v0.3.5
You might see slightly different numerical results if you are using a different version of the library.

Requirements

Make sure that Python 3.6 or higher is installed. The package has the following dependencies, as listed in requirements.txt:

  • numpy>=1.21.1
  • pandas>=1.3.1
  • scipy>=1.7.1
  • PyYAML>=5.4.1
  • scikit_learn>=0.24.2

Usage

pysr3 models are fully compatible to sklearn standards, so you can use them as you normally would use a sklearn model.

Linear Models

A simple example of using SR3-empowered LASSO for feature selection is shown below.

```python import numpy as np

from pysr3.linear.problems import LinearProblem

Create a sample dataset

seed = 42 numobjects = 300 numfeatures = 500 np.random.seed(seed)

create a vector of true model's coefficients

truex = np.random.choice(2, size=numfeatures, p=np.array([0.9, 0.1]))

create sample data

a = 10 * np.random.randn(numobjects, numfeatures) b = a.dot(truex) + np.random.randn(numobjects)

print(f"The dataset has {a.shape[0]} objects and {a.shape[1]} features; \n" f"The vector of true parameters contains {sum(truex != 0)} non-zero elements out of {numfeatures}.") ```

The dataset has 300 objects and 500 features; 
The vector of true parameters contains 55 non-zero elements out of 500.

First, let's fit a model with a fixed parameter lambda:

python from pysr3.linear.models import LinearL1ModelSR3 from sklearn.metrics import confusion_matrix lam = 0.1*np.max(np.abs(a.T.dot(b))) model = LinearL1ModelSR3(lam=lam, el=1e5)

python %%timeit model.fit(a, b)

38.6 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

```python maybex = model.coef['x'] tn, fp, fn, tp = confusionmatrix(truex, np.abs(maybex) > np.sqrt(model.tolsolver)).ravel()

print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n") ```

The model found 55 out of 55 features correctly, but also chose 5 out of 445 extra irrelevant features. 

Now let's see if we can improve it by adding grid-search:

```python

Automatic features selection using information criterion

from pysr3.linear.models import LinearL1ModelSR3 from sklearn.model_selection import RandomizedSearchCV from sklearn.utils.fixes import loguniform

Here we use SR3-empowered LASSO, but many other popular regularizers are also available

See the glossary of models for more details.

model = LinearL1ModelSR3()

We will search for the best model over the range of strengths for the regularizer

params = { "lam": loguniform(1e-1, 1e2) } selector = RandomizedSearchCV(estimator=model, paramdistributions=params, niter=50, # The function below evaluates an information criterion # on the test portion of CV-splits. scoring=lambda clf, x, y: -clf.getinformationcriterion(x, y, ic='bic'))

selector.fit(a, b) maybex = selector.bestestimator.coef['x'] tn, fp, fn, tp = confusionmatrix(truex, np.abs(maybex) > np.sqrt(model.tolsolver)).ravel()

print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n" f"The best parameter is {selector.bestparams}") ```

The model found 55 out of 55 features correctly, but also chose 1 out of 445 extra irrelevant features. 
The best parameter is {'lam': 0.15055187290939537}

Note that the discovered coefficients will be biased downwards due to L1 regularization.

python import matplotlib.pyplot as plt fig, ax = plt.subplots() indep = list(range(num_features)) ax.plot(indep, maybe_x, label='Discovered Coefficients') ax.plot(indep, true_x, alpha=0.5, label='True Coefficients') ax.legend(bbox_to_anchor=(1.05, 1)) plt.show()

png

You can get rid of the bias by refitting the model using only features that were selected.

Linear Mixed-Effects Models

Below we show how to use Linear Mixed-Effects (LME) models for simultaneous selection of fixed and random effects.

```python from pysr3.lme.models import L1LmeModelSR3 from pysr3.lme.problems import LMEProblem, LMEStratifiedShuffleSplit

Here we generate a random linear mixed-effects problem.

To use your own dataset check LMEProblem.fromdataframe and LMEProblem.fromx_y

problem, trueparameters = LMEProblem.generate( groupssizes=[10] * 8, # 8 groups, 10 objects each featureslabels=["fixed+random"] * 20, # 20 features, each one having both fixed and random components beta=np.array([0, 1] * 10), # True beta (fixed effects) has every other coefficient active gamma=np.array([0, 0, 0, 1] * 5), # True gamma (variances of random effects) has every fourth coefficient active obsvar=0.1, # The errors have standard errors of sqrt(0.1) ~= 0.33 seed=seed # random seed, for reproducibility )

LMEProblem provides a very convenient representation

of the problem. See the documentation for more details.

It also can be converted to a more familiar representation

x, y, columnslabels = problem.tox_y()

columns_labels describe the roles of the columns in x:

fixed effect, random effect, or both of those, as well as groups labels and observation standard deviation.

You can also convert it to pandas dataframe if you'd like.

pandasdataframe = problem.todataframe() ```

```python

We use SR3-empowered LASSO model, but many other popular models are also available.

See the glossary of models for more details.

model = L1LmeModelSR3(practical=True)

We're going to select features by varying the strength of the prior

and choosing the model that yields the best information criterion

on the validation set.

params = { "lam": loguniform(1e-3, 1e2), "ell": loguniform(1e-1, 1e2) }

We use standard functionality of sklearn to perform grid-search.

selector = RandomizedSearchCV(estimator=model, paramdistributions=params, niter=30, # number of points from parameters space to sample # the class below implements CV-splits for LME models cv=LMEStratifiedShuffleSplit(nsplits=2, testsize=0.5, randomstate=seed, columnslabels=columnslabels), # The function below will evaluate the information criterion # on the test-sets during cross-validation. # We use cAIC from Vaida, but other options (BIC, Muller's IC) are also available scoring=lambda clf, x, y: -clf.getinformationcriterion(x, y, columnslabels=columnslabels, ic="vaidaaic"), randomstate=seed, njobs=20 ) selector.fit(x, y, columnslabels=columnslabels) bestmodel = selector.bestestimator_

maybebeta = bestmodel.coef["beta"] maybegamma = bestmodel.coef["gamma"]

Since the solver stops witin sqrt(tol) from the minimum, we use it as a criterion for whether the feature

is selected or not

ftn, ffp, ffn, ftp = confusionmatrix(ytrue=trueparameters["beta"], ypred=abs(maybebeta) > np.sqrt(bestmodel.tolsolver) ).ravel() rtn, rfp, rfn, rtp = confusionmatrix(ytrue=trueparameters["gamma"], ypred=abs(maybegamma) > np.sqrt(bestmodel.tolsolver) ).ravel()

print( f"The model found {ftp} out of {ftp + ffn} correct fixed features, and also chose {ffp} out of {ftn + ffp} extra irrelevant fixed features. \n" f"It also identified {rtp} out of {rtp + rfn} random effects correctly, and got {rfp} out of {rtn + rfp} non-present random effects. \n" f"The best sparsity parameter is {selector.bestparams}") ```

The model found 10 out of 10 correct fixed features, and also chose 0 out of 10 extra irrelevant fixed features. 
It also identified 5 out of 5 random effects correctly, and got 0 out of 15 non-present random effects. 
The best sparsity parameter is {'ell': 0.3972110727381912, 'lam': 0.3725393839578885}

```python fig, axs = plt.subplots(1, 2, figsize=(9, 3), sharey=True)

indepbeta = list(range(np.size(trueparameters["beta"]))) indepgamma = list(range(np.size(trueparameters["gamma"])))

axs[0].settitle(r"$\beta$, Fixed Effects") axs[0].scatter(indepbeta, maybebeta, label='Discovered') axs[0].scatter(indepbeta, true_parameters["beta"], alpha=0.5, label='True')

axs[1].settitle(r"$\gamma$, Variances of Random Effects") axs[1].scatter(indepgamma, maybegamma, label='Discovered') axs[1].scatter(indepgamma, trueparameters["gamma"], alpha=0.5, label='True') axs[1].legend(bboxto_anchor=(1.55, 1)) plt.show() ```

png

```python

```

Owner

  • Name: Aleksei Sholokhov
  • Login: aksholokhov
  • Kind: user
  • Location: Seatte
  • Company: University of Washington

Ph.D. Student in Applied Mathematics, Graduate Research Assistant in IHME UW

JOSS Publication

pysr3: A Python Package for Sparse Relaxed Regularized Regression
Published
April 23, 2023
Volume 8, Issue 84, Page 5155
Authors
Aleksei Sholokhov ORCID
Department of Applied Mathematics, University of Washington
Peng Zheng ORCID
Department of Health Metrics Sciences, University of Washington
Aleksandr Aravkin ORCID
Department of Applied Mathematics, University of Washington, Department of Health Metrics Sciences, University of Washington
Editor
Paul La Plante ORCID
Tags
feature selection linear models mixed-effect models regularization

GitHub Events

Total
  • Fork event: 1
Last Year
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 374
  • Total Committers: 3
  • Avg Commits per committer: 124.667
  • Development Distribution Score (DDS): 0.021
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Aleksei Sholokhov a****h@u****u 366
GitHub Action a****n@g****m 6
Aleksandr Aravkin s****n@g****m 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 2
  • Total pull requests: 10
  • Average time to close issues: 23 days
  • Average time to close pull requests: 13 minutes
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 3.5
  • Average comments per pull request: 0.1
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mhu48 (1)
  • blakeaw (1)
Pull Request Authors
  • aksholokhov (10)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 16 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 4
  • Total maintainers: 1
pypi.org: pysr3

Python Library for Sparse Relaxed Regularized Regression.

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 16 Last month
Rankings
Dependent packages count: 10.1%
Forks count: 16.8%
Stargazers count: 17.7%
Average: 18.9%
Dependent repos count: 21.6%
Downloads: 28.4%
Maintainers (1)
Last synced: 4 months ago

Dependencies

requirements.txt pypi
  • PyYAML >=5.4.1
  • ipython *
  • numpy >=1.21.1
  • pandas >=1.3.1
  • scikit_learn >=0.24.2
  • scipy >=1.7.1
.github/workflows/deploy-docs.yml actions
  • JamesIves/github-pages-deploy-action v4 composite
  • actions/checkout v1 composite
  • actions/setup-python v1 composite
  • r-lib/actions/setup-pandoc v1 composite
.github/workflows/joss_pdf.yml actions
  • actions/checkout v2 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite
.github/workflows/testing_and_coverage.yml actions
  • actions/checkout master composite
  • actions/setup-python master composite
  • codecov/codecov-action v2 composite
.github/workflows/update-readme.yml actions
  • actions/checkout v1 composite
  • actions/setup-python v1 composite
  • ad-m/github-push-action master composite