pysr3

pysr3: A Python Package for Sparse Relaxed Regularized Regression - Published in JOSS (2023)

https://github.com/aksholokhov/pysr3

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Scientific Fields

Economics Social Sciences - 40% confidence

Engineering Computer Science - 40% confidence

Artificial Intelligence and Machine Learning Computer Science - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

SciKit-Learn compatible library for training mixed-effects models.

Basic Info

Host: GitHub
Owner: aksholokhov
License: gpl-3.0
Language: Python
Default Branch: master
Homepage:
Size: 13.4 MB

Statistics

Stars: 13
Watchers: 1
Forks: 5
Open Issues: 0
Releases: 3

Created about 6 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Code of conduct

Quickstart with `pysr3`

SR3 is a relaxation method designed for accurate feature selection. It currently supports:

Linear Models (L0, LASSO, A-LASSO, CAD, SCAD)
Linear Mixed-Effect Models (L0, LASSO, A-LASSO, CAD, SCAD)

Installation

pysr3 can be installed via bash pip install pysr3>=0.3.5

python from pysr3.__about__ import __version__ print(f"This tutorial was generated using PySR3 v{__version__}\n" "You might see slightly different numerical results if you are using a different version of the library.")

This tutorial was generated using PySR3 v0.3.5
You might see slightly different numerical results if you are using a different version of the library.

Requirements

Make sure that Python 3.6 or higher is installed. The package has the following dependencies, as listed in requirements.txt:

numpy>=1.21.1
pandas>=1.3.1
scipy>=1.7.1
PyYAML>=5.4.1
scikit_learn>=0.24.2

Usage

pysr3 models are fully compatible to sklearn standards, so you can use them as you normally would use a sklearn model.

Linear Models

A simple example of using SR3-empowered LASSO for feature selection is shown below.

```python import numpy as np

from pysr3.linear.problems import LinearProblem

Create a sample dataset

seed = 42 numobjects = 300 numfeatures = 500 np.random.seed(seed)

create a vector of true model's coefficients

truex = np.random.choice(2, size=numfeatures, p=np.array([0.9, 0.1]))

create sample data

a = 10 * np.random.randn(numobjects, numfeatures) b = a.dot(truex) + np.random.randn(numobjects)

print(f"The dataset has {a.shape[0]} objects and {a.shape[1]} features; \n" f"The vector of true parameters contains {sum(truex != 0)} non-zero elements out of {numfeatures}.") ```

The dataset has 300 objects and 500 features; 
The vector of true parameters contains 55 non-zero elements out of 500.

First, let's fit a model with a fixed parameter lambda:

python from pysr3.linear.models import LinearL1ModelSR3 from sklearn.metrics import confusion_matrix lam = 0.1*np.max(np.abs(a.T.dot(b))) model = LinearL1ModelSR3(lam=lam, el=1e5)

python %%timeit model.fit(a, b)

38.6 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

```python maybex = model.coef['x'] tn, fp, fn, tp = confusionmatrix(truex, np.abs(maybex) > np.sqrt(model.tolsolver)).ravel()

print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n") ```

The model found 55 out of 55 features correctly, but also chose 5 out of 445 extra irrelevant features.

Now let's see if we can improve it by adding grid-search:

```python

Automatic features selection using information criterion

from pysr3.linear.models import LinearL1ModelSR3 from sklearn.model_selection import RandomizedSearchCV from sklearn.utils.fixes import loguniform

Here we use SR3-empowered LASSO, but many other popular regularizers are also available

See the glossary of models for more details.

model = LinearL1ModelSR3()

We will search for the best model over the range of strengths for the regularizer

params = { "lam": loguniform(1e-1, 1e2) } selector = RandomizedSearchCV(estimator=model, paramdistributions=params, niter=50, # The function below evaluates an information criterion # on the test portion of CV-splits. scoring=lambda clf, x, y: -clf.getinformationcriterion(x, y, ic='bic'))

selector.fit(a, b) maybex = selector.bestestimator.coef['x'] tn, fp, fn, tp = confusionmatrix(truex, np.abs(maybex) > np.sqrt(model.tolsolver)).ravel()

print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n" f"The best parameter is {selector.bestparams}") ```

The model found 55 out of 55 features correctly, but also chose 1 out of 445 extra irrelevant features. 
The best parameter is {'lam': 0.15055187290939537}

Note that the discovered coefficients will be biased downwards due to L1 regularization.

python import matplotlib.pyplot as plt fig, ax = plt.subplots() indep = list(range(num_features)) ax.plot(indep, maybe_x, label='Discovered Coefficients') ax.plot(indep, true_x, alpha=0.5, label='True Coefficients') ax.legend(bbox_to_anchor=(1.05, 1)) plt.show()

png

You can get rid of the bias by refitting the model using only features that were selected.

Linear Mixed-Effects Models

Below we show how to use Linear Mixed-Effects (LME) models for simultaneous selection of fixed and random effects.

```python from pysr3.lme.models import L1LmeModelSR3 from pysr3.lme.problems import LMEProblem, LMEStratifiedShuffleSplit

Here we generate a random linear mixed-effects problem.

To use your own dataset check LMEProblem.fromdataframe and LMEProblem.fromx_y

problem, trueparameters = LMEProblem.generate( groupssizes=[10] * 8, # 8 groups, 10 objects each featureslabels=["fixed+random"] * 20, # 20 features, each one having both fixed and random components beta=np.array([0, 1] * 10), # True beta (fixed effects) has every other coefficient active gamma=np.array([0, 0, 0, 1] * 5), # True gamma (variances of random effects) has every fourth coefficient active obsvar=0.1, # The errors have standard errors of sqrt(0.1) ~= 0.33 seed=seed # random seed, for reproducibility )

LMEProblem provides a very convenient representation

of the problem. See the documentation for more details.

It also can be converted to a more familiar representation

x, y, columnslabels = problem.tox_y()

columns_labels describe the roles of the columns in x:

fixed effect, random effect, or both of those, as well as groups labels and observation standard deviation.

You can also convert it to pandas dataframe if you'd like.

pandasdataframe = problem.todataframe() ```

```python

We use SR3-empowered LASSO model, but many other popular models are also available.

See the glossary of models for more details.

model = L1LmeModelSR3(practical=True)

We're going to select features by varying the strength of the prior

and choosing the model that yields the best information criterion

on the validation set.

params = { "lam": loguniform(1e-3, 1e2), "ell": loguniform(1e-1, 1e2) }

We use standard functionality of sklearn to perform grid-search.

selector = RandomizedSearchCV(estimator=model, paramdistributions=params, niter=30, # number of points from parameters space to sample # the class below implements CV-splits for LME models cv=LMEStratifiedShuffleSplit(nsplits=2, testsize=0.5, randomstate=seed, columnslabels=columnslabels), # The function below will evaluate the information criterion # on the test-sets during cross-validation. # We use cAIC from Vaida, but other options (BIC, Muller's IC) are also available scoring=lambda clf, x, y: -clf.getinformationcriterion(x, y, columnslabels=columnslabels, ic="vaidaaic"), randomstate=seed, njobs=20 ) selector.fit(x, y, columnslabels=columnslabels) bestmodel = selector.bestestimator_

maybebeta = bestmodel.coef["beta"] maybegamma = bestmodel.coef["gamma"]

Since the solver stops witin sqrt(tol) from the minimum, we use it as a criterion for whether the feature

is selected or not

ftn, ffp, ffn, ftp = confusionmatrix(ytrue=trueparameters["beta"], ypred=abs(maybebeta) > np.sqrt(bestmodel.tolsolver) ).ravel() rtn, rfp, rfn, rtp = confusionmatrix(ytrue=trueparameters["gamma"], ypred=abs(maybegamma) > np.sqrt(bestmodel.tolsolver) ).ravel()

print( f"The model found {ftp} out of {ftp + ffn} correct fixed features, and also chose {ffp} out of {ftn + ffp} extra irrelevant fixed features. \n" f"It also identified {rtp} out of {rtp + rfn} random effects correctly, and got {rfp} out of {rtn + rfp} non-present random effects. \n" f"The best sparsity parameter is {selector.bestparams}") ```

The model found 10 out of 10 correct fixed features, and also chose 0 out of 10 extra irrelevant fixed features. 
It also identified 5 out of 5 random effects correctly, and got 0 out of 15 non-present random effects. 
The best sparsity parameter is {'ell': 0.3972110727381912, 'lam': 0.3725393839578885}

```python fig, axs = plt.subplots(1, 2, figsize=(9, 3), sharey=True)

indepbeta = list(range(np.size(trueparameters["beta"]))) indepgamma = list(range(np.size(trueparameters["gamma"])))

axs[0].settitle(r"$\beta$, Fixed Effects") axs[0].scatter(indepbeta, maybebeta, label='Discovered') axs[0].scatter(indepbeta, true_parameters["beta"], alpha=0.5, label='True')

axs[1].settitle(r"$\gamma$, Variances of Random Effects") axs[1].scatter(indepgamma, maybegamma, label='Discovered') axs[1].scatter(indepgamma, trueparameters["gamma"], alpha=0.5, label='True') axs[1].legend(bboxto_anchor=(1.55, 1)) plt.show() ```

png

```python

```

Owner

Name: Aleksei Sholokhov
Login: aksholokhov
Kind: user
Location: Seatte
Company: University of Washington

Website: aksholokhov.github.io
Twitter: aksholokhov
Repositories: 2
Profile: https://github.com/aksholokhov

Ph.D. Student in Applied Mathematics, Graduate Research Assistant in IHME UW

JOSS Publication

pysr3: A Python Package for Sparse Relaxed Regularized Regression

Published

April 23, 2023

DOI

10.21105/joss.05155

Volume 8, Issue 84, Page 5155

Authors

Aleksei Sholokhov

Department of Applied Mathematics, University of Washington

Peng Zheng

Department of Health Metrics Sciences, University of Washington

Aleksandr Aravkin

Department of Applied Mathematics, University of Washington, Department of Health Metrics Sciences, University of Washington

Editor

Paul La Plante

GitHub Events

Total

Fork event: 1

Last Year

Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 374
Total Committers: 3
Avg Commits per committer: 124.667
Development Distribution Score (DDS): 0.021

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Aleksei Sholokhov	a**h@u**u	366
GitHub Action	a**n@g**m	6
Aleksandr Aravkin	s**n@g**m	2

Committer Domains (Top 20 + Academic)

github.com: 1 uw.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 10
Average time to close issues: 23 days
Average time to close pull requests: 13 minutes
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 3.5
Average comments per pull request: 0.1
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mhu48 (1)
blakeaw (1)

Pull Request Authors

aksholokhov (10)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 16 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 4
Total maintainers: 1

pypi.org: pysr3

Python Library for Sparse Relaxed Regularized Regression.

Homepage: https://github.com/aksholokhov/pysr3
Documentation: https://pysr3.readthedocs.io/
License: GNU GPLv3
Latest release: 0.3.5
published almost 3 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 16 Last month

Rankings

Dependent packages count: 10.1%

Forks count: 16.8%

Stargazers count: 17.7%

Average: 18.9%

Dependent repos count: 21.6%

Downloads: 28.4%

Maintainers (1)

aksh

Last synced: 6 months ago

Dependencies

requirements.txt pypi

PyYAML >=5.4.1
ipython *
numpy >=1.21.1
pandas >=1.3.1
scikit_learn >=0.24.2
scipy >=1.7.1

.github/workflows/deploy-docs.yml actions

JamesIves/github-pages-deploy-action v4 composite
actions/checkout v1 composite
actions/setup-python v1 composite
r-lib/actions/setup-pandoc v1 composite

.github/workflows/joss_pdf.yml actions

actions/checkout v2 composite
actions/upload-artifact v1 composite
openjournals/openjournals-draft-action master composite

.github/workflows/testing_and_coverage.yml actions

actions/checkout master composite
actions/setup-python master composite
codecov/codecov-action v2 composite

.github/workflows/update-readme.yml actions

actions/checkout v1 composite
actions/setup-python v1 composite
ad-m/github-push-action master composite

pysr3

Science Score: 95.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Quickstart with pysr3

Installation

Requirements

Usage

Linear Models

Create a sample dataset

create a vector of true model's coefficients

create sample data

Automatic features selection using information criterion

Here we use SR3-empowered LASSO, but many other popular regularizers are also available

See the glossary of models for more details.

We will search for the best model over the range of strengths for the regularizer

Linear Mixed-Effects Models

Here we generate a random linear mixed-effects problem.

To use your own dataset check LMEProblem.fromdataframe and LMEProblem.fromx_y

LMEProblem provides a very convenient representation

of the problem. See the documentation for more details.

It also can be converted to a more familiar representation

columns_labels describe the roles of the columns in x:

fixed effect, random effect, or both of those, as well as groups labels and observation standard deviation.

You can also convert it to pandas dataframe if you'd like.

We use SR3-empowered LASSO model, but many other popular models are also available.

See the glossary of models for more details.

We're going to select features by varying the strength of the prior

and choosing the model that yields the best information criterion

on the validation set.

We use standard functionality of sklearn to perform grid-search.

Since the solver stops witin sqrt(tol) from the minimum, we use it as a criterion for whether the feature

is selected or not

Owner

JOSS Publication

pysr3: A Python Package for Sparse Relaxed Regularized Regression

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pysr3

Rankings

Maintainers (1)

Dependencies

Quickstart with `pysr3`