pysr3
pysr3: A Python Package for Sparse Relaxed Regularized Regression - Published in JOSS (2023)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
✓Committers with academic emails
1 of 3 committers (33.3%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Scientific Fields
Repository
SciKit-Learn compatible library for training mixed-effects models.
Basic Info
Statistics
- Stars: 13
- Watchers: 1
- Forks: 5
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
Quickstart with pysr3
SR3 is a relaxation method designed for accurate feature selection. It currently supports:
- Linear Models (L0, LASSO, A-LASSO, CAD, SCAD)
- Linear Mixed-Effect Models (L0, LASSO, A-LASSO, CAD, SCAD)
Installation
pysr3 can be installed via
bash
pip install pysr3>=0.3.5
python
from pysr3.__about__ import __version__
print(f"This tutorial was generated using PySR3 v{__version__}\n"
"You might see slightly different numerical results if you are using a different version of the library.")
This tutorial was generated using PySR3 v0.3.5
You might see slightly different numerical results if you are using a different version of the library.
Requirements
Make sure that Python 3.6 or higher is installed. The package has the following dependencies, as listed in requirements.txt:
- numpy>=1.21.1
- pandas>=1.3.1
- scipy>=1.7.1
- PyYAML>=5.4.1
- scikit_learn>=0.24.2
Usage
pysr3 models are fully compatible to sklearn standards, so you can use them as you normally would use a sklearn model.
Linear Models
A simple example of using SR3-empowered LASSO for feature selection is shown below.
```python import numpy as np
from pysr3.linear.problems import LinearProblem
Create a sample dataset
seed = 42 numobjects = 300 numfeatures = 500 np.random.seed(seed)
create a vector of true model's coefficients
truex = np.random.choice(2, size=numfeatures, p=np.array([0.9, 0.1]))
create sample data
a = 10 * np.random.randn(numobjects, numfeatures) b = a.dot(truex) + np.random.randn(numobjects)
print(f"The dataset has {a.shape[0]} objects and {a.shape[1]} features; \n" f"The vector of true parameters contains {sum(truex != 0)} non-zero elements out of {numfeatures}.") ```
The dataset has 300 objects and 500 features;
The vector of true parameters contains 55 non-zero elements out of 500.
First, let's fit a model with a fixed parameter lambda:
python
from pysr3.linear.models import LinearL1ModelSR3
from sklearn.metrics import confusion_matrix
lam = 0.1*np.max(np.abs(a.T.dot(b)))
model = LinearL1ModelSR3(lam=lam, el=1e5)
python
%%timeit
model.fit(a, b)
38.6 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```python maybex = model.coef['x'] tn, fp, fn, tp = confusionmatrix(truex, np.abs(maybex) > np.sqrt(model.tolsolver)).ravel()
print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n") ```
The model found 55 out of 55 features correctly, but also chose 5 out of 445 extra irrelevant features.
Now let's see if we can improve it by adding grid-search:
```python
Automatic features selection using information criterion
from pysr3.linear.models import LinearL1ModelSR3 from sklearn.model_selection import RandomizedSearchCV from sklearn.utils.fixes import loguniform
Here we use SR3-empowered LASSO, but many other popular regularizers are also available
See the glossary of models for more details.
model = LinearL1ModelSR3()
We will search for the best model over the range of strengths for the regularizer
params = { "lam": loguniform(1e-1, 1e2) } selector = RandomizedSearchCV(estimator=model, paramdistributions=params, niter=50, # The function below evaluates an information criterion # on the test portion of CV-splits. scoring=lambda clf, x, y: -clf.getinformationcriterion(x, y, ic='bic'))
selector.fit(a, b) maybex = selector.bestestimator.coef['x'] tn, fp, fn, tp = confusionmatrix(truex, np.abs(maybex) > np.sqrt(model.tolsolver)).ravel()
print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n" f"The best parameter is {selector.bestparams}") ```
The model found 55 out of 55 features correctly, but also chose 1 out of 445 extra irrelevant features.
The best parameter is {'lam': 0.15055187290939537}
Note that the discovered coefficients will be biased downwards due to L1 regularization.
python
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
indep = list(range(num_features))
ax.plot(indep, maybe_x, label='Discovered Coefficients')
ax.plot(indep, true_x, alpha=0.5, label='True Coefficients')
ax.legend(bbox_to_anchor=(1.05, 1))
plt.show()

You can get rid of the bias by refitting the model using only features that were selected.
Linear Mixed-Effects Models
Below we show how to use Linear Mixed-Effects (LME) models for simultaneous selection of fixed and random effects.
```python from pysr3.lme.models import L1LmeModelSR3 from pysr3.lme.problems import LMEProblem, LMEStratifiedShuffleSplit
Here we generate a random linear mixed-effects problem.
To use your own dataset check LMEProblem.fromdataframe and LMEProblem.fromx_y
problem, trueparameters = LMEProblem.generate( groupssizes=[10] * 8, # 8 groups, 10 objects each featureslabels=["fixed+random"] * 20, # 20 features, each one having both fixed and random components beta=np.array([0, 1] * 10), # True beta (fixed effects) has every other coefficient active gamma=np.array([0, 0, 0, 1] * 5), # True gamma (variances of random effects) has every fourth coefficient active obsvar=0.1, # The errors have standard errors of sqrt(0.1) ~= 0.33 seed=seed # random seed, for reproducibility )
LMEProblem provides a very convenient representation
of the problem. See the documentation for more details.
It also can be converted to a more familiar representation
x, y, columnslabels = problem.tox_y()
columns_labels describe the roles of the columns in x:
fixed effect, random effect, or both of those, as well as groups labels and observation standard deviation.
You can also convert it to pandas dataframe if you'd like.
pandasdataframe = problem.todataframe() ```
```python
We use SR3-empowered LASSO model, but many other popular models are also available.
See the glossary of models for more details.
model = L1LmeModelSR3(practical=True)
We're going to select features by varying the strength of the prior
and choosing the model that yields the best information criterion
on the validation set.
params = { "lam": loguniform(1e-3, 1e2), "ell": loguniform(1e-1, 1e2) }
We use standard functionality of sklearn to perform grid-search.
selector = RandomizedSearchCV(estimator=model, paramdistributions=params, niter=30, # number of points from parameters space to sample # the class below implements CV-splits for LME models cv=LMEStratifiedShuffleSplit(nsplits=2, testsize=0.5, randomstate=seed, columnslabels=columnslabels), # The function below will evaluate the information criterion # on the test-sets during cross-validation. # We use cAIC from Vaida, but other options (BIC, Muller's IC) are also available scoring=lambda clf, x, y: -clf.getinformationcriterion(x, y, columnslabels=columnslabels, ic="vaidaaic"), randomstate=seed, njobs=20 ) selector.fit(x, y, columnslabels=columnslabels) bestmodel = selector.bestestimator_
maybebeta = bestmodel.coef["beta"] maybegamma = bestmodel.coef["gamma"]
Since the solver stops witin sqrt(tol) from the minimum, we use it as a criterion for whether the feature
is selected or not
ftn, ffp, ffn, ftp = confusionmatrix(ytrue=trueparameters["beta"], ypred=abs(maybebeta) > np.sqrt(bestmodel.tolsolver) ).ravel() rtn, rfp, rfn, rtp = confusionmatrix(ytrue=trueparameters["gamma"], ypred=abs(maybegamma) > np.sqrt(bestmodel.tolsolver) ).ravel()
print( f"The model found {ftp} out of {ftp + ffn} correct fixed features, and also chose {ffp} out of {ftn + ffp} extra irrelevant fixed features. \n" f"It also identified {rtp} out of {rtp + rfn} random effects correctly, and got {rfp} out of {rtn + rfp} non-present random effects. \n" f"The best sparsity parameter is {selector.bestparams}") ```
The model found 10 out of 10 correct fixed features, and also chose 0 out of 10 extra irrelevant fixed features.
It also identified 5 out of 5 random effects correctly, and got 0 out of 15 non-present random effects.
The best sparsity parameter is {'ell': 0.3972110727381912, 'lam': 0.3725393839578885}
```python fig, axs = plt.subplots(1, 2, figsize=(9, 3), sharey=True)
indepbeta = list(range(np.size(trueparameters["beta"]))) indepgamma = list(range(np.size(trueparameters["gamma"])))
axs[0].settitle(r"$\beta$, Fixed Effects") axs[0].scatter(indepbeta, maybebeta, label='Discovered') axs[0].scatter(indepbeta, true_parameters["beta"], alpha=0.5, label='True')
axs[1].settitle(r"$\gamma$, Variances of Random Effects") axs[1].scatter(indepgamma, maybegamma, label='Discovered') axs[1].scatter(indepgamma, trueparameters["gamma"], alpha=0.5, label='True') axs[1].legend(bboxto_anchor=(1.55, 1)) plt.show() ```

```python
```
Owner
- Name: Aleksei Sholokhov
- Login: aksholokhov
- Kind: user
- Location: Seatte
- Company: University of Washington
- Website: aksholokhov.github.io
- Twitter: aksholokhov
- Repositories: 2
- Profile: https://github.com/aksholokhov
Ph.D. Student in Applied Mathematics, Graduate Research Assistant in IHME UW
JOSS Publication
pysr3: A Python Package for Sparse Relaxed Regularized Regression
Authors
Tags
feature selection linear models mixed-effect models regularizationGitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Aleksei Sholokhov | a****h@u****u | 366 |
| GitHub Action | a****n@g****m | 6 |
| Aleksandr Aravkin | s****n@g****m | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 2
- Total pull requests: 10
- Average time to close issues: 23 days
- Average time to close pull requests: 13 minutes
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 3.5
- Average comments per pull request: 0.1
- Merged pull requests: 10
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mhu48 (1)
- blakeaw (1)
Pull Request Authors
- aksholokhov (10)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 16 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 4
- Total maintainers: 1
pypi.org: pysr3
Python Library for Sparse Relaxed Regularized Regression.
- Homepage: https://github.com/aksholokhov/pysr3
- Documentation: https://pysr3.readthedocs.io/
- License: GNU GPLv3
-
Latest release: 0.3.5
published over 2 years ago
Rankings
Maintainers (1)
Dependencies
- PyYAML >=5.4.1
- ipython *
- numpy >=1.21.1
- pandas >=1.3.1
- scikit_learn >=0.24.2
- scipy >=1.7.1
- JamesIves/github-pages-deploy-action v4 composite
- actions/checkout v1 composite
- actions/setup-python v1 composite
- r-lib/actions/setup-pandoc v1 composite
- actions/checkout v2 composite
- actions/upload-artifact v1 composite
- openjournals/openjournals-draft-action master composite
- actions/checkout master composite
- actions/setup-python master composite
- codecov/codecov-action v2 composite
- actions/checkout v1 composite
- actions/setup-python v1 composite
- ad-m/github-push-action master composite
