https://github.com/chris-santiago/steps
A SciKit-Learn style feature selector using best subsets and stepwise regression.
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 3 committers (33.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
A SciKit-Learn style feature selector using best subsets and stepwise regression.
Basic Info
- Host: GitHub
- Owner: chris-santiago
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://chris-santiago.github.io/steps/
- Size: 781 KB
Statistics
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
step-select
A SciKit-Learn style feature selector using best subsets and stepwise regression.
Install
Create a virtual environment with Python 3.8 and install from PyPi:
bash
pip install step-select
Use
Preliminaries
Note: this example requires two additional packages: pandas and statsmodels.
In this example we'll show how the ForwardSelector and SubsetSelector classes can be used on their own or in conjuction with a Scikit-Learn Pipeline object.
```python import pandas as pd from sklearn.pipeline import Pipeline from sklearn.linearmodel import LinearRegression import statsmodels.datasets from statsmodels.api import OLS from statsmodels.tools import addconstant
from steps.forward import ForwardSelector from steps.subset import SubsetSelector ```
We'll download the auto dataset via Statsmodels; we'll use mpg as the endogenous variable and the remaining variables as exongenous. We won't use make, as that will create several dummies and increase the number of paramters to 12+, which is too many for the SubsetSelector class; we'll also drop price.
python
data = statsmodels.datasets.webuse('auto')
data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int)
data.fillna(0, inplace=True)
data.head()
| make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | 0 |
| 1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | 0 |
| 2 | AMC Spirit | 3799 | 22 | 0.0 | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | 0 |
| 3 | Buick Century | 4816 | 20 | 3.0 | 4.5 | 16 | 3250 | 196 | 40 | 196 | 2.93 | 0 |
| 4 | Buick Electra | 7827 | 15 | 4.0 | 4.0 | 20 | 4080 | 222 | 43 | 350 | 2.41 | 0 |
python
X = data.iloc[:, 3:]
y = data['mpg']
Forward Stepwise Selection
The ForwardSelector follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.
The ForwardSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.
|Parameter|Type|Description|
|---------|----|-----------|
|normalize|bool|Whether to normalize features; default False|
|metric|str|Optimization metric to use; must be one of aic or bic; default aic|
The ForwardSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.
python
selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)
ForwardSelector(normalize=True)
python
X.loc[:, selector.best_support_]
| rep78 | weight | length | gear_ratio | foreign | |
|---|---|---|---|---|---|
| 0 | 3.0 | 2930 | 186 | 3.58 | 0 |
| 1 | 3.0 | 3350 | 173 | 2.53 | 0 |
| 2 | 0.0 | 2640 | 168 | 3.08 | 0 |
| 3 | 3.0 | 3250 | 196 | 2.93 | 0 |
| 4 | 4.0 | 4080 | 222 | 2.41 | 0 |
| ... | ... | ... | ... | ... | ... |
| 69 | 4.0 | 2160 | 172 | 3.74 | 1 |
| 70 | 5.0 | 2040 | 155 | 3.78 | 1 |
| 71 | 4.0 | 1930 | 155 | 3.78 | 1 |
| 72 | 4.0 | 1990 | 156 | 3.78 | 1 |
| 73 | 5.0 | 3170 | 193 | 2.98 | 1 |
74 rows × 5 columns
Best Subset Selection
The SubsetSelector follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.
The SubsetSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.
|Parameter|Type|Description|
|---------|----|-----------|
|normalize|bool|Whether to normalize features; default False|
|metric|str|Optimization metric to use; must be one of aic or bic; default aic|
The SubsetSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.
python
selector = SubsetSelector(normalize=True, metric='aic')
selector.fit(X, y)
SubsetSelector(normalize=True)
python
X.loc[:, selector.get_support()]
| rep78 | weight | length | gear_ratio | foreign | |
|---|---|---|---|---|---|
| 0 | 3.0 | 2930 | 186 | 3.58 | 0 |
| 1 | 3.0 | 3350 | 173 | 2.53 | 0 |
| 2 | 0.0 | 2640 | 168 | 3.08 | 0 |
| 3 | 3.0 | 3250 | 196 | 2.93 | 0 |
| 4 | 4.0 | 4080 | 222 | 2.41 | 0 |
| ... | ... | ... | ... | ... | ... |
| 69 | 4.0 | 2160 | 172 | 3.74 | 1 |
| 70 | 5.0 | 2040 | 155 | 3.78 | 1 |
| 71 | 4.0 | 1930 | 155 | 3.78 | 1 |
| 72 | 4.0 | 1990 | 156 | 3.78 | 1 |
| 73 | 5.0 | 3170 | 193 | 2.98 | 1 |
74 rows × 5 columns
Comparing the full model
Using the SubsetSelector selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by weight, length, displacement and other features that are all related to the weight of a vehicle.
Note: Selection using BIC as the optimization metric yields a model where weight is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.
python
mod = OLS(endog=y, exog=add_constant(X)).fit()
mod.summary()
| Dep. Variable: | mpg | R-squared: | 0.720 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.681 |
| Method: | Least Squares | F-statistic: | 18.33 |
| Date: | Sat, 07 Aug 2021 | Prob (F-statistic): | 1.29e-14 |
| Time: | 15:37:36 | Log-Likelihood: | -187.23 |
| No. Observations: | 74 | AIC: | 394.5 |
| Df Residuals: | 64 | BIC: | 417.5 |
| Df Model: | 9 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 39.0871 | 9.100 | 4.295 | 0.000 | 20.907 | 57.267 |
| rep78 | 1.0021 | 0.357 | 2.809 | 0.007 | 0.290 | 1.715 |
| headroom | -0.0167 | 0.611 | -0.027 | 0.978 | -1.237 | 1.204 |
| trunk | -0.0772 | 0.154 | -0.503 | 0.617 | -0.384 | 0.230 |
| weight | -0.0037 | 0.002 | -1.928 | 0.058 | -0.008 | 0.000 |
| length | -0.0752 | 0.061 | -1.229 | 0.223 | -0.197 | 0.047 |
| turn | -0.1762 | 0.187 | -0.941 | 0.350 | -0.550 | 0.198 |
| displacement | 0.0131 | 0.011 | 1.180 | 0.243 | -0.009 | 0.035 |
| gear_ratio | 3.7067 | 1.751 | 2.116 | 0.038 | 0.208 | 7.206 |
| foreign | -4.4633 | 1.385 | -3.222 | 0.002 | -7.230 | -1.696 |
| Omnibus: | 28.364 | Durbin-Watson: | 2.523 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 52.945 |
| Skew: | 1.389 | Prob(JB): | 3.18e-12 |
| Kurtosis: | 6.074 | Cond. No. | 7.55e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems. ```python mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit() mod.summary() ```
| Dep. Variable: | mpg | R-squared: | 0.710 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.688 |
| Method: | Least Squares | F-statistic: | 33.25 |
| Date: | Sat, 07 Aug 2021 | Prob (F-statistic): | 5.22e-17 |
| Time: | 15:37:40 | Log-Likelihood: | -188.63 |
| No. Observations: | 74 | AIC: | 389.3 |
| Df Residuals: | 68 | BIC: | 403.1 |
| Df Model: | 5 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 40.3703 | 7.860 | 5.136 | 0.000 | 24.687 | 56.054 |
| rep78 | 0.9040 | 0.342 | 2.647 | 0.010 | 0.223 | 1.586 |
| weight | -0.0030 | 0.002 | -1.770 | 0.081 | -0.006 | 0.000 |
| length | -0.1058 | 0.053 | -1.990 | 0.051 | -0.212 | 0.000 |
| gear_ratio | 2.6905 | 1.511 | 1.780 | 0.079 | -0.325 | 5.706 |
| foreign | -4.0123 | 1.320 | -3.040 | 0.003 | -6.646 | -1.379 |
| Omnibus: | 24.257 | Durbin-Watson: | 2.442 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 39.774 |
| Skew: | 1.252 | Prob(JB): | 2.31e-09 |
| Kurtosis: | 5.576 | Cond. No. | 6.59e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Use in Scikit-Learn Pipeline
Both ForwardSelector and SubsetSelector objects are compatible with Scikit-Learn Pipeline objects, and can be used as feature selection steps:
python
pl = Pipeline([
('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())
])
pl.fit(X, y)
Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),
('regression', LinearRegression())])
python
pl.score(X, y)
0.7097132531085899
Owner
- Name: Chris Santiago
- Login: chris-santiago
- Kind: user
- Repositories: 64
- Profile: https://github.com/chris-santiago
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| chris-santiago | c****o@g****u | 37 |
| Chris Santiago | c****o@2****n | 4 |
| Chris Santiago | c****o@2****l | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: 3 minutes
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.33
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- chris-santiago (11)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 8 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 3
- Total maintainers: 1
pypi.org: step-select
A SciKit-Learn style feature selector using best subsets and stepwise regression.
- Homepage: https://github.com/chris-santiago/steps
- Documentation: https://chris-santiago.github.io/steps/
- License: MIT License
-
Latest release: 0.1.2
published almost 2 years ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite