https://github.com/chris-santiago/steps

A SciKit-Learn style feature selector using best subsets and stepwise regression.

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Keywords

best-subset-selection data-science python scikit-learn stepwise-selection

Keywords from Contributors

transformers

Last synced: 9 months ago · JSON representation

Repository

A SciKit-Learn style feature selector using best subsets and stepwise regression.

Basic Info

Host: GitHub
Owner: chris-santiago
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage: https://chris-santiago.github.io/steps/
Size: 781 KB

Statistics

Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Topics

best-subset-selection data-science python scikit-learn stepwise-selection

Created almost 5 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

step-select

A SciKit-Learn style feature selector using best subsets and stepwise regression.

Install

Create a virtual environment with Python 3.8 and install from PyPi:

bash pip install step-select

Use

Preliminaries

Note: this example requires two additional packages: pandas and statsmodels.

In this example we'll show how the ForwardSelector and SubsetSelector classes can be used on their own or in conjuction with a Scikit-Learn Pipeline object.

```python import pandas as pd from sklearn.pipeline import Pipeline from sklearn.linearmodel import LinearRegression import statsmodels.datasets from statsmodels.api import OLS from statsmodels.tools import addconstant

from steps.forward import ForwardSelector from steps.subset import SubsetSelector ```

We'll download the auto dataset via Statsmodels; we'll use mpg as the endogenous variable and the remaining variables as exongenous. We won't use make, as that will create several dummies and increase the number of paramters to 12+, which is too many for the SubsetSelector class; we'll also drop price.

python data = statsmodels.datasets.webuse('auto') data['foreign'] = pd.Series([x == 'Foreign' for x in data['foreign']]).astype(int) data.fillna(0, inplace=True) data.head()

	make	price	mpg	rep78	headroom	trunk	weight	length	turn	displacement	gear_ratio
0	AMC Concord	4099	22	3.0	2.5	11	2930	186	40	121	3.58
1	AMC Pacer	4749	17	3.0	3.0	11	3350	173	40	258	2.53
2	AMC Spirit	3799	22	0.0	3.0	12	2640	168	35	121	3.08
3	Buick Century	4816	20	3.0	4.5	16	3250	196	40	196	2.93
4	Buick Electra	7827	15	4.0	4.0	20	4080	222	43	350	2.41

python X = data.iloc[:, 3:] y = data['mpg']

Forward Stepwise Selection

The ForwardSelector follows the standard stepwise regression algorithm: begin with a null model, iteratively test each variable and select the one that gives the most statistically significant improvement of the fit, and repeat. This greedy algorithm continues until the fit no longer improves.

The ForwardSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.

|Parameter|Type|Description| |---------|----|-----------| |normalize|bool|Whether to normalize features; default False| |metric|str|Optimization metric to use; must be one of aic or bic; default aic|

The ForwardSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.

python selector = ForwardSelector(normalize=True, metric='aic') selector.fit(X, y)

ForwardSelector(normalize=True)

python X.loc[:, selector.best_support_]

	rep78	weight	length	gear_ratio	foreign
0	3.0	2930	186	3.58	0
1	3.0	3350	173	2.53	0
2	0.0	2640	168	3.08	0
3	3.0	3250	196	2.93	0
4	4.0	4080	222	2.41	0
...	...	...	...	...	...
69	4.0	2160	172	3.74	1
70	5.0	2040	155	3.78	1
71	4.0	1930	155	3.78	1
72	4.0	1990	156	3.78	1
73	5.0	3170	193	2.98	1

74 rows × 5 columns

Best Subset Selection

The SubsetSelector follows a very simple algorithm: compare all possible models with $k$ predictors, and select the model that minimizes our selection criteria. This algorithm is only appropriate for $k<=12$ features, as it becomes computationally expensive: there are $\frac{k!}{(p-k)!}$possible models, where $p$ is the total number of paramters and $k$ is the number of features included in the model.

The SubsetSelector is instantiated with two parameters: normalize and metric. Normalize defaults to False, assuming that this class is part of a larger pipeline; metric defaults to AIC.

The SubsetSelector class follows the Scikit-Learn API. After fitting the selector using the .fit() method, the selected features can be accessed using the boolean mask under the .best_support_ attribute.

python selector = SubsetSelector(normalize=True, metric='aic') selector.fit(X, y)

SubsetSelector(normalize=True)

python X.loc[:, selector.get_support()]

	rep78	weight	length	gear_ratio	foreign
0	3.0	2930	186	3.58	0
1	3.0	3350	173	2.53	0
2	0.0	2640	168	3.08	0
3	3.0	3250	196	2.93	0
4	4.0	4080	222	2.41	0
...	...	...	...	...	...
69	4.0	2160	172	3.74	1
70	5.0	2040	155	3.78	1
71	4.0	1930	155	3.78	1
72	4.0	1990	156	3.78	1
73	5.0	3170	193	2.98	1

74 rows × 5 columns

Comparing the full model

Using the SubsetSelector selected features yields a model with 4 fewer parameters and slightly improved AIC and BIC metrics. The summaries indicate possible multicollinearity in both models, likely caused by weight, length, displacement and other features that are all related to the weight of a vehicle.

Note: Selection using BIC as the optimization metric yields a model where weight is the only selected feature. Bayesian information criteria penalizes additional parameters more then AIC.

python mod = OLS(endog=y, exog=add_constant(X)).fit() mod.summary()

OLS Regression Results
Dep. Variable:	mpg	R-squared:	0.720
Model:	OLS	Adj. R-squared:	0.681
Method:	Least Squares	F-statistic:	18.33
Date:	Sat, 07 Aug 2021	Prob (F-statistic):	1.29e-14
Time:	15:37:36	Log-Likelihood:	-187.23
No. Observations:	74	AIC:	394.5
Df Residuals:	64	BIC:	417.5
Df Model:	9
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	39.0871	9.100	4.295	0.000	20.907	57.267
rep78	1.0021	0.357	2.809	0.007	0.290	1.715
headroom	-0.0167	0.611	-0.027	0.978	-1.237	1.204
trunk	-0.0772	0.154	-0.503	0.617	-0.384	0.230
weight	-0.0037	0.002	-1.928	0.058	-0.008	0.000
length	-0.0752	0.061	-1.229	0.223	-0.197	0.047
turn	-0.1762	0.187	-0.941	0.350	-0.550	0.198
displacement	0.0131	0.011	1.180	0.243	-0.009	0.035
gear_ratio	3.7067	1.751	2.116	0.038	0.208	7.206
foreign	-4.4633	1.385	-3.222	0.002	-7.230	-1.696

Omnibus:	28.364	Durbin-Watson:	2.523
Prob(Omnibus):	0.000	Jarque-Bera (JB):	52.945
Skew:	1.389	Prob(JB):	3.18e-12
Kurtosis:	6.074	Cond. No.	7.55e+04

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.55e+04. This might indicate that there are
strong multicollinearity or other numerical problems. ```python mod = OLS(endog=y, exog=add_constant(X.loc[:, selector.best_support_])).fit() mod.summary() ```

OLS Regression Results
Dep. Variable:	mpg	R-squared:	0.710
Model:	OLS	Adj. R-squared:	0.688
Method:	Least Squares	F-statistic:	33.25
Date:	Sat, 07 Aug 2021	Prob (F-statistic):	5.22e-17
Time:	15:37:40	Log-Likelihood:	-188.63
No. Observations:	74	AIC:	389.3
Df Residuals:	68	BIC:	403.1
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	40.3703	7.860	5.136	0.000	24.687	56.054
rep78	0.9040	0.342	2.647	0.010	0.223	1.586
weight	-0.0030	0.002	-1.770	0.081	-0.006	0.000
length	-0.1058	0.053	-1.990	0.051	-0.212	0.000
gear_ratio	2.6905	1.511	1.780	0.079	-0.325	5.706
foreign	-4.0123	1.320	-3.040	0.003	-6.646	-1.379

Omnibus:	24.257	Durbin-Watson:	2.442
Prob(Omnibus):	0.000	Jarque-Bera (JB):	39.774
Skew:	1.252	Prob(JB):	2.31e-09
Kurtosis:	5.576	Cond. No.	6.59e+04

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Use in Scikit-Learn Pipeline

Both ForwardSelector and SubsetSelector objects are compatible with Scikit-Learn Pipeline objects, and can be used as feature selection steps:

python pl = Pipeline([ ('feature_selection', SubsetSelector(normalize=True)), ('regression', LinearRegression()) ]) pl.fit(X, y)

Pipeline(steps=[('feature_selection', SubsetSelector(normalize=True)),
                ('regression', LinearRegression())])

python pl.score(X, y)

0.7097132531085899

Owner

Name: Chris Santiago
Login: chris-santiago
Kind: user

Repositories: 64
Profile: https://github.com/chris-santiago

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 11 months ago

All Time

Total Commits: 42
Total Committers: 3
Avg Commits per committer: 14.0
Development Distribution Score (DDS): 0.119

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
chris-santiago	c**o@g**u	37
Chris Santiago	c**o@2**n	4
Chris Santiago	c**o@2**l	1

Committer Domains (Top 20 + Academic)

257-mba.lan: 1 gatech.edu: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 9
Average time to close issues: N/A
Average time to close pull requests: 3 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.33
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

chris-santiago (11)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 8 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 3
Total maintainers: 1

pypi.org: step-select

A SciKit-Learn style feature selector using best subsets and stepwise regression.

Homepage: https://github.com/chris-santiago/steps
Documentation: https://chris-santiago.github.io/steps/
License: MIT License
Latest release: 0.1.2
published about 2 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 8 Last month

Rankings

Dependent packages count: 10.1%

Dependent repos count: 21.6%

Stargazers count: 25.1%

Average: 27.6%

Forks count: 29.8%

Downloads: 51.4%

Maintainers (1)

cjsantiago

Last synced: 9 months ago

Dependencies

.github/workflows/python-package.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/python-publish.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

setup.py pypi

https://github.com/chris-santiago/steps

Science Score: 23.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

step-select

Install

Use

Preliminaries

Forward Stepwise Selection

Best Subset Selection

Comparing the full model

Use in Scikit-Learn Pipeline

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: step-select

Rankings

Maintainers (1)

Dependencies