https://github.com/yuenshingyan/ForwardStepwiseFeatureSelection

ForwardStepwiseFeatureSelection

https://github.com/yuenshingyan/ForwardStepwiseFeatureSelection

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary

Keywords

data-science feature-selection machine-learning python
Last synced: 6 months ago · JSON representation

Repository

ForwardStepwiseFeatureSelection

Basic Info
  • Host: GitHub
  • Owner: yuenshingyan
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 356 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Topics
data-science feature-selection machine-learning python
Created about 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

ForwardStepwiseFeatureSelection

Open Source Love PyPI version MIT Licence

ForwardStepwiseFeatureSelection aims to select the best features or the subset of features in machine learning tasks according to corresponding score with other incredible packages like numpy, pandas and sklearn.

Quick Start

# Install ForwardStepwiseFeatureSelection
!pip install ForwardStepwiseFeatureSelection

Quick Example

# Import dependenices
from ForwardStepwiseFeatureSelection import ForwardStepwiseFeatureSelection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Read dataframe
insurance = pd.read_csv('insurance.csv')

# Label Encoding
for col in ['sex', 'smoker', 'region']:
    insurance[col].replace(insurance[col].unique(), range(insurance[col].nunique()), inplace=True)

X = insurance.drop('charges', axis=1)
y = insurance['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Scale our data
scaler = StandardScaler()
X_train, X_test = (pd.DataFrame(scaler.fit_transform(df), columns=df.columns) for df in [X_train, X_test])
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate the estimator
rfc = RandomForestRegressor()

# Instantiate ForwardStepwiseFeatureSelection
fsfs = ForwardStepwiseFeatureSelection(estimators=rfc, cv=3, scoring='neg_mean_absolute_error', mode=None, verbose=1, tolerance=3)

# Start feature selection
fsfs.fit(X_train, y_train)

print(fsfs.best_subsets)

>> {'RandomForestRegressor': ['smoker', 'age', 'bmi', 'children', 'region']}

This package is inspired by: PyData DC 2016 | A Practical Guide to Dimensionality Reduction Vishal Patel October 8, 2016

  • Examples: https://github.com/HindyDS/ForwardStepwiseFeatureSelection/tree/main/examples
  • Email: hindy888@hotmail.com
  • Source code: https://github.com/HindyDS/ForwardStepwiseFeatureSelection/tree/main/ForwardStepwiseFeatureSelection
  • Bug reports: https://github.com/HindyDS/ForwardStepwiseFeatureSelection/issues

It requires at least six arguments to run:

  • estimators: machine learning model
  • X (array): features space
  • y (array): target
  • cv (int): number of folds in a (Stratified)KFold
  • scoring (str): see https://scikit-learn.org/stable/modules/model_evaluation.html

Optional arguments: - mode (string): None or 'ts'. If 'ts' (Time Series) than it will change to walk forward cross validation. - maxtrial (int): number of trials that you wanted FSFS to stop searching - tolerance (int): how many times FSFS can fail to find better subset of features - leastgain (int): threshold of scoring metrics gain in fraction - maxfeats (int): maximum number of features - prior (list): starting point for FSFS to search, must be corresponds to the columns of X - exclusions (nested list): if the new selected feature is in one of the particular subpool (list in the nested list), then the features in that particular subpool with no longer be avalible to form any new subset in the following trials - njobs (int): Number of jobs to run in parallel. - n_digit (int): Decimal places for scoring - verbose (int): Level of verbosity of FSFS

If you have any ideas for this packge please don't hesitate to bring forward!

Owner

  • Name: Hindy
  • Login: yuenshingyan
  • Kind: user
  • Location: Hong Kong
  • Company: CT Risk Analytics

Quant | Machine Learning Engineer | Data Scientist| Master Degree in Physics

GitHub Events

Total
Last Year

Dependencies

setup.py pypi