pyimpetus

PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

https://github.com/atif-hassan/pyimpetus

Keywords

feature-selection machine-learning-algorithms markov-blanket minimal-features probability statistics t-test

Last synced: 10 months ago · JSON representation ·

Repository

PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Basic Info

Host: GitHub
Owner: atif-hassan
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 274 KB

Statistics

Stars: 134
Watchers: 4
Forks: 16
Open Issues: 3
Releases: 0

Topics

feature-selection machine-learning-algorithms markov-blanket minimal-features probability statistics t-test

Created almost 6 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

README.md

PyImpetus (a.k.a PPFS)

PyImpetus is a Markov Blanket based feature selection algorithm that selects a subset of features by considering their performance both individually as well as a group. This allows the algorithm to not only select the best set of features, but also select the best set of features that play well with each other. For example, the best performing feature might not play well with others while the remaining features, when taken together could out-perform the best feature. PyImpetus takes this into account and produces the best possible combination. Thus, the algorithm provides a minimal feature subset. So, you do not have to decide on how many features to take. PyImpetus selects the optimal set for you.

PyImpetus has been completely revamped and now supports binary classification, multi-class classification and regression tasks. It has been tested on 14 datasets and outperformed state-of-the-art Markov Blanket learning algorithms on all of them along with traditional feature selection algorithms such as Forward Feature Selection, Backward Feature Elimination and Recursive Feature Elimination.

How to install?

pip install PyImpetus

Functions and parameters

```python

The initialization of PyImpetus takes in multiple parameters as input

PPIMBC is for classification

model = PPIMBC(model, pvalthresh, numsimul, simulsize, simultype, sigtesttype, cv, verbose, randomstate, njobs) ``- **model** -estimator object, default=DecisionTreeClassifier()` The model which is used to perform classification in order to find feature importance via significance-test. - **pvalthresh** - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB. - **numsimul** - int, default=30 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5. - simul_size - float, default=0.2 The size of the test set in each train-test split - simul_type - boolean, default=0 To apply stratification or not - 0 means train-test splits are not stratified. - 1 means the train-test splits will be stratified. - sigtesttype - string, default="non-parametric" This determines the type of significance test to use. - "parametric" means a parametric significance test will be used (Note: This test selects very few features) - "non-parametric" means a non-parametric significance test will be used - cv - cv object/int, default=0 Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled. - verbose - int, default=2 Controls the verbosity: the higher, more the messages. - random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls. - n_jobs - int, default=-1 The number of CPUs to use to do the computation. - None means 1 unless in a :obj:joblib.parallel_backend context. - -1 means using all processors.

```python

The initialization of PyImpetus takes in multiple parameters as input

PPIMBR is for regression

model = PPIMBR(model, pvalthresh, numsimul, simulsize, sigtesttype, cv, verbose, randomstate, njobs) ``- **model** -estimator object, default=DecisionTreeRegressor()The model which is used to perform regression in order to find feature importance via significance-test. - **p_val_thresh** -float, default=0.05The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB. - **num_simul** -int, default=30**(This feature has huge impact on speed)** Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5. - **simul_size** -float, default=0.2The size of the test set in each train-test split - **sig_test_type** -string, default="non-parametric"This determines the type of significance test to use. -"parametric"means a parametric significance test will be used (Note: This test selects very few features) -"non-parametric"means a non-parametric significance test will be used - **cv** -cv object/int, default=0Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled. - **verbose** -int, default=2Controls the verbosity: the higher, more the messages. - **random_state** -int or RandomState instance, default=NonePass an int for reproducible output across multiple function calls. - **n_jobs** -int, default=-1The number of CPUs to use to do the computation. -Nonemeans 1 unless in a:obj:joblib.parallel_backendcontext. --1` means using all processors.

```python

To fit PyImpetus on provided dataset and find recommended features

fit(data, target) ``` - data - A pandas dataframe or a numpy matrix upon which feature selection is to be applied\ (Passing pandas dataframe allows using correct column names. Numpy matrix will apply default column names) - target - A numpy array, denoting the target variable

```python

This function returns the names of the columns that form the MB (These are the recommended features)

transform(data) ``` - data - A pandas dataframe or a numpy matrix which needs to be pruned\ (Passing pandas dataframe allows using correct column names. Numpy matrix will apply default column names)

```python

To fit PyImpetus on provided dataset and return pruned data

fit_transform(data, target) ``` - data - A pandas dataframe or numpy matrix upon which feature selection is to be applied\ (Passing pandas dataframe allows using correct column names. Numpy matrix will apply default column names) - target - A numpy array, denoting the target variable

```python

To plot XGBoost style feature importance

feature_importance() ```

How to import?

python from PyImpetus import PPIMBC, PPIMBR

Usage

If data is a pandas dataframe ```python

Import the algorithm. PPIMBC is for classification and PPIMBR is for regression

from PyImpetus import PPIMBC, PPIMBR

Initialize the PyImpetus object

model = PPIMBC(model=SVC(randomstate=27, classweight="balanced"), pvalthresh=0.05, numsimul=30, simulsize=0.2, simultype=0, sigtesttype="non-parametric", cv=5, randomstate=27, n_jobs=-1, verbose=2)

The fit_transform function is a wrapper for the fit and transform functions, individually.

The fit function finds the MB for given data while transform function provides the pruned form of the dataset

dftrain = model.fittransform(dftrain.drop("Response", axis=1), dftrain["Response"].values) dftest = model.transform(dftest)

Check out the MB

print(model.MB)

Check out the feature importance scores for the selected feature subset

print(model.featimpscores)

Get a plot of the feature importance scores

model.feature_importance() ```

If data is a numpy matrix ```python

Import the algorithm. PPIMBC is for classification and PPIMBR is for regression

from PyImpetus import PPIMBC, PPIMBR

Initialize the PyImpetus object

model = PPIMBC(model=SVC(randomstate=27, classweight="balanced"), pvalthresh=0.05, numsimul=30, simulsize=0.2, simultype=0, sigtesttype="non-parametric", cv=5, randomstate=27, n_jobs=-1, verbose=2)

The fit_transform function is a wrapper for the fit and transform functions, individually.

The fit function finds the MB for given data while transform function provides the pruned form of the dataset

dftrain = model.fittransform(xtrain, ytrain) dftest = model.transform(xtest)

Check out the MB

print(model.MB)

Check out the feature importance scores for the selected feature subset

print(model.featimpscores)

Get a plot of the feature importance scores

model.feature_importance() ```

For better accuracy

Note: Play with the values of num_simul, simul_size, simul_type and pvalthresh because sometimes a specific combination of these values will end up giving best results - ~~Increase the cv value~~ In all experiments, cv did not help in getting better accuracy. Use this only when you have extremely small dataset - Increase the num_simul value - Try one of these values for simul_size = {0.1, 0.2, 0.3, 0.4} - Use non-linear models for feature selection. Apply hyper-parameter tuning on models - Increase value of pvalthresh in order to increase the number of features to include in thre Markov Blanket

For better speeds

~~Decrease the cv value. For large datasets cv might not be required. Therefore, set cv=0 to disable the aggregation step. This will result in less robust feature subset selection but at much faster speeds~~
Decrease the num_simul value but don't decrease it below 5
Set n_jobs to -1
Use linear models

For selection of less features

Try reducing the pvalthresh value
Try out sig_test_type = "parametric"

Performance in terms of Accuracy (classification) and MSE (regression)

| Dataset | # of samples | # of features | Task Type | Score using all features | Score using featurewiz | Score using PyImpetus | # of features selected | % of features selected | Tutorial | | --- | --- | --- | --- |--- |--- |--- |--- |--- |--- | | Ionosphere | 351 | 34 | Classification | 88.01% | | 92.86% | 14 | 42.42% | tutorial here | | Arcene | 100 | 10000 | Classification | 82% | | 84.72% | 304 | 3.04% | | | AlonDS2000 | 62 | 2000 | Classification | 80.55% | 86.98% | 88.49% | 75 | 3.75% | | | slicelocalizationdata | 53500 | 384 | Regression | 6.54 | | 5.69 | 259 | 67.45% | tutorial here |

Note: Here, for the first, second and third tasks, a higher accuracy score is better while for the fourth task, a lower MSE (Mean Squared Error) is better.

Performance in terms of Time (in seconds)

| Dataset | # of samples | # of features | Time (with PyImpetus) | | --- | --- | --- | --- | | Ionosphere | 351 | 34 | 35.37 | | Arcene | 100 | 10000 | 1570 | | AlonDS2000 | 62 | 2000 | 125.511 | | slicelocalizationdata | 53500 | 384 | 1296.13 |

Future Ideas

Let me know

Feature Request

Drop me an email at atif.hit.hassan@gmail.com if you want any particular feature

Please cite this work as

Citation is available in the CITATION.bib file

Alternatively, use the following DBLP Bibtex link

Corresponding paper

A wrapper feature selection approach using Markov blankets

Old ArXiv version

PPFS: Predictive Permutation Feature Selection

Owner

Name: Atif Hassan
Login: atif-hassan
Kind: user
Location: Kolkata

Website: https://atif-hassan.github.io/
Repositories: 6
Profile: https://github.com/atif-hassan

PhD student at the Center of Excellence for AI, IIT Kharagpur.

Citation (CITATION.bib)

@article{HASSAN2025111069,
title = {A wrapper feature selection approach using Markov blankets},
journal = {Pattern Recognition},
volume = {158},
pages = {111069},
year = {2025},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2024.111069},
url = {https://www.sciencedirect.com/science/article/pii/S0031320324008203},
author = {Atif Hassan and Jiaul Hoque Paik and Swanand Ravindra Khare and Syed Asif Hassan},
keywords = {Feature selection, Markov blanket, Conditional independence test, Classification, Regression},
abstract = {In feature selection, Markov Blanket (MB) based approaches have attracted considerable attention with most MB discovery algorithms being categorized as filter based techniques. Typically, the Conditional Independence (CI) test employed by such methods is different for different data types. In this article, we propose a novel Markov Blanket based wrapper feature selection method. The proposed approach employs Predictive Permutation Independence (PPI), a novel Conditional Independence (CI) test that allows it to work out-of-the-box for both classification and regression tasks on mixed data. PPI can work with any supervised algorithm to estimate the association of a feature with the target variable while also providing a measure of feature importance. The proposed approach also includes an optional MB aggregation step that can be used to find the optimal MB under non-faithful conditions. Our method11Implementation and experimental results are available at https://github.com/AnonymousMLSubmissions/PatternRecognitionSubmission. outperforms other MB discovery methods, in terms of F1-score, by 7% on average, over 3 large-scale BN datasets. It also outperforms state-of-the-art feature selection techniques on 13 real-world datasets.}
}

GitHub Events

Total

Watch event: 6
Push event: 4
Fork event: 2

Last Year

Watch event: 6
Push event: 4
Fork event: 2

Committers

Last synced: over 2 years ago

All Time

Total Commits: 106
Total Committers: 4
Avg Commits per committer: 26.5
Development Distribution Score (DDS): 0.113

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Atif Hassan	a**n@g**m	94
Antoni Baum	a**m@p**m	10
Myrl	m**l@m**s	1
Henri Yandell	4****l	1

Committer Domains (Top 20 + Academic)

marmarel.is: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 8
Total pull requests: 5
Average time to close issues: 3 months
Average time to close pull requests: 7 days
Total issue authors: 6
Total pull request authors: 3
Average comments per issue: 3.38
Average comments per pull request: 0.2
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jmrichardson (2)
aswinjose89 (2)
ogencoglu (1)
brunofacca (1)
yutianfanxing (1)
ivan-marroquin (1)

Pull Request Authors

Yard1 (3)
hyandell (1)
marmarelis (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 104 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 27
Total maintainers: 1

pypi.org: pyimpetus

PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Homepage: https://github.com/atif-hassan/PyImpetus
Documentation: https://pyimpetus.readthedocs.io/
License: mit
Latest release: 4.1.2
published over 4 years ago

Versions: 27
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 104 Last month

Rankings

Stargazers count: 6.7%

Dependent packages count: 7.3%

Forks count: 9.6%

Downloads: 10.0%

Average: 11.2%

Dependent repos count: 22.1%

Maintainers (1)

atif_hassan

Last synced: 11 months ago

pyimpetus

Science Score: 41.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

PyImpetus (a.k.a PPFS)

How to install?

Functions and parameters

The initialization of PyImpetus takes in multiple parameters as input

PPIMBC is for classification

The initialization of PyImpetus takes in multiple parameters as input

PPIMBR is for regression

To fit PyImpetus on provided dataset and find recommended features

This function returns the names of the columns that form the MB (These are the recommended features)

To fit PyImpetus on provided dataset and return pruned data

To plot XGBoost style feature importance

How to import?

Usage

Import the algorithm. PPIMBC is for classification and PPIMBR is for regression

Initialize the PyImpetus object

The fit_transform function is a wrapper for the fit and transform functions, individually.

The fit function finds the MB for given data while transform function provides the pruned form of the dataset

Check out the MB

Check out the feature importance scores for the selected feature subset

Get a plot of the feature importance scores

Import the algorithm. PPIMBC is for classification and PPIMBR is for regression

Initialize the PyImpetus object

The fit_transform function is a wrapper for the fit and transform functions, individually.

The fit function finds the MB for given data while transform function provides the pruned form of the dataset

Check out the MB

Check out the feature importance scores for the selected feature subset

Get a plot of the feature importance scores

For better accuracy

For better speeds

For selection of less features

Performance in terms of Accuracy (classification) and MSE (regression)

Performance in terms of Time (in seconds)

Future Ideas

Feature Request

Please cite this work as

Corresponding paper

Old ArXiv version

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pyimpetus

Rankings

Maintainers (1)