stabilitypy

stability selection

https://github.com/rakibalfahad/stabilitypy

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.4%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

stability selection

Basic Info

Host: GitHub
Owner: rakibalfahad
License: bsd-3-clause
Language: Python
Default Branch: main
Size: 29.3 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 8 months ago · Last pushed 8 months ago

Metadata Files

Readme License Citation

StabilityPy

A modern implementation of stability selection for feature selection, with support for GPU acceleration via PyTorch and parallel processing for CPU.

Introduction

Stability selection is a method for feature selection introduced by Meinshausen and Bühlmann (2010). The core idea is to apply a feature selection algorithm repeatedly to random subsamples of the data and select only those features that are consistently selected across many subsamples.

Theory

Stability selection works as follows:

Create multiple subsamples of your data by randomly selecting a subset of observations.
For each subsample, run a feature selection algorithm (often a penalized regression like LASSO) across a range of regularization parameters.
For each feature, calculate its selection probability (stability score) as the fraction of subsamples where it was selected.
Choose features with stability scores above a user-defined threshold.

This approach has several advantages: - It provides control over the family-wise error rate of including false positives - It is more robust to small changes in the data - It works with many different base feature selection methods - It reduces the sensitivity to the choice of regularization parameter

In the randomized LASSO variant, the penalty term for each feature is randomly scaled, which adds another layer of robustness.

Installation

From PyPI (Recommended)

bash pip install stability-selection

From Source

To install the latest development version:

bash git clone https://github.com/yourusername/StabilityPy.git cd StabilityPy pip install -e .

For development, install with development dependencies:

bash pip install -e ".[dev]"

For GPU acceleration, install with GPU dependencies:

bash pip install -e ".[gpu]"

Features

GPU Acceleration: Uses PyTorch for GPU-accelerated computations when available
Parallel Processing: Efficient multi-core CPU utilization for bootstrap iterations
Multiple Bootstrapping Strategies:
- Subsampling without replacement (default)
- Complementary pairs subsampling
- Stratified bootstrapping for imbalanced classification
Scikit-learn Compatible: Works with scikit-learn pipelines and cross-validation
CSV/CSV.GZ Processing: Direct support for tabular data formats
Automated Feature Selection: Process tabular data and visualize results with a single command
Model Fine-tuning: Automatically fine-tune models with selected features and compare to baselines
Synthetic Data Generation: Generate controlled datasets for testing and benchmarking

Documentation

Example Usage

```python import numpy as np from sklearn.linearmodel import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from stabilityselection import StabilitySelection

Create a pipeline with a base estimator

base_estimator = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression(penalty='l1', solver='liblinear')) ])

Initialize stability selection with the base estimator

selector = StabilitySelection( baseestimator=baseestimator, lambdaname='modelC', lambdagrid=np.logspace(-5, -1, 50), njobs=-1, # Use all available CPU cores usegpu=True # Use GPU if available )

Fit the selector to your data

selector.fit(X, y)

Get selected features

selectedfeatures = selector.getsupport(indices=True) ```

Data Processing and Analysis

StabilityPy now includes a comprehensive script for processing tabular data and running the full stability selection workflow:

```bash

For classification problems

python stabilityprocessor.py --input data.csv --output resultsdir --problem_type classification

For regression problems

python stabilityprocessor.py --input data.csv.gz --output resultsdir --problemtype regression --usegpu ```

The script will: 1. Load and preprocess your data 2. Run stability selection to identify important features 3. Fine-tune a model using only the selected features 4. Compare performance with a baseline model using all features 5. Generate visualizations and save all results to the output directory

Synthetic Data Generation

You can generate synthetic datasets with controlled properties using the included generator:

```bash

Generate a classification dataset

python syntheticdatagenerator.py --output data.csv --problemtype classification --nsamples 1000 --nfeatures 100 --ninformative 10

Generate a compressed regression dataset

python syntheticdatagenerator.py --output data.csv.gz --problemtype regression --nsamples 2000 --nfeatures 500 --ninformative 20 --compress ```

Advanced Example with Visualizations

The package includes comprehensive example scripts that demonstrate various use cases:

```bash

Basic stability selection example

python examples/stabilityselectionexample.py

Example with GPU acceleration

python examples/gpuaccelerationexample.py

Example with synthetic data visualization

python examples/syntheticdatavisualization.py ```

Complete Workflow Example

Here's a complete workflow from data generation to model fine-tuning:

```bash

1. Generate synthetic data

python syntheticdatagenerator.py \ --output data/syntheticclassification.csv \ --problemtype classification \ --nsamples 1000 \ --nfeatures 100 \ --n_informative 10 \ --noise 0.1

2. Run stability selection with fine-tuning

python stabilityprocessor.py \ --input data/syntheticclassification.csv \ --output results/syntheticclassification \ --problemtype classification \ --nbootstrap 100 \ --usegpu ```

Output Files and Visualizations

The stability_processor.py script produces a comprehensive set of outputs:

selected_features.csv: CSV file with selected features and their stability scores
feature_importance.csv: Feature importances from the fine-tuned model
performance_metrics.csv: Performance comparison between selected features and all features
stabilityselectionresults.pkl: Pickled results object with all stability scores
finetunedmodel.pkl: Trained model using only selected features
baseline_model.pkl: Trained model using all features

Visualizations: - stability_paths.png: Plot of stability scores across regularization parameters - stability_heatmap.png: Heatmap of stability scores for top features - performance_comparison.png: Bar chart comparing model performance - feature_importance.png: Bar chart of feature importances from fine-tuned model

GPU Acceleration

StabilityPy provides GPU acceleration via PyTorch for faster computation, especially useful for large datasets:

```python from stability_selection import StabilitySelection, RandomizedLasso

For regression tasks, use RandomizedLasso with GPU acceleration

estimator = RandomizedLasso(weakness=0.5, usegpu=True) selector = StabilitySelection( baseestimator=estimator, lambdaname='alpha', lambdagrid=np.linspace(0.001, 0.5, num=100), threshold=0.9, usegpu=True, # Enable GPU acceleration njobs=-1 # Use all CPU cores for operations that can't be GPU accelerated ) selector.fit(X, y) ```

Development

The project follows standard Python development practices with tools for code quality and testing.

Development Setup

```bash

Clone the repository

git clone https://github.com/yourusername/StabilityPy.git cd StabilityPy

Install in development mode with dev dependencies

pip install -e ".[dev]"

Install pre-commit hooks

pre-commit install ```

Code Quality Tools

We use several tools to maintain code quality:

```bash

Format code

black stabilityselection examples --line-length=100 isort stabilityselection examples

Lint code

flake8 stability_selection examples --max-line-length=100

Run tests

pytest stability_selection/tests

Run tests with coverage

pytest --cov=stability_selection --cov-report=term ```

For more details, see the Development Guide and Standardization Summary.

References

[1] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473. Link to paper

[2] Shah, R.D. and Samworth, R.J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), pp.55-80.

Requirements

Python 3.8+
NumPy >= 1.20.0
SciPy >= 1.7.0
scikit-learn >= 1.0.0
PyTorch >= 1.10.0 (optional, for GPU acceleration)
joblib >= 1.0.0
tqdm >= 4.60.0
matplotlib >= 3.3.0 (for visualization)
seaborn >= 0.11.0 (for visualization)

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Acknowledgments

This project is a modernized version of the original stability-selection package by Thomas Huijskens, with added features for GPU acceleration, improved parallel processing, and standardized code quality.

Owner

Name: Rakib Al-Fahad
Login: rakibalfahad
Kind: user
Company: The University of Memphis

Repositories: 1
Profile: https://github.com/rakibalfahad

I am pursuing my Ph.D. in Computer Engineering at the Department of Electrical and Computer Engineering at The University of Memphis.

Citation (CITATION.md)

# Citation

If you use this package in your research, please consider citing the original stability selection paper:

```
@article{meinshausen2010stability,
  title={Stability selection},
  author={Meinshausen, Nicolai and B{\"u}hlmann, Peter},
  journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
  volume={72},
  number={4},
  pages={417--473},
  year={2010},
  publisher={Wiley Online Library}
}
```

For the complementary pairs bootstrap variant:

```
@article{shah2013variable,
  title={Variable selection with error control: another look at stability selection},
  author={Shah, Rajen D and Samworth, Richard J},
  journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
  volume={75},
  number={1},
  pages={55--80},
  year={2013},
  publisher={Wiley Online Library}
}
```

For the implementation of this package:

```
@software{stability_selection,
  author = {Various contributors},
  title = {Stability Selection - A scikit-learn compatible implementation with GPU acceleration},
  url = {https://github.com/yourusername/stability-selection},
  year = {2025}
}
```

GitHub Events

Total

Issue comment event: 1
Member event: 1
Push event: 7
Pull request event: 7
Create event: 2

Last Year

Issue comment event: 1
Member event: 1
Push event: 7
Pull request event: 7
Create event: 2

Dependencies

requirements.txt pypi

joblib >=1.3.0
matplotlib >=3.7.0
numpy >=1.24.0
pytest >=7.0.0
pytest-cov >=4.1.0
scikit-learn >=1.3.0
scipy >=1.11.0
seaborn >=0.12.0
torch >=2.0.0
tqdm >=4.65.0

setup.py pypi

stabilitypy

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

StabilityPy

Introduction

Theory

Installation

From PyPI (Recommended)

From Source

Features

Documentation

Example Usage

Create a pipeline with a base estimator

Initialize stability selection with the base estimator

Fit the selector to your data

Get selected features

Data Processing and Analysis

For classification problems

For regression problems

Synthetic Data Generation

Generate a classification dataset

Generate a compressed regression dataset

Advanced Example with Visualizations

Basic stability selection example

Example with GPU acceleration

Example with synthetic data visualization

Complete Workflow Example

1. Generate synthetic data

2. Run stability selection with fine-tuning

Output Files and Visualizations

GPU Acceleration

For regression tasks, use RandomizedLasso with GPU acceleration

Development

Development Setup

Clone the repository

Install in development mode with dev dependencies

Install pre-commit hooks

Code Quality Tools

Format code

Lint code

Run tests

Run tests with coverage

References

Requirements

License

Acknowledgments

Owner

Citation (CITATION.md)

GitHub Events

Total

Last Year

Dependencies