stabilitypy

stability selection

https://github.com/rakibalfahad/stabilitypy

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

stability selection

Basic Info
  • Host: GitHub
  • Owner: rakibalfahad
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: main
  • Size: 29.3 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

StabilityPy

License

A modern implementation of stability selection for feature selection, with support for GPU acceleration via PyTorch and parallel processing for CPU.

Introduction

Stability selection is a method for feature selection introduced by Meinshausen and Bühlmann (2010). The core idea is to apply a feature selection algorithm repeatedly to random subsamples of the data and select only those features that are consistently selected across many subsamples.

Theory

Stability selection works as follows:

  1. Create multiple subsamples of your data by randomly selecting a subset of observations.
  2. For each subsample, run a feature selection algorithm (often a penalized regression like LASSO) across a range of regularization parameters.
  3. For each feature, calculate its selection probability (stability score) as the fraction of subsamples where it was selected.
  4. Choose features with stability scores above a user-defined threshold.

This approach has several advantages: - It provides control over the family-wise error rate of including false positives - It is more robust to small changes in the data - It works with many different base feature selection methods - It reduces the sensitivity to the choice of regularization parameter

In the randomized LASSO variant, the penalty term for each feature is randomly scaled, which adds another layer of robustness.

Installation

From PyPI (Recommended)

bash pip install stability-selection

From Source

To install the latest development version:

bash git clone https://github.com/yourusername/StabilityPy.git cd StabilityPy pip install -e .

For development, install with development dependencies:

bash pip install -e ".[dev]"

For GPU acceleration, install with GPU dependencies:

bash pip install -e ".[gpu]"

Features

  • GPU Acceleration: Uses PyTorch for GPU-accelerated computations when available
  • Parallel Processing: Efficient multi-core CPU utilization for bootstrap iterations
  • Multiple Bootstrapping Strategies:
    • Subsampling without replacement (default)
    • Complementary pairs subsampling
    • Stratified bootstrapping for imbalanced classification
  • Scikit-learn Compatible: Works with scikit-learn pipelines and cross-validation
  • CSV/CSV.GZ Processing: Direct support for tabular data formats
  • Automated Feature Selection: Process tabular data and visualize results with a single command
  • Model Fine-tuning: Automatically fine-tune models with selected features and compare to baselines
  • Synthetic Data Generation: Generate controlled datasets for testing and benchmarking

Documentation

Example Usage

```python import numpy as np from sklearn.linearmodel import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from stabilityselection import StabilitySelection

Create a pipeline with a base estimator

base_estimator = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression(penalty='l1', solver='liblinear')) ])

Initialize stability selection with the base estimator

selector = StabilitySelection( baseestimator=baseestimator, lambdaname='modelC', lambdagrid=np.logspace(-5, -1, 50), njobs=-1, # Use all available CPU cores usegpu=True # Use GPU if available )

Fit the selector to your data

selector.fit(X, y)

Get selected features

selectedfeatures = selector.getsupport(indices=True) ```

Data Processing and Analysis

StabilityPy now includes a comprehensive script for processing tabular data and running the full stability selection workflow:

```bash

For classification problems

python stabilityprocessor.py --input data.csv --output resultsdir --problem_type classification

For regression problems

python stabilityprocessor.py --input data.csv.gz --output resultsdir --problemtype regression --usegpu ```

The script will: 1. Load and preprocess your data 2. Run stability selection to identify important features 3. Fine-tune a model using only the selected features 4. Compare performance with a baseline model using all features 5. Generate visualizations and save all results to the output directory

Synthetic Data Generation

You can generate synthetic datasets with controlled properties using the included generator:

```bash

Generate a classification dataset

python syntheticdatagenerator.py --output data.csv --problemtype classification --nsamples 1000 --nfeatures 100 --ninformative 10

Generate a compressed regression dataset

python syntheticdatagenerator.py --output data.csv.gz --problemtype regression --nsamples 2000 --nfeatures 500 --ninformative 20 --compress ```

Advanced Example with Visualizations

The package includes comprehensive example scripts that demonstrate various use cases:

```bash

Basic stability selection example

python examples/stabilityselectionexample.py

Example with GPU acceleration

python examples/gpuaccelerationexample.py

Example with synthetic data visualization

python examples/syntheticdatavisualization.py ```

Complete Workflow Example

Here's a complete workflow from data generation to model fine-tuning:

```bash

1. Generate synthetic data

python syntheticdatagenerator.py \ --output data/syntheticclassification.csv \ --problemtype classification \ --nsamples 1000 \ --nfeatures 100 \ --n_informative 10 \ --noise 0.1

2. Run stability selection with fine-tuning

python stabilityprocessor.py \ --input data/syntheticclassification.csv \ --output results/syntheticclassification \ --problemtype classification \ --nbootstrap 100 \ --usegpu ```

Output Files and Visualizations

The stability_processor.py script produces a comprehensive set of outputs:

  • selected_features.csv: CSV file with selected features and their stability scores
  • feature_importance.csv: Feature importances from the fine-tuned model
  • performance_metrics.csv: Performance comparison between selected features and all features
  • stabilityselectionresults.pkl: Pickled results object with all stability scores
  • finetunedmodel.pkl: Trained model using only selected features
  • baseline_model.pkl: Trained model using all features

Visualizations: - stability_paths.png: Plot of stability scores across regularization parameters - stability_heatmap.png: Heatmap of stability scores for top features - performance_comparison.png: Bar chart comparing model performance - feature_importance.png: Bar chart of feature importances from fine-tuned model

GPU Acceleration

StabilityPy provides GPU acceleration via PyTorch for faster computation, especially useful for large datasets:

```python from stability_selection import StabilitySelection, RandomizedLasso

For regression tasks, use RandomizedLasso with GPU acceleration

estimator = RandomizedLasso(weakness=0.5, usegpu=True) selector = StabilitySelection( baseestimator=estimator, lambdaname='alpha', lambdagrid=np.linspace(0.001, 0.5, num=100), threshold=0.9, usegpu=True, # Enable GPU acceleration njobs=-1 # Use all CPU cores for operations that can't be GPU accelerated ) selector.fit(X, y) ```

Development

The project follows standard Python development practices with tools for code quality and testing.

Development Setup

```bash

Clone the repository

git clone https://github.com/yourusername/StabilityPy.git cd StabilityPy

Install in development mode with dev dependencies

pip install -e ".[dev]"

Install pre-commit hooks

pre-commit install ```

Code Quality Tools

We use several tools to maintain code quality:

```bash

Format code

black stabilityselection examples --line-length=100 isort stabilityselection examples

Lint code

flake8 stability_selection examples --max-line-length=100

Run tests

pytest stability_selection/tests

Run tests with coverage

pytest --cov=stability_selection --cov-report=term ```

For more details, see the Development Guide and Standardization Summary.

References

[1] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473. Link to paper

[2] Shah, R.D. and Samworth, R.J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), pp.55-80.

Requirements

  • Python 3.8+
  • NumPy >= 1.20.0
  • SciPy >= 1.7.0
  • scikit-learn >= 1.0.0
  • PyTorch >= 1.10.0 (optional, for GPU acceleration)
  • joblib >= 1.0.0
  • tqdm >= 4.60.0
  • matplotlib >= 3.3.0 (for visualization)
  • seaborn >= 0.11.0 (for visualization)

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Acknowledgments

This project is a modernized version of the original stability-selection package by Thomas Huijskens, with added features for GPU acceleration, improved parallel processing, and standardized code quality.

Owner

  • Name: Rakib Al-Fahad
  • Login: rakibalfahad
  • Kind: user
  • Company: The University of Memphis

I am pursuing my Ph.D. in Computer Engineering at the Department of Electrical and Computer Engineering at The University of Memphis.

Citation (CITATION.md)

# Citation

If you use this package in your research, please consider citing the original stability selection paper:

```
@article{meinshausen2010stability,
  title={Stability selection},
  author={Meinshausen, Nicolai and B{\"u}hlmann, Peter},
  journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
  volume={72},
  number={4},
  pages={417--473},
  year={2010},
  publisher={Wiley Online Library}
}
```

For the complementary pairs bootstrap variant:

```
@article{shah2013variable,
  title={Variable selection with error control: another look at stability selection},
  author={Shah, Rajen D and Samworth, Richard J},
  journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
  volume={75},
  number={1},
  pages={55--80},
  year={2013},
  publisher={Wiley Online Library}
}
```

For the implementation of this package:

```
@software{stability_selection,
  author = {Various contributors},
  title = {Stability Selection - A scikit-learn compatible implementation with GPU acceleration},
  url = {https://github.com/yourusername/stability-selection},
  year = {2025}
}
```

GitHub Events

Total
  • Issue comment event: 1
  • Member event: 1
  • Push event: 7
  • Pull request event: 7
  • Create event: 2
Last Year
  • Issue comment event: 1
  • Member event: 1
  • Push event: 7
  • Pull request event: 7
  • Create event: 2

Dependencies

requirements.txt pypi
  • joblib >=1.3.0
  • matplotlib >=3.7.0
  • numpy >=1.24.0
  • pytest >=7.0.0
  • pytest-cov >=4.1.0
  • scikit-learn >=1.3.0
  • scipy >=1.11.0
  • seaborn >=0.12.0
  • torch >=2.0.0
  • tqdm >=4.65.0
setup.py pypi