cubist

A Python package for fitting Quinlan's Cubist regression model

https://github.com/pjaselin/cubist

Keywords

data-science machine-learning python regression scikit-learn

Last synced: 5 months ago · JSON representation

Repository

A Python package for fitting Quinlan's Cubist regression model

Basic Info

Host: GitHub
Owner: pjaselin
License: gpl-3.0
Language: C
Default Branch: main
Homepage: https://pjaselin.github.io/Cubist/
Size: 2.16 MB

Statistics

Stars: 48
Watchers: 0
Forks: 4
Open Issues: 4
Releases: 17

Topics

data-science machine-learning python regression scikit-learn

Created almost 5 years ago · Last pushed 7 months ago

Metadata Files

Readme License

Cubist

A Python package for fitting Quinlan's Cubist v2.07 regression model. Inspired by and based on the R wrapper for Cubist. Developed as a scikit-learn compatible estimator.

Table of Contents generated with DocToc

Installation
- Model-Only
- Optional Dependencies
Background
Advantages
Sample Usage
Cubist Model Class
- Model Parameters
- Model Attributes
Visualization Utilities
Considerations
Benchmarks
Literature for Cubist
- Original Paper
- Publications Using Cubist

Installation

Model-Only

bash pip install --upgrade cubist

Optional Dependencies

To enable visualization utilities:

bash pip install cubist[viz]

For development:

bash pip install cubist[dev]

Background

Cubist is a regression algorithm developed by John Ross Quinlan for generating rule-based predictive models. This has been available in the R world thanks to the work of Max Kuhn and his colleagues. It is introduced to Python with this package and made scikit-learn compatible for use with at ecosystem. Cross-validation and control over whether Cubist creates a composite model is also enabled here.

Advantages

Unlike other ensemble models such as RandomForest and XGBoost, Cubist generates a set of rules, making it easy to understand precisely how the model makes it's predictive decisions. Tools such as SHAP and LIME are therefore unnecessary as Cubist doesn't exhibit black box behavior.

Like XGBoost, Cubist can perform boosting by the addition of more models (called committees) that correct for the error of prior models (i.e. the second model created corrects for the prediction error of the first, the third for the error of the second, etc.).

In addition to boosting, the model can perform instance-based (nearest-neighbor) corrections to create composite models, combining the advantages of these two methods. Note that with instance-based correction, model accuracy may be improved at the expense of compute time (this extra step takes longer) and some interpretability as the linear regression rules are no longer completely followed. It should also be noted that a composite model might be quite large as the full training dataset must be stored in order to perform instance-based corrections for inferencing. A composite model will be used when auto=False with neighbors set to an integer between 1 and 9. Cubist can be allowed to decide whether to take advantage of composite models with auto=True and neighbors left unset.

Sample Usage

```python

from sklearn.datasets import loadiris from sklearn.modelselection import traintestsplit from cubist import Cubist X, y = loadiris(returnXy=True, asframe=True) Xtrain, Xtest, ytrain, ytest = traintestsplit( ... X, y, testsize=0.05, randomstate=42 ... ) model = Cubist(nrules=2, verbose=True) model.fit(Xtrain, y_train)

Cubist [Release 2.07 GPL Edition] Sat Dec 28 19:52:49 2024

Target attribute `outcome'

Read 142 cases (5 attributes)

Model:

Rule 1: [48 cases, mean 0.0, range 0 to 0, est err 0.0]

if
    petal width (cm) <= 0.6
then
    outcome = 0

Rule 2: [94 cases, mean 1.5, range 1 to 2, est err 0.2]

if
    petal width (cm) > 0.6
then
    outcome = 0.2 + 0.76 petal width (cm) + 0.271 petal length (cm)
              - 0.45 sepal width (cm)

Evaluation on training data (142 cases):

Average  |error|                0.1
Relative |error|               0.16
Correlation coefficient        0.98


    Attribute usage:
      Conds  Model

      100%    66%    petal width (cm)
              66%    sepal width (cm)
              66%    petal length (cm)

Time: 0.0 secs

Cubist(n_rules=2, verbose=True)

model.predict(Xtest) array([1.1257 , 0. , 2.04999995, 1.25449991, 1.30480003, 0. , 0.94999999, 1.93509996]) model.score(Xtest, y_test) 0.9543285583162371 ```

Cubist Model Class

Model Parameters

The following parameters can be passed as arguments to the Cubist() class instantiation:

n_rules: Limit of the number of rules Cubist will build. Recommended value is 500.
n_committees: Number of committees to construct. Each committee is a rule based model and beyond the first tries to correct the prediction errors of the prior constructed model. Recommended value is 5.
neighbors: Integer between 1 and 9 for how many instances should be used to correct the rule-based prediction. If no value is given, Cubist will build a rule-based model only. If this value is set, Cubist will create a composite model with the given number of neighbors. Regardless of the value set, if auto=True, Cubist may override this input and choose a different number of neighbors. Please assess the model for the selected value for the number of neighbors used.
unbiased: Should unbiased rules be used? Since Cubist minimizes the MAE of the predicted values, the rules may be biased and the mean predicted value may differ from the actual mean. This is recommended when there are frequent occurrences of the same value in a training dataset. Note that MAE may be slightly higher.
auto: A value of True allows the algorithm to choose whether to use nearest-neighbor corrections and how many neighbors to use. False will leave the choice of whether to use a composite model to the value passed neighbors.
extrapolation: Adjusts how much rule predictions are adjusted to be consistent with the training dataset. Recommended value is 5% as a decimal (0.05)
sample: Percentage of the data set to be randomly selected for model building (0.0 or greater but less than 1.0) and held out for model testing. When using this parameter, Cubist will report evaluation results on the testing set in addition to the training set results.
cv: Whether to carry out cross-validation (recommended value is 10)
random_state: An integer to set the random seed for the C Cubist code.
target_label: A label for the outcome variable. This is only used for printing rules.
verbose: Should the Cubist output be printed? 1 if yes, 0 if no.

Model Attributes

The following attributes are exposed to understand the Cubist model results:

model_: The trained Cubist model.
output_: The pretty print summary of the Cubist model.
featureimportances: DataFrame of how input variables are used in model conditions and regression equations.
nfeaturesin_: The number of features seen during model fitting.
featurenamesin_: List of features used to train Cubist.
splits_: Table of the splits created by the Cubist model.
coeffs_: Table of the regression coefficients found by the Cubist model.
version_: The Cubist model version.
featurestatistics: Model statistics (e.g. global mean, extrapolation %, ceiling value, floor value)
committeeerrorreduction_: Error reduction by using committees.
ncommitteesused_: Number of committees used by Cubist.

Visualization Utilities

Based on the R Cubist package, a few visualization utilities are provided to allow some exploration of trained Cubist models. Differing from the original package, these are extended somewhat to allow configuration of the subplots as well as for selecting a subset of variables/attributes to plot.

Coefficient Display

The CubistCoefficientDisplay plots the linear regression coefficients and intercepts selected by the Cubist model. One subplot is created for each variable/attribute with the rule number or committee/rule pair on the y-axis and the coefficient value plotted along the x-axis.

CubistCoefficientDisplay.from_estimator Parameters

estimator: The trained Cubist model.
committee: Optional parameter to filter to only committees at or below this committee number.
rule: Optional parameter to filter to only rules at or below this rule number.
feature_names: List of feature names to filter to in the plot. Leaving unselected plots all features.
ax: An optional Matplotlib axes object.
scatter_kwargs: Optional keywords to pass to matplotlib.pyplot.scatter.
gridspec_kwargs: Optional keywords to pass to matplotlib.pyplot.subplots.

CubistCoefficientDisplay Sample Usage

```python

import matplotlib.pyplot as plt from sklearn.datasets import loadiris from cubist import Cubist, CubistCoverageDisplay X, y = loadiris(returnXy=True, asframe=True) model = Cubist().fit(X, y) display = CubistCoverageDisplay.fromestimator(estimator=model) plt.show() ```

Sample Cubist Coefficient Display for Iris dataset

Coverage Display

The CubistCoverageDisplay is used to visualize the coverage of rule splits for a given dataset. One subplot is created per input variable/attribute/column with the rule number or comittee/rule pair plotted on the y-axis and the coverage ranges plotted along the x-axis, scaled to the percentage of the variable values.

CubistCoverageDisplay.from_estimator Parameters

estimator: The trained Cubist model.
X: An input dataset comparable to the dataset used to train the Cubist model.
committee: Optional parameter to filter to only committees at or below this committee number.
rule: Optional parameter to filter to only rules at or below this rule number.
feature_names: List of feature names to filter to in the plot. Leaving unselected plots all features.
ax: An optional Matplotlib axes object.
line_kwargs: Optional keywords to pass to matplotlib.pyplot.plot.
gridspec_kwargs: Optional keywords to pass to matplotlib.pyplot.subplots.

CubistCoverageDisplay Sample Usage

```python

import matplotlib.pyplot as plt from sklearn.datasets import loadiris from cubist import Cubist, CubistCoverageDisplay X, y = loadiris(returnXy=True, asframe=True) model = Cubist().fit(X, y) display = CubistCoverageDisplay.fromestimator(estimator=model, X=X) plt.show() ```

Sample Cubist Coverage Display for Iris dataset

Considerations

For small datasets, using the sample parameter is probably inadvisable as Cubist won't have enough samples to produce a representative model.
If you are looking for fast inferencing and can spare accuracy, consider skipping using a composite model by leaving neighbors unset.
Models that produce one or more rules without splits (i.e. a single linear model which holds true for the entire dataset), will return an empty splits_attribute while the coefficients will be available in the coeffs_ attribute.

Benchmarks

There are many literature examples demonstrating the power of Cubist and comparing it to Random Forest as well as other bootstrapped/boosted models. Some of these are compiled here: Cubist in Use. To demonstrate this, some benchmark scripts are provided in the respectively named folder.

Literature for Cubist

Original Paper

Learning with Continuous Classes

Publications Using Cubist

Owner

Name: Patrick Aselin
Login: pjaselin
Kind: user
Company: @VulcanForms

Repositories: 3
Profile: https://github.com/pjaselin

GitHub Events

Total

Create event: 3
Release event: 1
Issues event: 11
Watch event: 5
Delete event: 1
Issue comment event: 32
Push event: 102
Pull request event: 3

Last Year

Create event: 3
Release event: 1
Issues event: 11
Watch event: 5
Delete event: 1
Issue comment event: 32
Push event: 102
Pull request event: 3

Committers

Last synced: over 1 year ago

All Time

Total Commits: 397
Total Committers: 5
Avg Commits per committer: 79.4
Development Distribution Score (DDS): 0.126

Past Year

Commits: 107
Committers: 1
Avg Commits per committer: 107.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
pjaselin	2****n	347
Morten Pedersen	m**1@g**m	42
Anderson Chaves	a**s@g**m	4
Harry Moreno	m**9@g**m	3
Morten Pedersen	m**n@c**m	1

Committer Domains (Top 20 + Academic)

chevron.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 28
Total pull requests: 111
Average time to close issues: 2 months
Average time to close pull requests: 9 days
Total issue authors: 14
Total pull request authors: 2
Average comments per issue: 3.04
Average comments per pull request: 0.16
Merged pull requests: 107
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 2
Average time to close issues: 3 months
Average time to close pull requests: N/A
Issue authors: 4
Pull request authors: 1
Average comments per issue: 5.67
Average comments per pull request: 0.5
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

pjaselin (9)
jwgda (4)
Paulnkk (4)
Coding-Deng (2)
ramongss (1)
zhihaojin (1)
lheberling (1)
Aditya7879 (1)
pkalita595 (1)
eggy00 (1)
piccolomo (1)
moracabanas (1)
Silensea (1)
phrh (1)

Pull Request Authors

pjaselin (119)
mortenvester1 (1)

Top Labels

Issue Labels

enhancement (3)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 3,547 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 18
Total maintainers: 1

pypi.org: cubist

A Python package for fitting Quinlan's Cubist regression model.

Homepage: https://github.com/pjaselin/Cubist
Documentation: https://cubist.readthedocs.io/
License: GNU General Public License v3 (GPLv3)
Latest release: 1.0.0
published 8 months ago

Versions: 18
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 3,547 Last month

Rankings

Downloads: 6.8%

Dependent packages count: 10.0%

Stargazers count: 11.4%

Average: 13.8%

Forks count: 19.1%

Dependent repos count: 21.7%

Maintainers (1)

pjaselin

Last synced: 6 months ago

cubist

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Cubist

Installation

Model-Only

Optional Dependencies

Background

Advantages

Sample Usage

Cubist [Release 2.07 GPL Edition] Sat Dec 28 19:52:49 2024

Cubist Model Class

Model Parameters

Model Attributes

Visualization Utilities

Coefficient Display

CubistCoefficientDisplay.from_estimator Parameters

CubistCoefficientDisplay Sample Usage

Coverage Display

CubistCoverageDisplay.from_estimator Parameters

CubistCoverageDisplay Sample Usage

Considerations

Benchmarks

Literature for Cubist

Original Paper

Publications Using Cubist

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: cubist

Rankings

Maintainers (1)

Dependencies