cubist
A Python package for fitting Quinlan's Cubist regression model
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Keywords
Repository
A Python package for fitting Quinlan's Cubist regression model
Basic Info
- Host: GitHub
- Owner: pjaselin
- License: gpl-3.0
- Language: C
- Default Branch: main
- Homepage: https://pjaselin.github.io/Cubist/
- Size: 2.16 MB
Statistics
- Stars: 48
- Watchers: 0
- Forks: 4
- Open Issues: 4
- Releases: 17
Topics
Metadata Files
README.md
Cubist
A Python package for fitting Quinlan's Cubist v2.07 regression model. Inspired by and based on the R wrapper for Cubist. Developed as a scikit-learn compatible estimator.
Table of Contents generated with DocToc
- Installation
- Background
- Advantages
- Sample Usage
- Cubist Model Class
- Visualization Utilities
- Considerations
- Benchmarks
- Literature for Cubist
Installation
Model-Only
bash
pip install --upgrade cubist
Optional Dependencies
To enable visualization utilities:
bash
pip install cubist[viz]
For development:
bash
pip install cubist[dev]
Background
Cubist is a regression algorithm developed by John Ross Quinlan for generating rule-based predictive models. This has been available in the R world thanks to the work of Max Kuhn and his colleagues. It is introduced to Python with this package and made scikit-learn compatible for use with at ecosystem. Cross-validation and control over whether Cubist creates a composite model is also enabled here.
Advantages
Unlike other ensemble models such as RandomForest and XGBoost, Cubist generates a set of rules, making it easy to understand precisely how the model makes it's predictive decisions. Tools such as SHAP and LIME are therefore unnecessary as Cubist doesn't exhibit black box behavior.
Like XGBoost, Cubist can perform boosting by the addition of more models (called committees) that correct for the error of prior models (i.e. the second model created corrects for the prediction error of the first, the third for the error of the second, etc.).
In addition to boosting, the model can perform instance-based (nearest-neighbor) corrections to create composite models, combining the advantages of these two methods. Note that with instance-based correction, model accuracy may be improved at the expense of compute time (this extra step takes longer) and some interpretability as the linear regression rules are no longer completely followed. It should also be noted that a composite model might be quite large as the full training dataset must be stored in order to perform instance-based corrections for inferencing. A composite model will be used when auto=False with neighbors set to an integer between 1 and 9. Cubist can be allowed to decide whether to take advantage of composite models with auto=True and neighbors left unset.
Sample Usage
```python
from sklearn.datasets import loadiris from sklearn.modelselection import traintestsplit from cubist import Cubist X, y = loadiris(returnXy=True, asframe=True) Xtrain, Xtest, ytrain, ytest = traintestsplit( ... X, y, testsize=0.05, randomstate=42 ... ) model = Cubist(nrules=2, verbose=True) model.fit(Xtrain, y_train)
Cubist [Release 2.07 GPL Edition] Sat Dec 28 19:52:49 2024
Target attribute `outcome'
Read 142 cases (5 attributes)
Model:
Rule 1: [48 cases, mean 0.0, range 0 to 0, est err 0.0]
if
petal width (cm) <= 0.6
then
outcome = 0
Rule 2: [94 cases, mean 1.5, range 1 to 2, est err 0.2]
if
petal width (cm) > 0.6
then
outcome = 0.2 + 0.76 petal width (cm) + 0.271 petal length (cm)
- 0.45 sepal width (cm)
Evaluation on training data (142 cases):
Average |error| 0.1
Relative |error| 0.16
Correlation coefficient 0.98
Attribute usage:
Conds Model
100% 66% petal width (cm)
66% sepal width (cm)
66% petal length (cm)
Time: 0.0 secs
Cubist(n_rules=2, verbose=True)
model.predict(Xtest) array([1.1257 , 0. , 2.04999995, 1.25449991, 1.30480003, 0. , 0.94999999, 1.93509996]) model.score(Xtest, y_test) 0.9543285583162371 ```
Cubist Model Class
Model Parameters
The following parameters can be passed as arguments to the Cubist() class instantiation:
- n_rules: Limit of the number of rules Cubist will build. Recommended value is 500.
- n_committees: Number of committees to construct. Each committee is a rule based model and beyond the first tries to correct the prediction errors of the prior constructed model. Recommended value is 5.
- neighbors: Integer between 1 and 9 for how many instances should be used to correct the rule-based prediction. If no value is given, Cubist will build a rule-based model only. If this value is set, Cubist will create a composite model with the given number of neighbors. Regardless of the value set, if auto=True, Cubist may override this input and choose a different number of neighbors. Please assess the model for the selected value for the number of neighbors used.
- unbiased: Should unbiased rules be used? Since Cubist minimizes the MAE of the predicted values, the rules may be biased and the mean predicted value may differ from the actual mean. This is recommended when there are frequent occurrences of the same value in a training dataset. Note that MAE may be slightly higher.
- auto: A value of True allows the algorithm to choose whether to use nearest-neighbor corrections and how many neighbors to use. False will leave the choice of whether to use a composite model to the value passed
neighbors. - extrapolation: Adjusts how much rule predictions are adjusted to be consistent with the training dataset. Recommended value is 5% as a decimal (0.05)
- sample: Percentage of the data set to be randomly selected for model building (0.0 or greater but less than 1.0) and held out for model testing. When using this parameter, Cubist will report evaluation results on the testing set in addition to the training set results.
- cv: Whether to carry out cross-validation (recommended value is 10)
- random_state: An integer to set the random seed for the C Cubist code.
- target_label: A label for the outcome variable. This is only used for printing rules.
- verbose: Should the Cubist output be printed? 1 if yes, 0 if no.
Model Attributes
The following attributes are exposed to understand the Cubist model results:
- model_: The trained Cubist model.
- output_: The pretty print summary of the Cubist model.
- featureimportances: DataFrame of how input variables are used in model conditions and regression equations.
- nfeaturesin_: The number of features seen during model fitting.
- featurenamesin_: List of features used to train Cubist.
- splits_: Table of the splits created by the Cubist model.
- coeffs_: Table of the regression coefficients found by the Cubist model.
- version_: The Cubist model version.
- featurestatistics: Model statistics (e.g. global mean, extrapolation %, ceiling value, floor value)
- committeeerrorreduction_: Error reduction by using committees.
- ncommitteesused_: Number of committees used by Cubist.
Visualization Utilities
Based on the R Cubist package, a few visualization utilities are provided to allow some exploration of trained Cubist models. Differing from the original package, these are extended somewhat to allow configuration of the subplots as well as for selecting a subset of variables/attributes to plot.
Coefficient Display
The CubistCoefficientDisplay plots the linear regression coefficients and intercepts selected by the Cubist model. One subplot is created for each variable/attribute with the rule number or committee/rule pair on the y-axis and the coefficient value plotted along the x-axis.
CubistCoefficientDisplay.from_estimator Parameters
- estimator: The trained Cubist model.
- committee: Optional parameter to filter to only committees at or below this committee number.
- rule: Optional parameter to filter to only rules at or below this rule number.
- feature_names: List of feature names to filter to in the plot. Leaving unselected plots all features.
- ax: An optional Matplotlib axes object.
- scatter_kwargs: Optional keywords to pass to
matplotlib.pyplot.scatter. - gridspec_kwargs: Optional keywords to pass to
matplotlib.pyplot.subplots.
CubistCoefficientDisplay Sample Usage
```python
import matplotlib.pyplot as plt from sklearn.datasets import loadiris from cubist import Cubist, CubistCoverageDisplay X, y = loadiris(returnXy=True, asframe=True) model = Cubist().fit(X, y) display = CubistCoverageDisplay.fromestimator(estimator=model) plt.show() ```

Coverage Display
The CubistCoverageDisplay is used to visualize the coverage of rule splits for a given dataset. One subplot is created per input variable/attribute/column with the rule number or comittee/rule pair plotted on the y-axis and the coverage ranges plotted along the x-axis, scaled to the percentage of the variable values.
CubistCoverageDisplay.from_estimator Parameters
- estimator: The trained Cubist model.
- X: An input dataset comparable to the dataset used to train the Cubist model.
- committee: Optional parameter to filter to only committees at or below this committee number.
- rule: Optional parameter to filter to only rules at or below this rule number.
- feature_names: List of feature names to filter to in the plot. Leaving unselected plots all features.
- ax: An optional Matplotlib axes object.
- line_kwargs: Optional keywords to pass to
matplotlib.pyplot.plot. - gridspec_kwargs: Optional keywords to pass to
matplotlib.pyplot.subplots.
CubistCoverageDisplay Sample Usage
```python
import matplotlib.pyplot as plt from sklearn.datasets import loadiris from cubist import Cubist, CubistCoverageDisplay X, y = loadiris(returnXy=True, asframe=True) model = Cubist().fit(X, y) display = CubistCoverageDisplay.fromestimator(estimator=model, X=X) plt.show() ```

Considerations
- For small datasets, using the
sampleparameter is probably inadvisable as Cubist won't have enough samples to produce a representative model. - If you are looking for fast inferencing and can spare accuracy, consider skipping using a composite model by leaving
neighborsunset. - Models that produce one or more rules without splits (i.e. a single linear model which holds true for the entire dataset), will return an empty
splits_attribute while the coefficients will be available in thecoeffs_attribute.
Benchmarks
There are many literature examples demonstrating the power of Cubist and comparing it to Random Forest as well as other bootstrapped/boosted models. Some of these are compiled here: Cubist in Use. To demonstrate this, some benchmark scripts are provided in the respectively named folder.
Literature for Cubist
Original Paper
Publications Using Cubist
Owner
- Name: Patrick Aselin
- Login: pjaselin
- Kind: user
- Company: @VulcanForms
- Repositories: 3
- Profile: https://github.com/pjaselin
GitHub Events
Total
- Create event: 3
- Release event: 1
- Issues event: 11
- Watch event: 5
- Delete event: 1
- Issue comment event: 32
- Push event: 102
- Pull request event: 3
Last Year
- Create event: 3
- Release event: 1
- Issues event: 11
- Watch event: 5
- Delete event: 1
- Issue comment event: 32
- Push event: 102
- Pull request event: 3
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| pjaselin | 2****n | 347 |
| Morten Pedersen | m****1@g****m | 42 |
| Anderson Chaves | a****s@g****m | 4 |
| Harry Moreno | m****9@g****m | 3 |
| Morten Pedersen | m****n@c****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 28
- Total pull requests: 111
- Average time to close issues: 2 months
- Average time to close pull requests: 9 days
- Total issue authors: 14
- Total pull request authors: 2
- Average comments per issue: 3.04
- Average comments per pull request: 0.16
- Merged pull requests: 107
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 6
- Pull requests: 2
- Average time to close issues: 3 months
- Average time to close pull requests: N/A
- Issue authors: 4
- Pull request authors: 1
- Average comments per issue: 5.67
- Average comments per pull request: 0.5
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- pjaselin (9)
- jwgda (4)
- Paulnkk (4)
- Coding-Deng (2)
- ramongss (1)
- zhihaojin (1)
- lheberling (1)
- Aditya7879 (1)
- pkalita595 (1)
- eggy00 (1)
- piccolomo (1)
- moracabanas (1)
- Silensea (1)
- phrh (1)
Pull Request Authors
- pjaselin (119)
- mortenvester1 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 3,547 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 18
- Total maintainers: 1
pypi.org: cubist
A Python package for fitting Quinlan's Cubist regression model.
- Homepage: https://github.com/pjaselin/Cubist
- Documentation: https://cubist.readthedocs.io/
- License: GNU General Public License v3 (GPLv3)
-
Latest release: 1.0.0
published 8 months ago
Rankings
Maintainers (1)
Dependencies
- numpy >=1.19.2
- pandas >=1.1.3
- scikit-learn >=0.24.2