chaid

A python implementation of the common CHAID algorithm

https://github.com/rambatino/chaid

Keywords

chaid marketing-statistics spss stats tree

Last synced: 6 months ago · JSON representation

Repository

A python implementation of the common CHAID algorithm

Basic Info

Host: GitHub
Owner: Rambatino
License: apache-2.0
Language: Python
Default Branch: master
Size: 5.25 MB

Statistics

Stars: 159
Watchers: 12
Forks: 55
Open Issues: 1
Releases: 2

Topics

chaid marketing-statistics spss stats tree

Created almost 10 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License

Chi-Squared Automatic Inference Detection

This package provides a python implementation of the Chi-Squared Automatic Inference Detection (CHAID) decision tree as well as exhaustive CHAID

Installation

CHAID is distributed via pypi and can be installed like:

bash pip3 install CHAID

If you need support for graphs, optional packages must be installed together like: bash pip install CHAID[graph]

If you need support to read in a .sav file (SPSS), you will also need to install optional packages like: bash pip install CHAID[spss]

To install multiple optional packages, you can use a comma-separated list like: bash pip install CHAID[graph,spss]

Alternatively, you can clone the repository and install via bash pip install -e path/to/your/checkout

N.B. although we've made some attempt at supporting python 2.7 see here, we don't encourage the use of it as it's reached it's End Of Life (EOL).

Creating a CHAID Tree

``` python from CHAID import Tree, NominalColumn import pandas as pd import numpy as np

create the data

ndarr = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3) df = pd.DataFrame(ndarr) df.columns = ['a', 'b', 'c'] arr = np.array(([1] * 5) + ([2] * 5)) df['d'] = arr

df a b c d 0 1 2 3 1 1 1 2 3 1 2 1 2 3 1 3 1 2 3 1 4 1 2 3 1 5 2 2 3 2 6 2 2 3 2 7 2 2 3 2 8 2 2 3 2 9 2 2 3 2

set the CHAID input parameters

independentvariablecolumns = ['a', 'b', 'c'] dep_variable = 'd'

create the Tree via pandas

tree = Tree.frompandasdf(df, dict(zip(independentvariablecolumns, ['nominal'] * 3)), dep_variable)

create the same tree, but without pandas helper

tree = Tree.fromnumpy(ndarr, arr, splittitles=['a', 'b', 'c'], minchildnode_size=5)

create the same tree using the tree constructor

cols = [ NominalColumn(ndarr[:,0], name='a'), NominalColumn(ndarr[:,1], name='b'), NominalColumn(ndarr[:,2], name='c') ] tree = Tree(cols, NominalColumn(arr, name='d'), {'minchildnode_size': 5})

tree.print_tree() ([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1)) ├── ([1], {1: 5, 2: 0}, ) └── ([2], {1: 0, 2: 5}, )

to get a LibTree object,

tree.to_tree()

the different nodes of the tree can be accessed like

firstnode = tree.treestore[0]

first_node ([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))

the properties of the node can be access like

first_node.members {1: 5, 2: 5}

the properties of split can be accessed like

firstnode.split.p 0.001565402258002549 firstnode.split.score 10.0 ```

Creating a Tree using Bartlett's or Levene's Significance Test for Continuous Variables

When the dependent variable is continuous, the chi-squared test does not work due to very low frequencies of values across subgroups. As a consequence, and because the F-test is very susceptible to deviations from normality, the normality of the dependent set is determined and Bartlett's test for significance is used when the data is normally distributed (although the subgroups may not necessarily be so) or Levene's test is used when the data is non-normal.

``` python from CHAID import Tree

create the data

ndarr = np.array(([1, 2, 3] * 5) + ([2, 2, 3] * 5)).reshape(10, 3) df = pd.DataFrame(ndarr) df.columns = ['a', 'b', 'c'] df['d'] = np.random.normal(300, 100, 10) independentvariablecolumns = ['a', 'b', 'c'] dep_variable = 'd'

df a b c d 0 1 2 3 262.816747 1 1 2 3 240.139085 2 1 2 3 204.224083 3 1 2 3 231.024752 4 1 2 3 263.176338 5 2 2 3 440.371621 6 2 2 3 221.762452 7 2 2 3 197.290268 8 2 2 3 275.925549 9 2 2 3 238.471850

create the Tree via pandas

tree = Tree.frompandasdf(df, dict(zip(independentvariablecolumns, ['nominal'] * 3)), depvariable, depvariable_type='continuous')

print the tree (though not enough power to split)

tree.print_tree() ([], {'s.t.d': 86.562258585515579, 'mean': 297.52027436303212}, ) ```

Parameters

df: Pandas DataFrame
i_variables: Dict<string, string>: Independent variable column names as keys and the type as the values (nominal or ordinal)
d_variable: String: Dependent variable column name
opts: {}:
- alpha_merge: Float (default = 0.05): If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha_merge value, the least significant predictor categories are merged and the splitting of the node is attempted with the newly formed categories
- max_depth: Integer (default = 2): The maximum depth of the tree
- min_parent_node_size: Float (default = 30): The minimum number of respondents required for a split to occur on a particular node
- min_child_node_size: Float (default = 0): If the split of a node results in a child node whose node size is less than min_child_node_size, child nodes that have too few cases (as with this minimum) will merge with the most similar child node as measured by the largest of the p-values. However, if the resulting number of child nodes is 1, the node will not be split.
- max_splits: Integer or None (default = None): If specified, child nodes will continue to be merged until the number of splits at a single node is at max equal to max_splits. If not specified, this will be ignored.
- split_threshold: Float (default = 0): The split threshold when bucketing root node surrogate splits
- weight: String (default = None): The name of the weight column
- dep_variable_type (default = categorical, other_options = continuous): Whether the dependent variable is 'categorical' or 'continuous' Running from the Command Line -----------------------------

You can play around with the repo by cloning and running this from the command line:

python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05

It calls the print_tree() method, which prints the tree to terminal:

python ([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, chi=365.886947811, groups=[['female'], ['male']])) ├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, chi=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']])) │ ├── (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>) │ └── (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>) └── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, chi=16.4413525404, groups=[['C'], ['Q', 'S']])) ├── (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>) └── (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)

or to test the continuous dependent variable case:

python -m CHAID tests/data/titanic.csv fare sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous

python ([], {'s.t.d': 51.727293077231302, 'mean': 33.270043468296414}, (embarked, p=8.46027456424e-24, score=55.3476155546, groups=[['C'], ['Q', '<missing>'], ['S']]), dof=1308)) ├── (['C'], {'s.t.d': 84.029951444532529, 'mean': 62.336267407407405}, (sex, p=0.0293299541476, score=4.7994643184, groups=[['female'], ['male']]), dof=269)) │ ├── (['female'], {'s.t.d': 90.687664523113241, 'mean': 81.12853982300885}, <Invalid Chaid Split>) │ └── (['male'], {'s.t.d': 76.07029674707077, 'mean': 48.810619108280257}, <Invalid Chaid Split>) ├── (['Q', '<missing>'], {'s.t.d': 15.902095006812658, 'mean': 13.490467999999998}, <Invalid Chaid Split>) └── (['S'], {'s.t.d': 37.066877311088625, 'mean': 27.388825164113786}, (sex, p=3.43875930713e-07, score=26.3745361415, groups=[['female'], ['male']]), dof=913)) ├── (['female'], {'s.t.d': 48.971933059814894, 'mean': 39.339305154639177}, <Invalid Chaid Split>) └── (['male'], {'s.t.d': 28.242580058030033, 'mean': 21.806819261637241}, <Invalid Chaid Split>)

Note that the frequency of the dependent variable is replaced with the standard deviation and mean of the continuous set at each node and that any NaNs in the dependent set are automatically converted to 0.0.

Generating Splitting Rules

Append --rules to the cli or call tree.classification_rules(node) (either pass in the node or if node is None then it will return all splitting rules)

python -m CHAID tests/data/titanic.csv fare sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous --rules

python {'node': 2, 'rules': [{'variable': 'sex', 'data': ['female']}, {'variable': 'embarked', 'data': ['C']}]} {'node': 3, 'rules': [{'variable': 'sex', 'data': ['male']}, {'variable': 'embarked', 'data': ['C']}]} {'node': 4, 'rules': [{'variable': 'embarked', 'data': ['Q', '<missing>']}]} {'node': 6, 'rules': [{'variable': 'sex', 'data': ['female']}, {'variable': 'embarked', 'data': ['S']}]} {'node': 7, 'rules': [{'variable': 'sex', 'data': ['male']}, {'variable': 'embarked', 'data': ['S']}]}

Parameters

Run python -m CHAID -h to see description of command line arguments

How to Read the Tree

We'll start with a real world example using the titanic dataset.

First make sure to install all required packages:

bash python setup.py install && pip install ipdb

Run: bash python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05

after placing an ipdb statement on like 55 on __main__.py as in the example below. The parameters mean max depth two 4 levels, a minimum parent node size threshold to 2 and merge the groups if the p-value is greater than 0.05 when comparing the groups.

python 82 tree = Tree.from_pandas_df(data, independent_variables, 83 nspace.dependent_variable[0], 84 variable_types=types, **config) ---> 85 import ipdb; ipdb.set_trace() 86 87 if nspace.classify: 88 predictions = pd.Series(tree.node_predictions()) 89 predictions.name = 'node_id' 90 data = pd.concat([data, predictions], axis=1) 91 print(data.to_csv()) 92 elif nspace.predict:

Running tree.print_tree() gives:

python ([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1)) ├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1)) │ ├── (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>) │ └── (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>) └── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1)) ├── (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>) └── (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)

as show above. The first line is the root node, all the data is present in this node. The the vertical bars originating from a node represents paths to that node's children.

Running tree.tree_store will give you a list of all the nodes in the tree:

python [ ([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1)), (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1)), (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>), (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>), (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1)), (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>), (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>) ]

So let's inspect the root node tree.tree_store[0]:

python ([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))

Nodes have certain properties. Firstly, they show the column that was chosen to split to this node (for a root node the column is empty '([])'). The second property {0: 809, 1: 500} show the members of that node, and represent the current frequency of the dependent variable. In this case, it is all the answers in the 'survived' column, as that was the first column past to the program in the command line (python -m CHAID tests/data/titanic.csv survived). The next property represents the splitting of the node. What column was chosen to make that split (in this case, sex), the p-value of the split and the chi-score and most importantly, which variables in sex create the new nodes and the degrees of freedom associated with that split (1, in this case)

These properties that can be accessed:

python ipdb> root_node = tree.tree_store[0] ipdb> root_node.choices [] ipdb> root_node.members {0: 809, 1: 500} ipdb> root_node.split (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1)

The split variable can be further inspected:

python ipdb> split = root_node.split ipdb> split.column 'sex' ipdb> split.p 1.4714531016922664e-81 ipdb> split.score 365.88694781112048 ipdb> split.dof 1 ipdb> split.groupings "[['female'], ['male']]"

Therefore, in this example, the root node is split on the column 'sex' in the data, splitting up the females and males. These females and males each form a new node and further down, the all male and all female nodes are split on the column 'embarked' (although they needn't split on the same column). A <Invalid Chaid Split> is reached when either the node is pure (only one dependent variable remains) or when a terminating parameter is met (e.g. min node size, or max depth [see tree parameters above])

The conclusion drawn from this tree is that: "Gender was the most important factor driving the survival of people on the titanic. Whereby females had a much higher likelihood of surviving (survival = 1 in the survival column and 0 means they died). Of those females, those who embarked first class (class 'C', node 2) had a much higher likelihood of surviving."

Exporting the tree

If you want to export the tree to a dot file, then use:

python tree.to_tree()

This creates a treelib which has a .to_graphviz() method here.

In order to use visually graph the CHAID tree, you'll need to install two more libraries that aren't distributed via pypi:

graphviz - see here for platform specific installations
orca - see the README.md for platform specific installations

You can export the tree to .gv and png using:

python tree.render(path=None, view=False)

Which will save it to a file specified at path and can be instantly viewed when view=True.

This can also be triggered from the command line using --export or --export-path. The former causes it to be stored in a newly created trees folder and the latter specifies the location of the file. Both will trigger an auto-viewing of the tree. E.g:

bash python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --export

bash python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --export-path YOUR_PATH.gv

The output will look like:

Testing

CHAID uses pytest for its unit testing. The tests can be run from the root of a checkout with: bash py.test

If you so wish to run the unit tests across multiple python versions to make sure your changes are compatible, run: tox (detox to run in parallel). You may need to run pip install tox tox-pyenv detox & brew install pyenv beforehand.

Caveats

Unlike SPSS, this library doesn't modify the data internally. This means that weight variables aren't rounded as they are in SPSS.
Every row is valid, even if all values are NaN or undefined. This is different to SPSS where in the weighted case it will strip out all rows if all the independent variables are NaN

Upcoming Features

Accuracy Estimation using Machine Learning techniques on the data
Binning of continuous independent variables

Generating the CHANGELOG.md

gem install github_changelog_generator && github_changelog_generator --exclude-labels maintenance,refactor,testing

Owner

Name: Mark Ramotowski
Login: Rambatino
Kind: user
Location: London, United Kingdom
Company: @me

Repositories: 8
Profile: https://github.com/Rambatino

Algoholic at @axiomhq

GitHub Events

Total

Issues event: 4
Watch event: 7
Issue comment event: 8
Fork event: 4

Last Year

Issues event: 4
Watch event: 7
Issue comment event: 8
Fork event: 4

Committers

Last synced: 9 months ago

All Time

Total Commits: 224
Total Committers: 5
Avg Commits per committer: 44.8
Development Distribution Score (DDS): 0.299

Past Year

Commits: 2
Committers: 2
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.5

Top Committers

Name	Email	Commits
Mark Ramotowski	m**k@i**m	157
Richard Fitzgerald	r**d@i**m	55
Mark Ramotowski	m**i@g**m	6
Jihae Hwang	j****r	4
Martijn Pieters	g**m@z**m	2

Committer Domains (Top 20 + Academic)

intellectionsoftware.com: 2 zopatista.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 51
Total pull requests: 59
Average time to close issues: 2 months
Average time to close pull requests: 17 days
Total issue authors: 37
Total pull request authors: 6
Average comments per issue: 3.75
Average comments per pull request: 1.8
Merged pull requests: 54
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 14 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 5.5
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

asram6 (4)
diogoalvesderesende (4)
Rambatino (4)
xulaus (3)
mjpieters (2)
Ranji321 (2)
divatemangesh (2)
surendraatgithub-zz (1)
rleiva (1)
XinyuanHu (1)
rburroughs720 (1)
sriramab (1)
KamilGos (1)
appleyuchi (1)
1DanielG (1)

Pull Request Authors

Rambatino (39)
xulaus (13)
jihaekor (5)
mjpieters (2)
kenny-devarapalli (1)
ralic (1)

Top Labels

Issue Labels

help wanted (3) question (3) testing (1) bug (1) v4 changes (1)

Pull Request Labels

enhancement (14) bug (13) maintenance (5) refactor (4) testing (2) v4 changes (1)

Packages

Total packages: 1
Total downloads:
- pypi 7,115 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 37
Total maintainers: 1

pypi.org: chaid

A CHAID tree building algorithm

Homepage: https://github.com/Rambatino/CHAID
Documentation: https://chaid.readthedocs.io/
License: Apache License 2.0
Latest release: 5.4.2
published over 1 year ago

Versions: 37
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 7,115 Last month

Rankings

Downloads: 5.4%

Forks count: 5.8%

Stargazers count: 5.9%

Average: 9.7%

Dependent packages count: 10.0%

Dependent repos count: 21.7%

Maintainers (1)

Rambatino

Last synced: 6 months ago

chaid

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Chi-Squared Automatic Inference Detection

Installation

Creating a CHAID Tree

create the data

set the CHAID input parameters

create the Tree via pandas

create the same tree, but without pandas helper

create the same tree using the tree constructor

to get a LibTree object,

the different nodes of the tree can be accessed like

the properties of the node can be access like

the properties of split can be accessed like

Creating a Tree using Bartlett's or Levene's Significance Test for Continuous Variables

create the data

create the Tree via pandas

print the tree (though not enough power to split)

Parameters

Generating Splitting Rules

Parameters

How to Read the Tree

Exporting the tree

Testing

Caveats

Upcoming Features

Generating the CHANGELOG.md

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: chaid

Rankings

Maintainers (1)

Dependencies