mltools

Exploratory and diagnostic machine learning tools for R

https://github.com/ben519/mltools

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Keywords

exploratory-data-analysis machine-learning r

Last synced: 7 months ago · JSON representation

Repository

Exploratory and diagnostic machine learning tools for R

Basic Info

Host: GitHub
Owner: ben519
License: other
Language: R
Default Branch: master
Size: 172 KB

Statistics

Stars: 73
Watchers: 11
Forks: 26
Open Issues: 8
Releases: 0

Topics

exploratory-data-analysis machine-learning r

Created over 9 years ago · Last pushed over 4 years ago

Metadata Files

Readme License

mltools

Exploratory and diagnostic machine learning tools for R

About

The goal of this package is multifold:

Speed up data preparation for feeding machine-learning models
Identify structure and patterns in a dataset
Evaluate the results of a machine-learning model

Installation

CRAN

r install.packages("mltools")

or Github (development version)

r install.packages("devtools") devtools::install_github("ben519/mltools")

Demonstration

Predict whether or not someone is an alien.

```r library(data.table) library(mltools)

Copy the toy datasets since they are locked from being modified

train <- copy(alientrain) test <- copy(alientest)

train SkinColor IQScore Cat1 Cat2 Cat3 IsAlien 1: green 300 type1 type1 type4 TRUE 2: white 95 type1 type2 type4 FALSE 3: brown 105 type2 type6 type11 FALSE 4: white 250 type4 type5 type2 TRUE 5: blue 115 type2 type7 type11 TRUE 6: white 85 type4 type5 type2 FALSE 7: green 130 type1 type2 type4 TRUE 8: white 115 type1 type1 type4 FALSE

test SkinColor IQScore Cat1 Cat2 Cat3 1: white 79 type4 type5 type2 2: green 100 type4 type5 type2 3: brown 125 type3 type9 type7 4: white 90 type1 type8 type4 5: red 115 type1 type2 type4 ```

Questions about the data:

Are there any pairs of categorical fields which are highly/perfectly correlated?
Are there any parent-child related categorical fields?
How does the target variable change with IQScore?
What's the cardinality and skewness of each feature?

```r

Combine train (excluding IsAlien) and test

alien.all <- rbind(train[, !"IsAlien", with=FALSE], test)

--------------------------------------------------

Check for correlated and hierarchical fields

gini_impurities(alien.all, wide=TRUE) # weighted conditional gini impurities Var1 Cat1 Cat2 Cat3 SkinColor 1: Cat1 0.0000000 0.3589744 0.0000000 0.4743590 2: Cat2 0.0000000 0.0000000 0.0000000 0.3461538 3: Cat3 0.0000000 0.3589744 0.0000000 0.4743590 4: SkinColor 0.4102564 0.5384615 0.4102564 0.0000000

(Cat1, Cat3) = (Cat3, Cat1) = 0 => Cat1 and Cat3 perfectly correspond to each other

(Cat1, Cat2) > 0 and (Cat2, Cat1) = 0 => Cat1-Cat2 exhibit a parent-child relationship.

You can guess Cat1 by knowing Cat2, but not vice-versa.

--------------------------------------------------

Check relationship between IQScore and IsAlien by binning IQScore into groups

train[, BinIQScore := bin_data(IQScore, bins=seq(0, 300, by=50))] IQScore BinIQScore 1: 300 [250, 300] 2: 95 [50, 100) 3: 105 [100, 150) 4: 250 [250, 300] 5: 115 [100, 150) 6: 85 [50, 100) 7: 130 [100, 150) 8: 115 [100, 150)

train[, list(Samples=.N, IQScore=mean(IQScore)), keyby=BinIQScore] BinIQScore Samples IQScore 1: [50, 100) 2 90.00 2: [100, 150) 4 116.25 3: [250, 300] 2 275.00

Remove column BinIQScore

train[, BinIQScore := NULL]

--------------------------------------------------

Check skewness of fields

skewness(alien.all) $SkinColor SkinColor Count Pcnt 1: white 6 0.46153846 2: green 3 0.23076923 3: brown 2 0.15384615 4: blue 1 0.07692308 5: red 1 0.07692308

$Cat1 Cat1 Count Pcnt 1: type1 6 0.46153846 2: type4 4 0.30769231 3: type2 2 0.15384615 4: type3 1 0.07692308 ... ```

Preparing for ML model

Cateogrical fields in train and test should be factors with the same levels
Split the training dataset to do cross validation
Convert datasets to sparses matrices

```r set.seed(711)

--------------------------------------------------

Set SkinColor as a factor, such that it has the same levels in train and test

Set low frequency skin colors (1 or fewer occurences) as "other"

skincolors <- list(train$SkinColor, test$SkinColor) skincolors <- set_factor(skincolors, aggregationThreshold=1) train[, SkinColor := skincolors[[1]] ] # update train with the new values test[, SkinColor := skincolors[[2]] ] # update test with the new values

Repeat the process above for other categorical fields (without setting low freq. values as "other")

for(col in c("Cat1", "Cat2", "Cat3")){ vals <- list(train[[col]], test[[col]]) vals <- set_factor(vals) set(train, j=col, value=vals[[1]]) set(test, j=col, value=vals[[2]]) }

--------------------------------------------------

Randomly split the training data into 2 equally sized datasets

Partition train into two folds, stratified by IsAlien

train[, FoldID := folds(IsAlien, nfolds=2, stratified=TRUE, seed=2016)]

cvtrain <- train[FoldID==1, !"FoldID"] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien 1: green 300 type1 type1 type4 TRUE 2: brown 105 type2 type6 type11 FALSE 3: green 130 type1 type2 type4 TRUE 4: white 115 type1 type1 type4 FALSE

cvtest <- train[FoldID==2, !"FoldID"] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien 1: white 95 type1 type2 type4 FALSE 2: white 250 type4 type5 type2 TRUE 3: other 115 type2 type7 type11 TRUE 4: white 85 type4 type5 type2 FALSE

--------------------------------------------------

Convert cvtrain and cvtest to sparse matrices

Note that unordered factors are one-hot-encoded

library(Matrix)

cvtrain.sparse <- sparsify(cvtrain) 4 x 21 sparse Matrix of class "dgCMatrix" SkinColor_other SkinColorbrown SkinColorgreen SkinColorwhite IQScore Cat1type1 ... [1,] . . 1 . 300 1 [2,] . 1 . . 105 . [3,] . . 1 . 130 1 [4,] . . . 1 115 1

cvtest.sparse <- sparsify(cvtest) 4 x 21 sparse Matrix of class "dgCMatrix" SkinColor_other SkinColorbrown SkinColorgreen SkinColorwhite IQScore Cat1type1 ... [1,] . . . 1 95 1 [2,] . . . 1 250 . [3,] 1 . . . 115 . [4,] . . . 1 85 . ```

Evaluate model

What was the model's AUC ROC score?
How good was the model's predictions for each sample?

```r

--------------------------------------------------

Naive model that guesses someone is an alien if their IQScore is > 130

cvtest[, Prediction := ifelse(IQScore > 130, TRUE, FALSE)]

--------------------------------------------------

Evaluate predictions

Area Under the ROC Curve (AUC ROC)

auc_roc(preds=cvtest$Prediction, actuals=cvtest$IsAlien) 0.75

Individual scores to determine which predictions were good/bad (see help(roc_scores) for details)

cvtest[, ROCScore := rocscores(preds=Prediction, actuals=IsAlien)] cvtest[order(ROCScore)] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien Prediction ROCScore 1: white 95 type1 type2 type4 FALSE FALSE 0.0000000 2: white 250 type4 type5 type2 TRUE TRUE 0.0000000 3: white 85 type4 type5 type2 FALSE FALSE 0.0000000 4: _other 115 type2 type7 type11 TRUE FALSE 0.1666667 ```

Contact

If you'd like to contact me regarding bugs, questions, or general consulting, feel free to drop me a line - bgorman519@gmail.com

Support

Found this package helpful? Show your support and buy some merch!

Owner

Name: Ben
Login: ben519
Kind: user
Location: New Orleans, LA
Company: GormAnalysis

Website: https://gormanalysis.com/
Repositories: 39
Profile: https://github.com/ben519

Data Scientist and Founder of GormAnalysis

GitHub Events

Total

Last Year

Committers

Last synced: 10 months ago

All Time

Total Commits: 175
Total Committers: 3
Avg Commits per committer: 58.333
Development Distribution Score (DDS): 0.017

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ben Gorman	b**9@g**m	172
Michael Chirico	m**4@g**m	2
zane	z**r@v**u	1

Committer Domains (Top 20 + Academic)

vols.utk.edu: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 14
Total pull requests: 6
Average time to close issues: about 1 month
Average time to close pull requests: about 1 month
Total issue authors: 5
Total pull request authors: 6
Average comments per issue: 1.21
Average comments per pull request: 2.83
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ben519 (8)
mikoontz (2)
cregouby (2)
fatimamb (1)
S-UP (1)

Pull Request Authors

pford221 (1)
shuckle16 (1)
cycks (1)
fanli-gcb (1)
andredd (1)
MichaelChirico (1)

Top Labels

Issue Labels

bug (5) enhancement (2)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- cran 2,007 last-month
Total docker downloads: 42,045

Total dependent packages: 8
(may contain duplicates)
Total dependent repositories: 13
(may contain duplicates)
Total versions: 9
Total maintainers: 1

cran.r-project.org: mltools

Machine Learning Tools

Homepage: https://github.com/ben519/mltools
Documentation: http://cran.r-project.org/web/packages/mltools/mltools.pdf
License: MIT + file LICENSE
Latest release: 0.3.5
published almost 8 years ago

Versions: 8
Dependent Packages: 8
Dependent Repositories: 13
Downloads: 2,007 Last month
Docker Downloads: 42,045

Rankings

Forks count: 2.8%

Stargazers count: 5.0%

Dependent packages count: 6.1%

Downloads: 7.0%

Dependent repos count: 8.0%

Average: 8.7%

Docker downloads count: 23.1%

Maintainers (1)

bgorman@GormAnalysis.com

Last synced: 8 months ago

conda-forge.org: r-mltools

Homepage: https://github.com/ben519/mltools
License: MIT
Latest release: 0.3.5
published almost 6 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 29.4%

Stargazers count: 33.5%

Dependent repos count: 34.0%

Average: 37.0%

Dependent packages count: 51.2%

Last synced: 7 months ago

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
Matrix * imports
data.table >= 1.9.7 imports
methods * imports
stats * imports
testthat * suggests

mltools

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

mltools

About

Installation

CRAN

or Github (development version)

Demonstration

Copy the toy datasets since they are locked from being modified

Questions about the data:

Combine train (excluding IsAlien) and test

--------------------------------------------------

Check for correlated and hierarchical fields

(Cat1, Cat3) = (Cat3, Cat1) = 0 => Cat1 and Cat3 perfectly correspond to each other

(Cat1, Cat2) > 0 and (Cat2, Cat1) = 0 => Cat1-Cat2 exhibit a parent-child relationship.

You can guess Cat1 by knowing Cat2, but not vice-versa.

--------------------------------------------------

Check relationship between IQScore and IsAlien by binning IQScore into groups

Remove column BinIQScore

--------------------------------------------------

Check skewness of fields

Preparing for ML model

--------------------------------------------------

Set SkinColor as a factor, such that it has the same levels in train and test

Set low frequency skin colors (1 or fewer occurences) as "other"

Repeat the process above for other categorical fields (without setting low freq. values as "other")

--------------------------------------------------

Randomly split the training data into 2 equally sized datasets

Partition train into two folds, stratified by IsAlien

--------------------------------------------------

Convert cvtrain and cvtest to sparse matrices

Note that unordered factors are one-hot-encoded

Evaluate model

--------------------------------------------------

Naive model that guesses someone is an alien if their IQScore is > 130

--------------------------------------------------

Evaluate predictions

Area Under the ROC Curve (AUC ROC)

Individual scores to determine which predictions were good/bad (see help(roc_scores) for details)

Contact

Support

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: mltools

Rankings

Maintainers (1)

conda-forge.org: r-mltools

Rankings

Dependencies