mltools

Exploratory and diagnostic machine learning tools for R

https://github.com/ben519/mltools

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary

Keywords

exploratory-data-analysis machine-learning r
Last synced: 6 months ago · JSON representation

Repository

Exploratory and diagnostic machine learning tools for R

Basic Info
  • Host: GitHub
  • Owner: ben519
  • License: other
  • Language: R
  • Default Branch: master
  • Size: 172 KB
Statistics
  • Stars: 73
  • Watchers: 11
  • Forks: 26
  • Open Issues: 8
  • Releases: 0
Topics
exploratory-data-analysis machine-learning r
Created over 9 years ago · Last pushed over 4 years ago
Metadata Files
Readme License

README.md

mltools

Travis-CI Build Status

Exploratory and diagnostic machine learning tools for R

About

The goal of this package is multifold:

  • Speed up data preparation for feeding machine-learning models
  • Identify structure and patterns in a dataset
  • Evaluate the results of a machine-learning model

Installation

CRAN

r install.packages("mltools")

or Github (development version)

r install.packages("devtools") devtools::install_github("ben519/mltools")

Demonstration

Predict whether or not someone is an alien.

```r library(data.table) library(mltools)

Copy the toy datasets since they are locked from being modified

train <- copy(alientrain) test <- copy(alientest)

train SkinColor IQScore Cat1 Cat2 Cat3 IsAlien 1: green 300 type1 type1 type4 TRUE 2: white 95 type1 type2 type4 FALSE 3: brown 105 type2 type6 type11 FALSE 4: white 250 type4 type5 type2 TRUE 5: blue 115 type2 type7 type11 TRUE 6: white 85 type4 type5 type2 FALSE 7: green 130 type1 type2 type4 TRUE 8: white 115 type1 type1 type4 FALSE

test SkinColor IQScore Cat1 Cat2 Cat3 1: white 79 type4 type5 type2 2: green 100 type4 type5 type2 3: brown 125 type3 type9 type7 4: white 90 type1 type8 type4 5: red 115 type1 type2 type4 ```

Questions about the data:

  • Are there any pairs of categorical fields which are highly/perfectly correlated?
  • Are there any parent-child related categorical fields?
  • How does the target variable change with IQScore?
  • What's the cardinality and skewness of each feature?

```r

Combine train (excluding IsAlien) and test

alien.all <- rbind(train[, !"IsAlien", with=FALSE], test)

--------------------------------------------------

Check for correlated and hierarchical fields

gini_impurities(alien.all, wide=TRUE) # weighted conditional gini impurities Var1 Cat1 Cat2 Cat3 SkinColor 1: Cat1 0.0000000 0.3589744 0.0000000 0.4743590 2: Cat2 0.0000000 0.0000000 0.0000000 0.3461538 3: Cat3 0.0000000 0.3589744 0.0000000 0.4743590 4: SkinColor 0.4102564 0.5384615 0.4102564 0.0000000

(Cat1, Cat3) = (Cat3, Cat1) = 0 => Cat1 and Cat3 perfectly correspond to each other

(Cat1, Cat2) > 0 and (Cat2, Cat1) = 0 => Cat1-Cat2 exhibit a parent-child relationship.

You can guess Cat1 by knowing Cat2, but not vice-versa.

--------------------------------------------------

Check relationship between IQScore and IsAlien by binning IQScore into groups

train[, BinIQScore := bin_data(IQScore, bins=seq(0, 300, by=50))] IQScore BinIQScore 1: 300 [250, 300] 2: 95 [50, 100) 3: 105 [100, 150) 4: 250 [250, 300] 5: 115 [100, 150) 6: 85 [50, 100) 7: 130 [100, 150) 8: 115 [100, 150)

train[, list(Samples=.N, IQScore=mean(IQScore)), keyby=BinIQScore] BinIQScore Samples IQScore 1: [50, 100) 2 90.00 2: [100, 150) 4 116.25 3: [250, 300] 2 275.00

Remove column BinIQScore

train[, BinIQScore := NULL]

--------------------------------------------------

Check skewness of fields

skewness(alien.all) $SkinColor SkinColor Count Pcnt 1: white 6 0.46153846 2: green 3 0.23076923 3: brown 2 0.15384615 4: blue 1 0.07692308 5: red 1 0.07692308

$Cat1 Cat1 Count Pcnt 1: type1 6 0.46153846 2: type4 4 0.30769231 3: type2 2 0.15384615 4: type3 1 0.07692308 ... ```

Preparing for ML model

  • Cateogrical fields in train and test should be factors with the same levels
  • Split the training dataset to do cross validation
  • Convert datasets to sparses matrices

```r set.seed(711)

--------------------------------------------------

Set SkinColor as a factor, such that it has the same levels in train and test

Set low frequency skin colors (1 or fewer occurences) as "other"

skincolors <- list(train$SkinColor, test$SkinColor) skincolors <- set_factor(skincolors, aggregationThreshold=1) train[, SkinColor := skincolors[[1]] ] # update train with the new values test[, SkinColor := skincolors[[2]] ] # update test with the new values

Repeat the process above for other categorical fields (without setting low freq. values as "other")

for(col in c("Cat1", "Cat2", "Cat3")){ vals <- list(train[[col]], test[[col]]) vals <- set_factor(vals) set(train, j=col, value=vals[[1]]) set(test, j=col, value=vals[[2]]) }

--------------------------------------------------

Randomly split the training data into 2 equally sized datasets

Partition train into two folds, stratified by IsAlien

train[, FoldID := folds(IsAlien, nfolds=2, stratified=TRUE, seed=2016)]

cvtrain <- train[FoldID==1, !"FoldID"] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien 1: green 300 type1 type1 type4 TRUE 2: brown 105 type2 type6 type11 FALSE 3: green 130 type1 type2 type4 TRUE 4: white 115 type1 type1 type4 FALSE

cvtest <- train[FoldID==2, !"FoldID"] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien 1: white 95 type1 type2 type4 FALSE 2: white 250 type4 type5 type2 TRUE 3: other 115 type2 type7 type11 TRUE 4: white 85 type4 type5 type2 FALSE

--------------------------------------------------

Convert cvtrain and cvtest to sparse matrices

Note that unordered factors are one-hot-encoded

library(Matrix)

cvtrain.sparse <- sparsify(cvtrain) 4 x 21 sparse Matrix of class "dgCMatrix" SkinColor_other SkinColorbrown SkinColorgreen SkinColorwhite IQScore Cat1type1 ... [1,] . . 1 . 300 1 [2,] . 1 . . 105 . [3,] . . 1 . 130 1 [4,] . . . 1 115 1

cvtest.sparse <- sparsify(cvtest) 4 x 21 sparse Matrix of class "dgCMatrix" SkinColor_other SkinColorbrown SkinColorgreen SkinColorwhite IQScore Cat1type1 ... [1,] . . . 1 95 1 [2,] . . . 1 250 . [3,] 1 . . . 115 . [4,] . . . 1 85 . ```

Evaluate model

  • What was the model's AUC ROC score?
  • How good was the model's predictions for each sample?

```r

--------------------------------------------------

Naive model that guesses someone is an alien if their IQScore is > 130

cvtest[, Prediction := ifelse(IQScore > 130, TRUE, FALSE)]

--------------------------------------------------

Evaluate predictions

Area Under the ROC Curve (AUC ROC)

auc_roc(preds=cvtest$Prediction, actuals=cvtest$IsAlien) 0.75

Individual scores to determine which predictions were good/bad (see help(roc_scores) for details)

cvtest[, ROCScore := rocscores(preds=Prediction, actuals=IsAlien)] cvtest[order(ROCScore)] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien Prediction ROCScore 1: white 95 type1 type2 type4 FALSE FALSE 0.0000000 2: white 250 type4 type5 type2 TRUE TRUE 0.0000000 3: white 85 type4 type5 type2 FALSE FALSE 0.0000000 4: _other 115 type2 type7 type11 TRUE FALSE 0.1666667 ```

Contact

If you'd like to contact me regarding bugs, questions, or general consulting, feel free to drop me a line - bgorman519@gmail.com

Support

Found this package helpful? Show your support and buy some merch!

Owner

  • Name: Ben
  • Login: ben519
  • Kind: user
  • Location: New Orleans, LA
  • Company: GormAnalysis

Data Scientist and Founder of GormAnalysis

GitHub Events

Total
Last Year

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 175
  • Total Committers: 3
  • Avg Commits per committer: 58.333
  • Development Distribution Score (DDS): 0.017
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Ben Gorman b****9@g****m 172
Michael Chirico m****4@g****m 2
zane z****r@v****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 14
  • Total pull requests: 6
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 1 month
  • Total issue authors: 5
  • Total pull request authors: 6
  • Average comments per issue: 1.21
  • Average comments per pull request: 2.83
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ben519 (8)
  • mikoontz (2)
  • cregouby (2)
  • fatimamb (1)
  • S-UP (1)
Pull Request Authors
  • pford221 (1)
  • shuckle16 (1)
  • cycks (1)
  • fanli-gcb (1)
  • andredd (1)
  • MichaelChirico (1)
Top Labels
Issue Labels
bug (5) enhancement (2)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 2,007 last-month
  • Total docker downloads: 42,045
  • Total dependent packages: 8
    (may contain duplicates)
  • Total dependent repositories: 13
    (may contain duplicates)
  • Total versions: 9
  • Total maintainers: 1
cran.r-project.org: mltools

Machine Learning Tools

  • Versions: 8
  • Dependent Packages: 8
  • Dependent Repositories: 13
  • Downloads: 2,007 Last month
  • Docker Downloads: 42,045
Rankings
Forks count: 2.8%
Stargazers count: 5.0%
Dependent packages count: 6.1%
Downloads: 7.0%
Dependent repos count: 8.0%
Average: 8.7%
Docker downloads count: 23.1%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-mltools
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 29.4%
Stargazers count: 33.5%
Dependent repos count: 34.0%
Average: 37.0%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.5.0 depends
  • Matrix * imports
  • data.table >= 1.9.7 imports
  • methods * imports
  • stats * imports
  • testthat * suggests