mlr3resampling

Resampling algorithms for mlr3 framework in R

https://github.com/tdhock/mlr3resampling

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Resampling algorithms for mlr3 framework in R

Basic Info
  • Host: GitHub
  • Owner: tdhock
  • Language: R
  • Default Branch: main
  • Size: 981 KB
Statistics
  • Stars: 5
  • Watchers: 2
  • Forks: 2
  • Open Issues: 8
  • Releases: 0
Created over 2 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog

README.org

mlr3resampling provides new cross-validation algorithms for the mlr3
framework in R

| [[file:tests/testthat][tests]]    | [[https://github.com/tdhock/mlr3resampling/actions][https://github.com/tdhock/mlr3resampling/workflows/R-CMD-check/badge.svg]] |
| [[https://github.com/jimhester/covr][coverage]] | [[https://app.codecov.io/gh/tdhock/mlr3resampling?branch=main][https://codecov.io/gh/tdhock/mlr3resampling/branch/main/graph/badge.svg]]  |

** Installation

#+begin_src R
  install.packages("mlr3resampling")#release version from CRAN
  ## OR: development version from GitHub:
  install.packages("remotes")
  remotes::install_github("tdhock/mlr3resampling")
#+end_src

** Description

For an overview of functionality, [[https://tdhock.github.io/blog/2024/cv-all-same-new/][please read my recent blog post]], the
[[https://arxiv.org/abs/2410.08643][SOAK arXiv paper]], and [[https://github.com/tdhock/mlr3resampling/wiki/Articles][other articles]].

*** SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html][Newer resamplers vignette]] and data viz for
[[https://tdhock.github.io/2023-12-13-train-predict-subsets-regression/][regression]] and [[https://tdhock.github.io/2023-12-13-train-predict-subsets-classification/][classification]].

A supervised learning algorithm inputs a train set, and outputs a
prediction function, which can be used on a test set.  If each data
point belongs to a subset (such as geographic region, year, etc), then
how do we know if subsets are similar enough so that we can get
accurate predictions on one subset, after training on Other subsets?
And how do we know if training on All subsets would improve prediction
accuracy, relative to training on the Same subset?  SOAK,
Same/Other/All K-fold cross-validation, can be used to answer these
questions, by fixing a test subset, training models on Same/Other/All
subsets, and then comparing test error rates (Same versus Other and
Same versus All).

- subsets are similar (for the learning algorithm) if All is more
  accurate than Same, and the subset with more data (Same/Other) is
  more accurate.
- subsets are different (for the learning algorithm) if All/Other is
  less accurate than Same.
- Other can be just as bad as featureless baseline (or worse) when the
  subsets have different learnable/predictable patterns.
- Note that results depend on the data set and on the learning
  algorithm. For a given data set, a linear model may not be able to
  learn on one subset and accurately predict on the other, but a more
  complex learning algorithm may be able to.

This is implemented in =ResamplingSameOtherSizesCV= when you use it on
a task that defines the =subset= role, for example the Arizona trees
data, for which each row is a pixel in an image, and we want to
do binary classification -- does the pixel contain a tree or not?

#+begin_src R
> data(AZtrees,package="mlr3resampling")
> table(AZtrees$region3)

  NE   NW    S 
1464 1563 2929 
#+end_src

We see in the output above that the =region3= column has three values
(NE, NW, S). Each represents the region/area in which the pixel was
found. If we want good predictions in the south (S), can we train on
the north? (NE+NW) We can use the code below to setup the CV
experiment.  The rows 12,15,18 below represent splits that attempt to
answer that question (test.subset=S, train.subsets=other).

#+begin_src R
> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
> task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
    test.subset train.subsets test.fold
                      
 1:          NE           all         1
 2:          NW           all         1
 3:           S           all         1
 4:          NE           all         2
 5:          NW           all         2
 6:           S           all         2
 7:          NE           all         3
 8:          NW           all         3
 9:           S           all         3
10:          NE         other         1
11:          NW         other         1
12:           S         other         1
13:          NE         other         2
14:          NW         other         2
15:           S         other         2
16:          NE         other         3
17:          NW         other         3
18:           S         other         3
19:          NE          same         1
20:          NW          same         1
21:           S          same         1
22:          NE          same         2
23:          NW          same         2
24:           S          same         2
25:          NE          same         3
26:          NW          same         3
27:           S          same         3
    test.subset train.subsets test.fold
#+end_src

The rows in the output above represent different kinds of splits:

- train.subsets=same is used as a baseline, to answer the question,
  "what is the baseline error rate, if we train on data from the same
  subset as test?" This imporant baseline allows us to
  see if the all and other models do better or worse.
- train.subsets=all is used to answer the question, "can we get
  accurate predictions on a given test subset, by training on data
  from all subsets?"
- train.subsets=other is used to answer the question, "can be get
  accurate predictions on a given test subset, by training using data
  from other subsets?"

Code to re-run:

#+begin_src R
  data(AZtrees,package="mlr3resampling")
  table(AZtrees$region3)
  same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
  task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
  task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
  task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
  task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
  task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
  same_other_sizes_cv$instantiate(task.obj)
  same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
#+end_src

*** Algorithm 2: cross-validation for comparing different sized train sets

See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html][Newer Resamplers vignette]] and data viz for
[[https://tdhock.github.io/2023-12-26-train-sizes-regression/][regression]] and [[https://tdhock.github.io/2023-12-27-train-sizes-classification/][classification]].

How many train samples are required to get accurate predictions on a
test set? Cross-validation can be used to answer this question, with
variable size train sets. For example consider the Arizona Trees data
below,

#+begin_src R
> dim(AZtrees)
[1] 5956   25
> length(unique(AZtrees$polygon))
[1] 189
#+end_src

The output above indicates we have 5956 rows and 189 polygons. We can
do cross-validation on either polygons (if task has =group= role) or
rows (if no =group= role set). The code below sets a down-sampling
=ratio= of 0.8, and four =sizes= of down-sampled train sets.

#+begin_src R
> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> same_other_sizes_cv$param_set$values$sizes <- 4
> same_other_sizes_cv$param_set$values$ratio <- 0.8
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
    n.train.groups test.fold
                  
 1:             51         1
 2:             64         1
 3:             80         1
 4:            100         1
 5:            126         1
 6:             51         2
 7:             64         2
 8:             80         2
 9:            100         2
10:            126         2
11:             51         3
12:             64         3
13:             80         3
14:            100         3
15:            126         3
#+end_src

The output above has one row per train/test split that will be
computed in the cross-validation experiment. The full train set size
is 126 polygons, and there are four smaller train set sizes (each a
factor of 0.8 smaller). Each train set size will be computed for each
fold ID from 1 to 3.

Code to re-run:

#+begin_src R
  data(AZtrees,package="mlr3resampling")
  dim(AZtrees)
  length(unique(AZtrees$polygon))
  same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
  same_other_sizes_cv$param_set$values$sizes <- 4
  same_other_sizes_cv$param_set$values$ratio <- 0.8
  task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
  task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
  task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
  task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
  same_other_sizes_cv$instantiate(task.obj)
  same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
#+end_src

** Related work

- mlr3resampling code was copied/modified from Resampling and
  ResamplingCV classes in the excellent [[https://github.com/mlr-org/mlr3][mlr3]] package.
- As of Oct 2024, scikit-learn in python implements support for groups
  via [[https://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold][GroupKFold]] (keeping samples together when splitting) but not
  subsets (test data come from one subset, train data come from
  Same/Other/All subsets).

Owner

  • Name: Toby Dylan Hocking
  • Login: tdhock
  • Kind: user
  • Location: Flagstaff, AZ
  • Company: Northern Arizona University

GitHub Events

Total
  • Issues event: 18
  • Watch event: 1
  • Issue comment event: 30
  • Push event: 83
  • Pull request review comment event: 2
  • Pull request review event: 3
  • Pull request event: 14
  • Gollum event: 2
  • Fork event: 1
  • Create event: 8
Last Year
  • Issues event: 18
  • Watch event: 1
  • Issue comment event: 30
  • Push event: 83
  • Pull request review comment event: 2
  • Pull request review event: 3
  • Pull request event: 14
  • Gollum event: 2
  • Fork event: 1
  • Create event: 8

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 21
  • Total pull requests: 18
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 12 days
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 1.14
  • Average comments per pull request: 1.17
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 14
  • Pull requests: 12
  • Average time to close issues: 3 days
  • Average time to close pull requests: 5 days
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 1.36
  • Average comments per pull request: 0.92
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tdhock (17)
  • be-marc (3)
Pull Request Authors
  • tdhock (24)
  • sebffischer (2)
  • be-marc (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 366 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 9
  • Total maintainers: 1
cran.r-project.org: mlr3resampling

Resampling Algorithms for 'mlr3' Framework

  • Versions: 9
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 366 Last month
Rankings
Dependent packages count: 28.5%
Dependent repos count: 36.6%
Average: 50.2%
Downloads: 85.5%
Maintainers (1)
Last synced: 9 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R6 * imports
  • checkmate * imports
  • data.table * imports
  • mlr3 * imports
  • mlr3misc * imports
  • paradox * imports
  • animint2 * suggests
  • future * suggests
  • knitr * suggests
  • nc * suggests
  • rmarkdown * suggests
  • rpart * suggests
  • testthat * suggests