mlr3resampling
Resampling algorithms for mlr3 framework in R
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Resampling algorithms for mlr3 framework in R
Basic Info
- Host: GitHub
- Owner: tdhock
- Language: R
- Default Branch: main
- Size: 981 KB
Statistics
- Stars: 5
- Watchers: 2
- Forks: 2
- Open Issues: 8
- Releases: 0
Created over 2 years ago
· Last pushed 10 months ago
Metadata Files
Readme
Changelog
README.org
mlr3resampling provides new cross-validation algorithms for the mlr3
framework in R
| [[file:tests/testthat][tests]] | [[https://github.com/tdhock/mlr3resampling/actions][https://github.com/tdhock/mlr3resampling/workflows/R-CMD-check/badge.svg]] |
| [[https://github.com/jimhester/covr][coverage]] | [[https://app.codecov.io/gh/tdhock/mlr3resampling?branch=main][https://codecov.io/gh/tdhock/mlr3resampling/branch/main/graph/badge.svg]] |
** Installation
#+begin_src R
install.packages("mlr3resampling")#release version from CRAN
## OR: development version from GitHub:
install.packages("remotes")
remotes::install_github("tdhock/mlr3resampling")
#+end_src
** Description
For an overview of functionality, [[https://tdhock.github.io/blog/2024/cv-all-same-new/][please read my recent blog post]], the
[[https://arxiv.org/abs/2410.08643][SOAK arXiv paper]], and [[https://github.com/tdhock/mlr3resampling/wiki/Articles][other articles]].
*** SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets
See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html][Newer resamplers vignette]] and data viz for
[[https://tdhock.github.io/2023-12-13-train-predict-subsets-regression/][regression]] and [[https://tdhock.github.io/2023-12-13-train-predict-subsets-classification/][classification]].
A supervised learning algorithm inputs a train set, and outputs a
prediction function, which can be used on a test set. If each data
point belongs to a subset (such as geographic region, year, etc), then
how do we know if subsets are similar enough so that we can get
accurate predictions on one subset, after training on Other subsets?
And how do we know if training on All subsets would improve prediction
accuracy, relative to training on the Same subset? SOAK,
Same/Other/All K-fold cross-validation, can be used to answer these
questions, by fixing a test subset, training models on Same/Other/All
subsets, and then comparing test error rates (Same versus Other and
Same versus All).
- subsets are similar (for the learning algorithm) if All is more
accurate than Same, and the subset with more data (Same/Other) is
more accurate.
- subsets are different (for the learning algorithm) if All/Other is
less accurate than Same.
- Other can be just as bad as featureless baseline (or worse) when the
subsets have different learnable/predictable patterns.
- Note that results depend on the data set and on the learning
algorithm. For a given data set, a linear model may not be able to
learn on one subset and accurately predict on the other, but a more
complex learning algorithm may be able to.
This is implemented in =ResamplingSameOtherSizesCV= when you use it on
a task that defines the =subset= role, for example the Arizona trees
data, for which each row is a pixel in an image, and we want to
do binary classification -- does the pixel contain a tree or not?
#+begin_src R
> data(AZtrees,package="mlr3resampling")
> table(AZtrees$region3)
NE NW S
1464 1563 2929
#+end_src
We see in the output above that the =region3= column has three values
(NE, NW, S). Each represents the region/area in which the pixel was
found. If we want good predictions in the south (S), can we train on
the north? (NE+NW) We can use the code below to setup the CV
experiment. The rows 12,15,18 below represent splits that attempt to
answer that question (test.subset=S, train.subsets=other).
#+begin_src R
> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon" #keep data together when splitting.
> task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
test.subset train.subsets test.fold
1: NE all 1
2: NW all 1
3: S all 1
4: NE all 2
5: NW all 2
6: S all 2
7: NE all 3
8: NW all 3
9: S all 3
10: NE other 1
11: NW other 1
12: S other 1
13: NE other 2
14: NW other 2
15: S other 2
16: NE other 3
17: NW other 3
18: S other 3
19: NE same 1
20: NW same 1
21: S same 1
22: NE same 2
23: NW same 2
24: S same 2
25: NE same 3
26: NW same 3
27: S same 3
test.subset train.subsets test.fold
#+end_src
The rows in the output above represent different kinds of splits:
- train.subsets=same is used as a baseline, to answer the question,
"what is the baseline error rate, if we train on data from the same
subset as test?" This imporant baseline allows us to
see if the all and other models do better or worse.
- train.subsets=all is used to answer the question, "can we get
accurate predictions on a given test subset, by training on data
from all subsets?"
- train.subsets=other is used to answer the question, "can be get
accurate predictions on a given test subset, by training using data
from other subsets?"
Code to re-run:
#+begin_src R
data(AZtrees,package="mlr3resampling")
table(AZtrees$region3)
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
task.obj$col_roles$group <- "polygon" #keep data together when splitting.
task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
same_other_sizes_cv$instantiate(task.obj)
same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
#+end_src
*** Algorithm 2: cross-validation for comparing different sized train sets
See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/Newer_resamplers.html][Newer Resamplers vignette]] and data viz for
[[https://tdhock.github.io/2023-12-26-train-sizes-regression/][regression]] and [[https://tdhock.github.io/2023-12-27-train-sizes-classification/][classification]].
How many train samples are required to get accurate predictions on a
test set? Cross-validation can be used to answer this question, with
variable size train sets. For example consider the Arizona Trees data
below,
#+begin_src R
> dim(AZtrees)
[1] 5956 25
> length(unique(AZtrees$polygon))
[1] 189
#+end_src
The output above indicates we have 5956 rows and 189 polygons. We can
do cross-validation on either polygons (if task has =group= role) or
rows (if no =group= role set). The code below sets a down-sampling
=ratio= of 0.8, and four =sizes= of down-sampled train sets.
#+begin_src R
> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> same_other_sizes_cv$param_set$values$sizes <- 4
> same_other_sizes_cv$param_set$values$ratio <- 0.8
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon" #keep data together when splitting.
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
n.train.groups test.fold
1: 51 1
2: 64 1
3: 80 1
4: 100 1
5: 126 1
6: 51 2
7: 64 2
8: 80 2
9: 100 2
10: 126 2
11: 51 3
12: 64 3
13: 80 3
14: 100 3
15: 126 3
#+end_src
The output above has one row per train/test split that will be
computed in the cross-validation experiment. The full train set size
is 126 polygons, and there are four smaller train set sizes (each a
factor of 0.8 smaller). Each train set size will be computed for each
fold ID from 1 to 3.
Code to re-run:
#+begin_src R
data(AZtrees,package="mlr3resampling")
dim(AZtrees)
length(unique(AZtrees$polygon))
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
same_other_sizes_cv$param_set$values$sizes <- 4
same_other_sizes_cv$param_set$values$ratio <- 0.8
task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
task.obj$col_roles$group <- "polygon" #keep data together when splitting.
same_other_sizes_cv$instantiate(task.obj)
same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
#+end_src
** Related work
- mlr3resampling code was copied/modified from Resampling and
ResamplingCV classes in the excellent [[https://github.com/mlr-org/mlr3][mlr3]] package.
- As of Oct 2024, scikit-learn in python implements support for groups
via [[https://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold][GroupKFold]] (keeping samples together when splitting) but not
subsets (test data come from one subset, train data come from
Same/Other/All subsets).
Owner
- Name: Toby Dylan Hocking
- Login: tdhock
- Kind: user
- Location: Flagstaff, AZ
- Company: Northern Arizona University
- Website: http://tdhock.github.io
- Repositories: 88
- Profile: https://github.com/tdhock
GitHub Events
Total
- Issues event: 18
- Watch event: 1
- Issue comment event: 30
- Push event: 83
- Pull request review comment event: 2
- Pull request review event: 3
- Pull request event: 14
- Gollum event: 2
- Fork event: 1
- Create event: 8
Last Year
- Issues event: 18
- Watch event: 1
- Issue comment event: 30
- Push event: 83
- Pull request review comment event: 2
- Pull request review event: 3
- Pull request event: 14
- Gollum event: 2
- Fork event: 1
- Create event: 8
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 21
- Total pull requests: 18
- Average time to close issues: about 1 month
- Average time to close pull requests: 12 days
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 1.14
- Average comments per pull request: 1.17
- Merged pull requests: 16
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 14
- Pull requests: 12
- Average time to close issues: 3 days
- Average time to close pull requests: 5 days
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 1.36
- Average comments per pull request: 0.92
- Merged pull requests: 10
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- tdhock (17)
- be-marc (3)
Pull Request Authors
- tdhock (24)
- sebffischer (2)
- be-marc (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 366 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 9
- Total maintainers: 1
cran.r-project.org: mlr3resampling
Resampling Algorithms for 'mlr3' Framework
- Homepage: https://github.com/tdhock/mlr3resampling
- Documentation: http://cran.r-project.org/web/packages/mlr3resampling/mlr3resampling.pdf
- License: LGPL-3
-
Latest release: 2025.6.23
published 12 months ago
Rankings
Dependent packages count: 28.5%
Dependent repos count: 36.6%
Average: 50.2%
Downloads: 85.5%
Maintainers (1)
Last synced:
9 months ago
Dependencies
.github/workflows/R-CMD-check.yaml
actions
- actions/checkout v3 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml
actions
- actions/checkout v3 composite
- actions/upload-artifact v3 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION
cran
- R6 * imports
- checkmate * imports
- data.table * imports
- mlr3 * imports
- mlr3misc * imports
- paradox * imports
- animint2 * suggests
- future * suggests
- knitr * suggests
- nc * suggests
- rmarkdown * suggests
- rpart * suggests
- testthat * suggests