splitTools

Light weight R package to do fast data splitting for cross-validation or train/valid/test splits

https://github.com/mayer79/splittools

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.5%) to scientific vocabulary

Keywords

cross-validation machine-learning rstats time-series validation
Last synced: 6 months ago · JSON representation

Repository

Light weight R package to do fast data splitting for cross-validation or train/valid/test splits

Basic Info
Statistics
  • Stars: 13
  • Watchers: 2
  • Forks: 5
  • Open Issues: 2
  • Releases: 5
Topics
cross-validation machine-learning rstats time-series validation
Created about 6 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License

README.md

{splitTools}

R-CMD-check Codecov test coverage CRAN_Status_Badge

Overview

{splitTools} is a toolkit for fast data splitting. It does not have any dependencies.

Its two main functions partition() and create_folds() support

  • data partitioning (e.g. into training, validation and test),
  • creating (in- or out-of-sample) folds for cross-validation (CV),
  • creating repeated folds for CV,
  • stratified splitting,
  • grouped splitting as well as
  • blocked splitting (if the sequential order of the data should be retained).

The function create_timefolds() does time-series splitting where the out-of-sample data follows the (extending or moving) in-sample data.

The result of create_folds() can be directly passed to the folds argument in CV functions of XGBoost or LightGBM. Since these functions expect out-of-sample indices, set the option invert = TRUE.

Installation

```r

From CRAN

install.packages("splitTools")

Development version

devtools::install_github("mayer79/splitTools") ```

Usage

``` r library(splitTools)

p <- c(train = 0.5, valid = 0.25, test = 0.25)

Train/valid/test indices for iris data stratified by Species

str(inds <- partition(iris$Species, p, seed = 1))

List of 3

$ train: int [1:73] 1 3 5 7 8 10 12 13 14 15 ...

$ valid: int [1:38] 4 9 19 21 27 28 29 30 32 35 ...

$ test : int [1:39] 2 6 11 16 18 22 26 37 38 40 ...

Same, but different output interface

head(inds <- partition(iris$Species, p, splitintolist = FALSE, seed = 1))

[1] train test train valid train test

Levels: train valid test

In-sample indices for 5-fold CV (stratified by Species)

str(inds <- create_folds(iris$Species, k = 5, seed = 1))

List of 5

$ Fold1: int [1:120] 2 4 5 6 7 8 9 10 11 15 ...

$ Fold2: int [1:120] 1 2 3 4 5 6 9 10 11 12 ...

$ Fold3: int [1:120] 1 2 3 4 6 7 8 9 11 12 ...

$ Fold4: int [1:120] 1 3 5 6 7 8 10 11 12 13 ...

$ Fold5: int [1:120] 1 2 3 4 5 7 8 9 10 12 ...

In-sample indices for 3 times repeated 5-fold CV (stratified by Species)

str(inds <- createfolds(iris$Species, k = 5, mrep = 3, seed = 1))

List of 15

$ Fold1.Rep1: int [1:120] 2 4 5 6 7 8 9 10 11 15 ...

$ Fold2.Rep1: int [1:120] 1 2 3 4 5 6 9 10 11 12 ...

$ Fold3.Rep1: int [1:120] 1 2 3 4 6 7 8 9 11 12 ...

$ Fold4.Rep1: int [1:120] 1 3 5 6 7 8 10 11 12 13 ...

$ Fold5.Rep1: int [1:120] 1 2 3 4 5 7 8 9 10 12 ...

$ Fold1.Rep2: int [1:120] 1 2 3 4 5 6 8 9 11 12 ...

$ Fold2.Rep2: int [1:120] 1 3 6 7 8 9 10 12 13 14 ...

[...]

Indices for time-series splitting

str(inds <- create_timefolds(1:100, k = 5))

List of 5

$ Fold1:List of 2

..$ insample : int [1:17] 1 2 3 4 5 6 7 8 9 10 ...

..$ outsample: int [1:17] 18 19 20 21 22 23 24 25 26 27 ...

$ Fold2:List of 2

..$ insample : int [1:34] 1 2 3 4 5 6 7 8 9 10 ...

..$ outsample: int [1:17] 35 36 37 38 39 40 41 42 43 44 ...

$ Fold3:List of 2

[...]

```

For more details, check out the vignette.

Owner

  • Name: Michael Mayer
  • Login: mayer79
  • Kind: user

Responsible statistics | ML

GitHub Events

Total
  • Issues event: 4
  • Watch event: 1
  • Delete event: 3
  • Issue comment event: 3
  • Push event: 10
  • Pull request event: 6
  • Create event: 3
Last Year
  • Issues event: 4
  • Watch event: 1
  • Delete event: 3
  • Issue comment event: 3
  • Push event: 10
  • Pull request event: 6
  • Create event: 3

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 69
  • Total Committers: 3
  • Avg Commits per committer: 23.0
  • Development Distribution Score (DDS): 0.232
Past Year
  • Commits: 4
  • Committers: 2
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.25
Top Committers
Name Email Commits
mayer79 m****9@g****m 53
kapsner l****r@g****m 15
olivroy 5****y 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 11
  • Total pull requests: 24
  • Average time to close issues: 6 months
  • Average time to close pull requests: 5 days
  • Total issue authors: 3
  • Total pull request authors: 3
  • Average comments per issue: 2.82
  • Average comments per pull request: 0.38
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 3
  • Average time to close issues: about 4 hours
  • Average time to close pull requests: 7 minutes
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.33
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mayer79 (6)
  • DarioS (3)
  • kapsner (2)
  • bbb801 (1)
Pull Request Authors
  • mayer79 (22)
  • kapsner (3)
  • olivroy (2)
Top Labels
Issue Labels
enhancement (4) wontfix (1) bug (1)
Pull Request Labels
enhancement (1)

Packages

  • Total packages: 2
  • Total downloads:
    • cran 1,224 last-month
  • Total docker downloads: 21,613
  • Total dependent packages: 4
    (may contain duplicates)
  • Total dependent repositories: 5
    (may contain duplicates)
  • Total versions: 11
  • Total maintainers: 1
cran.r-project.org: splitTools

Tools for Data Splitting

  • Versions: 9
  • Dependent Packages: 4
  • Dependent Repositories: 5
  • Downloads: 1,224 Last month
  • Docker Downloads: 21,613
Rankings
Docker downloads count: 0.6%
Forks count: 10.8%
Average: 10.8%
Downloads: 11.9%
Dependent repos count: 13.0%
Dependent packages count: 13.7%
Stargazers count: 15.1%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-splittools
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 50.5%
Average: 50.9%
Forks count: 51.3%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • stats * imports
  • knitr * suggests
  • ranger * suggests
  • rmarkdown * suggests
  • testthat >= 3.0.0 suggests