splitTools
Light weight R package to do fast data splitting for cross-validation or train/valid/test splits
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary
Keywords
Repository
Light weight R package to do fast data splitting for cross-validation or train/valid/test splits
Basic Info
- Host: GitHub
- Owner: mayer79
- License: gpl-2.0
- Language: R
- Default Branch: main
- Homepage: https://mayer79.github.io/splitTools/
- Size: 1020 KB
Statistics
- Stars: 13
- Watchers: 2
- Forks: 5
- Open Issues: 2
- Releases: 5
Topics
Metadata Files
README.md
{splitTools} 
Overview
{splitTools} is a toolkit for fast data splitting. It does not have any dependencies.
Its two main functions partition() and create_folds() support
- data partitioning (e.g. into training, validation and test),
- creating (in- or out-of-sample) folds for cross-validation (CV),
- creating repeated folds for CV,
- stratified splitting,
- grouped splitting as well as
- blocked splitting (if the sequential order of the data should be retained).
The function create_timefolds() does time-series splitting where the out-of-sample data follows the (extending or moving) in-sample data.
The result of create_folds() can be directly passed to the folds argument in CV functions of XGBoost or LightGBM. Since these functions expect out-of-sample indices, set the option invert = TRUE.
Installation
```r
From CRAN
install.packages("splitTools")
Development version
devtools::install_github("mayer79/splitTools") ```
Usage
``` r library(splitTools)
p <- c(train = 0.5, valid = 0.25, test = 0.25)
Train/valid/test indices for iris data stratified by Species
str(inds <- partition(iris$Species, p, seed = 1))
List of 3
$ train: int [1:73] 1 3 5 7 8 10 12 13 14 15 ...
$ valid: int [1:38] 4 9 19 21 27 28 29 30 32 35 ...
$ test : int [1:39] 2 6 11 16 18 22 26 37 38 40 ...
Same, but different output interface
head(inds <- partition(iris$Species, p, splitintolist = FALSE, seed = 1))
[1] train test train valid train test
Levels: train valid test
In-sample indices for 5-fold CV (stratified by Species)
str(inds <- create_folds(iris$Species, k = 5, seed = 1))
List of 5
$ Fold1: int [1:120] 2 4 5 6 7 8 9 10 11 15 ...
$ Fold2: int [1:120] 1 2 3 4 5 6 9 10 11 12 ...
$ Fold3: int [1:120] 1 2 3 4 6 7 8 9 11 12 ...
$ Fold4: int [1:120] 1 3 5 6 7 8 10 11 12 13 ...
$ Fold5: int [1:120] 1 2 3 4 5 7 8 9 10 12 ...
In-sample indices for 3 times repeated 5-fold CV (stratified by Species)
str(inds <- createfolds(iris$Species, k = 5, mrep = 3, seed = 1))
List of 15
$ Fold1.Rep1: int [1:120] 2 4 5 6 7 8 9 10 11 15 ...
$ Fold2.Rep1: int [1:120] 1 2 3 4 5 6 9 10 11 12 ...
$ Fold3.Rep1: int [1:120] 1 2 3 4 6 7 8 9 11 12 ...
$ Fold4.Rep1: int [1:120] 1 3 5 6 7 8 10 11 12 13 ...
$ Fold5.Rep1: int [1:120] 1 2 3 4 5 7 8 9 10 12 ...
$ Fold1.Rep2: int [1:120] 1 2 3 4 5 6 8 9 11 12 ...
$ Fold2.Rep2: int [1:120] 1 3 6 7 8 9 10 12 13 14 ...
[...]
Indices for time-series splitting
str(inds <- create_timefolds(1:100, k = 5))
List of 5
$ Fold1:List of 2
..$ insample : int [1:17] 1 2 3 4 5 6 7 8 9 10 ...
..$ outsample: int [1:17] 18 19 20 21 22 23 24 25 26 27 ...
$ Fold2:List of 2
..$ insample : int [1:34] 1 2 3 4 5 6 7 8 9 10 ...
..$ outsample: int [1:17] 35 36 37 38 39 40 41 42 43 44 ...
$ Fold3:List of 2
[...]
```
For more details, check out the vignette.
Owner
- Name: Michael Mayer
- Login: mayer79
- Kind: user
- Repositories: 12
- Profile: https://github.com/mayer79
Responsible statistics | ML
GitHub Events
Total
- Issues event: 4
- Watch event: 1
- Delete event: 3
- Issue comment event: 3
- Push event: 10
- Pull request event: 6
- Create event: 3
Last Year
- Issues event: 4
- Watch event: 1
- Delete event: 3
- Issue comment event: 3
- Push event: 10
- Pull request event: 6
- Create event: 3
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 11
- Total pull requests: 24
- Average time to close issues: 6 months
- Average time to close pull requests: 5 days
- Total issue authors: 3
- Total pull request authors: 3
- Average comments per issue: 2.82
- Average comments per pull request: 0.38
- Merged pull requests: 24
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 3
- Average time to close issues: about 4 hours
- Average time to close pull requests: 7 minutes
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.33
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mayer79 (6)
- DarioS (3)
- kapsner (2)
- bbb801 (1)
Pull Request Authors
- mayer79 (22)
- kapsner (3)
- olivroy (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- cran 1,224 last-month
- Total docker downloads: 21,613
-
Total dependent packages: 4
(may contain duplicates) -
Total dependent repositories: 5
(may contain duplicates) - Total versions: 11
- Total maintainers: 1
cran.r-project.org: splitTools
Tools for Data Splitting
- Homepage: https://github.com/mayer79/splitTools
- Documentation: http://cran.r-project.org/web/packages/splitTools/splitTools.pdf
- License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
-
Latest release: 1.0.1
published over 2 years ago
Rankings
Maintainers (1)
conda-forge.org: r-splittools
- Homepage: https://github.com/mayer79/splitTools
- License: GPL-2.0-or-later
-
Latest release: 0.3.2
published about 4 years ago
Rankings
Dependencies
- stats * imports
- knitr * suggests
- ranger * suggests
- rmarkdown * suggests
- testthat >= 3.0.0 suggests