cvAUC

Computationally efficient confidence intervals for cross-validated AUC estimates in R

https://github.com/ledell/cvauc

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary

Keywords

auc confidence-intervals cross-validation machine-learning r statistics variance

Last synced: 6 months ago · JSON representation

Repository

Computationally efficient confidence intervals for cross-validated AUC estimates in R

Basic Info

Host: GitHub
Owner: ledell
License: apache-2.0
Language: R
Default Branch: master
Size: 138 KB

Statistics

Stars: 23
Watchers: 3
Forks: 11
Open Issues: 7
Releases: 0

Topics

auc confidence-intervals cross-validation machine-learning r statistics variance

Created about 11 years ago · Last pushed about 4 years ago

Metadata Files

Readme License

cvAUC

The cvAUC R package provides a computationally efficient means of estimating confidence intervals (or variance) of cross-validated Area Under the ROC Curve (AUC) estimates. This allows you to generate confidence intervals in seconds, compared to other techniques that are many orders of magnitude slower.

In binary classification problems, the AUC is commonly used to evaluate the performance of a prediction model. Often, it is combined with cross-validation in order to assess how the results will generalize to an independent data set. In order to evaluate the quality of an estimate for cross-validated AUC, we obtain an estimate of its variance.

For massive data sets, the process of generating a single performance estimate can be computationally expensive. Additionally, when using a complex prediction method, the process of cross-validating a predictive model on even a relatively small data set can still require a large amount of computation time. Thus, in many practical settings, the bootstrap is a computationally intractable approach to variance estimation. As an alternative to the bootstrap, a computationally efficient influence curve based approach to obtaining a variance estimate for cross-validated AUC can be used.

The primary functions of the package are ci.cvAUC() and ci.pooled.cvAUC(), which report cross-validated AUC and compute confidence intervals for cross-validated AUC estimates based on influence curves for i.i.d. and pooled repeated measures data, respectively. One benefit to using influence curve based confidence intervals is that they require much less computation time than bootstrapping methods. The utility functions, AUC() and cvAUC(), are simple wrappers for functions from the ROCR package.

Erin LeDell, Maya L. Petersen & Mark J. van der Laan, "Computationally Efficient Confidence Intervals for Cross-validated Area Under the ROC Curve Estimates." (Electronic Journal of Statistics) - Open access article: https://doi.org/10.1214/15-EJS1035

Install cvAUC

You can install:

The latest released version from CRAN with:

r install.packages("cvAUC")
The latest development version from GitHub with:

r remotes::install_github("ledell/cvAUC")

Using cvAUC

Here is a demo of how you can use the package, along with some benchmarks of the speed of the method. For a simpler example that runs faster, you can check out the help files for the various functions inside the R package.

In this example of the ci.cvAUC() function, we do the following:

Load an i.i.d. data set with a binary outcome.
We will use 10-fold cross-validation, so we need to divide the indices randomly into 10 folds. In this step, we stratify the folds by the outcome variable. Stratification is not necessary, but is commonly performed in order to create validation folds with similar distributions. This information is stored in a 10-element list called folds. Below, the function that creates the folds is called .cvFolds.
For the v^th iteration of the cross-validation (CV) process, fit a model on the training data (i.e. observations in folds {1,...,10}\v) and then using this saved fit, generate predicted values for the observations in the v^th validation fold. The .doFit() function below does this procedure. In this example, we use the Random Forest algorithm.
Next, the .doFit() function is applied across all 10 folds to generate the predicted values for the observations in each validation fold.
These predicted values are stored in vector called predictions, in the original order of the training observations..
Lastly, we use the ci.cvAUC() function to calculate CV AUC and to generate a 95% confidence interval for this CV AUC estimate.

First, we define a few utility functions:

```r .cvFolds <- function(Y, V){ # Create CV folds (stratify by outcome)
Y0 <- split(sample(which(Y==0)), rep(1:V, length = length(which(Y==0)))) Y1 <- split(sample(which(Y==1)), rep(1:V, length = length(which(Y==1)))) folds <- vector("list", length = V) for (v in seq(V)) {folds[[v]] <- c(Y0[[v]], Y1[[v]])}
return(folds) }

.doFit <- function(v, folds, train, y){ # Train & test a model; return predicted values on test samples set.seed(v) ycol <- which(names(train) == y) params <- list(x = train[-folds[[v]], -ycol], y = as.factor(train[-folds[[v]], ycol]), xtest = train[folds[[v]], -ycol]) fit <- do.call(randomForest, params) pred <- fit$test$votes[,2] return(pred) } ```

This function will execute the example:

```r iid_example <- function(train, y = "response", V = 10, seed = 1) {

# Create folds set.seed(seed) folds <- .cvFolds(Y = train[,c(y)], V = V)

# Generate CV predicted values cl <- makeCluster(detectCores()) registerDoParallel(cl) predictions <- foreach(v = 1:V, .combine = "c", .packages = c("randomForest"), .export = c(".doFit")) %dopar% .doFit(v, folds, train, y) stopCluster(cl) predictions[unlist(folds)] <- predictions

# Get CV AUC and 95% confidence interval runtime <- system.time(res <- ci.cvAUC(predictions = predictions, labels = train[,c(y)], folds = folds, confidence = 0.95)) print(runtime) return(res) } ```

Load a sample binary outcome training set into R with 10,000 rows:

r train_csv <- "https://erin-data.s3.amazonaws.com/higgs/higgs_train_10k.csv" train <- read.csv(train_csv, header = TRUE, sep = ",")

Run the example:

```r library(randomForest) library(doParallel) # to speed up the model training in the example library(cvAUC)

res <- iid_example(train = train, y = "response", V = 10, seed = 1)

user system elapsed

0.096 0.005 0.102

print(res)

$cvAUC

[1] 0.7818224

$se

[1] 0.004531916

$ci

[1] 0.7729400 0.7907048

$confidence

[1] 0.95

```

cvAUC Performance

For the example above (10,000 observations), it took ~0.1 seconds to calculate the cross-validated AUC and the influence curve based confidence intervals. This was benchmarked on a 3.1 GHz Intel Core i7 processor using cvAUC package version 1.1.3.

For bigger (i.i.d.) training sets, here are a few rough benchmarks:

100,000 observations: ~0.4 seconds
1 million observations: ~5.0 seconds

To try it on bigger datasets yourself, feel free to replace the 10k-row training csv with either of these files here:

train_csv <- "https://erin-data.s3.amazonaws.com/higgs/higgs_train_100k.csv" train_csv <- "https://erin-data.s3.amazonaws.com/higgs/higgs_train_1M.csv"

Owner

Name: Erin LeDell
Login: ledell
Kind: user
Location: Oakland, California, USA
Company: @h2oai

Website: http://twitter.com/ledell
Repositories: 51
Profile: https://github.com/ledell

Chief Machine Learning Scientist at H2O.ai

GitHub Events

Total

Last Year

Committers

Last synced: over 2 years ago

All Time

Total Commits: 22
Total Committers: 4
Avg Commits per committer: 5.5
Development Distribution Score (DDS): 0.545

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
ledell	e**n@h**i	10
Erin LeDell	o**s@l**g	8
ledell	l**l@s**u	3
Michael Chirico	m**o@g**m	1

Committer Domains (Top 20 + Academic)

grabtaxi.com: 1 stat.berkeley.edu: 1 ledell.org: 1 h2o.ai: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 11
Total pull requests: 2
Average time to close issues: over 1 year
Average time to close pull requests: over 1 year
Total issue authors: 6
Total pull request authors: 2
Average comments per issue: 1.45
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ledell (4)
ck37 (3)
sgruber65 (1)
Tato14 (1)
reiniervlinschoten (1)
beckermr (1)

Pull Request Authors

ledell (1)
MichaelChirico (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- cran 4,600 last-month
Total docker downloads: 43,550

Total dependent packages: 9
(may contain duplicates)
Total dependent repositories: 183
(may contain duplicates)
Total versions: 7
Total maintainers: 1

cran.r-project.org: cvAUC

Cross-Validated Area Under the ROC Curve Confidence Intervals

Homepage: https://github.com/ledell/cvAUC
Documentation: http://cran.r-project.org/web/packages/cvAUC/cvAUC.pdf
License: Apache License (== 2.0)
Latest release: 1.1.4
published about 4 years ago

Versions: 5
Dependent Packages: 8
Dependent Repositories: 183
Downloads: 4,600 Last month
Docker Downloads: 43,550

Rankings

Docker downloads count: 0.6%

Dependent repos count: 1.4%

Average: 5.5%

Dependent packages count: 6.1%

Forks count: 6.3%

Downloads: 7.3%

Stargazers count: 11.2%

Maintainers (1)

oss@ledell.org

Last synced: 6 months ago

conda-forge.org: r-cvauc

Homepage: https://github.com/ledell/cvAUC
License: Apache-2.0
Latest release: 1.1.4
published about 4 years ago

Versions: 2
Dependent Packages: 1
Dependent Repositories: 0

Rankings

Dependent packages count: 28.8%

Dependent repos count: 34.0%

Average: 36.8%

Forks count: 40.0%

Stargazers count: 44.2%

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

ROCR * imports
data.table * imports

cvAUC

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

cvAUC

Install cvAUC

Using cvAUC

user system elapsed

0.096 0.005 0.102

$cvAUC

[1] 0.7818224

$se

[1] 0.004531916

$ci

[1] 0.7729400 0.7907048

$confidence

[1] 0.95

cvAUC Performance

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: cvAUC

Rankings

Maintainers (1)

conda-forge.org: r-cvauc

Rankings

Dependencies