marchmadnessconformal

Code and data to accompany "Using Conformal Win Probability to Predict the Winners of the Canceled 2020 NCAA Basketball Tournaments"

https://github.com/chancejohnstone/marchmadnessconformal

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Code and data to accompany "Using Conformal Win Probability to Predict the Winners of the Canceled 2020 NCAA Basketball Tournaments"

Basic Info
  • Host: GitHub
  • Owner: chancejohnstone
  • Language: R
  • Default Branch: master
  • Homepage:
  • Size: 664 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Citation

README.md

Conformal Win Probabability for NCAA Basketball

Chancellor Johnstone and Dan Nettleton

Introduction

This repository holds all data and code for ``Using Conformal Win Probability to Predict the Winners of the Cancelled 2020 NCAA Basketball Tournaments", which uses conformal predictive distributions to estimate win probabilities for NCAA Division 1 women’s (and men’s) basketball. We also implement a simplified closed-form approach to generate probabilities of making the NCAA tournament (March Madness). To utilize the code and data available, clone this repository.

Workflow

The workflow for the results in the paper can be broken down into three main thrusts: 1) data cleaning, 2) win probability generation and 3) calibration. We describe this workflow and associated code in the following section.

Data Cleaning

The data cleaning portion of the workflow consisted of different approaches for the men’s and women’s data. The men’s data was freely and easily accessible from:

https://www.sportsbookreviewsonline.com/scoresoddsarchives/ncaabasketball/ncaabasketballoddsarchives.htm

The women’s data, unavailable in any easy format, was collected through a lengthy scraping process from:

https://www.ncaa.com/scoreboard/basketball-women/d1/

All scraping was performed using ncaaw_scrape.R. You can scrape all scores from the 2014-2015 season all the way to the 2020-2021 season with:

r source("ncaaw_scrape.R")

The entirety of the data is saved as ncaaw_2014_2021_data_fix.csv. In reality, the men’s data could be scraped in a similar fashion. We chose to utilize the freely available data for the men’s analysis regardless. We then “clean” the two datasets, to get the data in a fashion conducive to performing our analysis with:

r source("clean_ncaaw.R") source("clean_ncaab.R")

The cleaned data is saved as clean_ncaaw_scores_2014_2020_all.csv and clean_ncaam_scores_2014_2020_all.csv for the women and men, respectively.

One issue with both the men’s and women’s data is that naming convention is (extremely) inconsistent for each team across both the men’s and women’s datasets. For example, North Carolina A&T appears in seven different forms across the two datasets, e.g., NCA&T, NCarolinaAT, NCAT. To remedy this, we select one name for each school as “correct,” then replace any occurrence of a team with the correct name in the ``cleaned" datasets. The correct team names are saved in ncaa_clean_names_list.csv. The name cleaning is crucial to getting correct estimates for each team’s strength; otherwise there would be teams with only one or two games in the dataset.

Win Probability Generation

In order to provide a case study for the 2019-2020 season, we needed to generate conference win probabilities starting from when each conference tournament was cancelled. The conference tournament schedules as they were prior to cancellation for the women and men are shown in conf_tournament_schedule.csv and conf_tournament_schedulew.csv, respectively. Careful consideration was taken to account for byes within each conference tournament. Conference win probability estimates for both women and men were generated using make_tourn_probs_conformal.R. Conference win probabilities for the women and men are saved in conf_wp_w_conf.csv and conf_wp_m_conf.csv, respectively. The conformal win probabilities are generated within this R function, but win probabilities for the linear and logistic approaches are generated using separate R files:

all_year_ranks_test_ncaa_both.R

all_wp_prep_seasons_fresh.R

The first R file generates models for each season along with training and test datasets in specific formats. The second file generates conference win probabilities for each of the two aforementioned methods.

Calibration

Once the previous workflow steps are complete, the calibration results for the paper were generated with

r source("calibrate_ncaa_all.R")

March Madness Probabilities

In addition to conformal win probabilities, in the paper we also introduce a closed-form approach to generate probabilities for teams making the March Madness tournament field, under a simplified selection process. The probabilities of making the tournament for teams in situations 2 and 3 can be reconstructed with

r source("make_tourn_probs_conformal.R")

The March Madness field probabilities for teams in situation 4, who can only make the March Madness field by winning their conference tournament, have been generated in previous sections.

March Madness win probabilities for each exemplar brackets described in the paper can be generated with

r source("bracket_exemplars_conformal.R")

Conformal Predictive Distribution Example

In the paper we construct the conformal predictive distribution for the margin of victory between Baylor and Oregon State using data from the 2019-2020 regular season. We reproduce the example below:

To generate additional CPD for different games, one can utilize the generate_cpd() function within generate_cpd_ncaa.R. An example of the function in use is shown below, with CPDs generated for Baylor against South Carolina and Connecticut:

``` r source("generatecpdncaa.R")

>

> Attaching package: 'dplyr'

> The following objects are masked from 'package:stats':

>

> filter, lag

> The following objects are masked from 'package:base':

>

> intersect, setdiff, setequal, union

> Warning: package 'ggplot2' was built under R version 4.0.4

> Warning: package 'Iso' was built under R version 4.0.3

> Iso 0.0-18.1

> Warning: package 'doParallel' was built under R version 4.0.2

> Loading required package: foreach

> Loading required package: iterators

> Loading required package: parallel

numgames <- 2 yearvec <- rep(2019, times = numgames) leaguevec <- rep("w", times = numgames) hometeamvec <- rep("Baylor", times = numgames) awayteamvec <- c("SouthCarolina", "Connecticut") cpdBaylor <- generatecpd(yearvec, leaguevec, hometeamvec, awayteamvec)

ycand <- seq(-50,50,length.out = 201) w <- 4 h <- 3 par(pin=c(w, h)) plot(y = cpdBaylor$cpd[,1], x = ycand, type = "l", lwd = 2, ylab = expression(pi(mov,tau~"=1/2")), xlab = "mov", cex.lab = 1.5) lines(y = cpdBaylor$cpd[,2], x = y_cand, lwd = 2, col = "red", lty = 2) abline(v = -1, col = "blue", lty = 3, lwd = 2) legend("topleft", legend = c("South Carolina", "Connecticut"), col = c("black", "red"), lwd = 2, lty = c(1,2)) ```

References

Johnstone, Chancellor. 2020. “Shape-Restricted Random Forests and Semiparametric Prediction Intervals.” PhD thesis, Iowa State University.
Johnstone, Chancellor, and Dan Nettleton. "Using Conformal Win Probability to Predict the Winners of the Canceled 2020 NCAA Basketball Tournaments." The American Statistician (2023): 1-14.
Vovk, Vladimir, Jieli Shen, Valery Manokhin, and Min-ge Xie. 2019. “Nonparametric Predictive Distributions Based on Conformal Prediction.” *Machine Learning* 108 (3): 445–74.

Owner

  • Login: chancejohnstone
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this repo, please use the following citation."
authors:
  - family-names: Johnstone
    given-names: Chancellor
    orcid: https://orcid.org/0000-0002-2185-2208
title: "Using Conformal Win Probability to Predict the Winners of the Canceled 2020 NCAA Basketball Tournaments"
date-released: 2022-03-29

GitHub Events

Total
Last Year