marchmadnessconformal
Code and data to accompany "Using Conformal Win Probability to Predict the Winners of the Canceled 2020 NCAA Basketball Tournaments"
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Repository
Code and data to accompany "Using Conformal Win Probability to Predict the Winners of the Canceled 2020 NCAA Basketball Tournaments"
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Conformal Win Probabability for NCAA Basketball
Chancellor Johnstone and Dan Nettleton
Introduction
This repository holds all data and code for ``Using Conformal Win Probability to Predict the Winners of the Cancelled 2020 NCAA Basketball Tournaments", which uses conformal predictive distributions to estimate win probabilities for NCAA Division 1 women’s (and men’s) basketball. We also implement a simplified closed-form approach to generate probabilities of making the NCAA tournament (March Madness). To utilize the code and data available, clone this repository.
Workflow
The workflow for the results in the paper can be broken down into three main thrusts: 1) data cleaning, 2) win probability generation and 3) calibration. We describe this workflow and associated code in the following section.
Data Cleaning
The data cleaning portion of the workflow consisted of different approaches for the men’s and women’s data. The men’s data was freely and easily accessible from:
The women’s data, unavailable in any easy format, was collected through a lengthy scraping process from:
https://www.ncaa.com/scoreboard/basketball-women/d1/
All scraping was performed using ncaaw_scrape.R. You can scrape all scores from the 2014-2015 season all the way to the 2020-2021 season with:
r
source("ncaaw_scrape.R")
The entirety of the data is saved as ncaaw_2014_2021_data_fix.csv. In reality, the men’s data could be scraped in a similar fashion. We chose to utilize the freely available data for the men’s analysis regardless. We then “clean” the two datasets, to get the data in a fashion conducive to performing our analysis with:
r
source("clean_ncaaw.R")
source("clean_ncaab.R")
The cleaned data is saved as clean_ncaaw_scores_2014_2020_all.csv and clean_ncaam_scores_2014_2020_all.csv for the women and men, respectively.
One issue with both the men’s and women’s data is that naming convention is (extremely) inconsistent for each team across both the men’s and women’s datasets. For example, North Carolina A&T appears in seven different forms across the two datasets, e.g., NCA&T, NCarolinaAT, NCAT. To remedy this, we select one name for each school as “correct,” then replace any occurrence of a team with the correct name in the ``cleaned" datasets. The correct team names are saved in ncaa_clean_names_list.csv. The name cleaning is crucial to getting correct estimates for each team’s strength; otherwise there would be teams with only one or two games in the dataset.
Win Probability Generation
In order to provide a case study for the 2019-2020 season, we needed to generate conference win probabilities starting from when each conference tournament was cancelled. The conference tournament schedules as they were prior to cancellation for the women and men are shown in conf_tournament_schedule.csv and conf_tournament_schedulew.csv, respectively. Careful consideration was taken to account for byes within each conference tournament. Conference win probability estimates for both women and men were generated using make_tourn_probs_conformal.R. Conference win probabilities for the women and men are saved in conf_wp_w_conf.csv and conf_wp_m_conf.csv, respectively. The conformal win probabilities are generated within this R function, but win probabilities for the linear and logistic approaches are generated using separate R files:
all_year_ranks_test_ncaa_both.R
The first R file generates models for each season along with training and test datasets in specific formats. The second file generates conference win probabilities for each of the two aforementioned methods.
Calibration
Once the previous workflow steps are complete, the calibration results for the paper were generated with
r
source("calibrate_ncaa_all.R")
March Madness Probabilities
In addition to conformal win probabilities, in the paper we also introduce a closed-form approach to generate probabilities for teams making the March Madness tournament field, under a simplified selection process. The probabilities of making the tournament for teams in situations 2 and 3 can be reconstructed with
r
source("make_tourn_probs_conformal.R")
The March Madness field probabilities for teams in situation 4, who can only make the March Madness field by winning their conference tournament, have been generated in previous sections.
March Madness win probabilities for each exemplar brackets described in the paper can be generated with
r
source("bracket_exemplars_conformal.R")
Conformal Predictive Distribution Example
In the paper we construct the conformal predictive distribution for the margin of victory between Baylor and Oregon State using data from the 2019-2020 regular season. We reproduce the example below:

To generate additional CPD for different games, one can utilize the generate_cpd() function within generate_cpd_ncaa.R. An example of the function in use is shown below, with CPDs generated for Baylor against South Carolina and Connecticut:
``` r source("generatecpdncaa.R")
>
> Attaching package: 'dplyr'
> The following objects are masked from 'package:stats':
>
> filter, lag
> The following objects are masked from 'package:base':
>
> intersect, setdiff, setequal, union
> Warning: package 'ggplot2' was built under R version 4.0.4
> Warning: package 'Iso' was built under R version 4.0.3
> Iso 0.0-18.1
> Warning: package 'doParallel' was built under R version 4.0.2
> Loading required package: foreach
> Loading required package: iterators
> Loading required package: parallel
numgames <- 2 yearvec <- rep(2019, times = numgames) leaguevec <- rep("w", times = numgames) hometeamvec <- rep("Baylor", times = numgames) awayteamvec <- c("SouthCarolina", "Connecticut") cpdBaylor <- generatecpd(yearvec, leaguevec, hometeamvec, awayteamvec)
ycand <- seq(-50,50,length.out = 201) w <- 4 h <- 3 par(pin=c(w, h)) plot(y = cpdBaylor$cpd[,1], x = ycand, type = "l", lwd = 2, ylab = expression(pi(mov,tau~"=1/2")), xlab = "mov", cex.lab = 1.5) lines(y = cpdBaylor$cpd[,2], x = y_cand, lwd = 2, col = "red", lty = 2) abline(v = -1, col = "blue", lty = 3, lwd = 2) legend("topleft", legend = c("South Carolina", "Connecticut"), col = c("black", "red"), lwd = 2, lty = c(1,2)) ```

References
Owner
- Login: chancejohnstone
- Kind: user
- Repositories: 2
- Profile: https://github.com/chancejohnstone
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this repo, please use the following citation."
authors:
- family-names: Johnstone
given-names: Chancellor
orcid: https://orcid.org/0000-0002-2185-2208
title: "Using Conformal Win Probability to Predict the Winners of the Canceled 2020 NCAA Basketball Tournaments"
date-released: 2022-03-29