Going Under the Hood of the NCAA Tournament Visualization

Share me: Tweet about this on TwitterShare on FacebookShare on Google+Email this to someone

I was planning on putting together an extensive post on how I created my NCAA Tournament visualization as soon as I had a chance to go back and clean up the code, but since (a) my next two weeks are packed with work, and (b) I have received several questions about how I put together the NCAA visualization, I’ll just remove the veil and show the clunky inner workings of it.

What follows is a guide about how I collected, analyzed, and presented the data. It is not a guide about how it should have been done, but how I actually did it. This is a long-winded way of saying: this code is functional, but not exemplary. To that end, I welcome any suggestions for improvement, especially with the R code. (Seriously, you will see some things that will make your head spin, and some routines that take exponentially longer than they should.) Please e-mail me or leave a comment should you have any suggestions, and I’ll happily work it into the post.

Without further ado, here’s a walk-through, from start to finish.


 

Instructions

Collecting the Data

All of the data came from scraping the NCAA Stats website. To our benefit, NCAA Stats uses a very simple, clean, and predictable interface to displaying their statistics. This makes it easy to leverage Python and the wonderful BeautifulSoup library to create an accurate and thorough scraper. BeautifulSoup is designed to help scrapers get the job done quickly, and I highly recommend it for any kind of Python-based scraping. Among other things, it can parse a document, translate it into a tree, and allow easy navigation across that tree (e.g., selecting a specific element based on its ID, and selecting elements following or preceding that element). The scraper, and all of the code powering it, may be obtained here.

Basically, it does the following: (1) open the page containing all of the Division 1 teams for a given year, as seen here, and identify all of the hyperlinks that have a specific path (to the team page). The script then (2) opens each of these links (example) and (a) creates a list of the games that team played in (example) and (b) opens a separate containing the aggregate statistics (example); in both instances, it again relies on identifying hyperlinks with specific paths. Finally, (3) for both the individual game and the player stats, it looks for structural features (i.e., the first table with a given ID or class on the page) on that page to generate statistics. For example, the to get the number of free throws attempted by the away team’s third player, the script would look for the second table with the class “mytable”, the third table row (of all minus 1 table rows), and the ninth cell in that row. Again, because the NCAA Stats’ website is fairly static, we can tell the script to capture data based on (mostly) absolute positions. However, BeautifulSoup has the elements required to deal with more-dynamic pages as well.

That script spit it out five tab-delimited data files: game_data.tsv (summary statistics for each game), player_data.tsv (individual player data for each game), summary_player_data.tsv (summary statistics for each player), summary_team_data.tsv (summary statistics for each team), and team_data.tsv (essentially game_data, but splitting by teams). I tried to make the variable names fairly descriptive, but feel free to leave a comment if any are unclear.

Again, the scraper and its source code may be found here.

 

Calculating and Plotting

To perform a few additional calculations (i.e., point differentials, opponent season stats), I used R, which I highly recommend to anyone interested in statistics. R is basically a programming language for statistical computing, enabling users to perform a variety of statistical tests in just a few lines of code. It is also extensible, with several great packages. To be clear, R has a rather considerable learning curve and can appear very daunting at first, but it is very powerful (and F/OSS!), so I recommend you stick it (and perhaps seek out a local R User Group). Although one can use a simple text editor to create scripts (or enter into interactive mode within R), I highly recommend using RStudio (also F/OSS) to make your life easier.

The first thing I did was clean up the data files produced by the scraper. For example, I cleared cells that contained “N/A” or “-“; R will just treat those blanks as NAs. I also noticed that a few schools (Cal St. Northridge, Duke, Ill.-Chicago, Northern Ariz., Quinnipiac, and South Dakota) had a few data points that were miscalculated by the NCAA; I recalculated those myself. You can make these edits by simply opening the .tsv file in a simple text editor, or with a spreadsheet program (be careful with how you import and export with the latter).

From that point on, it was just working with R. The cleaned-up data files and the R script (everything needed to produce the plots) may be downloaded here. The script alone may be found here. I do want to reiterate that this is not particularly good code, but it is functional. However, if I had a bit more time, there’s a lot I would like to change about it. In addition to a lot of hacky parts (like eval(parse(paste())); not pretty), it’s also calculating several things that aren’t being used here (I planned to use those variables for other things, but never got around to it). If you have any suggestions on how to make the code better or more efficient, please leave a comment or e-mail me, and I’d be happy to append any suggestions to this post.

The first thing I did was load the required libraries (you can install these with install.packages(“[package name]”)), the data files (and providing instructions about how to treat certain columns), a file containing a full textual mapping for each variable (to have nice-looking Y-axes), and finally a vector containing each tournament team (using the team IDs):

################# LOAD LIBRARIES
library("Cairo")
library("ggplot2")
library("gridExtra")
library("psych")
library("reshape")
library("scales")
library("sm")

################# LOAD DATA
# Load aggregate player data
agg.player <- read.csv(file="data/summary_player_data.tsv", sep="\t", header=TRUE, row.names=1, na.strings="?")
agg.player$player_name <- as.character(agg.player$player_name)
agg.player$team_name <- as.character(agg.player$team_name)
agg.player$pos <- factor(agg.player$pos, c("G", "F", "C"))
agg.player$year <- factor(agg.player$year, c("Fr", "So", "Jr", "Sr"))
agg.player$height = sapply(strsplit(as.character(agg.player$height), "-"), function(x){12*as.numeric(x[1]) + as.numeric(x[2])})
agg.player$minutes = sapply(strsplit(as.character(agg.player$minutes), ":"), function(x){as.numeric(x[1]) + as.numeric(x[2])/60})

# Load aggregate team data
agg.team <- read.csv(file="data/summary_team_data.tsv", sep="\t", header=TRUE, row.names=1, na.strings="?")
agg.team$team_name <- as.character(agg.team$team_name)
agg.team$team_minutes = sapply(strsplit(as.character(agg.team$team_minutes), ":"), function(x){as.numeric(x[1]) + as.numeric(x[2])/60})
agg.team$opp_team_minutes = sapply(strsplit(as.character(agg.team$opp_team_minutes), ":"), function(x){as.numeric(x[1]) + as.numeric(x[2])/60})

# Load Individual Game Data
ind.game <- read.csv(file="data/game_data.tsv", sep="\t", header=TRUE, row.names=1, na.strings="?")
ind.game$game_date <- as.Date(ind.game$game_date, format='%m/%d/%Y')
ind.game$home_team_name <- as.character(ind.game$home_team_name)
ind.game$away_team_name <- as.character(ind.game$away_team_name)
ind.game$home_team_minutes = sapply(strsplit(as.character(ind.game$home_team_minutes), ":"), function(x){as.numeric(x[1]) + as.numeric(x[2])/60})
ind.game$away_team_minutes = sapply(strsplit(as.character(ind.game$away_team_minutes), ":"), function(x){as.numeric(x[1]) + as.numeric(x[2])/60})


# Load Individual Player Data
ind.player <- read.csv(file="data/player_data.tsv", sep="\t", header=TRUE, row.names=NULL, na.strings="?")
ind.player$player_name <- as.character(ind.player$player_name)
ind.player$team_name <- as.character(ind.player$team_name)
ind.player$pos <- factor(ind.player$pos, c("G", "F", "C"))
ind.player$minutes = sapply(strsplit(as.character(ind.player$minutes), ":"), function(x){as.numeric(x[1]) + as.numeric(x[2])/60})
ind.player$game_date <- as.Date(ind.player$game_date, format='%m/%d/%Y')

# Load Variable Names (to label axes when generating graphs)
varnames <- read.csv(file="data/varnames.csv", header=TRUE, row.names=NULL)

# Create a vector with all of the 64 teams in the NCAA Tournament
tourneyteams <- c("328", "251", "235", "418", "740", "110", "626", "457", "739", "522", "428", "5", "649", "508", "28755", "772", "260", "518", "473", "327", "796", "29", "513", "545", "782", "311", "14927", "433", "340", "275", "310", "665", "306", "415", "387", "688", "465", "87", "301", "490", "690", "157", "83", "107", "441", "173", "534", "317", "367", "193", "416", "609", "521", "404", "169", "156", "434", "140", "610", "529", "472", "735", "14", "488")

Next, I (1) calculated a few basic percentages for the individual game statistics (e.g., three-point percentage); (2) calculated some additional team-level information (e.g., average home game point differential); (3) added all of this information to each individual game; (4) created a new data frame of team-level data for each game (e.g., Team A stats and Team B (opponent) stats, followed by Team B stats and Team A (opponent) stats (thus, each game would have two entries in this data frame); (5) appended each respective team’s aggregate season statistics to each game; and finally (6) took that aggregate opponent data and fed it into our team-level data frame, created in step 4. Again, several of these variables were not used in the plots that were generated, but some were (e.g., opponents’ average home point differential). This is the code used for that:

################# TAKE OUR EXISTING DATA, ADD TO IT, AND CREATE NEW DATAFRAMES
##### Create a list with our basic stats
basicgamestats <- c("fgm", "fga", "three_fgm", "three_fga", "ft", "fta", "pts", "ptsavg", "offreb", "defreb", "totreb", "rebavg", "ast", "to", "stl", "blk", "fouls", "dbldbl", "trpdbl")
basicgamestats_team <- basicgamestats


##### Add a few pieces of data to each game
# Point differential for each game
ind.game$ptsdiff <- ind.game$home_team_pts - ind.game$away_team_pts
# Field goal percentage
ind.game$away_team_fgpct <- ind.game$away_team_fgm/ind.game$away_team_fga
ind.game$home_team_fgpct <- ind.game$home_team_fgm/ind.game$home_team_fga
# Three-point percentage
ind.game$away_team_three_fgpct <- ind.game$away_team_three_fgm/ind.game$away_team_three_fga
ind.game$home_team_three_fgpct <- ind.game$home_team_three_fgm/ind.game$home_team_three_fga
# Free-throw percentage
ind.game$away_team_ftpct <- ind.game$away_team_ft/ind.game$away_team_fta
ind.game$home_team_ftpct <- ind.game$home_team_ft/ind.game$home_team_fta

# Add our changes to our list of variables
basicgamestatsaddpct <- c("fgpct", "three_fgpct", "ftpct")
basicgamestats_team <- append(basicgamestats_team, basicgamestatsaddpct)
basicgamestatsaddunique <- c("ptsdiff")


##### Calculate additional team data
# Create a vector with each team
agg.team.teams <- row.names(agg.team)

# Create empty vectors to later populate and transform into data frame
vec_home_avg_ptsdiff <- NULL
vec_away_avg_ptsdiff <- NULL
vec_avg_ptsdiff <- NULL
vec_home_wins <- NULL
vec_away_wins <- NULL
vec_wins <- NULL
vec_home_losses <- NULL
vec_away_losses <- NULL
vec_losses <- NULL
vec_winpct <- NULL
vec_guard_points <- NULL
vec_guard_avg_height <- NULL
vec_forward_points <- NULL
vec_forward_avg_height <- NULL

# Parse each team and generate statistics
for (team in seq(along=agg.team.teams)) {
  gen_team_id <- as.character(agg.team.teams[team])
  gen_agg_team_stats <- agg.team[gen_team_id, ]
  gen_agg_player_stats <- agg.player[agg.player$team_id == gen_team_id, ]
  
  # Calculate a handful of general team statistics
  gen_home_ptsdiff <- ind.game[ind.game$home_team_id == gen_team_id, ]$ptsdiff
  gen_away_ptsdiff <- ind.game[ind.game$away_team_id == gen_team_id, ]$ptsdiff * -1
  gen_total_ptsdiff <- c(gen_home_ptsdiff, gen_away_ptsdiff)
  gen_home_avg_ptsdiff <- mean(gen_home_ptsdiff)
  gen_away_avg_ptsdiff <- mean(gen_away_ptsdiff)
  gen_avg_ptsdiff <- mean(gen_total_ptsdiff)
  gen_home_wins <- length(which(gen_home_ptsdiff > 0))
  gen_away_wins <- length(which(gen_away_ptsdiff > 0))
  gen_wins <- gen_home_wins + gen_away_wins
  gen_home_losses <- length(which(gen_home_ptsdiff < 0))
  gen_away_losses <- length(which(gen_away_ptsdiff < 0))
  gen_losses <- gen_home_losses + gen_away_losses
  gen_winpct <- (gen_wins)/(gen_wins+gen_losses)  
  
  # Calculate a handful of position-specific stats
  gen_guards <- gen_agg_player_stats[gen_agg_player_stats$pos == "G", ]
  gen_guard_points <- sum(gen_guards$pts, na.rm=TRUE)/sum(gen_agg_team_stats$team_pts, na.rm=TRUE)
  gen_guard_avg_height <- mean(gen_guards$height, na.rm=TRUE)
  gen_forwards <- gen_agg_player_stats[grep("F|C", gen_agg_player_stats$pos), ]
  gen_forward_points <- sum(gen_forwards$pts, na.rm=TRUE)/sum(gen_agg_team_stats$team_pts, na.rm=TRUE)
  gen_forward_avg_height <- mean(gen_forwards$height, na.rm=TRUE)
  
  # Append these variables to their respective vectors to add to data frame
  vec_home_avg_ptsdiff <- append(vec_home_avg_ptsdiff, gen_home_avg_ptsdiff)
  vec_away_avg_ptsdiff <- append(vec_away_avg_ptsdiff, gen_away_avg_ptsdiff)
  vec_avg_ptsdiff <- append(vec_avg_ptsdiff, gen_avg_ptsdiff)
  vec_home_wins <- append(vec_home_wins, gen_home_wins)
  vec_away_wins <- append(vec_away_wins, gen_away_wins)
  vec_wins <- append(vec_wins, gen_wins)
  vec_home_losses <- append(vec_home_losses, gen_home_losses)
  vec_away_losses <- append(vec_away_losses, gen_away_losses)
  vec_losses <- append(vec_losses, gen_losses)
  vec_winpct <- append(vec_winpct, gen_winpct)
  vec_guard_points <- append(vec_guard_points, gen_guard_points)
  vec_guard_avg_height <- append(vec_guard_avg_height, gen_guard_avg_height)
  vec_forward_points <- append(vec_forward_points, gen_forward_points)
  vec_forward_avg_height <- append(vec_forward_avg_height, gen_forward_avg_height)
  
}

# Add our extra bits of data to the agg.data frame
agg.team$team_home_avg_ptsdiff <- vec_home_avg_ptsdiff
agg.team$team_away_avg_ptsdiff <- vec_away_avg_ptsdiff
agg.team$team_avg_ptsdiff <- vec_avg_ptsdiff
agg.team$team_home_wins <- vec_home_wins
agg.team$team_away_wins <- vec_away_wins
agg.team$team_wins <- vec_wins
agg.team$team_home_losses <- vec_home_losses
agg.team$team_away_losses <- vec_away_losses
agg.team$team_losses <- vec_losses
agg.team$team_winpct <- vec_winpct
agg.team$team_guard_points <- vec_guard_points
agg.team$team_guard_avg_height <- vec_guard_avg_height
agg.team$team_forward_points <- vec_forward_points
agg.team$team_forward_avg_height <- vec_forward_avg_height

# Erase any existing data frame-populating vectors
rm(list = ls(pattern = "\\bvec_"))
rm(list = ls(pattern = "\\bgen_"))

# Add a few more elements to our vector of variables
basicgamestatsaddsingle <- c("home_avg_ptsdiff","away_avg_ptsdiff","avg_ptsdiff","home_wins","away_wins","wins","home_losses","away_losses","losses","winpct","guard_points","guard_avg_height","forward_points","forward_avg_height")
basicgamestats_team <- append(basicgamestats_team, basicgamestatsaddsingle)


##### Append new team data to each game
# Create our list of games
ind.game.games <- row.names(ind.game)

home_away_vars_to_populate <- NULL
for (variable in seq(along=basicgamestats_team)) {
  gen_var_name <- basicgamestats_team[variable]
  home_away_vars_to_populate <- append(home_away_vars_to_populate, paste0("team_", gen_var_name))
}

# Create empty vectors for the information to be added
for (variable in seq(along=home_away_vars_to_populate)) {
  gen_var_name <- home_away_vars_to_populate[variable]
  command <- paste0("gen_var_away_vec_", gen_var_name, " <- NULL")
  eval(parse(text=command))
  command <- paste0("gen_var_home_vec_", gen_var_name, " <- NULL")
  eval(parse(text=command))
}
  
# Parse each game and generate statistics
for (game in seq(along=ind.game.games)) {
  gen_game_id <- ind.game.games[game]
  gen_away_team_id <- as.character(ind.game[gen_game_id, ]$away_team_id)
  gen_home_team_id <- as.character(ind.game[gen_game_id, ]$home_team_id)
  
  # For each one of our variables
  for (variable in seq(along=home_away_vars_to_populate)) {
    gen_var_name <- home_away_vars_to_populate[variable]
    
    # Get values from team data frame
    command <- paste0("gen_values_away_", gen_var_name, " <- agg.team['", gen_away_team_id, "', ]$", gen_var_name)
    eval(parse(text=command))
    command <- paste0("gen_values_home_", gen_var_name, " <- agg.team['", gen_home_team_id, "', ]$", gen_var_name)
    eval(parse(text=command))

    # Append it to our earlier vector
    command <- paste0("gen_var_away_vec_", gen_var_name, " <- append(gen_var_away_vec_", gen_var_name, ", gen_values_away_", gen_var_name, ")")
    eval(parse(text=command))
    command <- paste0("gen_var_home_vec_", gen_var_name, " <- append(gen_var_home_vec_", gen_var_name, ", gen_values_home_", gen_var_name, ")")
    eval(parse(text=command))
  }
}

# Add the data to the data frame
for (variable in seq(along=home_away_vars_to_populate)) {
  gen_var_name <- home_away_vars_to_populate[variable]
  command <- paste0("ind.game$away_season_", gen_var_name, " <- gen_var_away_vec_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("ind.game$home_season_", gen_var_name, " <- gen_var_home_vec_", gen_var_name)
  eval(parse(text=command))
}

# Erase any existing data frame-populating vectors
rm(list = ls(pattern = "\\bvec_"))
rm(list = ls(pattern = "\\bgen_"))


##### Create a new team-level data frame that has individual game stats and looks like: Team Stats, Opponent Stats (Note: Each game will thus have two entries!)
# First, the away team
vec_team_away <- data.frame(thisteam_team_id = ind.game$away_team_id, thisteam_team_name = ind.game$away_team_name, game_id = row.names(ind.game), game_date = ind.game$game_date, neutral_site = ind.game$neutral_site, home = 0, opp_team_id = ind.game$home_team_id, opp_team_name = ind.game$home_team_name)
for (variable in seq(along=basicgamestats_team)) {
  gen_var_name <- basicgamestats_team[variable]
  command <- paste0("vec_team_away$thisteam_", gen_var_name, " = ind.game$away_team_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_away$opp_", gen_var_name, " = ind.game$home_team_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_away$thisteam_season_", gen_var_name, " = ind.game$away_season_team_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_away$opp_season_", gen_var_name, " = ind.game$home_season_team_", gen_var_name)
  eval(parse(text=command))
}
for (variable in seq(along=basicgamestatsaddunique)) {
  gen_var_name <- basicgamestatsaddunique[variable]
  command <- paste0("vec_team_away$", gen_var_name, " = ind.game$", gen_var_name)
  eval(parse(text=command))
}
vec_team_away$ptsdiff <- vec_team_away$ptsdiff * -1 # The way we calculate ptsdiff is Home Team Score - Away Team Score, so we need to take the inverse here.

# Then, the home team
vec_team_home <- data.frame(thisteam_team_id = ind.game$home_team_id, thisteam_team_name = ind.game$home_team_name, game_id = row.names(ind.game), game_date = ind.game$game_date, neutral_site = ind.game$neutral_site, home = 1, opp_team_id = ind.game$away_team_id, opp_team_name = ind.game$away_team_name)
for (variable in seq(along=basicgamestats_team)) {
  gen_var_name <- basicgamestats_team[variable]
  command <- paste0("vec_team_home$thisteam_", gen_var_name, " = ind.game$home_team_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_home$opp_", gen_var_name, " = ind.game$away_team_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_home$thisteam_season_", gen_var_name, " = ind.game$home_season_team_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_home$opp_season_", gen_var_name, " = ind.game$away_season_team_", gen_var_name)
  eval(parse(text=command))
}
for (variable in seq(along=basicgamestatsaddunique)) {
  gen_var_name <- basicgamestatsaddunique[variable]
  command <- paste0("vec_team_home$", gen_var_name, " = ind.game$", gen_var_name)
  eval(parse(text=command))
}

# Bind the two dataframes into ind.team, which will have team-level data for each game
ind.team <- rbind(vec_team_away, vec_team_home)

# Erase any existing data frame-populating vectors
rm(list = ls(pattern = "\\bvec_"))
rm(list = ls(pattern = "\\bgen_"))


##### Add in opponents' winning percentage, etc. (all stats from basicgamestatsaddsingle, but for opponents)
# Create a vector with each team
agg.team.teams <- row.names(agg.team)

# Create empty vectors for the information to be added
for (variable in seq(along=basicgamestatsaddsingle)) {
  gen_var_name <- basicgamestatsaddsingle[variable]
  command <- paste0("gen_var_vec_opp_season_", gen_var_name, " <- NULL")
  eval(parse(text=command))
}

# Parse each TEAM and generate statistics
for (team in seq(along=agg.team.teams)) {
  gen_team_id <- as.character(agg.team.teams[team])

  # For each one of our variables
  for (variable in seq(along=basicgamestatsaddsingle)) {
    gen_var_name <- basicgamestatsaddsingle[variable]
    
    # Calculate means for each variable
    command <- paste0("gen_values_opp_season_", gen_var_name, " <- mean(ind.team[ind.team$thisteam_team_id == ", gen_team_id, ", ]$opp_season_", gen_var_name, ", na.rm = TRUE)")
    eval(parse(text=command))
    
    # Append it to our earlier vector
    command <- paste0("gen_var_vec_opp_season_", gen_var_name, " <- append(gen_var_vec_opp_season_", gen_var_name, ", gen_values_opp_season_", gen_var_name, ")")
    eval(parse(text=command))
    
    command <- paste0("head(gen_var_vec_opp_season_", gen_var_name, ")")
    eval(parse(text=command))
  }
}

# Add the data to the data frame
for (variable in seq(along=basicgamestatsaddsingle)) {
  gen_var_name <- basicgamestatsaddsingle[variable]
  command <- paste0("agg.team$opp_team_", gen_var_name, " <- gen_var_vec_opp_season_", gen_var_name)
  eval(parse(text=command))
}

# Erase any existing data frame-populating vectors
rm(list = ls(pattern = "\\bvec_"))
rm(list = ls(pattern = "\\bgen_"))


##### Now, we append all of that opponent data to each game
# Create our list of games
ind.game.games <- row.names(ind.game)

# Determine what variables need to be populated
opp_vars_to_populate <- NULL
for (variable in seq(along=basicgamestatsaddsingle)) {
  gen_var_name <- basicgamestatsaddsingle[variable]
  opp_vars_to_populate <- append(opp_vars_to_populate, paste0("opp", gen_var_name))
}

# Create empty vectors for the information to be added
for (variable in seq(along=basicgamestatsaddsingle)) {
  gen_var_name <- basicgamestatsaddsingle[variable]
  command <- paste0("gen_var_away_vec_", gen_var_name, " <- NULL")
  eval(parse(text=command))
  command <- paste0("gen_var_home_vec_", gen_var_name, " <- NULL")
  eval(parse(text=command))
}

# Parse each game and generate statistics
for (game in seq(along=ind.game.games)) {
  gen_game_id <- ind.game.games[game]
  gen_away_team_id <- as.character(ind.game[gen_game_id, ]$away_team_id)
  gen_home_team_id <- as.character(ind.game[gen_game_id, ]$home_team_id)
  
  # For each one of our variables
  for (variable in seq(along=basicgamestatsaddsingle)) {
    gen_var_name <- basicgamestatsaddsingle[variable]
    
    # Get values from team data frame
    command <- paste0("gen_values_away_", gen_var_name, " <- agg.team['", gen_away_team_id, "', ]$opp_team_", gen_var_name)
    eval(parse(text=command))
    command <- paste0("gen_values_home_", gen_var_name, " <- agg.team['", gen_home_team_id, "', ]$opp_team_", gen_var_name)
    eval(parse(text=command))
    
    # Append it to our earlier vector
    command <- paste0("gen_var_away_vec_", gen_var_name, " <- append(gen_var_away_vec_", gen_var_name, ", gen_values_away_", gen_var_name, ")")
    eval(parse(text=command))
    command <- paste0("gen_var_home_vec_", gen_var_name, " <- append(gen_var_home_vec_", gen_var_name, ", gen_values_home_", gen_var_name, ")")
    eval(parse(text=command))
  }
}

# Add the data to the data frame
for (variable in seq(along=basicgamestatsaddsingle)) {
  gen_var_name <- basicgamestatsaddsingle[variable]
  command <- paste0("ind.game$away_season_opp_", gen_var_name, " <- gen_var_away_vec_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("ind.game$home_season_opp_", gen_var_name, " <- gen_var_home_vec_", gen_var_name)
  eval(parse(text=command))
}


# Erase any existing data frame-populating vectors
rm(list = ls(pattern = "\\bvec_"))
rm(list = ls(pattern = "\\bgen_"))


#### And now we take that opponent data and feed it back into our aggregate team dataframe
# First, the away team
vec_team_away <- data.frame(row.names=row.names(ind.team))
for (variable in seq(along=basicgamestatsaddsingle)) {
  gen_var_name <- basicgamestatsaddsingle[variable]
  command <- paste0("vec_team_away$thisteam_season_opp_", gen_var_name, " = ind.game$away_season_opp_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_away$opp_season_opp_", gen_var_name, " = ind.game$home_season_opp_", gen_var_name)
  eval(parse(text=command))
}

# Then, the home team
vec_team_home <- data.frame(row.names=row.names(ind.team))
for (variable in seq(along=basicgamestatsaddsingle)) {
  gen_var_name <- basicgamestatsaddsingle[variable]
  command <- paste0("vec_team_home$thisteam_season_opp_", gen_var_name, " = ind.game$home_season_opp_", gen_var_name)
  eval(parse(text=command))
  command <- paste0("vec_team_home$opp_season_opp_", gen_var_name, " = ind.game$away_season_opp_", gen_var_name)
  eval(parse(text=command))
}

vec_ind.team <- rbind(vec_team_away, vec_team_home)
ind.team <- cbind(ind.team, vec_team_away, vec_team_home)

# Erase any existing data frame-populating vectors
rm(list = ls(pattern = "\\bvec_"))
rm(list = ls(pattern = "\\bgen_"))

Lastly, it was time to plot all of that data using the ggplot2 library, so I (1) created a few new data frames that only included information for the tournament teams; (2) created vectors containing the variables that I wanted plotted; (3) generated the minimum and maximum values for each variable (to have consistent Y-axes); and lastly, (4) ran a loop for each team where, for each variable in each of the vectors in step 2, it would create a temporary data frame with the relevant values, generate a given type of plot, and finally save it as a PNG file.

################# NOW, WE'RE READY TO PLOT! ALL PLOTS GO IN ./PLOTS/
#### Preparations for plotting
# Set the field of Tournament foe
tourney_team_agg_stats <- agg.team[tourneyteams, ]
tourney_team_ind_stats <- subset(ind.team, thisteam_team_id %in% tourneyteams)
tourney_player_ind_stats <- subset(ind.player, team_id %in% tourneyteams)

# Melt the data for percentile rankings
tourney_team_agg_stats.m <- melt(tourney_team_agg_stats1)
tourney_team_agg_stats.m[is.na(tourney_team_agg_stats.m)] <- 0
tourney_team_agg_stats.rescale <- ddply(tourney_team_agg_stats.m, .(variable), transform, rescale = scale(value))
tourney_team_agg_stats.percentile <-  ddply(tourney_team_agg_stats.m, .(variable), transform, percentile=ecdf(value)(value))

# Create lists of variables we want to plot, by dataframe
ind.player.plot <- c("three_fga", "three_fgm", "ast", "blk", "fga", "fgm", "fouls", "fta", "ft", "pts", "defreb", "offreb", "totreb", "stl", "to")
ind.team.plot <- c("three_fgpct", "three_fga", "three_fgm", "ast", "blk", "fgpct", "fga", "fgm", "fouls", "ftpct", "fta", "ft", "pts", "defreb", "offreb", "stl", "totreb", "to")
agg.team.plot <- c("three_fgpct", "three_fga", "three_fgm", "ast", "forward_avg_height", "guard_avg_height", "avg_ptsdiff", "away_avg_ptsdiff", "home_avg_ptsdiff", "ptsavg", "rebavg", "blk", "fgpct", "fga", "fgm", "fouls", "ftpct", "fta", "ft", "losses", "away_losses", "home_losses", "forward_points", "guard_points", "pts", "defreb", "offreb", "totreb", "stl", "to", "winpct", "away_wins", "home_wins", "wins")

# Get Min, Mean, and Max (to set limits for Y axes in each plot).
range_team_agg_stats <- data.frame(stats=c("min", "mean", "max"))
for (variable in seq(along=agg.team.plot)) {
  gen_var_name <- agg.team.plot[variable]
  command <- paste0("try(range_team_agg_stats$", gen_var_name, " <- c(min(agg.team$team_", gen_var_name, ", na.rm=TRUE), mean(agg.team$team_", gen_var_name, ", na.rm=TRUE), max(agg.team$team_", gen_var_name, ", na.rm=TRUE)))")
  eval(parse(text=command))
}

range_team_ind_stats <- data.frame(stats=c("min", "mean", "max"))
for (variable in seq(along=ind.team.plot)) {
  gen_var_name <- ind.team.plot[variable]
  command <- paste0("try(range_team_ind_stats$", gen_var_name, " <- c(min(ind.team$thisteam_", gen_var_name, ", na.rm=TRUE), mean(ind.team$thisteam_", gen_var_name, ", na.rm=TRUE), max(ind.team$thisteam_", gen_var_name, ", na.rm=TRUE)))")
  eval(parse(text=command))
}

range_player_ind_stats <- data.frame(stats=c("min", "mean", "max"))
for (variable in seq(along=ind.player.plot)) {
  gen_var_name <- ind.player.plot[variable]
  command <- paste0("try(range_player_ind_stats$", gen_var_name, " <- c(min(ind.player$", gen_var_name, ", na.rm=TRUE), mean(ind.player$", gen_var_name, ", na.rm=TRUE), max(ind.player$", gen_var_name, ", na.rm=TRUE)))")
  eval(parse(text=command))
}

# Set plot size
X11(width=6.4, height=6.4)


#### Start plotting, running a loop for each team in tourneyteams
for (item in seq(along=tourneyteams)) {
  team <- tourneyteams[item]
  teamname <- agg.team[team, "team_name"]
  team_ind_games <- ind.team[ind.team$thisteam_team_id == team, ]
  rows_in_team_ind_games <- nrow(team_ind_games)
  
  # Create bar graphs for each aggregate team stat
  for (variable in seq(along=agg.team.plot)) {
    tryCatch({
      gen_var_name <- agg.team.plot[variable]
      gen_thisteam_var_name <- paste0("team_", gen_var_name)
      gen_opp_var_name <- paste0("opp_team_", gen_var_name)
      tmp_thisteam_stat_df <- subset(agg.team, team_name == teamname, select = c("team_name", gen_thisteam_var_name))
      colnames(tmp_thisteam_stat_df) <- c("team", "vals")
      tmp_opp_stat_df <- subset(agg.team, team_name == teamname, select = c(gen_opp_var_name))
      colnames(tmp_opp_stat_df) <- c("vals")
      tmp_opp_stat_df$team <- c ("Opposing Teams")
      command <- paste0('tmp_avg_stat_df <- data.frame(team=c("National Average", "Tournament Average"), vals=c(mean(agg.team$team_', gen_var_name, ', na.rm=TRUE), mean(tourney_team_agg_stats$team_', gen_var_name, ', na.rm=TRUE)))')
      eval(parse(text=command))
      tmp_stat_df <- rbind(tmp_thisteam_stat_df, tmp_opp_stat_df, tmp_avg_stat_df)
      
      tmp_stat_df$team <- factor(tmp_stat_df$team, c(teamname, 'Opposing Teams', 'National Average', 'Tournament Average'))
      eval(parse(text=paste0("min_avg <- range_team_agg_stats$", gen_var_name, "[1]")))
      eval(parse(text=paste0("max_avg <- range_team_agg_stats$", gen_var_name, "[3]")))
      range_avg = max_avg - min_avg
      p <- ggplot(data=tmp_stat_df, aes(x=team, y=vals, fill=team)) +
        geom_bar(stat="identity") +
        scale_y_continuous(limits=c(min_avg,max_avg), breaks=c(min_avg,((range_avg/8*1)+min_avg),((range_avg/8*2)+min_avg),((range_avg/8*3)+min_avg),((range_avg/8*4)+min_avg),((range_avg/8*5)+min_avg),((range_avg/8*6)+min_avg),((range_avg/8*7)+min_avg),max_avg), oob=rescale_none) +
        xlab("") +
        ylab(varnames[varnames$Variable == gen_var_name, "Text"]) +
        scale_fill_manual(values=c("#9E373E", "#319C86", "#999999", "#D1B993")) +
        guides(fill=FALSE) +
        theme_bw()
      p <- arrangeGrob(p, sub = textGrob("rodrigozamith.com", x = 0, hjust = -6, vjust=-0.3, gp = gpar(fontface = "italic", fontsize = 8)))
      plotfile = paste(team,"_team_agg_", gen_var_name, ".png", sep="")
      ggsave(filename=plotfile, path="plots/", width=6.4, height=6.4, plot=p, type = "cairo-png")
    }, error = function(e) NULL)
  }
  
  # Create boxplots for each individual team stat
  for (variable in seq(along=ind.team.plot)) {
    tryCatch({
      gen_var_name <- ind.team.plot[variable]
      gen_this_team_var_name <- paste0("thisteam_", gen_var_name)
      gen_opp_var_name <- paste0("opp_", gen_var_name)
      tmp_thisteam_stat_df <- subset(ind.team, thisteam_team_id == team, select = c("thisteam_team_name", gen_this_team_var_name))
      colnames(tmp_thisteam_stat_df) <- c("team", "vals")
      tmp_opp_stat_df <- subset(ind.team, thisteam_team_id == team, select = c(gen_opp_var_name))
      colnames(tmp_opp_stat_df) <- c("vals")
      tmp_opp_stat_df$team <- c ("Opposing Teams")
      tmp_stat_df <- rbind(tmp_thisteam_stat_df, tmp_opp_stat_df)
      
      tmp_stat_df$team <- factor(tmp_stat_df$team, c(teamname, 'Opposing Teams'))
      eval(parse(text=paste0("min_avg <- range_team_ind_stats$", gen_var_name, "[1]")))
      eval(parse(text=paste0("max_avg <- range_team_ind_stats$", gen_var_name, "[3]")))
      range_avg = max_avg - min_avg
      p <- ggplot(data=tmp_stat_df, aes(x=team, y=vals, fill=team)) +
        geom_jitter(aes(color = team), alpha = 0.4) +
        geom_boxplot(alpha = 0.7) +
        scale_y_continuous(limits=c(min_avg,max_avg), breaks=c(min_avg,((range_avg/8*1)+min_avg),((range_avg/8*2)+min_avg),((range_avg/8*3)+min_avg),((range_avg/8*4)+min_avg),((range_avg/8*5)+min_avg),((range_avg/8*6)+min_avg),((range_avg/8*7)+min_avg),max_avg), oob=rescale_none) +
        xlab("") +
        ylab(varnames[varnames$Variable == gen_var_name, "Text"]) +
        guides(fill=FALSE, color=FALSE) +
        theme_bw()
      p <- arrangeGrob(p, sub = textGrob("rodrigozamith.com", x = 0, hjust = -6, vjust=-0.3, gp = gpar(fontface = "italic", fontsize = 8)))
      plotfile = paste(team,"_team_ind_", gen_var_name, ".png", sep="")
      ggsave(filename=plotfile, path="plots/", width=6.4, height=6.4, plot=p, type = "cairo-png")
    }, error = function(e) NULL)
  }
  
  # Create boxplots for each individual player stat
  for (variable in seq(along=ind.player.plot)) {
    tryCatch({
      gen_var_name <- ind.player.plot[variable]
      tmp_stat_df <- subset(ind.player, team_id == team, select = c("player_name", gen_var_name))
      colnames(tmp_stat_df)[2] <- "vals"
      tmp_stat_df[is.na(tmp_stat_df)] <- 0
      
      # Plot it
      eval(parse(text=paste0("min_avg <- range_player_ind_stats$", gen_var_name, "[1]")))
      eval(parse(text=paste0("max_avg <- range_player_ind_stats$", gen_var_name, "[3]")))
      range_avg = max_avg - min_avg
      
      p <- ggplot(data=tmp_stat_df, aes(x=reorder(substring(player_name, 1, 20), -vals), y=vals, fill=player_name)) +
        geom_jitter(aes(color = player_name), alpha = 0.4) + geom_boxplot(alpha = 0.7) +
        scale_y_continuous(limits=c(min_avg,max_avg), breaks=c(min_avg,((range_avg/8*1)+min_avg),((range_avg/8*2)+min_avg),((range_avg/8*3)+min_avg),((range_avg/8*4)+min_avg),((range_avg/8*5)+min_avg),((range_avg/8*6)+min_avg),((range_avg/8*7)+min_avg),max_avg), oob=rescale_none) +
        #xlab("Player Name") +
        xlab("") +
        ylab(varnames[varnames$Variable == gen_var_name, "Text"]) +
        guides(fill=FALSE, color=FALSE) + theme_bw() + theme(axis.text.x=element_text(angle = 330, hjust = 0))
      p <- arrangeGrob(p, sub = textGrob("rodrigozamith.com", x = 0, hjust = -6, vjust=-0.3, gp = gpar(fontface = "italic", fontsize = 8)))
      plotfile = paste(team,"_player_ind_", gen_var_name, ".png", sep="")
      ggsave(filename=plotfile, path="plots/", width=6.4, height=6.4, plot=p, type = "cairo-png")
    }, error = function(e) NULL)
  }
  
  # Create date-based line graph for each team stat
  for (variable in seq(along=ind.team.plot)) {
    tryCatch({
      gen_var_name <- ind.team.plot[variable]
      gen_this_team_var_name <- paste0("thisteam_", gen_var_name)
      gen_opp_var_name <- paste0("opp_", gen_var_name)
      tmp_thisteam_stat_df <- subset(ind.team, thisteam_team_id == team, select = c("thisteam_team_name", "game_date", gen_this_team_var_name))
      colnames(tmp_thisteam_stat_df) <- c("team", "game_date", "vals")
      tmp_opp_stat_df <- subset(ind.team, thisteam_team_id == team, select = c("game_date", gen_opp_var_name))
      colnames(tmp_opp_stat_df) <- c("game_date", "vals")
      tmp_opp_stat_df$team <- c ("Opposing Teams")
      tmp_stat_df <- rbind(tmp_thisteam_stat_df, tmp_opp_stat_df)
      tmp_stat_df[is.na(tmp_stat_df)] <- 0
      
      # Plot it
      eval(parse(text=paste0("min_avg <- range_team_ind_stats$", gen_var_name, "[1]")))
      eval(parse(text=paste0("max_avg <- range_team_ind_stats$", gen_var_name, "[3]")))
      range_avg = max_avg - min_avg
      p <- ggplot(data=tmp_stat_df, aes(x=game_date, y=vals, group=team, color=team)) +
        geom_line() +
        scale_y_continuous(limits=c(min_avg,max_avg), breaks=c(min_avg,((range_avg/8*1)+min_avg),((range_avg/8*2)+min_avg),((range_avg/8*3)+min_avg),((range_avg/8*4)+min_avg),((range_avg/8*5)+min_avg),((range_avg/8*6)+min_avg),((range_avg/8*7)+min_avg),max_avg), oob=rescale_none) +
        xlab("") +
        ylab(varnames[varnames$Variable == gen_var_name, "Text"]) +
        labs(color="") +
        theme_bw() +
        theme(axis.text.x=element_text(angle = 330, hjust = 0), legend.position="bottom")
      p <- arrangeGrob(p, sub = textGrob("rodrigozamith.com", x = 0, hjust = -6, vjust=-0.3, gp = gpar(fontface = "italic", fontsize = 8)))
      plotfile = paste(team,"_team_time_", gen_var_name, ".png", sep="")
      ggsave(filename=plotfile, path="plots/", width=6.4, height=6.4, plot=p, type = "cairo-png")
    }, error = function(e) NULL)
  }
  
  # Create a percentile graph for each stat (This might be a bit confusing since high own-team = good, high opp-team = bad. Ideally, you want high, low, respectively.)
  tmp_stat_df <- subset(tourney_team_agg_stats.percentile, team_name == teamname, select = c("team_name", "variable", "percentile"))
  tmp_stat_df$variable <- factor(tmp_stat_df$variable, c("team_fgpct", "opp_team_fgpct", "team_three_fgpct", "opp_team_three_fgpct", "team_ptsavg", "opp_team_ptsavg", "team_offreb", "opp_team_offreb", "team_rebavg", "opp_team_rebavg", "team_ast", "opp_team_ast", "team_to", "opp_team_to", "team_stl", "opp_team_stl", "team_blk", "opp_team_blk", "team_avg_ptsdiff", "opp_team_avg_ptsdiff"))
  
  p <- ggplot(data=tmp_stat_df, aes(x=variable, y=percentile, fill=percentile)) +
    geom_bar(stat="identity") +
    theme(legend.position = "none", axis.text.x=element_text(size=10, angle = 90, hjust = 0)) +
    scale_x_discrete(expand = c(0, 0), labels=c("FG Pct", "Opp FG Pct", "3Pt Pct", "Opp 3Pt Pct", "Avg Pts", "Opp Avg Pts", "Offensive Reb", "Opp Off Reb", "Avg Reb", "Opp Avg Reb", "Assists", "Opp Assists", "Turnovers", "Opp Turnovers", "Steals", "Opp Steals", "Blocks", "Opp Blocks", "Avg Point Diff", "Opp Avg Pt Diff")) +
    ylim(0, 1) +
    xlab("") +
    ylab("Percentile (relative to tournament teams)") +
    scale_fill_gradient(low = "#4199c4", high = "#d12d10") +
    guides(fill=FALSE)
  p <- arrangeGrob(p, sub = textGrob("rodrigozamith.com", x = 0, hjust = -6, vjust=-0.3, gp = gpar(fontface = "italic", fontsize = 8)))
  plotfile = paste(team,"_team_spe_percentile.png", sep="")
  ggsave(filename=plotfile, path="plots/", width=6.4, height=6.4, plot=p, type = "cairo-png")
}

And, with that, I produced about 5,500 images that followed this pattern: [team id]_[type of graph]_[statistic].png

 

Creating the web interface

Now, it was time to present all of this. The first thing I noticed is that my code generated some rather large images (1920×1920 pixels); this was far too large, and I instead aimed for 480×480 pixels. When I quickly attempted to re-size the images in the R code (setting a different width and height in ggsave()), I got some funky output; for example, the chart area would be tiny, relative to the larger text, even after I removed the line of code specifying the text size. While this can likely be solved by playing around with the theme options, I instead simply performed batch re-sizing using ImageMagick’s Convert tool (e.g., convert 488_team_spe_percentile.png -resize 480×480 resized/488_team_spe_percentile.png). With that, I had a 480×480-sized copy of every single one of my plots.

I then had to create an HTML container, which looked like this (all of the code for the front-end (sans the plots) may be obtained here):

<!DOCTYPE html>
<meta charset="utf-8">
<link rel="stylesheet" type="text/css" href="custom.css" />
<link rel="stylesheet" type="text/css" href="tooltipster.css" />

<title>NCAA Basketball Stats</title>
<body>

<script src="libraries/jquery-1.9.0.min.js"></script>
<script src="libraries/jquery.tooltipster.min.js"></script>
<script src="libraries/custom-imagechanger.js"></script>
<script src="libraries/custom-tooltipster.js"></script>

<div id="header"><h3>Regular Season Stats for NCAA Tournament Teams (2012-2013)</h3></div><br>

<div id="statstable">
<table width="1000">
<tr>
<td width="480">
<center><span id="TeamLeftText" class="team">Florida</span></center>
<img src="plots/235_player_ind_pts.png" width="480" height="480" id="TeamLeftImage">
</td>
<td width="40">
</td>
<td width="480">
<center><span id="TeamRightText" class="team">Minnesota</span></center>
<img src="plots/428_player_ind_pts.png" width="480" height="480" id="TeamRightImage">
</td>
</tr>
</table>
</div><br>

<div id="settingspanel">
	<p class="header">Settings:</p><form>
	<p class="options">Select the team on the <b>left</b>: <select id="TeamLeft"><option value="5">Akron</option><option value="14">Albany (NY)</option><option value="29">Arizona</option><option value="14927">Belmont</option><option value="83">Bucknell</option><option value="87">Butler</option><option value="107">California</option><option value="140">Cincinnati</option><option value="157">Colorado</option><option value="156">Colorado St.</option><option value="169">Creighton</option><option value="173">Davidson</option><option value="193">Duke</option><option value="28755">FGCU</option><option value="235" selected>Florida</option><option value="251">Georgetown</option><option value="260">Gonzaga</option><option value="275">Harvard</option><option value="301">Illinois</option><option value="306">Indiana</option><option value="310">Iona</option><option value="311">Iowa St.</option><option value="317">James Madison</option><option value="328">Kansas</option><option value="327">Kansas St.</option><option value="340">La Salle</option><option value="367">Louisville</option><option value="387">Marquette</option><option value="404">Memphis</option><option value="415">Miami (FL)</option><option value="418">Michigan</option><option value="416">Michigan St.</option><option value="428">Minnesota</option><option value="434">Missouri</option><option value="441">Montana</option><option value="488">N.C. A&T</option><option value="473">New Mexico</option><option value="472">New Mexico St.</option><option value="457">North Carolina</option><option value="490">North Carolina St.</option><option value="508">Northwestern St.</option><option value="513">Notre Dame</option><option value="518">Ohio St.</option><option value="522">Oklahoma</option><option value="521">Oklahoma St.</option><option value="433">Ole Miss</option><option value="529">Oregon</option><option value="534">Pacific</option><option value="545">Pittsburgh</option><option value="609">Saint Louis</option><option value="626">San Diego St.</option><option value="649">South Dakota St.</option><option value="665">Southern U.</option><option value="610">St. Mary's (CA)</option><option value="688">Syracuse</option><option value="690">Temple</option><option value="110">UCLA</option><option value="465">UNLV</option><option value="735">Valparaiso</option><option value="740">VCU</option><option value="739">Villanova</option><option value="772">Western Ky.</option><option value="782">Wichita St.</option><option value="796">Wisconsin</option></select></p>
	
	<p class="options">Select the team on the <b>right</b>: <select id="TeamRight"><option value="5">Akron</option><option value="14">Albany (NY)</option><option value="29">Arizona</option><option value="14927">Belmont</option><option value="83">Bucknell</option><option value="87">Butler</option><option value="107">California</option><option value="140">Cincinnati</option><option value="157">Colorado</option><option value="156">Colorado St.</option><option value="169">Creighton</option><option value="173">Davidson</option><option value="193">Duke</option><option value="28755">FGCU</option><option value="235">Florida</option><option value="251">Georgetown</option><option value="260">Gonzaga</option><option value="275">Harvard</option><option value="301">Illinois</option><option value="306">Indiana</option><option value="310">Iona</option><option value="311">Iowa St.</option><option value="317">James Madison</option><option value="328">Kansas</option><option value="327">Kansas St.</option><option value="340">La Salle</option><option value="367">Louisville</option><option value="387">Marquette</option><option value="404">Memphis</option><option value="415">Miami (FL)</option><option value="418">Michigan</option><option value="416">Michigan St.</option><option value="428" selected>Minnesota</option><option value="434">Missouri</option><option value="441">Montana</option><option value="488">N.C. A&T</option><option value="473">New Mexico</option><option value="472">New Mexico St.</option><option value="457">North Carolina</option><option value="490">North Carolina St.</option><option value="508">Northwestern St.</option><option value="513">Notre Dame</option><option value="518">Ohio St.</option><option value="522">Oklahoma</option><option value="521">Oklahoma St.</option><option value="433">Ole Miss</option><option value="529">Oregon</option><option value="534">Pacific</option><option value="545">Pittsburgh</option><option value="609">Saint Louis</option><option value="626">San Diego St.</option><option value="649">South Dakota St.</option><option value="665">Southern U.</option><option value="610">St. Mary's (CA)</option><option value="688">Syracuse</option><option value="690">Temple</option><option value="110">UCLA</option><option value="465">UNLV</option><option value="735">Valparaiso</option><option value="740">VCU</option><option value="739">Villanova</option><option value="772">Western Ky.</option><option value="782">Wichita St.</option><option value="796">Wisconsin</option></select></p>
	
	<p class="options">Select the <b>statistic</b> of interest: <select id="Statistic"><option value="player_ind_three_fga">Individual Player: 3Pt Field Goals Attempted</option><option value="player_ind_three_fgm">Individual Player: 3Pt Field Goals Made</option><option value="player_ind_ast">Individual Player: Assists</option><option value="player_ind_blk">Individual Player: Blocks</option><option value="player_ind_fga">Individual Player: Field Goals Attempted</option><option value="player_ind_fgm">Individual Player: Field Goals Made</option><option value="player_ind_fouls">Individual Player: Fouls</option><option value="player_ind_fta">Individual Player: Free Throws Attempted</option><option value="player_ind_ft">Individual Player: Free Throws Made</option><option value="player_ind_pts" selected>Individual Player: Points Scored</option><option value="player_ind_defreb">Individual Player: Rebounds (Defensive)</option><option value="player_ind_offreb">Individual Player: Rebounds (Offensive)</option><option value="player_ind_totreb">Individual Player: Rebounds (Total)</option><option value="player_ind_stl">Individual Player: Steals</option><option value="player_ind_to">Individual Player: Turnovers</option><option value="team_ind_three_fgpct">Team (Variation): 3Pt Field Goal Percentage</option><option value="team_ind_three_fga">Team (Variation): 3Pt Field Goals Attempted</option><option value="team_ind_three_fgm">Team (Variation): 3Pt Field Goals Made</option><option value="team_ind_ast">Team (Variation): Assists</option><option value="team_ind_blk">Team (Variation): Blocks</option><option value="team_ind_fgpct">Team (Variation): Field Goal Percentage</option><option value="team_ind_fga">Team (Variation): Field Goals Attempted</option><option value="team_ind_fgm">Team (Variation): Field Goals Made</option><option value="team_ind_fouls">Team (Variation): Fouls</option><option value="team_ind_ftpct">Team (Variation): Free Throw Percentage</option><option value="team_ind_fta">Team (Variation): Free Throws Attempted</option><option value="team_ind_ft">Team (Variation): Free Throws Made</option><option value="team_ind_pts">Team (Variation): Points Scored</option><option value="team_ind_defreb">Team (Variation): Rebounds (Defensive)</option><option value="team_ind_offreb">Team (Variation): Rebounds (Offensive)</option><option value="team_ind_stl">Team (Variation): Steals</option><option value="team_ind_totreb">Team (Variation): Total Rebounds</option><option value="team_ind_to">Team (Variation): Turnovers</option><option value="team_time_three_fgpct">Team (Over Time): 3Pt Field Goal Percentage</option><option value="team_time_three_fga">Team (Over Time): 3Pt Field Goals Attempted</option><option value="team_time_three_fgm">Team (Over Time): 3Pt Field Goals Made</option><option value="team_time_ast">Team (Over Time): Assists</option><option value="team_time_blk">Team (Over Time): Blocks</option><option value="team_time_fgpct">Team (Over Time): Field Goal Percentage</option><option value="team_time_fga">Team (Over Time): Field Goals Attempted</option><option value="team_time_fgm">Team (Over Time): Field Goals Made</option><option value="team_time_fouls">Team (Over Time): Fouls</option><option value="team_time_ftpct">Team (Over Time): Free Throw Percentage</option><option value="team_time_fta">Team (Over Time): Free Throws Attempted</option><option value="team_time_ft">Team (Over Time): Free Throws Made</option><option value="team_time_pts">Team (Over Time): Points Scored</option><option value="team_time_defreb">Team (Over Time): Rebounds (Defensive)</option><option value="team_time_offreb">Team (Over Time): Rebounds (Offensive)</option><option value="team_time_totreb">Team (Over Time): Rebounds (Total)</option><option value="team_time_stl">Team (Over Time): Steals</option><option value="team_time_to">Team (Over Time): Turnovers</option><option value="team_agg_three_fgpct">Team (vs National/Tourney): 3Pt Field Goal Percentage</option><option value="team_agg_three_fga">Team (vs National/Tourney): 3Pt Field Goals Attempted</option><option value="team_agg_three_fgm">Team (vs National/Tourney): 3Pt Field Goals Made</option><option value="team_agg_ast">Team (vs National/Tourney): Assists</option><option value="team_agg_forward_avg_height">Team (vs National/Tourney): Average Forward Height</option><option value="team_agg_guard_avg_height">Team (vs National/Tourney): Average Guard Height</option><option value="team_agg_avg_ptsdiff">Team (vs National/Tourney): Average Point Differential</option><option value="team_agg_away_avg_ptsdiff">Team (vs National/Tourney): Average Point Differential (Away Games)</option><option value="team_agg_home_avg_ptsdiff">Team (vs National/Tourney): Average Point Differential (Home Games)</option><option value="team_agg_ptsavg">Team (vs National/Tourney): Average Points Scored</option><option value="team_agg_rebavg">Team (vs National/Tourney): Average Rebounds</option><option value="team_agg_blk">Team (vs National/Tourney): Blocks</option><option value="team_agg_fgpct">Team (vs National/Tourney): Field Goal Percentage</option><option value="team_agg_fga">Team (vs National/Tourney): Field Goals Attempted</option><option value="team_agg_fgm">Team (vs National/Tourney): Field Goals Made</option><option value="team_agg_fouls">Team (vs National/Tourney): Fouls</option><option value="team_agg_ftpct">Team (vs National/Tourney): Free Throw Percentage</option><option value="team_agg_fta">Team (vs National/Tourney): Free Throws Attempted</option><option value="team_agg_ft">Team (vs National/Tourney): Free Throws Made</option><option value="team_agg_losses">Team (vs National/Tourney): Losses</option><option value="team_agg_away_losses">Team (vs National/Tourney): Losses (Away)</option><option value="team_agg_home_losses">Team (vs National/Tourney): Losses (Home)</option><option value="team_agg_forward_points">Team (vs National/Tourney): Percent of Points by Forwards</option><option value="team_agg_guard_points">Team (vs National/Tourney): Percent of Points by Guards</option><option value="team_agg_pts">Team (vs National/Tourney): Points Scored</option><option value="team_agg_defreb">Team (vs National/Tourney): Rebounds (Defensive)</option><option value="team_agg_offreb">Team (vs National/Tourney): Rebounds (Offensive)</option><option value="team_agg_totreb">Team (vs National/Tourney): Rebounds (Total)</option><option value="team_agg_stl">Team (vs National/Tourney): Steals</option><option value="team_agg_to">Team (vs National/Tourney): Turnovers</option><option value="team_agg_winpct">Team (vs National/Tourney): Wining Percentage</option><option value="team_agg_away_wins">Team (vs National/Tourney): Wins (Away)</option><option value="team_agg_home_wins">Team (vs National/Tourney): Wins (Home)</option><option value="team_agg_wins">Team (vs National/Tourney): Wins</option><option value="team_spe_percentile">Percentile Rankings per Stat (among tournament teams)</option></select></p>
	
	<br><p class="header">Description:</p>
	<p class="description">These statistics relate to player and team performance in the 2012-2013 college basketball season. Tournament teams include only those that were seeded on Selection Sunday, and not the teams that had to play in the play-in game.</p>
	<br><p class="header">Sources:</p>
	<ul class="sourcelist">
	<li><a href="http://stats.ncaa.org/team/inst_team_list?sport_code=MBB&division=1" class="tooltip" title="All statistics were originally sourced from the NCAA's own site, with additional calculations performed by the author.">National Collegiate Athletic Association (NCAA)</a></li>
	</ul>
</div>

<div id="footer">Created by Rodrigo Zamith. Licensed under <a href="http://creativecommons.org/licenses/by-nc/3.0" class="tooltip" title="This license lets you remix, tweak, and build upon this work non-commercially, provided you acknowledge the author; additionally derivative works do not have to be licensed under these terms.">CC BY-NC</a> terms.</div>
</body>

I’d like to point out a few things in the code above. First, notice that I start by loading specific images, but that these images are given an ID in the img tag (e.g., id=”TeamLeftImage”). Preceding that image is also a span with an id (e.g., id=”TeamLeftText”), which contains a team’s name. Similarly, notice that every one of the drop-down menus (select) also has an id (e.g., id=”TeamLeft”), and each option within those select elements also has a value attached to it (e.g., value=”5″).

By using jQuery, I can have the browser load new images in response to different selections, using nothing more than the information in these ids, values, and the text contained within each option. My little jQuery script will essentially do the following if there’s a change to either the select elements with the ids “TeamLeft” or “TeamRight”: (1) get the current value of the select element with the id “Statistic”; (2) get the value of the current selection for the given select element; (3) using the two aforementioned pieces of information, generate a new file name string; (4) get the text of the option for the current selection; and finally (5) change the image source of the respective image to the newly-generated file name string from step 3 and change the text contained in the respective span to the string generated in step 4. We do something similar for the select element “Statistic”, only we change two images at the same time. This is what that code looks like:

$(document).ready(function() {
    $("#TeamLeft").change(function() {
        var stat = $("#Statistic").val();
		var team = $(this).val();
		var newimg = 'plots/' + team + '_' + stat + '.png';
		
		var teamname = $("#TeamLeft option:selected").text();
		
		$("#TeamLeftImage").attr('src', newimg);
		$("#TeamLeftText").text(teamname);
    });
	
	$("#TeamRight").change(function() {
        var stat = $("#Statistic").val();
		var team = $(this).val();
		var newimg = 'plots/' + team + '_' + stat + '.png';
		
		var teamname = $("#TeamRight option:selected").text();
		
		$("#TeamRightImage").attr('src', newimg);
		$("#TeamRightText").text(teamname);
    });
	
	$("#Statistic").change(function() {
        var stat = $(this).val();
		var teamLeft = $("#TeamLeft").val();
		var teamRight = $("#TeamRight").val();
		var teamLeftnewimg = 'plots/' + teamLeft + '_' + stat + '.png';
		var teamRightnewimg = 'plots/' + teamRight + '_' + stat + '.png';
		
		$("#TeamLeftImage").attr('src', teamLeftnewimg);
		$("#TeamRightImage").attr('src', teamRightnewimg);
    });
});

I should also note that instead of typing all of that information (team names and the myriad of statistics) into the index.html file, one may simply generate a JSON output for the relevant columns with R and then use jQuery to import that information and automatically generate the appropriate elements. I took a simpler (if not as attractive) option: I simply copied the output from R (e.g., team_id and team_name), pasted it onto a spreadsheet, and then used a simple formula to generate the code I needed (e.g., ="<option value="""&A1&""">"&B1&"</option>" where A1 contains the team id and B1 contains the team name) and copied and pasted that onto my index.html file.

 

Final Thoughts

This was a nice from-scratch project to help hone my skills in a couple of different programming languages, but there’s no question that the code could be improved upon. This is particularly the case for R; that code has a lot of nasty stuff in it and could be made considerably faster. If any readers would like to share their thoughts on how it could be improved, please do so by commenting below or sending me an e-mail. I’ll try to append some of the suggestions to the bottom of this post.

I should note here two applications that I have not yet experimented with, but are very relevant to this type of project. The first is Shiny, a module by the folks behind RStudio that intends to make it easy to create dynamic and interactive web applications, with the full power of R behind it. The second is rApache, which similarly embeds the R interpreter inside the Apache web server. While they certainly show a lot of promise, it should be noted that both require that a module be installed on the server hosting the application. Thus, for anyone who uses shared hosting (including me), this is likely to present a barrier (lest your host be very accommodating). Nonetheless, I suggest you keep them on your radar.

Share me: Tweet about this on TwitterShare on FacebookShare on Google+Email this to someone
Bookmark the permalink.

9 Responses to Going Under the Hood of the NCAA Tournament Visualization

  1. Igor Sosa says:

    thanks! This is really helpful!

    • Rodrigo Zamith says:

      Glad you found it helpful, Igor. I’ve been really short on time lately, but I’m hoping to do something with soccer data (especially the English Premier League) sometime in the future.

  2. Ken Butler says:

    I enjoyed reading this. Thank you for sharing what you did.

    My own preferred combo for this is R plus Perl. Perl has a package called HTML::TreeBuilder that I use for web scraping, which enables me to do things like “find the a links whose class is “fred” and extract the href part if there is one”. I’m not saying it’s the best way, but I don’t know Python (not yet, anyway).

    I’ve been calculating Bradley-Terry ratings for the NCAA basketball teams (scraping from USA Today, which likewise has a predictable interface). This is a way of estimating how strong a team is from its W-L record and who it’s played. For basketball it works like a charm. (For football there are not really enough games to estimate anything.)

    • Rodrigo Zamith says:

      Hey Ken,

      I’m glad you enjoyed it. I have not touched Perl in years (and even then, it was a brief romance), but the package you referenced sounds very similar to the BeautifulSoup library mentioned in the post (it would also enable you to find all hrefs with class ‘fred’), so I would recommend you take a look at it if you ever decide to try your hand with Python. Thanks for sharing; I’ll be sure to mention it to colleagues who are partial to Perl.

      As for the Bradley-Terry ratings, thank you very much for sharing that. I was hoping to find something along those lines purely for predictive modeling purposes, but I couldn’t find a listing that included all D-I teams. Sure enough, a quick Google search of “Bradley-Terry ratings” got me this: http://dbaker.50webs.com/cbb13_rank.html. I’ll definitely have to keep that in mind for next year!

    • Igor Sosa says:

      hi Ken,

      this sounds very interesting. Also because I am using in the last time more often perl than python. Have you any website where you have put your analysis on?

  3. Josh Bredeweg says:

    Thank you for taking the time to post all this information. This topic is one I’ve been thinking about for a long time, that being how to integrate R and an interactive web interface. How much more complexity would be added to this process by pulling the data in real time? Say as games are happening/being completed.

    • Rodrigo Zamith says:

      Hey Josh,

      This is something that would probably be best accomplished using Shiny, which is referenced at the bottom of the post (note: I have not yet played with Shiny; I have only read about it). (Alternatively, you could set up a crontab entry to run an R script every x second to generate and replace a static image, but Shiny would almost certainly be the better option.) That said, you’ll need an input (the numbers to plot, which would also have to come in in real-time). Thus, your data source would be very important here. For example, if it’s providing data in a standardized format (i.e. JSON), then you might be able to set up Shiny/R to just pull the data directly from the source and plot it. If it’s not using a standardized format (i.e. box scores from ESPN), then you’d probably have to write a scraper that would scrape the page with the numbers every few seconds, which would probably be very resource-intensive and get you blacklisted. Anyway, it is possible, but the level of complexity would largely be dependent on the data source and the approach you wish to use for presentation. I would definitely recommend you read up on Shiny, though.

  4. Pingback: Visualizing Season Performance by NCAA Tournament Teams (2014) - Rodrigo Zamith

  5. Pingback: Visualizing Season Performance by NCAA Tournament Teams - Rodrigo Zamith

Leave a Reply

Your email address will not be published. Required fields are marked *


+ 3 = 4

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>