NCAA Scraping

Bill Petti

2016-11-22

The latest release of the baseballr includes a function for acquiring player statistics from the NCAA’s website for baseball teams across the three major divisions (I, II, III).

In order to look up teams, you can either load the teams for all divisions from the baseballr-data repository or access them directly from the NCAA website for a given year and division.

Loading from the baseballr-data repository:

library(baseballr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ncaa_teams_df <- load_ncaa_baseball_teams()

From the NCAA website:

try(ncaa_teams(year = most_recent_ncaa_baseball_season(), division = "1"))
#> ── NCAA Baseball Teams data from stats.ncaa.org ───────────── baseballr 1.5.0 ──
#> ℹ Data updated: 2023-03-20 10:00:07 EDT
#> # A tibble: 305 × 8
#>    team_id team_name      team_url        confer…¹ confe…² divis…³  year seaso…⁴
#>    <chr>   <chr>          <chr>           <chr>    <chr>   <chr>   <dbl> <chr>  
#>  1 140     Cincinnati     /team/140/16340 823      AAC     1        2023 16340  
#>  2 196     East Carolina  /team/196/16340 823      AAC     1        2023 16340  
#>  3 288     Houston        /team/288/16340 823      AAC     1        2023 16340  
#>  4 404     Memphis        /team/404/16340 823      AAC     1        2023 16340  
#>  5 651     South Fla.     /team/651/16340 823      AAC     1        2023 16340  
#>  6 718     Tulane         /team/718/16340 823      AAC     1        2023 16340  
#>  7 128     UCF            /team/128/16340 823      AAC     1        2023 16340  
#>  8 782     Wichita St.    /team/782/16340 823      AAC     1        2023 16340  
#>  9 67      Boston College /team/67/16340  821      ACC     1        2023 16340  
#> 10 147     Clemson        /team/147/16340 821      ACC     1        2023 16340  
#> # … with 295 more rows, and abbreviated variable names ¹​conference_id,
#> #   ²​conference, ³​division, ⁴​season_id

The function, ncaa_team_player_stats(), requires the user to pass values for three parameters for the function to work:

team_id: numerical code used by the NCAA for each school year: a four-digit year type: whether to pull data for batters or pitchers

If you want to pull batting statistics for Florida State for the 2023 season, you would use the following:


team_id <- ncaa_teams_df %>% 
  dplyr::filter(.data$team_name == "Florida St.") %>% 
  dplyr::select("team_id") %>% 
  dplyr::distinct() %>% 
  dplyr::pull("team_id")

year <- most_recent_ncaa_baseball_season()

ncaa_team_player_stats(team_id = team_id, year = year, "batting")
#> ── NCAA Baseball Team Batting Stats data from stats.ncaa.org ───────────────────
#> ℹ Data updated: 2023-03-20 10:00:11 EDT
#> # A tibble: 37 × 35
#>     year team_name team_id confe…¹ confe…² divis…³ playe…⁴ playe…⁵ playe…⁶ Yr   
#>    <int> <chr>       <dbl>   <int> <chr>     <dbl>   <int> <chr>   <chr>   <chr>
#>  1  2023 Florida …     234     821 ACC           1 2649339 http:/… Tibbs … So   
#>  2  2023 Florida …     234     821 ACC           1 2649334 http:/… Ferrer… So   
#>  3  2023 Florida …     234     821 ACC           1 2478605 http:/… Carrio… Jr   
#>  4  2023 Florida …     234     821 ACC           1 2468075 http:/… Vincen… Sr   
#>  5  2023 Florida …     234     821 ACC           1 2112619 http:/… De Sed… Sr   
#>  6  2023 Florida …     234     821 ACC           1 2649307 http:/… Rank, … So   
#>  7  2023 Florida …     234     821 ACC           1 2797459 http:/… Smith,… Fr   
#>  8  2023 Florida …     234     821 ACC           1 2649340 http:/… Bush, … So   
#>  9  2023 Florida …     234     821 ACC           1 2797428 http:/… Kamaka… Fr   
#> 10  2023 Florida …     234     821 ACC           1 2797465 http:/… Willia… So   
#> # … with 27 more rows, 25 more variables: Pos <chr>, Jersey <chr>, GP <dbl>,
#> #   GS <dbl>, BA <dbl>, OBPct <dbl>, SlgPct <dbl>, R <dbl>, AB <dbl>, H <dbl>,
#> #   `2B` <dbl>, `3B` <dbl>, TB <dbl>, HR <dbl>, RBI <dbl>, BB <dbl>, HBP <dbl>,
#> #   SF <dbl>, SH <dbl>, K <dbl>, DP <dbl>, CS <dbl>, Picked <dbl>, SB <dbl>,
#> #   RBI2out <dbl>, and abbreviated variable names ¹​conference_id, ²​conference,
#> #   ³​division, ⁴​player_id, ⁵​player_url, ⁶​player_name

The same can be done for pitching, just by changing the type parameter:

ncaa_team_player_stats(team_id = team_id, year = year,  "pitching")
#> ── NCAA Baseball Team Pitching Stats data from stats.ncaa.org ──────────────────
#> ℹ Data updated: 2023-03-20 10:00:16 EDT
#> # A tibble: 37 × 43
#>     year team_name team_id confe…¹ confe…² divis…³ playe…⁴ playe…⁵ playe…⁶ Yr   
#>    <int> <chr>       <dbl>   <int> <chr>     <dbl>   <int> <chr>   <chr>   <chr>
#>  1  2023 Florida …     234     821 ACC           1 2649339 http:/… Tibbs … So   
#>  2  2023 Florida …     234     821 ACC           1 2649334 http:/… Ferrer… So   
#>  3  2023 Florida …     234     821 ACC           1 2478605 http:/… Carrio… Jr   
#>  4  2023 Florida …     234     821 ACC           1 2468075 http:/… Vincen… Sr   
#>  5  2023 Florida …     234     821 ACC           1 2112619 http:/… De Sed… Sr   
#>  6  2023 Florida …     234     821 ACC           1 2649307 http:/… Rank, … So   
#>  7  2023 Florida …     234     821 ACC           1 2797459 http:/… Smith,… Fr   
#>  8  2023 Florida …     234     821 ACC           1 2649340 http:/… Bush, … So   
#>  9  2023 Florida …     234     821 ACC           1 2797428 http:/… Kamaka… Fr   
#> 10  2023 Florida …     234     821 ACC           1 2797465 http:/… Willia… So   
#> # … with 27 more rows, 33 more variables: Pos <chr>, Jersey <chr>, GP <dbl>,
#> #   App <dbl>, GS <dbl>, ERA <dbl>, IP <dbl>, H <dbl>, R <dbl>, ER <dbl>,
#> #   BB <dbl>, SO <dbl>, SHO <dbl>, BF <dbl>, `P-OAB` <dbl>, `2B-A` <dbl>,
#> #   `3B-A` <dbl>, Bk <dbl>, `HR-A` <dbl>, WP <dbl>, HB <dbl>, IBB <dbl>,
#> #   `Inh Run` <dbl>, `Inh Run Score` <dbl>, SHA <dbl>, SFA <dbl>,
#> #   Pitches <dbl>, GO <dbl>, FO <dbl>, W <dbl>, L <dbl>, SV <dbl>, KL <dbl>,
#> #   and abbreviated variable names ¹​conference_id, ²​conference, ³​division, …

Now, the function is dependent on the user knowing the team_id used by the NCAA website. Given that, I’ve included a ncaa_school_id_lu function so that users can find the team_id they need.

Just pass a string to the function and it will return possible matches based on the school’s name:

ncaa_school_id_lu("Vand")
#> ───────────────────────────────────────────────────────────── baseballr 1.5.0 ──
#> # A tibble: 14 × 8
#>    team_id team_name  team_url        conference…¹ confe…² divis…³  year seaso…⁴
#>      <dbl> <chr>      <chr>                  <dbl> <chr>     <dbl> <dbl>   <dbl>
#>  1     736 Vanderbilt /team/736/16340          911 SEC           1  2023   16340
#>  2     736 Vanderbilt /team/736/15860          911 SEC           1  2022   15860
#>  3     736 Vanderbilt /team/736/15580          911 SEC           1  2021   15580
#>  4     736 Vanderbilt /team/736/15204          911 SEC           1  2020   15204
#>  5     736 Vanderbilt /team/736/15204          911 SEC           1  2019   15204
#>  6     736 Vanderbilt /team/736/12973          911 SEC           1  2018   12973
#>  7     736 Vanderbilt /team/736/12560          911 SEC           1  2017   12560
#>  8     736 Vanderbilt /team/736/12360          911 SEC           1  2016   12360
#>  9     736 Vanderbilt /team/736/12080          911 SEC           1  2015   12080
#> 10     736 Vanderbilt /team/736/11620          911 SEC           1  2014   11620
#> 11     736 Vanderbilt /team/736/11320          911 SEC           1  2013   11320
#> 12     736 Vanderbilt /team/736/10942          911 SEC           1  2012   10942
#> 13     736 Vanderbilt /team/736/10561          911 SEC           1  2011   10561
#> 14     736 Vanderbilt /team/736/10240          911 SEC           1  2010   10240
#> # … with abbreviated variable names ¹​conference_id, ²​conference, ³​division,
#> #   ⁴​season_id