I'm gonna start a thread on what I hope will be helpful R tips to wrangle this huge NFL Big Data Bowl data. If you're an advanced R programmer, this is probably not for you but feel free to correct me if I made a mistake or offer better alternatives
#1
slice_sample() if you want to quickly preview what your result might look like using a random sampling of rows in your data
slice_sample() if you want to quickly preview what your result might look like using a random sampling of rows in your data
#2
janitor::clean_names() if variable names with random capitalization, spaces and other undesired characters make you sick
with the defaults you can turn gameTimeEastern (
) into game_time_eastern (
)
janitor::clean_names() if variable names with random capitalization, spaces and other undesired characters make you sick
with the defaults you can turn gameTimeEastern (



#3
lubridate::mdy() to convert a variable into a Date
data %>% mutate(game_date = mdy(game_date))
lubridate::mdy() to convert a variable into a Date
data %>% mutate(game_date = mdy(game_date))
#4
lubridate::parse_date_time() for inconsistent date formats
players %>%
mutate(birth_date = lubridate::parse_date_time(birth_date,
orders = c("y-m-d", "m/d/y"))
lubridate::parse_date_time() for inconsistent date formats
players %>%
mutate(birth_date = lubridate::parse_date_time(birth_date,
orders = c("y-m-d", "m/d/y"))
#7
if you're going to bind all 17 weeks of data into one dataset, save it to disk as a parquet file via {arrow}. from my very unscientific testing with different file formats (rda, fst, feather, rds, tsv.gz), parquet was the fastest read
More on {arrow}: https://arrow.apache.org/docs/r/
if you're going to bind all 17 weeks of data into one dataset, save it to disk as a parquet file via {arrow}. from my very unscientific testing with different file formats (rda, fst, feather, rds, tsv.gz), parquet was the fastest read
More on {arrow}: https://arrow.apache.org/docs/r/