R: how to toggle html page selection in web scraping - html

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
I'm trying to scrape the 2016 table data from the above webpage. If I change the Year to 2010, the url changes to http://legacy.baseballprospectus.com/sortable/index.php?cid=1966487.
I want to automate my algorithm so that it can obtain the table across different Year, but I'm not sure how I can obtain the unique identifiers (e.g. 1966487) for each page automatically. Is there a way to find the list of these?
I've tried looking at the html source code, but no luck.

With rvest, you can set the value in the form and submit it. Wrapped in purrr::map_dfr to iterate and row-bind the results in to a data frame,
library(rvest)
sess <- html_session("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181")
baseball <- purrr::map_dfr(
2017:2015,
function(y){
Sys.sleep(10 + runif(1)) # be polite
form <- sess %>%
html_node(xpath = '//form[#action="index.php"]') %>%
html_form() %>%
set_values(year = y)
sess <- submit_form(sess, form)
sess %>%
read_html() %>%
html_node('#TTdata') %>%
html_table(header = TRUE)
}
)
tibble::as_data_frame(baseball) # for printing
#> # A tibble: 4,036 x 38
#> `#` NAME TEAM LG YEAR AGE G PA AB R
#> <dbl> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 1 Giancarlo Stanton MIA NL 2017 27 159 692 597 123
#> 2 2 Joey Votto CIN NL 2017 33 162 707 559 106
#> 3 3 Charlie Blackmon COL NL 2017 30 159 725 644 137
#> 4 4 Aaron Judge NYA AL 2017 25 155 678 542 128
#> 5 5 Nolan Arenado COL NL 2017 26 159 680 606 100
#> 6 6 Kris Bryant CHN NL 2017 25 151 665 549 111
#> 7 7 Mike Trout ANA AL 2017 25 114 507 402 92
#> 8 8 Jose Altuve HOU AL 2017 27 153 662 590 112
#> 9 9 Paul Goldschmidt ARI NL 2017 29 155 665 558 117
#> 10 10 Jose Ramirez CLE AL 2017 24 152 645 585 107
#> # ... with 4,026 more rows, and 28 more variables: H <int>, `1B` <int>,
#> # `2B` <int>, `3B` <int>, HR <int>, TB <int>, BB <int>, IBB <int>,
#> # SO <int>, HBP <int>, SF <int>, SH <int>, RBI <int>, DP <int>,
#> # NETDP <dbl>, SB <int>, CS <int>, AVG <dbl>, OBP <dbl>, SLG <dbl>,
#> # OPS <dbl>, ISO <dbl>, BPF <int>, oppOPS <dbl>, TAv <dbl>, VORP <dbl>,
#> # FRAA <dbl>, BWARP <dbl>

Related

Scraping Website with Unchanging URL in R

I would like to scrape a series of tables from a website whose URL does not change when I click through the tables in my browser. Each table corresponds to a unique date. The default table is that which corresponds to today's date. I can scroll through past dates in my browser, but can't seem to find a way to do so in R.
Using library(rvest) this bit of code will reliably download the table that corresponds to today's date (I'm only interested in the first of the three tables).
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_table()
off <- off[[1]]
How can I download the table that corresponds to, say "2022-10-04", to "2022-10-06", or to yesterday?
I've tried to work through it by identifying the node under which the table lies, in the hopes that I could manipulate it to reflect a prior date. However, the following reproduces the same table as above:
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > div > div.dayContent > div > table") %>%
html_table()
off <- off[[1]]
Scrolling through past dates in my browser, I've identified various places in the html that reference the prior date; but I can't seem to change it from R, yet alone get the table I download to reflect a change:
webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > header > div")
I've messed around some with html_form(), follow_link(), and set_values() also, but to no avail.
Is there a good way to navigate this particular URL in R?
You can consider the following approach :
library(RSelenium)
library(rvest)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA
Here is another approach that can be considered :
library(RDCOMClient)
library(rvest)
url <- "https://official.nba.com/referee-assignments/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)
web_Obj_Date <- doc$querySelector("#ref-filters-menu > li > div > button")
web_Obj_Date$dispatchEvent(clickEvent)
web_Obj_Date_Input <- doc$GetElementById('ref-date')
web_Obj_Date_Input[["Value"]] <- "2022-10-05"
web_Obj_Go_Button <- doc$querySelector("#date-filter")
web_Obj_Go_Button$dispatchEvent(clickEvent)
html_Content <- doc$Body()$innerHTML()
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 8 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "Game" "Official 1" "Official 2" "Official 3" "Alternate" NA NA
2 "S" "M" "T" "W" "T" "F" "S"
3 "" "" "" "" "" "" "1"
4 "2" "3" "4" "5" "6" "7" "8"
5 "9" "10" "11" "12" "13" "14" "15"
6 "16" "17" "18" "19" "20" "21" "22"
7 "23" "24" "25" "26" "27" "28" "29"
8 "30" "31" "" "" "" "" ""
[[3]]
# A tibble: 7 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "S" "M" "T" "W" "T" "F" "S"
2 "" "" "" "" "" "" "1"
3 "2" "3" "4" "5" "6" "7" "8"
4 "9" "10" "11" "12" "13" "14" "15"
5 "16" "17" "18" "19" "20" "21" "22"
6 "23" "24" "25" "26" "27" "28" "29"
7 "30" "31" "" "" "" "" ""
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA
If you install the Docker software (see https://docs.docker.com/engine/install/), you can consider the following approach with firefox :
library(RSelenium)
library(rvest)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA

C5_rules() in Tidymodels

I would like to use tidymodels to fit a C5.0 rule-based classification model. I have specified the model as follows
c5_spec <-
C5_rules() %>%
set_engine("C5.0") %>%
set_mode("classification")
In the documentation for the C5_rules() command, I read the following.
The model is not trained or fit until the fit.model_spec() function is used with the data.
I'm not quite sure what I need to do with the parsnip model object after that. Every time I try to fit the model, I get the following error
preprocessor 1/1, model 1/1 (predictions): Error in predict.C5.0(object = object$fit, newdata = new_data, type = "class"): either a tree or rules must be provided
What am I missing?
Thank you very much!
That's a good start! You've defined your model spec, but if you're wanting to fit using a workflow, you'll need to create a recipe & workflow as well. Julia Silge's blog is hands down the best resource for getting used to working with tidymodels. Here's a reprex that fits a C5 classifier once to training data:
# load tidymodels & rules
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(rules)
#> Warning: package 'rules' was built under R version 4.1.1
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 x 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # ... with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Warning in vec2table(truth = truth, estimate = estimate, dnn = dnn, ...): `truth`
#> was converted to a factor
#> Truth
#> Prediction 4 6 8
#> 4 11 0 0
#> 6 0 7 0
#> 8 0 0 14
Created on 2021-09-30 by the reprex package (v2.0.1)
I tried reprex by Mark Rieke and I got an error for the last command (conf_mat).
load tidymodels & rules
library(tidymodels)
library(rules)
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 × 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # … with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Error in `yardstick_table()`:
#> ! `truth` must be a factor.
#> ℹ This is an internal error in the yardstick package, please report it to the package authors.
#> Backtrace:
#> ▆
#> 1. ├─cars_preds %>% conf_mat(truth = cyl, estimate = .pred_class)
#> 2. ├─yardstick::conf_mat(., truth = cyl, estimate = .pred_class)
#> 3. └─yardstick:::conf_mat.data.frame(., truth = cyl, estimate = .pred_class)
#> 4. └─yardstick:::yardstick_table(truth = truth, estimate = estimate, case_weights = case_weights)
#> 5. └─rlang::abort("`truth` must be a factor.", .internal = TRUE)

list of lists of matrices (from JSON) into single data.frame - purrr has problems with differing row numbers?

I'm trying to use the information contained in keyed JSON names to add context to the data contained in their nested matrices. The matrices have different numbers of rows, and some of the matrices are missing (list element NULL). I am able to extract the relevant data and retain information as list names from the hierarchy using map and at_depth from the purrr package, but I cannot find a clean way to get this into a single data.frame.
I have attempted to use purrr:::transpose as exemplified here, and I've tried using tidyr:::unnest as shown here, but I think their desired results and inputs differ enough from mine that they are not applicable. There seems to be too many problems with the differing row names and/or the missing matrices. I am also new to the purrr package, so there could be something simple that I'm missing here.
Here is my own attempt which produces nearly the desired result, and I think I could modify it a bit more to remove the for loop and have another layer of some 'apply' functions, but I have the suspicion that there are better ways to go about this.
Minimal reproducible Example
#Download data
json <- getURL("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi?type=lake_survey&id=69070100")
#Surveys are the relevant data
data.listed <- fromJSON(json, simplifyDataFrame=F)
surveys <- data.listed$result$surveys
#Get list of lists of matrices - fish size count data
fcounts <- map(surveys, "lengths") %>%
at_depth(2, "fishCount") %>%
at_depth(2, data.frame) # side note: is this a good way to the inner matrices to data.frames?
#top-level - list - surveys
#2nd-level - list - species in each survey
#3rd-level - data.frame - X1: measured_size, X2: counts
#use survey IDs as names for top level list
#just as species are used as names for 2nd level lists
names(fcounts) <- sapply(surveys, function(s) {return(s$surveyID)})
#This produces nearly the correct result
for (i in 1:length(fcounts)){
surv.id <- names(fcounts)[[i]]
if (length(fcounts[[i]]) > 0) {
listed.withSpecies <- lapply(names(fcounts[[i]]), function(species) cbind(fcounts[[i]][[species]], species))
surv.fishCounts <- do.call(rbind, listed.withSpecies)
colnames(surv.fishCounts) <- c("size", "count", "species")
surv.fishCounts$survey.ID <- surv.id
print(surv.fishCounts)
}
}
This is one way to get nested data frames of the lengths counts into a big data frame:
library(httr)
library(tidyverse)
res <- GET("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
query = list(type="lake_survey", id="69070100"))
content(res, as="text") %>%
jsonlite::fromJSON(simplifyDataFrame = FALSE, flatten=FALSE) -> x
x$result$surveys %>%
map_df(~{
tmp_df <- flatten_df(.x[c("surveyDate", "surveyID", "surveyType", "surveySubType")])
lens <- .x$lengths
if (length(lens) > 0) {
fish <- names(lens)
data_frame(fish,
max_length = map_dbl(lens, "maximum_length"),
min_length = map_dbl(lens, "minimum_length"),
lens = map(lens, "fishCount") %>%
map(~set_names(as_data_frame(.), c("catch_len", "ct")))) %>%
mutate(surveyDate = tmp_df$surveyDate,
surveyType = tmp_df$surveyType,
surveySubType = tmp_df$surveySubType,
surveyID = tmp_df$surveyID) -> tmp_df
}
tmp_df
}) -> lengths_df
glimpse(lengths_df)
## Observations: 21
## Variables: 8
## $ surveyDate <chr> "1988-07-19", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-...
## $ surveyID <chr> "107278", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "10...
## $ surveyType <chr> "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey"...
## $ surveySubType <chr> "Population Assessment", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re...
## $ fish <chr> NA, "PMK", "BLB", "LMB", "YEP", "BLG", "WTS", "WAE", "NOP", "GSF", "BLC", NA, "HSF", "PMK", "...
## $ max_length <dbl> NA, 6, 12, 16, 6, 7, 18, 18, 36, 4, 10, NA, 8, 7, 12, 12, 6, 8, 23, 38, 12
## $ min_length <dbl> NA, 3, 10, 1, 3, 3, 16, 16, 6, 4, 4, NA, 7, 4, 10, 12, 5, 3, 12, 9, 7
## $ lens <list> [NULL, <c("3", "6"), c("1", "3")>, <c("10", "11", "12"), c("1", "1", "4")>, <c("1", "16", "2...
print(lengths_df, n=nrow(lengths_df))
## # A tibble: 21 × 8
## surveyDate surveyID surveyType surveySubType fish max_length min_length lens
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <list>
## 1 1988-07-19 107278 Standard Survey Population Assessment <NA> NA NA <NULL>
## 2 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 <tibble [2 × 2]>
## 3 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 <tibble [3 × 2]>
## 4 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 <tibble [6 × 2]>
## 5 1995-07-17 107539 Standard Survey Re-Survey YEP 6 3 <tibble [3 × 2]>
## 6 1995-07-17 107539 Standard Survey Re-Survey BLG 7 3 <tibble [5 × 2]>
## 7 1995-07-17 107539 Standard Survey Re-Survey WTS 18 16 <tibble [3 × 2]>
## 8 1995-07-17 107539 Standard Survey Re-Survey WAE 18 16 <tibble [2 × 2]>
## 9 1995-07-17 107539 Standard Survey Re-Survey NOP 36 6 <tibble [17 × 2]>
## 10 1995-07-17 107539 Standard Survey Re-Survey GSF 4 4 <tibble [1 × 2]>
## 11 1995-07-17 107539 Standard Survey Re-Survey BLC 10 4 <tibble [6 × 2]>
## 12 1992-07-24 107587 Standard Survey Re-Survey <NA> NA NA <NULL>
## 13 2005-07-11 107906 Standard Survey Population Assessment HSF 8 7 <tibble [2 × 2]>
## 14 2005-07-11 107906 Standard Survey Population Assessment PMK 7 4 <tibble [4 × 2]>
## 15 2005-07-11 107906 Standard Survey Population Assessment BLB 12 10 <tibble [3 × 2]>
## 16 2005-07-11 107906 Standard Survey Population Assessment LMB 12 12 <tibble [1 × 2]>
## 17 2005-07-11 107906 Standard Survey Population Assessment YEP 6 5 <tibble [2 × 2]>
## 18 2005-07-11 107906 Standard Survey Population Assessment BLG 8 3 <tibble [6 × 2]>
## 19 2005-07-11 107906 Standard Survey Population Assessment WAE 23 12 <tibble [8 × 2]>
## 20 2005-07-11 107906 Standard Survey Population Assessment NOP 38 9 <tibble [20 × 2]>
## 21 2005-07-11 107906 Standard Survey Population Assessment BLC 12 7 <tibble [4 × 2]>
You can expand the nested catch observations this way:
filter(lengths_df, !map_lgl(lens, is.null)) %>%
unnest(lens)
## # A tibble: 98 × 9
## surveyDate surveyID surveyType surveySubType fish max_length min_length catch_len ct
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <int>
## 1 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 3 1
## 2 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 6 3
## 3 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 10 1
## 4 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 11 1
## 5 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 12 4
## 6 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 1 1
## 7 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 16 1
## 8 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 2 6
## 9 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 4 4
## 10 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 5 2
## # ... with 88 more rows

Stacking columns without melting by id

I imported a json file using rjson and converted it to a data.frame but all of the data is spread widthwise and the column names contain the key information.
stations <- fromJSON(file = "station_information.json")
test <- as.data.frame(stations[3])
What this looks like is:
> dim(test)
[1] 2 5985
> test[1:27]
data.stations.station_id data.stations.name data.stations.short_name
1 72 W 52 St & 11 Ave 6926.01
2 72 W 52 St & 11 Ave 6926.01
data.stations.lat data.stations.lon data.stations.region_id
1 40.76727 -73.99393 71
2 40.76727 -73.99393 71
data.stations.rental_methods data.stations.capacity
1 KEY 39
2 CREDITCARD 39
data.stations.eightd_has_key_dispenser data.stations.station_id.1
1 FALSE 79
2 FALSE 79
data.stations.name.1 data.stations.short_name.1 data.stations.lat.1
1 Franklin St & W Broadway 5430.08 40.71912
2 Franklin St & W Broadway 5430.08 40.71912
data.stations.lon.1 data.stations.region_id.1 data.stations.rental_methods.1
1 -74.00667 71 KEY
2 -74.00667 71 CREDITCARD
data.stations.capacity.1 data.stations.eightd_has_key_dispenser.1
1 33 FALSE
2 33 FALSE
data.stations.station_id.2 data.stations.name.2 data.stations.short_name.2
1 82 St James Pl & Pearl St 5167.06
2 82 St James Pl & Pearl St 5167.06
data.stations.lat.2 data.stations.lon.2 data.stations.region_id.2
1 40.71117 -74.00017 71
2 40.71117 -74.00017 71
data.stations.rental_methods.2 data.stations.capacity.2
1 KEY 27
2 CREDITCARD 27
data.stations.eightd_has_key_dispenser.2
1 FALSE
2 FALSE
As you can see, this cannot be fixed with a simple transpose t() or melt() solution. I am wondering what I am doing wrong on import or conversion to data.frame which leaves me with a dataframe stretched with the index of what should be rows appended to the column names.
I have tried both of these methods but I am left with the same stretched data:
plyr::ldply(stations, data.frame)
do.call(rbind, lapply(stations, data.frame, stringsAsFactors=FALSE))
In the end I would like my output to look like every 9 columns were "cut" and stacked on to the first 9 - so that I am left with 655 rows and 9 columns Any suggestions would be appreciated.
NOTE: I am taking the JSON directly from this link (it is not a large file)
Here is a reproducible example of the first 27 columns, which should be reshaped to a 9 x 3 dataframe:
> dput(df)
structure(list(data.stations.station_id = structure(c(1L, 1L), class = "factor", .Label = "72"),
data.stations.name = structure(c(1L, 1L), class = "factor", .Label = "W 52 St & 11 Ave"),
data.stations.short_name = structure(c(1L, 1L), class = "factor", .Label = "6926.01"),
data.stations.lat = c(40.76727216, 40.76727216), data.stations.lon = c(-73.99392888,
-73.99392888), data.stations.region_id = c(71, 71), data.stations.rental_methods = structure(c(2L,
1L), .Label = c("CREDITCARD", "KEY"), class = "factor"),
data.stations.capacity = c(39, 39), data.stations.eightd_has_key_dispenser = c(FALSE,
FALSE), data.stations.station_id.1 = structure(c(1L, 1L), class = "factor", .Label = "79"),
data.stations.name.1 = structure(c(1L, 1L), class = "factor", .Label = "Franklin St & W Broadway"),
data.stations.short_name.1 = structure(c(1L, 1L), class = "factor", .Label = "5430.08"),
data.stations.lat.1 = c(40.71911552, 40.71911552), data.stations.lon.1 = c(-74.00666661,
-74.00666661), data.stations.region_id.1 = c(71, 71), data.stations.rental_methods.1 = structure(c(2L,
1L), .Label = c("CREDITCARD", "KEY"), class = "factor"),
data.stations.capacity.1 = c(33, 33), data.stations.eightd_has_key_dispenser.1 = c(FALSE,
FALSE), data.stations.station_id.2 = structure(c(1L, 1L), class = "factor", .Label = "82"),
data.stations.name.2 = structure(c(1L, 1L), class = "factor", .Label = "St James Pl & Pearl St"),
data.stations.short_name.2 = structure(c(1L, 1L), class = "factor", .Label = "5167.06"),
data.stations.lat.2 = c(40.71117416, 40.71117416), data.stations.lon.2 = c(-74.00016545,
-74.00016545), data.stations.region_id.2 = c(71, 71), data.stations.rental_methods.2 = structure(c(2L,
1L), .Label = c("CREDITCARD", "KEY"), class = "factor"),
data.stations.capacity.2 = c(27, 27), data.stations.eightd_has_key_dispenser.2 = c(FALSE,
FALSE)), .Names = c("data.stations.station_id", "data.stations.name",
"data.stations.short_name", "data.stations.lat", "data.stations.lon",
"data.stations.region_id", "data.stations.rental_methods", "data.stations.capacity",
"data.stations.eightd_has_key_dispenser", "data.stations.station_id.1",
"data.stations.name.1", "data.stations.short_name.1", "data.stations.lat.1",
"data.stations.lon.1", "data.stations.region_id.1", "data.stations.rental_methods.1",
"data.stations.capacity.1", "data.stations.eightd_has_key_dispenser.1",
"data.stations.station_id.2", "data.stations.name.2", "data.stations.short_name.2",
"data.stations.lat.2", "data.stations.lon.2", "data.stations.region_id.2",
"data.stations.rental_methods.2", "data.stations.capacity.2",
"data.stations.eightd_has_key_dispenser.2"), row.names = c(NA,
-2L), class = "data.frame")
So the output structure should look like this (obviously with the values not NA). Each row represents the appended index number of the original dataframe's column names
> output
data.stations.station_id data.stations.name data.stations.short_name
1 NA NA NA
2 NA NA NA
3 NA NA NA
data.stations.lat data.stations.lon data.stations.region_id
1 NA NA NA
2 NA NA NA
3 NA NA NA
data.stations.rental_methods data.stations.capacity
1 NA NA
2 NA NA
3 NA NA
data.stations.eightd_has_key_dispenser
1 NA
2 NA
3 NA
I would try:
library(data.table)
rbindlist(lapply(split(seq_along(df), c(0, (seq_along(df)%/%9)[-length(df)])),
function(x) df[, x]), use.names = FALSE)
## data.stations.station_id data.stations.name data.stations.short_name data.stations.lat
## 1: 72 W 52 St & 11 Ave 6926.01 40.76727
## 2: 72 W 52 St & 11 Ave 6926.01 40.76727
## 3: 79 Franklin St & W Broadway 5430.08 40.71912
## 4: 79 Franklin St & W Broadway 5430.08 40.71912
## 5: 82 St James Pl & Pearl St 5167.06 40.71117
## 6: 82 St James Pl & Pearl St 5167.06 40.71117
## data.stations.lon data.stations.region_id data.stations.rental_methods
## 1: -73.99393 71 KEY
## 2: -73.99393 71 CREDITCARD
## 3: -74.00667 71 KEY
## 4: -74.00667 71 CREDITCARD
## 5: -74.00017 71 KEY
## 6: -74.00017 71 CREDITCARD
## data.stations.capacity data.stations.eightd_has_key_dispenser
## 1: 39 FALSE
## 2: 39 FALSE
## 3: 33 FALSE
## 4: 33 FALSE
## 5: 27 FALSE
## 6: 27 FALSE
That is, create a list of data.frames with 9 columns each, and rbind them. This way, you won't have problems of data coercion when converting to a matrix.
This results in a 6 row x 9 column data.table. Not sure what rule you want to use to drop rows to end up with just 3 rows....
But I think you're trying to solve a problem that does not exist. Try reading your data like this:
library(jsonlite)
x <- fromJSON("https://gbfs.citibikenyc.com/gbfs/en/station_information.json")
head(x[[3]]$stations)
## station_id name short_name lat lon region_id
## 1 72 W 52 St & 11 Ave 6926.01 40.76727 -73.99393 71
## 2 79 Franklin St & W Broadway 5430.08 40.71912 -74.00667 71
## 3 82 St James Pl & Pearl St 5167.06 40.71117 -74.00017 71
## 4 83 Atlantic Ave & Fort Greene Pl 4354.07 40.68383 -73.97632 71
## 5 116 W 17 St & 8 Ave 6148.02 40.74178 -74.00150 71
## 6 119 Park Ave & St Edwards St 4700.06 40.69609 -73.97803 71
## rental_methods capacity eightd_has_key_dispenser
## 1 KEY, CREDITCARD 39 FALSE
## 2 KEY, CREDITCARD 33 FALSE
## 3 KEY, CREDITCARD 27 FALSE
## 4 KEY, CREDITCARD 62 FALSE
## 5 KEY, CREDITCARD 39 FALSE
## 6 KEY, CREDITCARD 19 FALSE
dim(x[[3]]$stations)
# [1] 665 9
You can use matrix but make sure all your factor columns are characters, i.e.
ind <- sapply(df, is.factor)
df[ind] <- lapply(df[ind], as.character)
final_df <- as.data.frame(matrix(unlist(df), ncol = 9, byrow = TRUE))
final_df[c(TRUE, FALSE),]
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1 72 72 W 52 St & 11 Ave W 52 St & 11 Ave 6926.01 6926.01 40.76727216 40.76727216 -73.99392888
#3 79 79 Franklin St & W Broadway Franklin St & W Broadway 5430.08 5430.08 40.71911552 40.71911552 -74.00666661
#5 82 82 St James Pl & Pearl St St James Pl & Pearl St 5167.06 5167.06 40.71117416 40.71117416 -74.00016545
On the other hand, as #A5C1D2H2I1M1N2O1R2T1 notes, you might be looking for this instead:
as.data.frame(matrix(c(t(df)), ncol = 9, byrow = TRUE))
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1 72 W 52 St & 11 Ave 6926.01 40.76727 -73.99393 71 KEY 39 FALSE
#2 79 Franklin St & W Broadway 5430.08 40.71912 -74.00667 71 KEY 33 FALSE
#3 82 St James Pl & Pearl St 5167.06 40.71117 -74.00017 71 KEY 27 FALSE
#4 72 W 52 St & 11 Ave 6926.01 40.76727 -73.99393 71 CREDITCARD 39 FALSE
#5 79 Franklin St & W Broadway 5430.08 40.71912 -74.00667 71 CREDITCARD 33 FALSE
#6 82 St James Pl & Pearl St 5167.06 40.71117 -74.00017 71 CREDITCARD 27 FALSE

How can I replace empty cells with NA in R?

I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.
The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>
You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>