Rvest R not getting inner table

Rvest R not getting inner table - html

I'm trying to retrieve the Medals Table inside Wikipedia for Olympics 2012.
library(rvest)
library(magrittr)
url <- "https://en.wikipedia.org/wiki/United_States_at_the_2012_Summer_Olympics"
xpath0 <- '//*[#id="mw-content-text"]/table[1]'
xpath1 <- '//*[#id="mw-content-text"]/table[2]'
xpath2 <- '//*[#id="mw-content-text"]/table[2]/tbody/tr/td[1]'
xpath3 <- '//*[#id="mw-content-text"]/table[2]/tbody/tr/td[1]/table'
tb <- url %>%
html() %>%
html_nodes(xpath=xpath0) %>%
html_nodes("") %>%
html_table()
xpath0 or xpath1 return an error
Error in parse_simple_selector(stream) :
Expected selector, got <EOF at 1>
xpath2 and xpath3 return empty lists.
At same time I tried to use Selectorgadget (https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html) to point to the exact element. I got
//td[(((count(preceding-sibling::) + 1) = 1) and parent::)] |
//*[contains(concat( " ", #class, " " ), concat( " ",
"headerSortDown", " " ))]
and the Error
Error in parse_simple_selector(stream) :
Expected selector, got
I really appreciate any help.
Joa

The first table with the names has a complicated structure and seems to be very difficult to convert into a standard format. At least I didn't succeed.
A summary of the number of medals by sport and the total medals can be obtained with
library(rvest) #v.0.2.0.9000
url <- "https://en.wikipedia.org/wiki/United_States_at_the_2012_Summer_Olympics"
tb <- read_html(url) %>% html_node("table.wikitable:nth-child(2)") %>% html_table(fill=TRUE)
#> head(tb)
# Medals by sport NA NA NA NA NA NA
#1 Sport 01 ! 02 ! 03 ! Total NA NA
#2 Swimming 16 9 6 31 NA NA
#3 Track & field 9 12 7 28 NA NA
#4 Gymnastics 3 1 2 6 NA NA
#5 Shooting 3 0 1 4 NA NA
#6 Tennis 3 0 1 4 NA NA
Then there is another table summarizing all competitors that you can get with
tb2 <- read_html(url) %>% html_node("table.wikitable:nth-child(20)") %>% html_table()
#> head(tb2)
# Sport Men Women Total
#1 Archery 3 3 6
#2 Athletics (track and field) 63 62 125
#3 Badminton 2 1 3
#4 Basketball 12 12 24
#5 Boxing 9 3 12
#6 Canoeing 5 2 7
And this is the table of multiple medalists:
tb3 <- read_html(url) %>% html_node("table.wikitable:nth-child(8)") %>% html_table(fill=TRUE)
#> head(tb3)
# Multiple medalists NA NA NA NA NA NA
#1 Name Sport 01 ! 02 ! 03 ! Total NA
#2 Michael Phelps Swimming 4 2 0 6 NA
#3 Missy Franklin Swimming 4 0 1 5 NA
#4 Allison Schmitt Swimming 3 1 1 5 NA
#5 Ryan Lochte Swimming 2 2 1 5 NA
#6 Allyson Felix Track & field 3 0 0 3 NA
It really depends on which table you want to have, as pointed out by #Metrics. There are many tables on that page.

Related

Scraping Website with Unchanging URL in R

I would like to scrape a series of tables from a website whose URL does not change when I click through the tables in my browser. Each table corresponds to a unique date. The default table is that which corresponds to today's date. I can scroll through past dates in my browser, but can't seem to find a way to do so in R.
Using library(rvest) this bit of code will reliably download the table that corresponds to today's date (I'm only interested in the first of the three tables).
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_table()
off <- off[[1]]
How can I download the table that corresponds to, say "2022-10-04", to "2022-10-06", or to yesterday?
I've tried to work through it by identifying the node under which the table lies, in the hopes that I could manipulate it to reflect a prior date. However, the following reproduces the same table as above:
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > div > div.dayContent > div > table") %>%
html_table()
off <- off[[1]]
Scrolling through past dates in my browser, I've identified various places in the html that reference the prior date; but I can't seem to change it from R, yet alone get the table I download to reflect a change:
webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > header > div")
I've messed around some with html_form(), follow_link(), and set_values() also, but to no avail.
Is there a good way to navigate this particular URL in R?

You can consider the following approach :
library(RSelenium)
library(rvest)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA

Here is another approach that can be considered :
library(RDCOMClient)
library(rvest)
url <- "https://official.nba.com/referee-assignments/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)
web_Obj_Date <- doc$querySelector("#ref-filters-menu > li > div > button")
web_Obj_Date$dispatchEvent(clickEvent)
web_Obj_Date_Input <- doc$GetElementById('ref-date')
web_Obj_Date_Input[["Value"]] <- "2022-10-05"
web_Obj_Go_Button <- doc$querySelector("#date-filter")
web_Obj_Go_Button$dispatchEvent(clickEvent)
html_Content <- doc$Body()$innerHTML()
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 8 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "Game" "Official 1" "Official 2" "Official 3" "Alternate" NA NA
2 "S" "M" "T" "W" "T" "F" "S"
3 "" "" "" "" "" "" "1"
4 "2" "3" "4" "5" "6" "7" "8"
5 "9" "10" "11" "12" "13" "14" "15"
6 "16" "17" "18" "19" "20" "21" "22"
7 "23" "24" "25" "26" "27" "28" "29"
8 "30" "31" "" "" "" "" ""
[[3]]
# A tibble: 7 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "S" "M" "T" "W" "T" "F" "S"
2 "" "" "" "" "" "" "1"
3 "2" "3" "4" "5" "6" "7" "8"
4 "9" "10" "11" "12" "13" "14" "15"
5 "16" "17" "18" "19" "20" "21" "22"
6 "23" "24" "25" "26" "27" "28" "29"
7 "30" "31" "" "" "" "" ""
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA

If you install the Docker software (see https://docs.docker.com/engine/install/), you can consider the following approach with firefox :
library(RSelenium)
library(rvest)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA

R: how to toggle html page selection in web scraping

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
I'm trying to scrape the 2016 table data from the above webpage. If I change the Year to 2010, the url changes to http://legacy.baseballprospectus.com/sortable/index.php?cid=1966487.
I want to automate my algorithm so that it can obtain the table across different Year, but I'm not sure how I can obtain the unique identifiers (e.g. 1966487) for each page automatically. Is there a way to find the list of these?
I've tried looking at the html source code, but no luck.

With rvest, you can set the value in the form and submit it. Wrapped in purrr::map_dfr to iterate and row-bind the results in to a data frame,
library(rvest)
sess <- html_session("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181")
baseball <- purrr::map_dfr(
2017:2015,
function(y){
Sys.sleep(10 + runif(1)) # be polite
form <- sess %>%
html_node(xpath = '//form[#action="index.php"]') %>%
html_form() %>%
set_values(year = y)
sess <- submit_form(sess, form)
sess %>%
read_html() %>%
html_node('#TTdata') %>%
html_table(header = TRUE)
}
)
tibble::as_data_frame(baseball) # for printing
#> # A tibble: 4,036 x 38
#> `#` NAME TEAM LG YEAR AGE G PA AB R
#> <dbl> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 1 Giancarlo Stanton MIA NL 2017 27 159 692 597 123
#> 2 2 Joey Votto CIN NL 2017 33 162 707 559 106
#> 3 3 Charlie Blackmon COL NL 2017 30 159 725 644 137
#> 4 4 Aaron Judge NYA AL 2017 25 155 678 542 128
#> 5 5 Nolan Arenado COL NL 2017 26 159 680 606 100
#> 6 6 Kris Bryant CHN NL 2017 25 151 665 549 111
#> 7 7 Mike Trout ANA AL 2017 25 114 507 402 92
#> 8 8 Jose Altuve HOU AL 2017 27 153 662 590 112
#> 9 9 Paul Goldschmidt ARI NL 2017 29 155 665 558 117
#> 10 10 Jose Ramirez CLE AL 2017 24 152 645 585 107
#> # ... with 4,026 more rows, and 28 more variables: H <int>, `1B` <int>,
#> # `2B` <int>, `3B` <int>, HR <int>, TB <int>, BB <int>, IBB <int>,
#> # SO <int>, HBP <int>, SF <int>, SH <int>, RBI <int>, DP <int>,
#> # NETDP <dbl>, SB <int>, CS <int>, AVG <dbl>, OBP <dbl>, SLG <dbl>,
#> # OPS <dbl>, ISO <dbl>, BPF <int>, oppOPS <dbl>, TAv <dbl>, VORP <dbl>,
#> # FRAA <dbl>, BWARP <dbl>

Selecting a row from a dataframe based on condition in R

I am trying to subset a dataframe based on each user_id and order_date.
If ecomm_id and pulse_id exists in the row for that userid and for order_date, that row should be selected to new dataframe.
Else only one row with no ecomm_id must be selected to the new data frame and all other rows must be discarded.
Sample data:
userid returning device store_n testid ecomm_id pulse_id order_date
1.00 1 0 9328 Experience E 1 23 7/25/2015
1.00 1 0 NA Experience E NA NA 7/25/2015
2.00 1 1 NA Experience C NA NA 7/14/2015
3.00 1 0 3486 Experience F 2 86 7/23/2015
3.00 1 0 NA Experience F NA NA 7/24/2015
3.00 1 0 NA Experience F NA NA 7/24/2015
Expected Output:
userid returning device store_n testid ecomm_id pulse_id order_date
1.00 1 0 9328 Experience E 1 23 7/25/2015
2.00 1 1 NA Experience C NA NA 7/14/2015
3.00 1 0 3486 Experience F 2 86 7/23/2015
3.00 1 0 NA Experience F NA NA 7/24/2015

Hope this helps!
df <- data.frame(userid=c(1,1,2,3,3,3),
returning=c(1,1,1,1,1,1),
device=c(0,0,1,0,0,0),
store_n=c(9328,NA,NA,3486,NA,NA),
testid=c('Experience E','Experience E','Experience C','Experience F','Experience F','Experience F'),
ecomm_id=c(1,NA,NA,2,NA,NA),
pulse_id=c(23,NA,NA,86,NA,NA),
order_date=c('7/25/2015','7/25/2015','7/14/2015','7/23/2015','7/24/2015','7/24/2015')
)
library(dplyr)
df1 <- unique(df) %>% group_by(userid,order_date) %>% summarise(count=n())
df1 <- merge(unique(df),df1,on=c(userid,order_date))
final_df <- df1[!(is.na(df1$ecomm_id) & is.na(df1$pulse_id) & df1$count > 1),-ncol(df1)]
Don't forget to let us know if it solved your problem :)

With data.table, this becomes a concise "one-liner":
library(data.table)
setDT(DT)[order(ecomm_id), .SD[1], keyby = .(userid, order_date)]
userid order_date returning device store_n testid tid ecomm_id pulse_id
1: 1.00 7/25/2015 1 0 9328 Experience E 1 23
2: 2.00 7/14/2015 1 1 NA Experience C NA NA
3: 3.00 7/23/2015 1 0 3486 Experience F 2 86
4: 3.00 7/24/2015 1 0 NA Experience F NA NA
By ordering by ecomm_id, the NA entries are moved to the bottom. Now, for each combination of userid and order_date the first element within that group is picked.
Note that this assumes that there is at most one entry per group in case of non-NA ecomm_ids because the OP has specified:
If ecomm_id and pulse_id exists in the row for that userid and for order_date, that row should be selected to new dataframe.

How to extract strings from rows which have .json like format?

I have imported a .json file using library(jsonlite) stream_in(file(".json"))
However, one of the columns still looks as a .json format.
Im not really sure how proceed in order to extact the columns ID and email from the .json column.
My example:
date <- as.Date(as.character( c("2015-02-13",
"2015-02-14",
"2015-02-14")))
ID <- c(1,2,3)
name <- c("John","Michael","Thomas")
drinks <- c("Beer","Coffee","Tee")
consumed <- c(2,5,3)
john<- "{\"employeID\":\"1\",\"other_details\":{\"email\":\"john#gmx.com\"},\"computer\":\"yes\"}"
michael<- "{\"employeID\":\"2\",\"other_details\":{\"email\":\"michael#yahoo.com\"},\"computer\":\"yes\"}"
thomas<- "{\"employeID\":\"3\",\"other_details\":{\"email\":\"thomas#gmail.com\"},\"computer\":\"yes\"}"
json <- c(john,michael,thomas)
df <- data.frame(date,ID,name,drinks,consumed,json)
Where the data.frame looks like that:
I would like to get the following format:
date ID name drinks consumed email computer
#1 2015-02-13 1 John Beer 2 john#gmx.com yes
#2 2015-02-14 2 Michael Coffee 5 michael#yahoo.com no
#3 2015-02-14 3 Thomas Tee 3 thomas#gmail.com yes
What I have tried was to was first to use the library(jsonlite) again in different variations but it always results in:
fromJSON(df$json[1])
Error: Argument 'txt' must be a JSON string, URL or file.
How can I extract these fields properly?

df$json is a factor vector while fromJSON only accepts a JSON string, URL or file. You can try
fromJSON(as.character(df$json[1]))
or add stringsAsFactor=FALSE when you create df.
You do your task, you can try:
library(tidyverse)
df %>%
filter(json != "{}") %>% # Drop rows with json == "{}"
rowwise() %>%
do(data.frame(ID = .$ID, jsonlite::fromJSON(.$json), stringsAsFactors=FALSE)) %>%
merge(df %>% select(-json), by="ID", all.y=TRUE)
Output:
ID employeID email computer date name drinks consumed
1 1 1 john#gmx.com yes 2015-02-13 John Beer 2
2 2 2 michael#yahoo.com yes 2015-02-14 Michael Coffee 5
3 3 3 thomas#gmail.com yes 2015-02-14 Thomas Tee 3
It can handle cases with "{}" in json column.
df2 <- df %>%
rbind(data.frame(date="2015-02-14", ID=4, name="Kitman",
drinks="Chocolate", consumed=1, json="{}"))
df2 %>%
filter(json != "{}") %>%
rowwise() %>%
do(data.frame(ID = .$ID, jsonlite::fromJSON(.$json), stringsAsFactors=FALSE)) %>%
merge(df2 %>% select(-json), by="ID", all.y=TRUE)
Output:
ID employeID email computer date name drinks consumed
1 1 1 john#gmx.com yes 2015-02-13 John Beer 2
2 2 2 michael#yahoo.com yes 2015-02-14 Michael Coffee 5
3 3 3 thomas#gmail.com yes 2015-02-14 Thomas Tee 3
4 4 <NA> <NA> <NA> 2015-02-14 Kitman Chocolate 1
Outdated:
cbind(
df %>% select(-json),
df$json %>%
map(~as.data.frame(jsonlite::fromJSON(.))) %>%
do.call("rbind", .)
)
Output:
date ID name drinks consumed employeID email computer
1 2015-02-13 1 John Beer 2 1 john#gmx.com yes
2 2015-02-14 2 Michael Coffee 5 2 michael#yahoo.com yes
3 2015-02-14 3 Thomas Tee 3 3 thomas#gmail.com yes

First, try:
ndjson::stream_in("filename.json")
The ndjson package is faster than jsonlite and was built for flattening (it's very task-specific and not as swiss-army-knife-ish as the highly useful jsonlite pkg).
Or, we can keep the tidyverse idioms all the way through:
library(tidyverse)
map_df(df$json, ~jsonlite::fromJSON(as.character(.))) %>%
bind_cols(select(df, -json)) %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.list, as.character) %>%
select(ID, name, drinks, consumed, everything())
## # A tibble: 3 × 8
## ID name drinks consumed computer employeID other_details.email date
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <date>
## 1 1 John Beer 2 yes 1 john#gmx.com 2015-02-13
## 2 2 Michael Coffee 5 yes 2 michael#yahoo.com 2015-02-14
## 3 3 Thomas Tee 3 yes 3 thomas#gmail.com 2015-02-14
And, you get your character columns.

How can I replace empty cells with NA in R?

I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.

The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>

You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Rvest R not getting inner table - html

Related

Scraping Website with Unchanging URL in R

R: how to toggle html page selection in web scraping

Selecting a row from a dataframe based on condition in R

How to extract strings from rows which have .json like format?

How can I replace empty cells with NA in R?

Categories

Resources