Robust Scaler in recipes Package - tidymodels

Is there a robust scaler method in the recipes package in the R programming language? As a result of my research, I could not find this method.

I'm assuming you are referring to the RobustScaler from scikit-learn. You are correct that there isn't a similar step in the recipes package.
It is implemented in the extrasteps package which you can install with
# install.packages("devtools")
devtools::install_github("EmilHvitfeldt/extrasteps")
Then you can use the step_robust() which will do what you are expecting.
library(recipes)
library(extrasteps)
rec <- recipe(~., data = mtcars) %>%
step_robust(all_predictors()) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.244 0 -0.177 -0.156 0.244 -0.685 -0.623 0 1 0 1
#> 2 0.244 0 -0.177 -0.156 0.244 -0.437 -0.344 0 1 0 1
#> 3 0.488 -0.5 -0.430 -0.359 0.185 -0.977 0.448 1 1 0 -0.5
#> 4 0.298 0 0.301 -0.156 -0.732 -0.107 0.862 1 0 -1 -0.5
#> 5 -0.0678 0.5 0.798 0.623 -0.649 0.112 -0.344 0 0 -1 0
#> 6 -0.149 0 0.140 -0.216 -1.11 0.131 1.25 1 0 -1 -0.5
#> 7 -0.664 0.5 0.798 1.46 -0.577 0.238 -0.932 0 0 -1 1
#> 8 0.705 -0.5 -0.242 -0.731 -0.00595 -0.131 1.14 1 0 0 0
#> 9 0.488 -0.5 -0.271 -0.335 0.268 -0.170 2.59 1 0 0 0
#> 10 0 0 -0.140 0 0.268 0.112 0.294 1 0 0 1
#> # … with 22 more rows
tidy(rec, 1)
#> # A tibble: 33 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 mpg lower 15.4 robust_hS9q6
#> 2 mpg median 19.2 robust_hS9q6
#> 3 mpg higher 22.8 robust_hS9q6
#> 4 cyl lower 4 robust_hS9q6
#> 5 cyl median 6 robust_hS9q6
#> 6 cyl higher 8 robust_hS9q6
#> 7 disp lower 121. robust_hS9q6
#> 8 disp median 196. robust_hS9q6
#> 9 disp higher 326 robust_hS9q6
#> 10 hp lower 96.5 robust_hS9q6
#> # … with 23 more rows
rec <- recipe(~., data = mtcars) %>%
step_robust(all_predictors(), range = c(0.1, 0.9)) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.114 0 -0.115 -0.0732 0.171 -0.337 -0.281 0 1 0
#> 2 0.114 0 -0.115 -0.0732 0.171 -0.215 -0.155 0 1 0
#> 3 0.229 -0.5 -0.280 -0.169 0.129 -0.480 0.202 1 1 0
#> 4 0.140 0 0.196 -0.0732 -0.512 -0.0526 0.388 1 0 -0.5
#> 5 -0.0317 0.5 0.519 0.293 -0.453 0.0550 -0.155 0 0 -0.5
#> 6 -0.0698 0 0.0910 -0.101 -0.778 0.0645 0.563 1 0 -0.5
#> 7 -0.311 0.5 0.519 0.687 -0.403 0.117 -0.420 0 0 -0.5
#> 8 0.330 -0.5 -0.157 -0.344 -0.00416 -0.0645 0.514 1 0 0
#> 9 0.229 -0.5 -0.176 -0.158 0.187 -0.0837 1.16 1 0 0
#> 10 0 0 -0.0910 0 0.187 0.0550 0.132 1 0 0
#> # … with 22 more rows, and 1 more variable: carb <dbl>
tidy(rec, 1)
#> # A tibble: 33 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 mpg lower 14.3 robust_MygTA
#> 2 mpg median 19.2 robust_MygTA
#> 3 mpg higher 30.1 robust_MygTA
#> 4 cyl lower 4 robust_MygTA
#> 5 cyl median 6 robust_MygTA
#> 6 cyl higher 8 robust_MygTA
#> 7 disp lower 80.6 robust_MygTA
#> 8 disp median 196. robust_MygTA
#> 9 disp higher 396 robust_MygTA
#> 10 hp lower 66 robust_MygTA
#> # … with 23 more rows

Related

Scraping Website with Unchanging URL in R

I would like to scrape a series of tables from a website whose URL does not change when I click through the tables in my browser. Each table corresponds to a unique date. The default table is that which corresponds to today's date. I can scroll through past dates in my browser, but can't seem to find a way to do so in R.
Using library(rvest) this bit of code will reliably download the table that corresponds to today's date (I'm only interested in the first of the three tables).
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_table()
off <- off[[1]]
How can I download the table that corresponds to, say "2022-10-04", to "2022-10-06", or to yesterday?
I've tried to work through it by identifying the node under which the table lies, in the hopes that I could manipulate it to reflect a prior date. However, the following reproduces the same table as above:
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > div > div.dayContent > div > table") %>%
html_table()
off <- off[[1]]
Scrolling through past dates in my browser, I've identified various places in the html that reference the prior date; but I can't seem to change it from R, yet alone get the table I download to reflect a change:
webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > header > div")
I've messed around some with html_form(), follow_link(), and set_values() also, but to no avail.
Is there a good way to navigate this particular URL in R?
You can consider the following approach :
library(RSelenium)
library(rvest)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA
Here is another approach that can be considered :
library(RDCOMClient)
library(rvest)
url <- "https://official.nba.com/referee-assignments/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)
web_Obj_Date <- doc$querySelector("#ref-filters-menu > li > div > button")
web_Obj_Date$dispatchEvent(clickEvent)
web_Obj_Date_Input <- doc$GetElementById('ref-date')
web_Obj_Date_Input[["Value"]] <- "2022-10-05"
web_Obj_Go_Button <- doc$querySelector("#date-filter")
web_Obj_Go_Button$dispatchEvent(clickEvent)
html_Content <- doc$Body()$innerHTML()
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 8 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "Game" "Official 1" "Official 2" "Official 3" "Alternate" NA NA
2 "S" "M" "T" "W" "T" "F" "S"
3 "" "" "" "" "" "" "1"
4 "2" "3" "4" "5" "6" "7" "8"
5 "9" "10" "11" "12" "13" "14" "15"
6 "16" "17" "18" "19" "20" "21" "22"
7 "23" "24" "25" "26" "27" "28" "29"
8 "30" "31" "" "" "" "" ""
[[3]]
# A tibble: 7 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "S" "M" "T" "W" "T" "F" "S"
2 "" "" "" "" "" "" "1"
3 "2" "3" "4" "5" "6" "7" "8"
4 "9" "10" "11" "12" "13" "14" "15"
5 "16" "17" "18" "19" "20" "21" "22"
6 "23" "24" "25" "26" "27" "28" "29"
7 "30" "31" "" "" "" "" ""
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA
If you install the Docker software (see https://docs.docker.com/engine/install/), you can consider the following approach with firefox :
library(RSelenium)
library(rvest)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA

C5_rules() in Tidymodels

I would like to use tidymodels to fit a C5.0 rule-based classification model. I have specified the model as follows
c5_spec <-
C5_rules() %>%
set_engine("C5.0") %>%
set_mode("classification")
In the documentation for the C5_rules() command, I read the following.
The model is not trained or fit until the fit.model_spec() function is used with the data.
I'm not quite sure what I need to do with the parsnip model object after that. Every time I try to fit the model, I get the following error
preprocessor 1/1, model 1/1 (predictions): Error in predict.C5.0(object = object$fit, newdata = new_data, type = "class"): either a tree or rules must be provided
What am I missing?
Thank you very much!
That's a good start! You've defined your model spec, but if you're wanting to fit using a workflow, you'll need to create a recipe & workflow as well. Julia Silge's blog is hands down the best resource for getting used to working with tidymodels. Here's a reprex that fits a C5 classifier once to training data:
# load tidymodels & rules
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(rules)
#> Warning: package 'rules' was built under R version 4.1.1
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 x 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # ... with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Warning in vec2table(truth = truth, estimate = estimate, dnn = dnn, ...): `truth`
#> was converted to a factor
#> Truth
#> Prediction 4 6 8
#> 4 11 0 0
#> 6 0 7 0
#> 8 0 0 14
Created on 2021-09-30 by the reprex package (v2.0.1)
I tried reprex by Mark Rieke and I got an error for the last command (conf_mat).
load tidymodels & rules
library(tidymodels)
library(rules)
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 × 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # … with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Error in `yardstick_table()`:
#> ! `truth` must be a factor.
#> ℹ This is an internal error in the yardstick package, please report it to the package authors.
#> Backtrace:
#> ▆
#> 1. ├─cars_preds %>% conf_mat(truth = cyl, estimate = .pred_class)
#> 2. ├─yardstick::conf_mat(., truth = cyl, estimate = .pred_class)
#> 3. └─yardstick:::conf_mat.data.frame(., truth = cyl, estimate = .pred_class)
#> 4. └─yardstick:::yardstick_table(truth = truth, estimate = estimate, case_weights = case_weights)
#> 5. └─rlang::abort("`truth` must be a factor.", .internal = TRUE)

web scraping understat website to retrieve table failing in R

I am trying to to pull out a table from the the website https://understat.com/league/EPL
The table I am trying to import into R is highlighted in red in the screenshot here;
screenshot of website
Using inspect tools I can see the xpath to the table as follows;
//*[#id="league-chemp"]/table
full XPath is
/html/body/div[1]/div[3]/div[3]/div/div[2]/div/table
My code is as follows;
library(rvest)
library(selectr)
library(xml2)
library(jsonlite)
library(htmltab)
library(RCurl)
library(XML)
url <- 'https://understat.com/league/EPL'
webpage <- read_html('https://understat.com/league/EPL')
xpath <- "/html/body/div[1]/div[3]/div[3]/div/div[2]/div/table/tbody"
nodes <- html_nodes(webpage, xpath = xpath)
However the response is;
> nodes
{xml_nodeset (0)}
I've hit a dead end, I think there maybe some embedded JSON code and javascript within the main html body of the response that is causing issues, but its all above my expertise right now.
I have been able to extract the table with the following code :
library(rvest)
library(RSelenium)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://understat.com/league/EPL"
remDr$navigate(url)
Sys.sleep(5)
html_Content <- remDr$getPageSource()[[1]]
tables <- read_html(html_Content) %>% html_table()
tables
[[1]]
# A tibble: 20 x 12
`<U+2116>` Team M W D L G GA PTS xG xGA xPTS
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 Arsenal 9 8 0 1 23 10 24 19.48-3.52 8.17-1.83 20.03-3.97
2 2 Manchester City 9 7 2 0 33 9 23 23.27-9.73 5.81-3.19 23.59+0.59
3 3 Tottenham 9 6 2 1 20 10 20 14.78-5.22 10.60+0.60 15.12-4.88
4 4 Chelsea 8 5 1 2 13 10 16 12.10-0.90 10.62+0.62 11.86-4.14
5 5 Manchester United 8 5 0 3 13 15 15 12.35-0.65 11.41-3.59 11.86-3.14
6 6 Newcastle United 9 3 5 1 17 9 14 18.41+1.41 12.13+3.13 15.73+1.73
7 7 Brighton 8 4 2 2 14 9 14 14.52+0.52 8.58-0.42 15.53+1.53
8 8 Bournemouth 9 3 3 3 8 20 12 5.26-2.74 15.39-4.61 6.43-5.57
9 9 Fulham 9 3 2 4 14 18 11 10.22-3.78 21.34+3.34 7.02-3.98
10 10 Liverpool 8 2 4 2 20 12 10 17.02-2.98 12.33+0.33 12.95+2.95
11 11 Brentford 9 2 4 3 16 17 10 13.28-2.72 13.00-4.00 12.78+2.78
12 12 Everton 9 2 4 3 8 9 10 10.33+2.33 14.67+5.67 9.13-0.87
13 13 West Ham 9 3 1 5 8 10 10 11.51+3.51 9.64-0.36 13.64+3.64
14 14 Leeds 8 2 3 3 11 12 9 10.45-0.55 12.28+0.28 9.73+0.73
15 15 Crystal Palace 8 2 3 3 10 12 9 9.91-0.09 13.71+1.71 8.62-0.38
16 16 Aston Villa 8 2 2 4 6 10 8 8.08+2.08 10.45+0.45 10.24+2.24
17 17 Southampton 9 2 1 6 8 17 7 9.32+1.32 13.88-3.12 9.36+2.36
18 18 Wolverhampton Wanderers 9 1 3 5 3 12 6 8.16+5.16 11.84-0.16 9.54+3.54
19 19 Leicester 9 1 1 7 15 24 4 9.06-5.94 15.12-8.88 8.00+4.00
20 20 Nottingham Forest 8 1 1 6 6 21 4 8.62+2.62 15.17-5.83 7.48+3.48
[[2]]
# A tibble: 11 x 11
`<U+2116>` Player Team Apps Min G A xG xA xG90 xA90
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl>
1 1 "Erling Haaland" "Manchester City" 9 768 15 3 10.10-4.90 2.61-0.39 1.18 0.31
2 2 "Harry Kane" "Tottenham" 9 804 8 1 6.72-1.28 2.06+1.06 0.75 0.23
3 3 "Roberto Firmino" "Liverpool" 7 473 6 3 4.06-1.94 1.40-1.60 0.77 0.27
4 4 "Aleksandar Mitrovic" "Fulham" 8 666 6 0 4.53-1.47 0.38+0.38 0.61 0.05
5 5 "Ivan Toney" "Brentford" 9 810 6 2 5.24-0.76 1.55-0.45 0.58 0.17
6 6 "Phil Foden" "Manchester City" 9 678 6 4 3.37-2.63 2.49-1.51 0.45 0.33
7 7 "Gabriel Jesus" "Arsenal" 9 794 5 3 6.29+1.29 1.73-1.27 0.71 0.2
8 8 "James Maddison" "Leicester" 8 716 5 2 1.40-3.60 0.97-1.03 0.18 0.12
9 9 "Leandro Trossard" "Brighton" 8 686 5 1 3.01-1.99 0.70-0.30 0.39 0.09
10 10 "Wilfried Zaha" "Crystal Palace" 7 624 4 1 3.33-0.67 1.24+0.24 0.48 0.18
11 NA "" "" NA NA 252 183 250.82-1.18 180.23-2.77 NA NA
Here is another approach that can be considered :
library(RDCOMClient)
url <- "https://understat.com/league/EPL"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$Body()$innerHTML()
tables <- read_html(html_Content) %>% html_table()
tables
[[1]]
# A tibble: 20 x 12
`?` Team M W D L G GA PTS xG xGA xPTS
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 Arsenal 9 8 0 1 23 10 24 19.48-3.52 8.17-1.83 20.03-3.97
2 2 Manchester City 9 7 2 0 33 9 23 23.27-9.73 5.81-3.19 23.59+0.59
3 3 Tottenham 9 6 2 1 20 10 20 14.78-5.22 10.60+0.60 15.12-4.88
4 4 Chelsea 8 5 1 2 13 10 16 12.10-0.90 10.62+0.62 11.86-4.14
5 5 Manchester United 8 5 0 3 13 15 15 12.35-0.65 11.41-3.59 11.86-3.14
6 6 Newcastle United 9 3 5 1 17 9 14 18.41+1.41 12.13+3.13 15.73+1.73
7 7 Brighton 8 4 2 2 14 9 14 14.52+0.52 8.58-0.42 15.53+1.53
8 8 Bournemouth 9 3 3 3 8 20 12 5.26-2.74 15.39-4.61 6.43-5.57
9 9 Fulham 9 3 2 4 14 18 11 10.22-3.78 21.34+3.34 7.02-3.98
10 10 Liverpool 8 2 4 2 20 12 10 17.02-2.98 12.33+0.33 12.95+2.95
11 11 Brentford 9 2 4 3 16 17 10 13.28-2.72 13.00-4.00 12.78+2.78
12 12 Everton 9 2 4 3 8 9 10 10.33+2.33 14.67+5.67 9.13-0.87
13 13 West Ham 9 3 1 5 8 10 10 11.51+3.51 9.64-0.36 13.64+3.64
14 14 Leeds 8 2 3 3 11 12 9 10.45-0.55 12.28+0.28 9.73+0.73
15 15 Crystal Palace 8 2 3 3 10 12 9 9.91-0.09 13.71+1.71 8.62-0.38
16 16 Aston Villa 8 2 2 4 6 10 8 8.08+2.08 10.45+0.45 10.24+2.24
17 17 Southampton 9 2 1 6 8 17 7 9.32+1.32 13.88-3.12 9.36+2.36
18 18 Wolverhampton Wanderers 9 1 3 5 3 12 6 8.16+5.16 11.84-0.16 9.54+3.54
19 19 Leicester 9 1 1 7 15 24 4 9.06-5.94 15.12-8.88 8.00+4.00
20 20 Nottingham Forest 8 1 1 6 6 21 4 8.62+2.62 15.17-5.83 7.48+3.48
[[2]]
# A tibble: 11 x 11
`?` Player Team Apps Min G A xG xA xG90 xA90
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl>
1 1 "Erling Haaland" "Manchester City" 9 768 15 3 10.10-4.90 2.61-0.39 1.18 0.31
2 2 "Harry Kane" "Tottenham" 9 804 8 1 6.72-1.28 2.06+1.06 0.75 0.23
3 3 "Roberto Firmino" "Liverpool" 7 473 6 3 4.06-1.94 1.40-1.60 0.77 0.27
4 4 "Aleksandar Mitrovic" "Fulham" 8 666 6 0 4.53-1.47 0.38+0.38 0.61 0.05
5 5 "Ivan Toney" "Brentford" 9 810 6 2 5.24-0.76 1.55-0.45 0.58 0.17
6 6 "Phil Foden" "Manchester City" 9 678 6 4 3.37-2.63 2.49-1.51 0.45 0.33
7 7 "Gabriel Jesus" "Arsenal" 9 794 5 3 6.29+1.29 1.73-1.27 0.71 0.2
8 8 "James Maddison" "Leicester" 8 716 5 2 1.40-3.60 0.97-1.03 0.18 0.12
9 9 "Leandro Trossard" "Brighton" 8 686 5 1 3.01-1.99 0.70-0.30 0.39 0.09
10 10 "Wilfried Zaha" "Crystal Palace" 7 624 4 1 3.33-0.67 1.24+0.24 0.48 0.18
11 NA "" "" NA NA 252 183 250.82-1.18 180.23-2.77 NA NA

R: how to toggle html page selection in web scraping

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
I'm trying to scrape the 2016 table data from the above webpage. If I change the Year to 2010, the url changes to http://legacy.baseballprospectus.com/sortable/index.php?cid=1966487.
I want to automate my algorithm so that it can obtain the table across different Year, but I'm not sure how I can obtain the unique identifiers (e.g. 1966487) for each page automatically. Is there a way to find the list of these?
I've tried looking at the html source code, but no luck.
With rvest, you can set the value in the form and submit it. Wrapped in purrr::map_dfr to iterate and row-bind the results in to a data frame,
library(rvest)
sess <- html_session("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181")
baseball <- purrr::map_dfr(
2017:2015,
function(y){
Sys.sleep(10 + runif(1)) # be polite
form <- sess %>%
html_node(xpath = '//form[#action="index.php"]') %>%
html_form() %>%
set_values(year = y)
sess <- submit_form(sess, form)
sess %>%
read_html() %>%
html_node('#TTdata') %>%
html_table(header = TRUE)
}
)
tibble::as_data_frame(baseball) # for printing
#> # A tibble: 4,036 x 38
#> `#` NAME TEAM LG YEAR AGE G PA AB R
#> <dbl> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 1 Giancarlo Stanton MIA NL 2017 27 159 692 597 123
#> 2 2 Joey Votto CIN NL 2017 33 162 707 559 106
#> 3 3 Charlie Blackmon COL NL 2017 30 159 725 644 137
#> 4 4 Aaron Judge NYA AL 2017 25 155 678 542 128
#> 5 5 Nolan Arenado COL NL 2017 26 159 680 606 100
#> 6 6 Kris Bryant CHN NL 2017 25 151 665 549 111
#> 7 7 Mike Trout ANA AL 2017 25 114 507 402 92
#> 8 8 Jose Altuve HOU AL 2017 27 153 662 590 112
#> 9 9 Paul Goldschmidt ARI NL 2017 29 155 665 558 117
#> 10 10 Jose Ramirez CLE AL 2017 24 152 645 585 107
#> # ... with 4,026 more rows, and 28 more variables: H <int>, `1B` <int>,
#> # `2B` <int>, `3B` <int>, HR <int>, TB <int>, BB <int>, IBB <int>,
#> # SO <int>, HBP <int>, SF <int>, SH <int>, RBI <int>, DP <int>,
#> # NETDP <dbl>, SB <int>, CS <int>, AVG <dbl>, OBP <dbl>, SLG <dbl>,
#> # OPS <dbl>, ISO <dbl>, BPF <int>, oppOPS <dbl>, TAv <dbl>, VORP <dbl>,
#> # FRAA <dbl>, BWARP <dbl>

How can I replace empty cells with NA in R?

I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.
The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>
You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>