How to scrape HTML table with nested column with Rvest? - html

I encounter a big problem in scrapping of HTML table with nested columns.
The table is from the immigration department of Hong Kong.
A screenshot is shown here:
I tried to do it with rvest, but the result is messy.
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>%
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable
My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.
I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.

Here is a way. The headers format complicates things but the code below works. It extracts the entire table, not just the first row.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
library(ggplot2)
})
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
page <- url_data %>% read_html()
page %>%
html_elements("[headers='Arrival']") %>%
html_text() %>%
paste("Arrival", .) -> col_names
page %>%
html_elements("[headers='Departure']") %>%
html_text() %>%
paste("Departure", .) %>%
c(col_names, .) -> col_names
page %>%
html_elements("[headers='Control_Point']") %>%
html_text() -> row_names
page %>%
html_elements("[class='hRight']") %>%
html_text() %>%
sub(",", "", .) %>%
as.numeric() %>%
matrix(nrow = length(row_names), byrow = TRUE) %>%
as.data.frame() %>%
setNames(col_names) %>%
`row.names<-`(row_names) -> final
final
#> Arrival Hong Kong Residents
#> Airport 4258
#> Express Rail Link West Kowloon 0
#> Hung Hom 0
#> Lo Wu 0
#> Lok Ma Chau Spur Line 0
#> Heung Yuen Wai 0
#> Hong Kong-Zhuhai-Macao Bridge 333
#> Lok Ma Chau 0
#> Man Kam To 0
#> Sha Tau Kok 0
#> Shenzhen Bay 3404
#> China Ferry Terminal 0
#> Harbour Control 0
#> Kai Tak Cruise Terminal 0
#> Macau Ferry Terminal 0
#> Total 7995
#> Arrival Mainland Visitors Arrival Other Visitors
#> Airport 1488 422
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 28 39
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 348 37
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 1864 498
#> Arrival Total Departure Hong Kong Residents
#> Airport 6168 3775
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 400 243
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 3789 1301
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 10357 5319
#> Departure Mainland Visitors
#> Airport 1154
#> Express Rail Link West Kowloon 0
#> Hung Hom 0
#> Lo Wu 0
#> Lok Ma Chau Spur Line 0
#> Heung Yuen Wai 0
#> Hong Kong-Zhuhai-Macao Bridge 194
#> Lok Ma Chau 0
#> Man Kam To 0
#> Sha Tau Kok 0
#> Shenzhen Bay 524
#> China Ferry Terminal 0
#> Harbour Control 0
#> Kai Tak Cruise Terminal 0
#> Macau Ferry Terminal 0
#> Total 1872
#> Departure Other Visitors Departure Total
#> Airport 315 5244
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 15 452
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 28 1853
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 358 7549
Created on 2022-09-18 with reprex v2.0.2
To plot the pie chart in ggplot plot a bar chart then change to polar coordinates.
Airport <- final[1,,]
Airport %>%
t() %>%
as.data.frame() %>%
mutate(`Arrival/Departure` = row.names(.)) %>%
ggplot(aes("", Airport, fill = `Arrival/Departure`)) +
geom_col(width = 1) +
scale_fill_manual(values = RColorBrewer::brewer.pal(n = 8, name = "Spectral")) +
coord_polar(theta = "y", start = 0) +
theme_void()
Created on 2022-09-18 with reprex v2.0.2

An alternative would be to select the tbody rows, filtering out the hidden items by attribute, then add in the headers later.
library(rvest)
library(tidyverse)
rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
df <- map_dfr(
rows,
function(x) {
x %>%
html_elements("td[headers]") %>%
set_names(headers) %>%
html_text()
}
) %>%
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))
Or somewhat condensed,
library(rvest)
library(tidyverse)
rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
df <- map_dfr(rows, ~ set_names(.x %>% html_elements("td[headers]") %>% html_text(), headers)) %>%
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))

Related

RSelenium and Rvest to create a table without html_table() from oddsportal.com

Recently, https://www.oddsportal.com/ changed their format. I can no longer use the html_table() to parse the game result table. It seems like the only option here is to use html_text2()and reconstruct the table manually.
library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
url_results <- "https://www.oddsportal.com/basketball/australia/nbl/results/"
rD <- rsDriver(port= sample(7600)[1], browser=c("firefox"), chromever = NULL)
remDr <- rD$client ; remDr$navigate(url_results)
try(remDr$findElement(using = "xpath", '//*[#id="onetrust-accept-btn-handler"]')$clickElement())
page <- remDr$getPageSource() ; remDr$close() ; rD$server$stop()
# R_table <- 0
# pop <- page[[1]] %>%
# read_html() %>%
# html_nodes(xpath='//*[#id="tournamentTable"]') %>%
# html_table()
# try(R_table <- pop[[1]])
# table <- R_table
R_table <- 0
pop <- page[[1]] %>%
read_html() %>%
html_nodes(xpath=paste0('//*[#id="app"]/div/div[1]/div/main/div[2]/div[7]')) %>%
html_text2()
try(R_table <- pop[[1]])
table <- R_table
Would anyone know good ways to reconstruct the table the way the website represents? This is the outcome I used to get before they changed the format by using html_table() :
V1 V2 V3 V4 V5
Today, 10 Jan 1 2 B's
21:30 Perth – New Zealand Breakers 93:90 1.98 1.79 16
19:30 Illawarra Hawks – Tasmania JackJumpers 89:92 3.95 1.24 16
08 Jan 2023 1 2 B's
16:00 Cairns Taipans – South East Melbourne 94:85 1.54 2.43 16
14:00 Adelaide – New Zealand Breakers 83:85 1.91 1.85 16

Placing "NA" into an Empty Position?

I am trying to scrape name/address information from yellowpages (https://www.yellowpages.ca/). I have a function (from :(R) Webscraping Error : arguments imply differing number of rows: 1, 0) that is able to retrieve this information:
library(rvest)
library(dplyr)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
However, sometimes the address information is not always present. For example : there are several barbers listed on this page https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON and they all have addresses except one of them. As a result, when I run this function, I get the following error:
scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")
Error:
! Tibble columns must have compatible sizes.
* Size 14: Existing data.
* Size 12: Column `address`.
i Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
My Question: Is there some way that I can modify the definition of the "scraper" function in such a way, such that when no address is listed, an NA appears in that line? For example:
barber address
1 barber111 address111
2 barber222 address222
3 barber333 NA
Is there some way I could add a statement similar to CASE WHEN that would grab the address or place an NA when the address is not there?
In order to match the businesses with their addresses, it is best to find a root node for each listing and get the text from the relevant child node. If the child node is empty, you can add an NA
library(rvest)
library(dplyr)
scraper <- function(url) {
nodes <- read_html(url) %>% html_elements(".listing_right_section")
tibble(name = nodes %>% sapply(function(x) {
x <- html_text2(html_elements(x, css = ".jsListingName"))
if(length(x)) x else NA}),
address = nodes %>% sapply(function(x) {
x <- html_text2(html_elements(x, css = ".listing__address--full"))
if(length(x)) x else NA}))
}
So now we can do:
scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")
#> # A tibble: 14 x 2
#> name address
#> <chr> <chr>
#> 1 Lords'n Ladies Hair Design 1560 Lasalle Blvd, Sudbury, ON P3A~
#> 2 Jo's The Lively Barber 611 Main St, Lively, ON P3Y 1M9
#> 3 Hairapy Studio 517 & Barber Shop 517 Notre Dame Ave, Sudbury, ON P3~
#> 4 Nickel Range Unisex Hairstyling 111 Larch St, Sudbury, ON P3E 4T5
#> 5 Ugo Barber & Hairstyling 911 Lorne St, Sudbury, ON P3C 4R7
#> 6 Gordon's Hairstyling 19 Durham St, Sudbury, ON P3C 5E2
#> 7 Valley Plaza Barber Shop 5085 Highway 69 N, Hanmer, ON P3P ~
#> 8 Rick's Hairstyling Shop 28 Young St, Capreol, ON P0M 1H0
#> 9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3
#> 10 Pat's Hairstylists 33 Godfrey Dr, Copper Cliff, ON P0~
#> 11 WildRootz Hair Studio 911 Lorne St, Sudbury, ON P3C 4R7
#> 12 Sleek Barber Bar 324 Elm St, ON P3C 1V8
#> 13 Faiella Classic Hair <NA>
#> 14 Ben's Barbershop & Hairstyling <NA>
Created on 2022-09-16 with reprex v2.0.2
Perhaps even simpler solution
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html() %>%
html_elements(".listing_right_top_section")
tibble(
name = page %>%
html_element(".jsListingName") %>%
html_text2(),
address = page %>%
html_element(".listing__address--full") %>%
html_text2()
)
}
# A tibble: 14 x 2
name address
<chr> <chr>
1 Lords'n Ladies Hair Design 1560 Lasalle Blvd, Sudbury, ON P3A 1Z7
2 Jo's The Lively Barber 611 Main St, Lively, ON P3Y 1M9
3 Hairapy Studio 517 & Barber Shop 517 Notre Dame Ave, Sudbury, ON P3C 5L1
4 Nickel Range Unisex Hairstyling 111 Larch St, Sudbury, ON P3E 4T5
5 Ugo Barber & Hairstyling 911 Lorne St, Sudbury, ON P3C 4R7
6 Gordon's Hairstyling 19 Durham St, Sudbury, ON P3C 5E2
7 Valley Plaza Barber Shop 5085 Highway 69 N, Hanmer, ON P3P 1J6
8 Rick's Hairstyling Shop 28 Young St, Capreol, ON P0M 1H0
9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3
10 Pat's Hairstylists 33 Godfrey Dr, Copper Cliff, ON P0M 1N0
11 WildRootz Hair Studio 911 Lorne St, Sudbury, ON P3C 4R7
12 Sleek Barber Bar 324 Elm St, ON P3C 1V8
13 Faiella Classic Hair NA
14 Ben's Barbershop & Hairstyling NA

Tidymodels prediction methods giving different results

I'm a bit confused about getting metrics from resamples using tidymodels.
I seem to be getting 3 different metrics from the same set of resamples, depending on if I use collect_predictions() %>% metrics() or simply collect_metrics()
Here is a simple example...
library(tidyverse)
library(tidymodels)
starwars_df <- starwars %>% select(name:sex) %>% drop_na()
lasso_linear_reg_glmnet_spec <-
linear_reg(penalty = .1, mixture = 1) %>%
set_engine('glmnet')
basic_rec <-
recipe(mass ~ height + sex + skin_color,
data = starwars_df) %>%
step_novel(all_nominal_predictors()) %>%
step_other(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors())
sw_wf <- workflow() %>%
add_recipe(basic_rec) %>%
add_model(lasso_linear_reg_glmnet_spec)
sw_boots <- bootstraps(starwars_df, times = 50)
resampd <- fit_resamples(
sw_wf,
sw_boots,
control = control_resamples(save_pred = TRUE)
)
The following three lines give different results
resampd %>% collect_predictions(resampd, summarize = T) %>% metrics(mass, .pred)
resampd %>% collect_predictions(resampd, summarize = F) %>% metrics(mass, .pred)
resampd %>% collect_metrics()
As an additional question, what would be the best/correct way to get confidence intervals for the rmse in the above example. Here is one way...
individ_metrics <- resampd %>% collect_predictions() %>% group_by(id) %>% rmse(mass, .pred)
confintr::ci_mean(individ_metrics$.estimate)
mean(individ_metrics$.estimate)
Thanks!
The reason that none of those are the same is they are not aggregated in the same way. It turns that taking a mean of a set of means doesn't give you the same (right) result as taking the mean of the whole underlying set. If you were to do something like resampd %>% collect_predictions(summarize = TRUE) %>% metrics(mass, .pred), that is like taking a mean of a set of means.
It turns out that these two things are the same:
## these are the same:
resampd %>%
collect_predictions(summarize = FALSE) %>%
group_by(id) %>%
metrics(mass, .pred)
#> # A tibble: 150 × 4
#> id .metric .estimator .estimate
#> <chr> <chr> <chr> <dbl>
#> 1 Bootstrap01 rmse standard 16.4
#> 2 Bootstrap02 rmse standard 23.1
#> 3 Bootstrap03 rmse standard 31.6
#> 4 Bootstrap04 rmse standard 17.6
#> 5 Bootstrap05 rmse standard 9.59
#> 6 Bootstrap06 rmse standard 25.0
#> 7 Bootstrap07 rmse standard 16.3
#> 8 Bootstrap08 rmse standard 35.1
#> 9 Bootstrap09 rmse standard 25.7
#> 10 Bootstrap10 rmse standard 25.3
#> # … with 140 more rows
resampd %>% collect_metrics(summarize = FALSE)
#> # A tibble: 100 × 5
#> id .metric .estimator .estimate .config
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 Bootstrap01 rmse standard 16.4 Preprocessor1_Model1
#> 2 Bootstrap01 rsq standard 0.799 Preprocessor1_Model1
#> 3 Bootstrap02 rmse standard 23.1 Preprocessor1_Model1
#> 4 Bootstrap02 rsq standard 0.193 Preprocessor1_Model1
#> 5 Bootstrap03 rmse standard 31.6 Preprocessor1_Model1
#> 6 Bootstrap03 rsq standard 0.608 Preprocessor1_Model1
#> 7 Bootstrap04 rmse standard 17.6 Preprocessor1_Model1
#> 8 Bootstrap04 rsq standard 0.836 Preprocessor1_Model1
#> 9 Bootstrap05 rmse standard 9.59 Preprocessor1_Model1
#> 10 Bootstrap05 rsq standard 0.860 Preprocessor1_Model1
#> # … with 90 more rows
Created on 2022-08-23 with reprex v2.0.2

Webscraping Rvest not working from html page, table showing NA'S - Mc Donalds

I am trying to scrape data from https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html to make a dataframe with all the nutri values and allerges drop down menu,(Further information, per 100g, per portion, contained allergies), however my rvest cannot detect the information as a table.
I don't even show any required value
library(rvest)
url4 <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
test <- url4 %>% read_html() %>%
html_nodes(xpath = '//*[#id="collapseOne"]/div/div/div/div[1]') %>%
html_table()
test <- as.data.frame(test)
I also tried this
library(rvest)
library(stringr)
library(tidyr)
url <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)
How could that be done, I'm very new to web scraping don't know if it's Html tags are correct
------ This is scraping data I want---------
link correct or not.
You can request the information from their json API
library(tidyverse)
library(httr2)
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$nutrient_facts %>%
.$nutrient %>%
as_tibble %>%
select(4:9)
# A tibble: 10 x 6
id name nutrient_~1 uom uom_d~2 value
<int> <chr> <chr> <chr> <chr> <chr>
1 1 Serving Size primary_se~ g grams 302
2 2 Brennwert energy_kJ kJ kiloJo~ 2992
3 3 Brennwert energy_kcal kcal kilo c~ 716
4 4 Fett fat g grams 40
5 5 davon gesättigte Fettsäuren saturated_~ g grams 16
6 6 Kohlenhydrate carbohydra~ g grams 44
7 7 davon Zucker sugar g grams 11
8 8 Ballaststoffe fiber g grams 3.3
9 9 Eiweiß protein g grams 40
10 10 Salz salt g grams 2.4
# ... with abbreviated variable names 1: nutrient_name_id,
# 2: uom_description
Information on the allergies
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$item_allergen %>%
str_split(pattern = ", ") %>%
getElement(1)
[1] "Milch (einschl. Laktose)"
[2] "Eier"
[3] "Glutenhaltiges Getreide: Weizen (wie Dinkel und Khorasan-Weizen)"
[4] "Senf"
[5] "Sesamsamen"

tidy() function cant process last_fit() obejcts

Functions like last_fit() from the tune package produces last_fit objects which are large nested lists containing the fit results. I tried to transform them into data.frames using the tidy() function from the broom package but this resulted in the following error:
MRE :
library(tidymodels)
library(tidyverse)
data <- mtcars
model_default<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine('xgboost',objective = 'reg:squarederror')
wf <- workflow() %>%
add_model(model_default) %>%
add_recipe(recipe(mpg~.,data))
lf <- last_fit(wf,split)
tidy_lf <- tidy(lf)
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
is.atomic(x) is not TRUE
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
4: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
5: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
6: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
7: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
Question : How can I use tidy() with an last_fit() output?
The object that last_fit() creates is a tibble (containing metrics, predictions, etc), not a model that can be tidied. You can use extract_workflow() to extract out the fitted workflow from the object created by last_fit(), and this object can be tidied:
library(tidymodels)
car_split <- initial_split(mtcars)
wf <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(recipe(mpg ~ ., mtcars))
lf <- last_fit(wf, car_split)
lf
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 × 6
#> splits id .metrics .notes .predictions .workflow
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [24/8]> train/test split <tibble> <tibble> <tibble [8 × 4]> <workflow>
lf %>%
extract_workflow() %>%
tidy()
#> # A tibble: 11 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -33.6 36.0 -0.935 0.367
#> 2 cyl -0.0296 1.34 -0.0221 0.983
#> 3 disp 0.0252 0.0269 0.934 0.367
#> 4 hp -0.00539 0.0319 -0.169 0.868
#> 5 drat -0.167 2.54 -0.0659 0.948
#> 6 wt -5.69 2.79 -2.04 0.0623
#> 7 qsec 3.32 1.76 1.89 0.0820
#> 8 vs -4.40 3.80 -1.16 0.268
#> 9 am 2.54 2.67 0.950 0.360
#> 10 gear 2.69 2.28 1.18 0.259
#> 11 carb -0.0486 1.11 -0.0439 0.966
Created on 2022-03-23 by the reprex package (v2.0.1)