Scrape page content after option tag is selected

Scrape page content after option tag is selected - html

I'd like to scrape the content of a page once the province (and the commune) are selected.
The following code correctly outputs the provinces and their values.
library(rvest)
page <- read_html(x = "https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati/")
text <- page %>% html_nodes(xpath='//select[#name="provincia"]/option')%>% html_text()
values <- page %>% html_nodes(xpath='//select[#name="provincia"]/option')%>% html_attr("value")
Res <- data.frame(text = text, values = values, stringsAsFactors = FALSE)
Res
Now, I'd like to access the page for each value, e.g. this might be helpful for getting access to value=19.
text <- page %>% html_nodes(xpath="//*/option[#value = '19']")%>% html_text()
text
The source code is the following
<div class="row results_form_search">
<form role="search" method="POST" class="search-form" action="/progetto-torelli/progetto-torelli-risultati/" id="search_location">
<input type="hidden" name="comune_from" value="" />
<div class="form-row">
<input type="text" name="cognome" placeholder="Cognome" autocomplete="off" value="">
<select name="provincia">
<option value="0" selected>Seleziona Provincia</option>
<option value="74"
>-
</option>
<option value="75"
>AGRIGENTO
</option>
<option value="19"
>ALESSANDRIA
This is where the content that I want to scrape might be.
<div class="row">
<ul class="listing_search">
</ul>
</div>
Thank you so much for your advice!

RSelenium may end up being the way to go. However, if you can insert some judicious waits, or chunk your requests, so server isn't swamped with requests, you can use rvest and make the same requests the page does.
You first need to generate all the combinations of province and comune (filtering out unwanted values); this can be done by making xmlhttp requests, using the value attribute for the options within the select for province, to gather back the comune dropdown options and their associated values.
You then make further requests, for each combination pair, to get the page content, which you would get when making selections from each of those dropdowns manually and pushing CERCA.
Pauses are needed as there are 10,389 valid combinations, by my reckoning, and, if you attempt to make all those requests one after the other, following the initial requests as well, the server will cut-off the connection.
Another option would be to chunk up combined into smaller dataframes and make requests for those at timed intervals and then combine the results.
library(rvest)
library(dplyr)
library(purrr)
get_provincias <- function(link) {
nodes <- read_html(link) %>%
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
df <- data.frame(
Provincia = nodes %>% html_text(trim = T),
id0 = nodes %>% html_attr("value")
)
return(df)
}
get_comunes <- function(id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')
df <- data.frame(
id0 = id, # id1
Comune = nodes %>% html_text(trim = T),
id3 = nodes %>% html_attr("value")
)
return(df)
}
get_page <- function(prov_id, com_id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
prov_id,
com_id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
page <- read_html(link)
# print(page %>% html_node(".listing_name") %>% html_text(trim = T))
# print(tibble(id3 = com_id, page = page))
return(tibble(id3 = com_id, page = page))
}
provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")
comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")
combined <- dplyr::right_join(provincias, comunes, by = "id0")
# length(combined$Comune) -> 10389
results <- map2_dfr(combined$id0, combined$id3, .f = get_page)
final <- dplyr::inner_join(combined, results, by = "id3")
Below is a longer version, with the additional info you requested, where I played around with adding pauses. I still found that I could run up to, and including
combined <- dplyr::right_join(provincias, comunes, by = "id0")
in one go. But after that I needed to chunk requests into about 2000 requests batches with 20-30 minutes in between. You can try tweaking the timings below. I ended up using the commented out section to run each batch and then left a pause of 30 mins in between.
Some things to consider:
It seems that you can have comunes values like ... which still return listings. With that in mind you may wish to remove the :not parts of this:
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
as I assumed that was filtering out invalid results.
Next, you might consider writing a helper function with httr and retry,
to make the requests with backoff/retry, rather than use pauses.
Such a function might look like this:
httr::RETRY(
"GET",
<request url>,
times = 3,
pause_min = 20*60,
pause_base = 20*60)
Anyway, those are some ideas. Even without the server cutting the connection, via uses of waits, I still found it started to throttle requests, meaning some requests took quite a long time to complete. Optimizing this could potentially take a lot of time and effort. I spent a good few days playing around looking at chunk size and waits.
library(rvest)
library(dplyr)
library(purrr)
get_provincias <- function(link) {
nodes <- read_html(link) %>%
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
df <- data.frame(
Provincia = nodes %>% html_text(trim = T),
id0 = nodes %>% html_attr("value")
)
return(df)
}
get_comunes <- function(id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')
df <- data.frame(
id0 = id, # id1
Comune = nodes %>% html_text(trim = T),
id3 = nodes %>% html_attr("value")
)
return(df)
}
get_data <- function(prov_id, com_id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
prov_id,
com_id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
page <- read_html(link)
df <- data.frame(
cognome = page %>% html_nodes(".listing_name") %>% html_text(trim = T),
livello = page %>% html_nodes(".listing_level") %>% html_text(trim = T),
id3 = com_id,# for later join back on comune
id0 = prov_id
)
Sys.sleep(.25) # pause for . sec
return(df)
}
get_chunks <- function(df, chunk_size) { # adapted from #BenBolker https://stackoverflow.com/a/7060331
n <- nrow(df)
r <- rep(1:ceiling(n / chunk_size), each = chunk_size)[1:n]
d <- split(df, r)
return(d)
}
write_rows <- function(df, filename) {
flag <- file.exists(filename)
df2 <- purrr::map2_dfr(df$id0, df$id3, .f = get_data)
write.table(df2,
file = filename, sep = ",",
append = flag,
quote = F, col.names = !flag,
row.names = F
)
Sys.sleep(60*10)
}
provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")
Sys.sleep(60*5)
comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")
Sys.sleep(60*10)
combined <- dplyr::right_join(provincias, comunes, by = "id0")
Sys.sleep(60*10)
chunked <- get_chunks(combined, 2000) # https://stackoverflow.com/questions/7060272/split-up-a-dataframe-by-number-of-rows
filename <- "prov_com_cog_liv.csv"
map(chunked, ~ write_rows(.x, filename))
## #### test case #####################
# df <- chunked[[6]]
#
# flag <- file.exists(filename)
#
# df2 <- map2_dfr(df$id0, df$id3, .f = get_data)
#
# write.table(df2,
# file = filename, sep = ",",
# append = flag,
# quote = F, col.names = !flag,
# row.names = F
# )
####################################
results <- read.csv(filename)
final <- dplyr::right_join(combined, results, by = "id3")

Related

R: Inferring a Common Merge Key

I am trying to webscrape a site to get addresses for a set of names (part A) along with the longitude and latitudes (part B). I don't know how to do this all together, so I did this in two parts:
# part A
library(tidyverse)
library(rvest)
library(httr)
library(XML)
# Define function to scrape 1 page
get_info <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.mywebsite/",
page_n, "?extension") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
# Apply function to pages 1:10
df_1 <- map_dfr(1:10, get_info)
# Check dimensions
dim(df_1)
[1] 90
Here is part B:
# Recognize pattern in websites
part1 = "https://www.mywebsite/"
part2 = c(0:55)
part3 = "extension"
temp = data.frame(part1, part2, part3)
# Create list of websites
temp$all_websites = paste0(temp$part1, temp$part2, temp$part3)
# Scrape
df_2 <- list()
for (i in 1:10)
{tryCatch({
url_i <-temp$all_websites[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("head")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
df_2[[i]] <- listanswer_i
print(listanswer_i)
}, error = function(e){})
}
# Extract long/lat from results
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
df_2 = data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
df_2 = df_2 %>% filter(X1 != "LatLngBounds();")
> dim(df_2)
[1] 86 3
We can see that df_1 and df_2 have a different number of rows - but also, there is no common merge key between df_1 and df_2. How can I re-write my code in such a way that I can create a merge key between df_1 and df_2 such that I can merge the common records between these files together?

I am not sure multiple requests to the same URIs are needed. There are some lat long values not listed either on the results pages or on the result specific linked webpage e.g.Toronto Beaches Dentist from current page 2 results has no lat long shown on either page 2 or the website specific page. In these cases, you may choose to fill the blanks using another service which returns lat long based on an address.
You can re-write your function and alter your regex patterns to produce 2 dataframes which can be joined and the resultant dataframe returned. With the appropriate regex changes, as given below, you can use the address column to join the 2 dataframes. I dislike a key which is an address but it does appear to be internally consistent across the result page. I have used a left join to return all rows from the dentist listings i.e. the practice business names.
library(tidyverse)
library(rvest)
urls <- sprintf("https://www.dentistsearch.ca/search-doctor/%i?category=0&services=0&province=55&city=&k=", 1:10)
pages <- lapply(urls, read_html)
get_dentist_info <- function(page) {
page_text <- page %>% html_text()
address_keys <- page_text %>%
str_match_all('marker_\\d+\\.set\\("content", "(.*?)"\\);') %>%
.[[1]] %>%
.[, 2]
lat_long <- page_text %>%
str_match_all("LatLng\\((.*)\\);(?![\\s\\S]+myOptions)") %>%
.[[1]] %>%
.[, 2]
lat_lon <- tibble(address = address_keys, lat_long = lat_long) %>%
separate(lat_long, into = c("lat", "long"), sep = ", ") %>%
mutate(lat = as.numeric(lat), long = as.numeric(long))
practice_info <- tibble(
title = page %>% html_elements(".title > a") %>% html_text(trim = T),
address = page %>% html_elements(".marker") %>% html_text()
)
dentist_info <- left_join(practice_info, lat_lon, by = "address")
return(dentist_info)
}
all_dentist_info <- map_dfr(pages, get_dentist_info)

Scraping several webpages from a website (newspaper archive) using RSelenium

I managed to scrape one page from a newspaper archive according to explanations here.
Now I am trying to automatise the process to access a list of pages by running one code.
Making a list of URLs was easy as the newspaper's archive has a similar pattern of links:
https://en.trend.az/archive/2021-XX-XX
The problem is with writing a loop to scrape such data as title, date, time, category. For simplicity, I tried to work only with article headlines from 2021-09-30 to 2021-10-02.
## Setting data frames
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
list_of_url <- character() # or str_c()
## Generating subpage list
for (i in format(seq(d1, d2, by="days"), format="%Y-%m-%d")) {
list_of_url[i] <- str_c ("https://en.trend.az", "/archive/", i)
# Launching browser
driver <- rsDriver(browser = c("firefox")) #Version 93.0 (64-bit)
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(list_of_url[i])
remDr0$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
page <- read_html(remDr$getPageSource()[[1]])
# Scraping article headlines
get_headline <- page %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
get_time <- str_sub(get_time, start= -5)
length(get_time)
}
}
In total length should have been 157+166+140=463. In fact, I did not manage to collect all data even from one page (length(get_time) = 126)
I considered that after the first set of commands in the loop, I obtained three remDr for the 3 dates specified, but they were not recognised later independently.
Because of that I tried to initiate a second loop inside the initial one before or after page <- by
for (remDr0 in remDr) {
page <- read_html(remDr0$getPageSource()[[1]])
# substituted all remDr-s below with remDr0
OR
page <- read_html(remDr$getPageSource()[[1]])
for (page0 in page)
# substituted all page-s below with page0
However, these attempts ended with different errors.
I would appreciate the help of specialists as it is my first time using R for such purposes.
Hope it will be possible to correct the existing loop that I made or maybe even suggest a shorter pathway, by making a function, for example.

Slight broadening for scraping multiple categories
library(RSelenium)
library(dplyr)
library(rvest)
Mention the date period
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
dt = seq(d1, d2, by="days")#contains the date sequence
#launch browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
### `get_headline` Function for newspaper headlines
get_headline = function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
headlines = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
headlines
return(headlines)
}
get_time Function for the time of publishing
get_time <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selector of time on the website
time <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-date') %>%
html_text() %>%
str_sub(start= -5)
time
return(time)
}
Numbering of all articles from one page/day
get_number <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of headlines on the website
headline <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
number <- seq(1:length(headline))
return(number)
}
Collection of all functions into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
headline <- get_headline(x)
time <- get_time(x)
headline_number <- get_number(x)
# Combine into a tibble
combined_data <- tibble(Num = headline_number,
Article = headline,
Time = time)
}
Used lapply to loop through all the dates in dt
df = lapply(dt, get_data_table)

Web scraping in R with Selenium to click new pages

I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page.
Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.
How can it be done? You help me a lot. Thanks!
This is my code:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
library(tidyr) #extract_numeric(years)
library(stringr)
df_0<-data.frame(project=character(),
name=character(),
location=character(),
dates=character(),
objective=character(),
collected=character(),
donor=character(),
shares=character(),
follow=character(),
comments=character(),
category=character())
#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
# 2) name
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
# 3) location
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()
# 4) dates
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
# create the df with all the info
review_data <- data.frame(project=project,
name= gsub("\\Organizador.*","",info[7]),
location=str_remove(location[7], "Organizador"),
dates = dates,
collected = unlist(strsplit(money, " "))[1],
objective = unlist(strsplit(money, " "))[8],
donor = popularity[1],
shares = popularity[2],
follow = popularity[3],
comments = extract_numeric(comments),
category = category[17],
stringsAsFactors = F)

The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)

#David Perea, see this page for differentiation of scraping methods, including Selenium. The method proposed by QHarr is very good, but doesn't use Selenium and also requires good knowledge of HTTP.

Error in eval_tidy (xs [[j]], mask): object 'titles' not found

When I try this code, I get an error. I want to scrape the data from multiple pages. But when I try the script down below, I get an error that says:
Error in eval_tidy (xs [[j]], mask): object 'titles' not found
library(tidyverse)
library(rvest)
# function to scrape all elements also missing elements
scrape_css <- function(css, group, html_page) {
txt <- html_page %>%
html_nodes(group) %>%
lapply(
. %>%
html_nodes(css) %>%
html_text() %>%
ifelse(identical(., character(0)), NA, .)
) %>%
unlist()
return(txt)
}
# Get all elements from 1 page
get_one_page <- function(url) {
html <- read_html(url)
titles <- scrape_css(
".recipe-card_title__1oIb-",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
minutes <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(1)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
callories <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(2)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
}
return(tibble(titles = titles, minutes = minutes, callories = callories))
url <- ("https://www.ah.nl/allerhande/recepten/R-L1473207825981/suikerbewust")
appie <- get_one_page(url)

Two things:
return should be inside the curly braces at the end of the function
return(tibble(titles = titles, minutes = minutes, callories = callories))
}
You forgot to update the variable name such that you have
html <- read_html(url)
when you need
html_page <- read_html(url)

This does not per se answer #Laila's question, but I ran into the
Error in eval_tidy(xs[[j]], mask) : object '' not found
error, when I did sth similar to
tibble(
a = 1,
b = ,
c = NA
)
So basically, forgot to enter a value for a tibble column. But it is a rather uninformative error message. I am leaving this for the future me (and others) as a reference, as there are not many posts related to this error.

Scraping table with multiple headers in R using any package? (XML, rCurl, rlist htmltab, rvest etc)

I am attempting to scrape this table
http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1
Here are all my attempts. None of them get even close to extracting any information. Am i missing something?
library("rvest")
library("tidyverse")
# METHOD 1
url <- "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1"
data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="t1"]/tbody/tr[1]') %>%
html_table()
data <- data[[1]]
# METHOD 2
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# METHOD 3
library(htmltab)
tab <- htmltab("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",
which = '//*[#id="t1"]/tbody/tr[4]',
header = '//*[#id="t1"]/tbody/tr[3]',
rm_nodata_cols = TRUE)
# METHOD 4
website <-read_html("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1")
scraped <- website %>%
html_nodes("table") %>%
.[(2)] %>%
html_table(fill = TRUE) %>%
`[[`(1)
# METHOD 5
getHrefs <- function(node, encoding)
if (!is.null(xmlChildren(node)$a)) {
paste(xpathSApply(node, './a', xmlGetAttr, "href"), collapse = ",")
} else {
return(xmlValue(xmlChildren(node)$text))
}
data <- ( readHTMLTable("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1", which = 1, elFun = getHrefs) )
The expected results should be the 12 colnames in the table & the data below

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scrape page content after option tag is selected - html

Related

R: Inferring a Common Merge Key

Scraping several webpages from a website (newspaper archive) using RSelenium

Web scraping in R with Selenium to click new pages

Error in eval_tidy (xs [[j]], mask): object 'titles' not found

Scraping table with multiple headers in R using any package? (XML, rCurl, rlist htmltab, rvest etc)

Categories

Resources