Scraping web table using R and rvest - html

I'm new in web scraping using R.
I'm trying to scrape the table generated by this link:
https://gd.eppo.int/search?k=saperda+tridentata.
In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I tried to follow the suggestion by Allan Cameron given here (rvest, table with thead and tbody tags) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page".
Where can I get this link? In this specific case I used "https://gd.eppo.int/media/js/application/zzsearch.js?7", is this one?
Below you have my code.
Thank you in advance!
library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)
pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text")
json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8")
table_contents <- JSON %>%
{gsub("\\\\n", "\n", .)} %>%
{gsub("\\\\/", "/", .)} %>%
{gsub("\\\\\"", "\"", .)} %>%
strsplit("html\":\"") %>%
unlist %>%
extract(2) %>%
substr(1, nchar(.) -2) %>%
paste0("</tbody>")
new_page <- gsub("</tbody>", table_contents, resp)
read_html(new_page) %>%
html_nodes("table") %>%
html_table()

The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.
library(httr)
library(jsonlite)
params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)

Related

Web scraping table R

I'm trying to get the data from the rating column on this site https://www.ratingraph.com/tv-shows/one-piece-ratings-17673/, but I'm having problems with "{xml_nodeset (0)}".
my attempt:
library("rvest")
`%>%` <- magrittr::`%>%`
page <- read_html("https://www.ratingraph.com/tv-shows/one-piece-ratings-17673/")
table <- page %>%
html_nodes("table")
df <- table[2] %>%
html_table()
These are the data I want:
By inspecting the page and looking on the "Network" tab, you can see the call it makes to create the table.
The response is in JSON, which is easily parsed into an R list.
Much of this is probably unnecessary for your purpose, so you can shorten it.
If you want more than 25 rows, increase the length=25, or take it out.
page <- httr::GET(
paste0("https://www.ratingraph.com/show-episodes-list/17673/?draw=1&columns%5B0%5D%5Bdata%5D=trend&",
"columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=false&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=season&",
"columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=false&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=episode&",
"columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=false&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=name&",
"columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=false&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=start&",
"columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=false&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=total_votes&",
"columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=false&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=average_rating&",
"columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=false&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=1&",
"order%5B0%5D%5Bdir%5D=asc&order%5B1%5D%5Bcolumn%5D=2&order%5B1%5D%5Bdir%5D=asc&start=0&length=25&search%5Bvalue%5D=&search%5Bregex%5D=false&_=", Sys.time() %>% as.numeric() %>% paste0("000")))
table <- page %>% httr::content(as = 'parsed')
avg_ratings <- sapply(table$data, `[[`, 'average_rating') %>% as.numeric()

Web scraping in R with Selenium to click new pages

I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page.
Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.
How can it be done? You help me a lot. Thanks!
This is my code:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
library(tidyr) #extract_numeric(years)
library(stringr)
df_0<-data.frame(project=character(),
name=character(),
location=character(),
dates=character(),
objective=character(),
collected=character(),
donor=character(),
shares=character(),
follow=character(),
comments=character(),
category=character())
#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
# 2) name
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
# 3) location
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()
# 4) dates
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
# create the df with all the info
review_data <- data.frame(project=project,
name= gsub("\\Organizador.*","",info[7]),
location=str_remove(location[7], "Organizador"),
dates = dates,
collected = unlist(strsplit(money, " "))[1],
objective = unlist(strsplit(money, " "))[8],
donor = popularity[1],
shares = popularity[2],
follow = popularity[3],
comments = extract_numeric(comments),
category = category[17],
stringsAsFactors = F)
The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)
#David Perea, see this page for differentiation of scraping methods, including Selenium. The method proposed by QHarr is very good, but doesn't use Selenium and also requires good knowledge of HTTP.

Web-Scraping using R. I want to extract some table like data from a website

I'm having some problems scraping data from a website. I have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands
More precisely, I want to extract the brands on the right-hand side.
My idea so far:
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()
But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!
That data is dynamically pulled from a script tag. You can pull the content of that script tag and parse as json. subset just for the items of interest from the returned list and then extract the brand names:
library(rvest)
library(jsonlite)
library(stringr)
data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json()
data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})
I find the above easier to read but you could try other methods such as
library(rvest)
library(jsonlite)
library(stringr)
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json() %>%
{.$props$pageProps$apolloState} %>%
subset(., {str_detect(names(.), 'Brand:')}) %>%
lapply(. , function(x){x$name})
Using {} to have call be treated like an expression and not a function is something I read in a comment by #asachet

More efficient way to scrape ratings values from TripAdvisor using R

I am trying to scrape the ratings from TripAdvisor. So far, I have managed to extract the HTML nodes, turn them into character strings, extract the string that represents the numeric I need then converted it to the correct number, finally dividing it by 10 to get the correct value it is demonstrating.
library(rvest)
url <- "https://www.tripadvisor.co.uk/Attraction_Review-g1466790-d547811-Reviews-Royal_Botanic_Gardens_Kew-Kew_Richmond_upon_Thames_Greater_London_England.html"
ratings_too_big <- url %>%
read_html() %>%
html_nodes("#REVIEWS .ui_bubble_rating") %>%
as.character() %>%
substr(38,39) %>%
as.numeric()
ratings_too_big/10
This is without doubt very messy - what's a cleaner, more efficient way to do this? I have also tried Hadley Wickham's example shown here:
library(rvest)
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"
reviews <- url %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
rating <- reviews %>%
html_node(".rating .rating_s_fill") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
This was not successful, as no data was returned (there appears to be nothing in the HTML node ".rating .rating_s_fill"). I am new scraping and css identifiers, so apologies if the answer is obvious.

Scraping HTML webpage using R

I am scraping JFK's website to get flight schedules. The link to the flight schedules is here;
http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures
To begin with, I am inspecting the one of the fields of any given flight and noting down its xpath. Idea is to see the output and then develop the code from there. This is what I have so far:
library(rvest)
Departure_url <- read_html('http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures')
Departures <- Departure_url %>% html_nodes(xpath = '//*[#id="ffAlLbl"]') %>% html_text()
I am getting an empty character object as output for 'Departures' object in the code above.
I am not sure why this happens. I am looking for a node through which the entire schedule can be downloaded.
Any help is appreciated !!
To scrape that table is kind of tricky.
First of all, what you try to scrape is live content. So you need a headless browser such as RSelenium.
Second, the content is actually inside an iframe that is inside another iframe, so you need to use switch to frame twice.
Finally, the content is not a table, so you need to get all vectors and combine them into a table.
The following code should do the job:
library(RSelenium)
library(rvest)
library(stringr)
library(glue)
library(tidyverse)
#Rselenium
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
myclient$navigate("http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures")
#Switch two frame twice
webElems <- myclient$findElement(using = "css",value = "[name=webfidsBox]")
myclient$switchToFrame(webElems)
webElems <- myclient$findElement(using = "css",value = "#coif02")
myclient$switchToFrame(webElems)
#get page souce of the content
myPagesource <- read_html(myclient$getPageSource()[[1]])
selected_node <- myPagesource %>% html_node("#fvData")
#get content as vectors in list and merge into table
result_list <- map(1:7,~ myPagesource %>% html_nodes(str_c(".c",.x)) %>% html_text())
result_list2 <- map(c(5,6),~myPagesource %>% html_nodes(glue::glue("tr>td:nth-child({i})",i=.x)) %>% html_text())
result_list[[5]] <- c(result_list[[5]],result_list2[[1]])
result_list[[6]] <- c(result_list[[6]],result_list2[[2]])
result_df <- do.call("cbind", result_list)
colnames(result_df) <- result_df[1,]
result_df <- as.tibble(result_df[-1,])
You can do some data cleaning afterward.