Scraping HTML webpage using R - html

I am scraping JFK's website to get flight schedules. The link to the flight schedules is here;
http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures
To begin with, I am inspecting the one of the fields of any given flight and noting down its xpath. Idea is to see the output and then develop the code from there. This is what I have so far:
library(rvest)
Departure_url <- read_html('http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures')
Departures <- Departure_url %>% html_nodes(xpath = '//*[#id="ffAlLbl"]') %>% html_text()
I am getting an empty character object as output for 'Departures' object in the code above.
I am not sure why this happens. I am looking for a node through which the entire schedule can be downloaded.
Any help is appreciated !!

To scrape that table is kind of tricky.
First of all, what you try to scrape is live content. So you need a headless browser such as RSelenium.
Second, the content is actually inside an iframe that is inside another iframe, so you need to use switch to frame twice.
Finally, the content is not a table, so you need to get all vectors and combine them into a table.
The following code should do the job:
library(RSelenium)
library(rvest)
library(stringr)
library(glue)
library(tidyverse)
#Rselenium
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
myclient$navigate("http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures")
#Switch two frame twice
webElems <- myclient$findElement(using = "css",value = "[name=webfidsBox]")
myclient$switchToFrame(webElems)
webElems <- myclient$findElement(using = "css",value = "#coif02")
myclient$switchToFrame(webElems)
#get page souce of the content
myPagesource <- read_html(myclient$getPageSource()[[1]])
selected_node <- myPagesource %>% html_node("#fvData")
#get content as vectors in list and merge into table
result_list <- map(1:7,~ myPagesource %>% html_nodes(str_c(".c",.x)) %>% html_text())
result_list2 <- map(c(5,6),~myPagesource %>% html_nodes(glue::glue("tr>td:nth-child({i})",i=.x)) %>% html_text())
result_list[[5]] <- c(result_list[[5]],result_list2[[1]])
result_list[[6]] <- c(result_list[[6]],result_list2[[2]])
result_df <- do.call("cbind", result_list)
colnames(result_df) <- result_df[1,]
result_df <- as.tibble(result_df[-1,])
You can do some data cleaning afterward.

Related

Web scraping table R

I'm trying to get the data from the rating column on this site https://www.ratingraph.com/tv-shows/one-piece-ratings-17673/, but I'm having problems with "{xml_nodeset (0)}".
my attempt:
library("rvest")
`%>%` <- magrittr::`%>%`
page <- read_html("https://www.ratingraph.com/tv-shows/one-piece-ratings-17673/")
table <- page %>%
html_nodes("table")
df <- table[2] %>%
html_table()
These are the data I want:
By inspecting the page and looking on the "Network" tab, you can see the call it makes to create the table.
The response is in JSON, which is easily parsed into an R list.
Much of this is probably unnecessary for your purpose, so you can shorten it.
If you want more than 25 rows, increase the length=25, or take it out.
page <- httr::GET(
paste0("https://www.ratingraph.com/show-episodes-list/17673/?draw=1&columns%5B0%5D%5Bdata%5D=trend&",
"columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=false&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=season&",
"columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=false&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=episode&",
"columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=false&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=name&",
"columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=false&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=start&",
"columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=false&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=total_votes&",
"columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=false&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=average_rating&",
"columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=false&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=1&",
"order%5B0%5D%5Bdir%5D=asc&order%5B1%5D%5Bcolumn%5D=2&order%5B1%5D%5Bdir%5D=asc&start=0&length=25&search%5Bvalue%5D=&search%5Bregex%5D=false&_=", Sys.time() %>% as.numeric() %>% paste0("000")))
table <- page %>% httr::content(as = 'parsed')
avg_ratings <- sapply(table$data, `[[`, 'average_rating') %>% as.numeric()

Web-Scraping using R. I want to extract some table like data from a website

I'm having some problems scraping data from a website. I have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands
More precisely, I want to extract the brands on the right-hand side.
My idea so far:
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()
But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!
That data is dynamically pulled from a script tag. You can pull the content of that script tag and parse as json. subset just for the items of interest from the returned list and then extract the brand names:
library(rvest)
library(jsonlite)
library(stringr)
data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json()
data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})
I find the above easier to read but you could try other methods such as
library(rvest)
library(jsonlite)
library(stringr)
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json() %>%
{.$props$pageProps$apolloState} %>%
subset(., {str_detect(names(.), 'Brand:')}) %>%
lapply(. , function(x){x$name})
Using {} to have call be treated like an expression and not a function is something I read in a comment by #asachet

rvest how to get last page number in r language

I'm learning web scraping and want to create an example for myself.
https://www.goodreads.com/search?page=1&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93
I want to scrape last page number which is 100 by using above url. I tried several different codes, but they are not working well.
url %>%
read_html(x) %>%
html_nodes('div.leftContainer') %>%
html_nodes('a[href^="/search?page=100&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93"]') %>%
html_text()
I used html_nodes to get text '100' but it failed. I want to use length() and as.integer() to get the number.
I would like to know how to get the value of last page number.
You should be able to use nth-last-of-type to get penultimate href containing page
library(rvest)
url <- 'https://www.goodreads.com/search?page=1&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93'
last_page <- read_html(url) %>% html_node('[href*=page]:nth-last-child(2)') %>% html_text() %>% as.integer()
Below another possible solution:
library(RSelenium)
remDr <- rsDriver(port=4555L,browser = "firefox")
remoteDriver<- remDr[["client"]]
url <- "https://www.goodreads.com/search?page=1&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93"
remoteDriver$navigate(url)
#gets the last number of page
last_page<-remoteDriver$findElement(using = 'xpath', value = '/html/body/div[2]/div[3]/div[1]/div[2]/div[2]/div[3]/div/a[10]')$getElementText()
print(last_page)
[[1]]
[1] "100"

Scraping web table using R and rvest

I'm new in web scraping using R.
I'm trying to scrape the table generated by this link:
https://gd.eppo.int/search?k=saperda+tridentata.
In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I tried to follow the suggestion by Allan Cameron given here (rvest, table with thead and tbody tags) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page".
Where can I get this link? In this specific case I used "https://gd.eppo.int/media/js/application/zzsearch.js?7", is this one?
Below you have my code.
Thank you in advance!
library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)
pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text")
json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8")
table_contents <- JSON %>%
{gsub("\\\\n", "\n", .)} %>%
{gsub("\\\\/", "/", .)} %>%
{gsub("\\\\\"", "\"", .)} %>%
strsplit("html\":\"") %>%
unlist %>%
extract(2) %>%
substr(1, nchar(.) -2) %>%
paste0("</tbody>")
new_page <- gsub("</tbody>", table_contents, resp)
read_html(new_page) %>%
html_nodes("table") %>%
html_table()
The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.
library(httr)
library(jsonlite)
params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)

Cannot find numbers of pages of website in web scraping

I want to take number of pages from web site. I try to do it like on tutorial. I used this function:
get_last_page <- function(html){
pages_data <- html %>%
# The '.' indicates the class
html_nodes('.pagination-page') %>%
# Extract the raw text as a list
html_text()
# The second to last of the buttons is the one
pages_data[(length(pages_data)-1)] %>%
# Take the raw string
unname() %>%
# Convert to number
as.numeric()
}
first_page <- read_html(url)
(latest_page_number <- get_last_page(first_page))
for website
url <-'http://www.trustpilot.com/review/www.amazon.com'
it works fine.When I tried to do it with
url <-'https://energybase.ru/en/oil-gas-field/index'
I got integer(0).
I change
html_nodes('.pagination-page')
to
html_nodes('.html_nodes('data-page')')
And failed.
How can I change my code to make it works fine?
I think you have to go about this a little differently here.
The energybase.ru URL isn't organized quite the same way as the TrustPilot URL.
For our purposes here, we're interested in the fact that the last page has its own node .last. From there, you just have to extract the value of the data-page attribute and increment it by 1.
library("rvest")
library("magrittr")
url <- 'https://energybase.ru/en/oil-gas-field/index'
read_html(url) %>% html_nodes(".last") %>% html_children() %>% html_attr("data-page") %>% as.numeric()+1
# [1] 21
Edit: note, you can always intercept the piping at html_children() (by adding a %>% html_attrs() to it) to find out what attributes are available at your disposal there.
You could use the rel=last attribute=value node and extract the number from the href
library("rvest")
library("magrittr")
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
number_of_pages <- str_match_all(pg %>% html_node("[rel=last]") %>% html_attr("href"),'page=(\\d+)')[[1]][,2] %>% as.numeric()
Or, there are a number of ways you could calculate it given that there are more pages than pagination visibile. One way is to get the total count from the appropriate li in the drop down and divide by the results per page count.
library(rvest)
library(magrittr)
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
total_sites <- strtoi(pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount'), base = 0L)
# or use: total_sites <- pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount') %>% as.numeric()
sites_per_page <- length(pg %>% html_nodes('.index-list-item'))
number_of_pages <- ceiling(total_sites/sites_per_page)