rvest cannot find node with xpath - html

This is the website I scapre
ppp projects
I want to use xpath to select the node like below
The xpath I get by use inspect element is "//*[#id="pppListUl"]/li1/div2/span2/span"
My scrpits are like below:
a <- html("http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/toPPPList.do")
b <- html_nodes(a, xpath = '//*[#id="pppListUl"]/li[1]/div[2]/span[2]/span')
b
Then I got the result
{xml_nodeset (0)}
Then I check the page source, I didn't even find anything about the project I selected.
I was wondering why I cannot find it in the page source, and in turn, how can I get the node by rvest.

It makes an XHR request for the content. Just work with that data (it's pretty clean):
library(httr)
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=1,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) %>%
dplyr::glimpse()
(StackOverflow isn't advanced enough to let me post the output of that as it thinks it's spam).
It's a 4 element list with fields totalCount, list (which has the actual data), currentPage and totalPage.
It looks like you can change the queryPage form variable to iterate through the pages to get the whole list/database, something like:
library(httr)
library(purrr)
library(dplyr)
get_page <- function(page_num=1, .pb=NULL) {
if (!is.null(.pb)) pb$tick()$print()
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=page_num,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) -> dat
dat$list
}
n <- 5 # change this to the value in `totalPage`
pb <- progress_estimated(n)
df <- map_df(1:n, get_page, pb)

Related

R read_html running indefinitely

I am trying to scrape data from this website: edmunds cost to own data. But whenever I try to run read_html("link") nothing happens as far as I can tell it just runs indefinitely;
library(rvest)
htm <- read_html("https://www.edmunds.com/lexus/rx-350/2019/cost-to-own/?style=401771404")
I have also tried things like: But they all just run forever. Why can't I read this html?
library(httr)
library(XML)
library(dplyr)
library(rvest)
h <- handle("https://www.edmunds.com/lexus/rx-350/2019/cost-to-own/?style=401771404")
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))

HTML content not showing when using html_nodes from rvest

I'm trying to get a specific number in webpages of: https://ideas.repec.org/. More specifically, I'm looking for the number of search results like this:IDEAS' search results
However, when I'm applying the following code, I get an empty string:
library(rvest)
x <- GET("https://ideas.repec.org/cgi-bin/htsearch?form=extended&wm=wrd&dt=range&ul=&q=labor&cmd=Search%21&wf=4BFF&s=R&db=01%2F01%2F1950&de=31%2F12%2F1950")
webpage <- read_html(x)
hits_html <- html_nodes(webpage, xpath = '//*[#id="content-block"]/p')
hits <- html_text(hits_html)
hits
[1] ""
You could regex it out from the appropriate node. This does assume a constant before and after string and case. You could make also case insensitive with (?i)found\\s+(\\d+)\\s+results.
library(rvest)
library(stringr)
page = read_html("https://ideas.repec.org/cgi-bin/htsearch?form=extended&wm=wrd&dt=range&ul=&q=labor&cmd=Search%21&wf=4BFF&s=R&db=01%2F01%2F1950&de=31%2F12%2F1950")
r = page %>% html_node("#content-block") %>% html_text() %>%toString()
x <- str_match_all(r,'Found\\s+(\\d+)\\s+results')
print(x[[1]][,2])

Scraping HTML webpage using R

I am scraping JFK's website to get flight schedules. The link to the flight schedules is here;
http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures
To begin with, I am inspecting the one of the fields of any given flight and noting down its xpath. Idea is to see the output and then develop the code from there. This is what I have so far:
library(rvest)
Departure_url <- read_html('http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures')
Departures <- Departure_url %>% html_nodes(xpath = '//*[#id="ffAlLbl"]') %>% html_text()
I am getting an empty character object as output for 'Departures' object in the code above.
I am not sure why this happens. I am looking for a node through which the entire schedule can be downloaded.
Any help is appreciated !!
To scrape that table is kind of tricky.
First of all, what you try to scrape is live content. So you need a headless browser such as RSelenium.
Second, the content is actually inside an iframe that is inside another iframe, so you need to use switch to frame twice.
Finally, the content is not a table, so you need to get all vectors and combine them into a table.
The following code should do the job:
library(RSelenium)
library(rvest)
library(stringr)
library(glue)
library(tidyverse)
#Rselenium
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
myclient$navigate("http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures")
#Switch two frame twice
webElems <- myclient$findElement(using = "css",value = "[name=webfidsBox]")
myclient$switchToFrame(webElems)
webElems <- myclient$findElement(using = "css",value = "#coif02")
myclient$switchToFrame(webElems)
#get page souce of the content
myPagesource <- read_html(myclient$getPageSource()[[1]])
selected_node <- myPagesource %>% html_node("#fvData")
#get content as vectors in list and merge into table
result_list <- map(1:7,~ myPagesource %>% html_nodes(str_c(".c",.x)) %>% html_text())
result_list2 <- map(c(5,6),~myPagesource %>% html_nodes(glue::glue("tr>td:nth-child({i})",i=.x)) %>% html_text())
result_list[[5]] <- c(result_list[[5]],result_list2[[1]])
result_list[[6]] <- c(result_list[[6]],result_list2[[2]])
result_df <- do.call("cbind", result_list)
colnames(result_df) <- result_df[1,]
result_df <- as.tibble(result_df[-1,])
You can do some data cleaning afterward.

Read HTML into R

I would like R to take a word in a column in a dataset, and return a value from a website. The code I have so far is below. So, for each word in the data frame column, it will go to the website and return the pronunciation (for example, the pronunciation on http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=word&stress=-s is "W ER1 D"). I have looked at the HTML of the website, and it's unclear what I would need to enter to return this value - it's between <tt> and </tt> but there are many of these. I'm also not sure how to then get that value into R. Thank you.
library(xml2)
for (word in df$word) {
result <- read_html("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in="word"&stress=-s")
}
Parsing HTML is a tricky task in R. There are a couple ways though. If the HTML converts well to XML and the website/API always returns the same structure then you can use tools to parse XML. Otherwise you could use regex and call stringr::str_extract() on the HTML.
For your case, it is fairly easy to get the value you're looking for using XML tools. It's true that there are a lot of <tt> tags but the one you want is always in the second instance so you can just pull out that one.
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
print(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue))
#don't abuse your API
Sys.sleep(0.1)
}
>[1] "HH AE1 P IY0 ."
>[1] "S AE1 D ."
Good luck!
EDIT: This code will return a dataframe:
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
#initializae the dataframe with pronunciation field
pronunciation_list <- data.frame(pronunciation=character(),stringsAsFactors = F)
#loop over the words
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
to_add <- data.frame(pronunciation=(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue)))
#bind the data
pronunciation_list<- rbind(pronunciation_list, to_add)
#don't abuse your API
Sys.sleep(0.1)
}

Web scraping with R, solution with Jsonlite seems flaky

I maintain small scrips to extract financial data from websites. One of them retrieves the dutch natural gas grid balance. However, I keep getting problems with it as it works for a while, then get an error message and finally find a work around. Anyway, it seems that I am using a rather flaky method to do it. Could anyone guide me to a better direction (package) of getting this done?
Below I add the code (which again stopped working)
library(curl)
library(bitops)
url <- "https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh"
h <- new_handle(copypostfields ="moo=moomooo")
handle_setheaders(h, "Content-Type" = "text/moo", "Cache-Control" = "no-cache", "User-Agent" = "A cow")
req <- curl_fetch_memory(url, handle=h)
x <- rawToChar(req$content)
library(jsonlite)
json_data <- fromJSON(x)
data <- json_data[,c(1,4)]
n=tail(data,1)
Many thanks
You can use rvest for this (but there could be better approaches too)
library(rvest)
json_data <- read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.)
data <- json_data[,c(1,4)]
n=tail(data,1)
n
Output:
> n
sbsdatetime position
37 2017-11-16 12:00:00 -9
Slightly elegant solution if the dataframe isn't required:
library(rvest)
library(dplyr)
read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.) %>%
select(1:4) %>%
tail(n=1)