Web scraping with R, solution with Jsonlite seems flaky - json

I maintain small scrips to extract financial data from websites. One of them retrieves the dutch natural gas grid balance. However, I keep getting problems with it as it works for a while, then get an error message and finally find a work around. Anyway, it seems that I am using a rather flaky method to do it. Could anyone guide me to a better direction (package) of getting this done?
Below I add the code (which again stopped working)
library(curl)
library(bitops)
url <- "https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh"
h <- new_handle(copypostfields ="moo=moomooo")
handle_setheaders(h, "Content-Type" = "text/moo", "Cache-Control" = "no-cache", "User-Agent" = "A cow")
req <- curl_fetch_memory(url, handle=h)
x <- rawToChar(req$content)
library(jsonlite)
json_data <- fromJSON(x)
data <- json_data[,c(1,4)]
n=tail(data,1)
Many thanks

You can use rvest for this (but there could be better approaches too)
library(rvest)
json_data <- read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.)
data <- json_data[,c(1,4)]
n=tail(data,1)
n
Output:
> n
sbsdatetime position
37 2017-11-16 12:00:00 -9
Slightly elegant solution if the dataframe isn't required:
library(rvest)
library(dplyr)
read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.) %>%
select(1:4) %>%
tail(n=1)

Related

Scrape object from html with rvest

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())

Web-Scraping using R. I want to extract some table like data from a website

I'm having some problems scraping data from a website. I have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands
More precisely, I want to extract the brands on the right-hand side.
My idea so far:
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()
But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!
That data is dynamically pulled from a script tag. You can pull the content of that script tag and parse as json. subset just for the items of interest from the returned list and then extract the brand names:
library(rvest)
library(jsonlite)
library(stringr)
data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json()
data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})
I find the above easier to read but you could try other methods such as
library(rvest)
library(jsonlite)
library(stringr)
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json() %>%
{.$props$pageProps$apolloState} %>%
subset(., {str_detect(names(.), 'Brand:')}) %>%
lapply(. , function(x){x$name})
Using {} to have call be treated like an expression and not a function is something I read in a comment by #asachet

Scraping web table using R and rvest

I'm new in web scraping using R.
I'm trying to scrape the table generated by this link:
https://gd.eppo.int/search?k=saperda+tridentata.
In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I tried to follow the suggestion by Allan Cameron given here (rvest, table with thead and tbody tags) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page".
Where can I get this link? In this specific case I used "https://gd.eppo.int/media/js/application/zzsearch.js?7", is this one?
Below you have my code.
Thank you in advance!
library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)
pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text")
json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8")
table_contents <- JSON %>%
{gsub("\\\\n", "\n", .)} %>%
{gsub("\\\\/", "/", .)} %>%
{gsub("\\\\\"", "\"", .)} %>%
strsplit("html\":\"") %>%
unlist %>%
extract(2) %>%
substr(1, nchar(.) -2) %>%
paste0("</tbody>")
new_page <- gsub("</tbody>", table_contents, resp)
read_html(new_page) %>%
html_nodes("table") %>%
html_table()
The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.
library(httr)
library(jsonlite)
params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)

Scraping HTML webpage using R

I am scraping JFK's website to get flight schedules. The link to the flight schedules is here;
http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures
To begin with, I am inspecting the one of the fields of any given flight and noting down its xpath. Idea is to see the output and then develop the code from there. This is what I have so far:
library(rvest)
Departure_url <- read_html('http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures')
Departures <- Departure_url %>% html_nodes(xpath = '//*[#id="ffAlLbl"]') %>% html_text()
I am getting an empty character object as output for 'Departures' object in the code above.
I am not sure why this happens. I am looking for a node through which the entire schedule can be downloaded.
Any help is appreciated !!
To scrape that table is kind of tricky.
First of all, what you try to scrape is live content. So you need a headless browser such as RSelenium.
Second, the content is actually inside an iframe that is inside another iframe, so you need to use switch to frame twice.
Finally, the content is not a table, so you need to get all vectors and combine them into a table.
The following code should do the job:
library(RSelenium)
library(rvest)
library(stringr)
library(glue)
library(tidyverse)
#Rselenium
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
myclient$navigate("http://www.flightview.com/airport/JFK-New_York-NY-(Kennedy)/departures")
#Switch two frame twice
webElems <- myclient$findElement(using = "css",value = "[name=webfidsBox]")
myclient$switchToFrame(webElems)
webElems <- myclient$findElement(using = "css",value = "#coif02")
myclient$switchToFrame(webElems)
#get page souce of the content
myPagesource <- read_html(myclient$getPageSource()[[1]])
selected_node <- myPagesource %>% html_node("#fvData")
#get content as vectors in list and merge into table
result_list <- map(1:7,~ myPagesource %>% html_nodes(str_c(".c",.x)) %>% html_text())
result_list2 <- map(c(5,6),~myPagesource %>% html_nodes(glue::glue("tr>td:nth-child({i})",i=.x)) %>% html_text())
result_list[[5]] <- c(result_list[[5]],result_list2[[1]])
result_list[[6]] <- c(result_list[[6]],result_list2[[2]])
result_df <- do.call("cbind", result_list)
colnames(result_df) <- result_df[1,]
result_df <- as.tibble(result_df[-1,])
You can do some data cleaning afterward.

rvest cannot find node with xpath

This is the website I scapre
ppp projects
I want to use xpath to select the node like below
The xpath I get by use inspect element is "//*[#id="pppListUl"]/li1/div2/span2/span"
My scrpits are like below:
a <- html("http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/toPPPList.do")
b <- html_nodes(a, xpath = '//*[#id="pppListUl"]/li[1]/div[2]/span[2]/span')
b
Then I got the result
{xml_nodeset (0)}
Then I check the page source, I didn't even find anything about the project I selected.
I was wondering why I cannot find it in the page source, and in turn, how can I get the node by rvest.
It makes an XHR request for the content. Just work with that data (it's pretty clean):
library(httr)
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=1,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) %>%
dplyr::glimpse()
(StackOverflow isn't advanced enough to let me post the output of that as it thinks it's spam).
It's a 4 element list with fields totalCount, list (which has the actual data), currentPage and totalPage.
It looks like you can change the queryPage form variable to iterate through the pages to get the whole list/database, something like:
library(httr)
library(purrr)
library(dplyr)
get_page <- function(page_num=1, .pb=NULL) {
if (!is.null(.pb)) pb$tick()$print()
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=page_num,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) -> dat
dat$list
}
n <- 5 # change this to the value in `totalPage`
pb <- progress_estimated(n)
df <- map_df(1:n, get_page, pb)