Extract the element from html page in R - html

I am new to R and trying to scrape the map data from the following webpage:
https://www.svk.se/en/national-grid/the-control-room/. The map is called "The flow of electricity". I am trying to scrape the capacity numbers (in blue) and the corresponding countries. So far I could not find a solution on how to find the countries' names in the HTML code and consequently scrape them.
Here is an example of data I need:
Would you have any idea?
Thanks a lot in advance.

The data is not in the table, hence we need to extract all the information individually.
Here is a way to do this using rvest.
library(rvest)
url <-'https://www.svk.se/en/national-grid/the-control-room/'
webpage <- url %>% read_html() %>%html_nodes('div.island')
tibble::tibble(country = webpage %>% html_nodes('span.country') %>% html_text(),
watt = webpage %>% html_nodes('span.watt') %>% html_text() %>%
gsub('\\s', '', .) %>% as.numeric(),
unit = webpage %>% html_nodes('span.unit') %>% html_text())
# country watt unit
# <chr> <dbl> <chr>
#1 SWEDEN 3761 MW
#2 DENMARK 201 MW
#3 NORWAY 2296 MW
#4 FINLAND 1311 MW
#5 ESTONIA 632 MW
#6 LATVIA 177 MW
#7 LITHUANIA 1071 MW

The flow data comes from an API call so you need to make an additional xhr (to an url you can find in the network tab via dev tools ) to get this data. You don't need to specify values for the timestamp (Ticks) and random (rnd) params in the querystring.
library(jsonlite)
data <- jsonlite::read_json('https://www.svk.se/Proxy/Proxy/?a=http://driftsdata.statnett.no/restapi/PhysicalFlowMap/GetFlow?Ticks=&rnd=')
As a dataframe:
library(jsonlite)
library (plyr)
data <- jsonlite::read_json('https://www.svk.se/Proxy/Proxy/?a=http://driftsdata.statnett.no/restapi/PhysicalFlowMap/GetFlow?Ticks=&rnd=')
df <- ldply (data, data.frame)

Related

rvest error on form submission "`Form` doesn't contain a `action` attribute"

I am trying to send search requests with rvest, but I get always the same error. I have tried several ways included this solution: https://gist.github.com/ibombonato/11507d776d1042f80ca59cd31509afd3
My code is the following.
library(rvest)
url <- 'https://www.saferproducts.gov/PublicSearch'
cocorahs <- html_session(URL)
form.unfilled <- cocorahs %>% html_node("form") %>% html_form()
form.unfilled[["fields"]][[3]][["value"]] <- "input" ## This is the line which I think should be corrected
form.filled <- form.unfilled %>%
set_values("searchParameter.AdvancedKeyword" = "amazon")
session1 <- session_submit(cocorahs, form.filled, submit = NULL)
# or
session <- submit_form(cocorahs, form.filled)
But I get always the following error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
Run `rlang::last_error()` to see where the error occurred.
I think the way is to edit the attributes of those buttons. Maybe has someone the answer to this. Thanks in advance.
An alternative method with httr2
library(tidyverse)
library(rvest)
library(httr2)
data <- "https://www.saferproducts.gov/PublicSearch" %>%
request() %>%
req_body_form(
"searchParameter.Keyword" = "Amazon"
) %>%
req_perform() %>%
resp_body_html()
tibble(
title = data %>%
html_elements(".document-title") %>%
html_text2(),
report_title = data %>%
html_elements(".info") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish()
)
#> # A tibble: 10 × 2
#> title repor…¹
#> <chr> <chr>
#> 1 Self balancing scooter was used off & on for three years. Consumer i… Incide…
#> 2 The consumer stated that when he opened one of the marshmallow roast… Incide…
#> 3 The consumer, 59, stated that he was welding with a brand new auto d… Incide…
#> 4 The consumer reported, that their hover soccer toy caught fire while… Incide…
#> 5 80 yr old male's electric hotplate was set between 1 and 2(of 5) bef… Incide…
#> 6 Amazon Recalls Amazon Basics Desk Chairs Due to Fall and Injury Haza… Recall…
#> 7 The Consumer reported to have been notified by email that the diarrh… Incide…
#> 8 consumer reported about light fixture attached to a photography umbr… Incide…
#> 9 Drive DeVilbiss Healthcare Recalls Adult Portable Bed Rails After Tw… Recall…
#> 10 MixBin Electronics Recalls iPhone Cases Due to Risk of Skin Irritati… Recall…
#> # … with abbreviated variable name ¹​report_title
Created on 2023-01-15 with reprex v2.0.2

Tabulizing data off website PDFs (w/ various formats) assigning each event values according to HTML link titles

I've been trying to automate the process of manually typing down the data from the ATF's trace data site (see "URL") but it's been a fairly big pain as I've only been able to collect the each URL link that holds the PDFs and assign it to its correct State/Territory and Year. The newer files 2017-2019 have data "tables that are relatively easier to pull data from compared to 2014-2016, e.g. I'm referring to page 10 of the Trace Data report 2019 and Trace Data report 2014.
It's the latter that I'm having trouble with the most as the data is not stored in something that looks like a table but surrounding a (!)pie-chart. There have been some promising R packages such as "pdftools" and "tesseract". But I'm very much an amateur when it comes to trouble-shooting advanced analytical packages such as these.
It's my guess that I'm still a ways off from where I want to be with the final product as I would need to mine the bottom text of page 10 to find how many "other" weapons were traced to a city, as well the number of weapons where a recovery city couldn't be determined. But if anyone has any suggestions on what I could try next or to even make the working code more efficient, I'd appreciate it.
URL <- "https://www.atf.gov/resource-center/data-statistics"
html <- paste(readLines(URL))
library(xml2)
library(tidyverse)
library(rvest)
library(stringr)
x <- c('\t\t\t<div>([^<]*)</div>','\t\t</tr><tr><td>([^<]*)</td>','\t\t\t<td>([^<]*)</td>')
r <- read_html(URL) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>%
dplyr::filter(grepl('node',Link, fixed = T))
r <- as.data.frame(r)
x <- c('<ul><li>([^<]*)</li>','\t<li>([^<]*)</li>')
states <- c('Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','District of Columbia','Florida','Georgia','Guam & Northern Mariana Islands','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania','Puerto Rico','Rhode Island','South Carolina','South Dakota','Tennessee','Texas','Utah','Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming')
s <- list()
for(i in 1:nrow(r)){
s[[i]] <- read_html(r$Link[i]) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>% mutate(Year <- r$Title[i]) %>%
dplyr::filter(Title %in% states | str_detect(Title, "Virgin Islands")) %>%
dplyr::filter(grepl('download',Link, fixed = T))
trace_list = do.call(rbind, s)
}
names(trace_list)[3] <- "Year"
Progress so far...
library(pdftools)
pdf_file <- "https://www.atf.gov/file/146951/download"
text <- pdf_text(pdf_file)
cat(text[10])
vtext <- as.list(str_split(text[10],"\n"))
x <- data.frame(matrix(unlist(vtext), nrow=length(vtext), byrow=TRUE),stringsAsFactors=FALSE)
x1 <- pivot_longer(x, cols = 1:length(x),names_to="X1",values_to="X2")
x1$X2 <- trimws(x1$X2)
x1 <- x1[c(8,12),]
x1[1,2] <- sub(" ","_",x1[1,2],fixed=T)
library(splitstackshape)
x1 <- as.data.frame(cSplit(x1, 'X2', sep=" ", type.convert=FALSE))
x1 <- x1[,c(2:length(x1))]
colnames(x1) <- x1[1,]
x1 <- x1[-1, ]
x2 <- pivot_longer(x1, cols = 1:length(x1),names_to="city",values_to="count")
mixing both pdftools amd tesseract...
library(tesseract)
img_file <- pdftools::pdf_convert("https://www.atf.gov/file/89621/download", format = 'tiff', dpi = 400)
text <- ocr(img_file)
strsplit(text[10],"\n")
Expected output:
year
state
city
count
2019
AL
Birmingham
100
2018
CA
Los Angeles
200
2017
CA
None
30
2017
CA
Other
400

Scrape website's Power BI dashboard using R

I have been trying to scrape my local government's Power BI dashboard using R but it seems like it might be impossible. I've read from the Microsoft site that it is not possible to scrable Power BI dashboards but I am going through several forums showing that it is possible, however I am going through a loop
I am trying to scrape the Zip Code tab data from this dashboard:
https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2
I've tried several "techniques" from the given code below
scc_webpage <- xml2::read_html("https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2")
# Attempt using xpath
scc_webpage %>%
rvest::html_nodes(xpath = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]') %>%
rvest::html_text()
# Attempt using div.<class>
scc_webpage %>%
rvest::html_nodes("div.pivotTableCellWrap cell-interactive tablixAlignRight ") %>%
rvest::html_text()
# Attempt using xpathSapply
query = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]'
XML::xpathSApply(xml, query, xmlValue)
scc_webpage %>%
html_nodes("ui-view")
But I always either get an output saying character(0) when using xpath and getting the div class and id, or even {xml_nodeset (0)} when trying to go through html_nodes. The weird thing is that it wouldn't show the whole html of the tableau data when I do:
scc_webpage %>%
html_nodes("div")
And this would be the output, leaving the chunk that I needed blank:
{xml_nodeset (2)}
[1] <div id="pbi-loading"><svg version="1.1" class="pulsing-svg-item" xmlns="http://www.w3.org/2000/svg" xmlns:xlink ...
[2] <div id="pbiAppPlaceHolder">\r\n <ui-view></ui-view><root></root>\n</div>
I guess the issue may be because the numbers are within a series of nested div attributes??
The main data I am trying to get are the numbers from the table showing the Zip code, confirmed cases, % total cases, deaths, % total deaths.
If this is possible to do in R or possibly in Python using Selenium, any help with this would be greatly appreciated!!
The problem is that the site you want to analyze relies on JavaScript to run and fetch the content for you. In such a case, httr::GET is of no help to you.
However, since manual work is also not an option, we have Selenium.
The following does what you're looking for:
library(dplyr)
library(purrr)
library(readr)
library(wdman)
library(RSelenium)
library(xml2)
library(selectr)
# using wdman to start a selenium server
selServ <- selenium(
port = 4444L,
version = 'latest',
chromever = '84.0.4147.30', # set this to a chrome version that's available on your machine
)
# using RSelenium to start chrome on the selenium server
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)
# open a new Tab on Chrome
remDr$open()
# navigate to the site you wish to analyze
report_url <- "https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2"
remDr$navigate(report_url)
# find and click the button leading to the Zip Code data
zipCodeBtn <- remDr$findElement('.//button[descendant::span[text()="Zip Code"]]', using="xpath")
zipCodeBtn$clickElement()
# fetch the site source in XML
zipcode_data_table <- read_html(remDr$getPageSource()[[1]]) %>%
querySelector("div.pivotTable")
Now we have the page source read into R, probably what you had in mind when you started your scraping task.
From here on it's smooth sailing and merely about converting that xml to a useable table:
col_headers <- zipcode_data_table %>%
querySelectorAll("div.columnHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
rownames <- zipcode_data_table %>%
querySelectorAll("div.rowHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
zipcode_data <- zipcode_data_table %>%
querySelectorAll("div.bodyCells div.pivotTableCellWrap") %>%
map(xml_parent) %>%
unique() %>%
map(~ .x %>% querySelectorAll("div.pivotTableCellWrap") %>% map_chr(xml_text)) %>%
setNames(col_headers) %>%
bind_cols()
# tadaa
df_final <- tibble(zipcode = rownames, zipcode_data) %>%
type_convert(trim_ws = T, na = c(""))
The resulting df looks like this:
> df_final
# A tibble: 15 x 5
zipcode `Confirmed Cases ` `% of Total Cases ` `Deaths ` `% of Total Deaths `
<chr> <dbl> <chr> <dbl> <chr>
1 63301 1549 17.53% 40 28.99%
2 63366 1364 15.44% 38 27.54%
3 63303 1160 13.13% 21 15.22%
4 63385 1091 12.35% 12 8.70%
5 63304 1046 11.84% 3 2.17%
6 63368 896 10.14% 12 8.70%
7 63367 882 9.98% 9 6.52%
8 534 6.04% 1 0.72%
9 63348 105 1.19% 0 0.00%
10 63341 84 0.95% 1 0.72%
11 63332 64 0.72% 0 0.00%
12 63373 25 0.28% 1 0.72%
13 63386 17 0.19% 0 0.00%
14 63357 13 0.15% 0 0.00%
15 63376 5 0.06% 0 0.00%

parse Google Scholar search results scraped with rvest

I am trying to use rvest to scrape one page of Google Scholar search results into a dataframe of author, paper title, year, and journal title.
The simplified, reproducible example below is code that searches Google Scholar for the example terms "apex predator conservation".
Note: to stay within the Terms of Service, I only want to process the first page of search results that I would get from a manual search. I am not asking about automation to scrape additional pages.
The following code already works to extract:
author
paper title
year
but it does not have:
journal title
I would like to extract the journal title and add it to the output.
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)
df
source: https://stackoverflow.com/a/58192323/8742237
So the output of that code looks like this:
#> titles
#> 1 [HTML][HTML] Saving large carnivores, but losing the apex predator?
#> 2 Site fidelity and sex-specific migration in a mobile apex predator: implications for conservation and ecosystem dynamics
#> 3 Effects of tourism-related provisioning on the trophic signatures and movement patterns of an apex predator, the Caribbean reef shark
#> authors years
#> 1 A Ordiz, R Bischof, JE Swenson 2013
#> 2 A Barnett, KG Abrantes, JD Stevens, JM Semmens 2011
Two questions:
How can I add a column that has the journal title extracted from the raw data?
Is there a reference where I can read and learn more about how to work out how to extract other fields for myself, so I don't have to ask here?
One way to add them is this:
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
leftovers <- authors_years %>%
str_remove_all(authors) %>%
str_remove_all(years)
journals <- str_split(leftovers, "-") %>%
map_chr(2) %>%
str_extract_all("[:alpha:]*") %>%
map(function(x) x[x != ""]) %>%
map(~paste(., collapse = " ")) %>%
unlist()
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, journals = journals, stringsAsFactors = FALSE)
For your second question: the css selector gadget chrome extension is nice for getting the css selectors of the elements you want. But in your case all elements share the same css class, so the only way to disentangle them is to use regex. So I guess learn a bit about css selectors and regex :)

Web scraping with R, solution with Jsonlite seems flaky

I maintain small scrips to extract financial data from websites. One of them retrieves the dutch natural gas grid balance. However, I keep getting problems with it as it works for a while, then get an error message and finally find a work around. Anyway, it seems that I am using a rather flaky method to do it. Could anyone guide me to a better direction (package) of getting this done?
Below I add the code (which again stopped working)
library(curl)
library(bitops)
url <- "https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh"
h <- new_handle(copypostfields ="moo=moomooo")
handle_setheaders(h, "Content-Type" = "text/moo", "Cache-Control" = "no-cache", "User-Agent" = "A cow")
req <- curl_fetch_memory(url, handle=h)
x <- rawToChar(req$content)
library(jsonlite)
json_data <- fromJSON(x)
data <- json_data[,c(1,4)]
n=tail(data,1)
Many thanks
You can use rvest for this (but there could be better approaches too)
library(rvest)
json_data <- read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.)
data <- json_data[,c(1,4)]
n=tail(data,1)
n
Output:
> n
sbsdatetime position
37 2017-11-16 12:00:00 -9
Slightly elegant solution if the dataframe isn't required:
library(rvest)
library(dplyr)
read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.) %>%
select(1:4) %>%
tail(n=1)