I have been trying to scrape my local government's Power BI dashboard using R but it seems like it might be impossible. I've read from the Microsoft site that it is not possible to scrable Power BI dashboards but I am going through several forums showing that it is possible, however I am going through a loop
I am trying to scrape the Zip Code tab data from this dashboard:
https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2
I've tried several "techniques" from the given code below
scc_webpage <- xml2::read_html("https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2")
# Attempt using xpath
scc_webpage %>%
rvest::html_nodes(xpath = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]') %>%
rvest::html_text()
# Attempt using div.<class>
scc_webpage %>%
rvest::html_nodes("div.pivotTableCellWrap cell-interactive tablixAlignRight ") %>%
rvest::html_text()
# Attempt using xpathSapply
query = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]'
XML::xpathSApply(xml, query, xmlValue)
scc_webpage %>%
html_nodes("ui-view")
But I always either get an output saying character(0) when using xpath and getting the div class and id, or even {xml_nodeset (0)} when trying to go through html_nodes. The weird thing is that it wouldn't show the whole html of the tableau data when I do:
scc_webpage %>%
html_nodes("div")
And this would be the output, leaving the chunk that I needed blank:
{xml_nodeset (2)}
[1] <div id="pbi-loading"><svg version="1.1" class="pulsing-svg-item" xmlns="http://www.w3.org/2000/svg" xmlns:xlink ...
[2] <div id="pbiAppPlaceHolder">\r\n <ui-view></ui-view><root></root>\n</div>
I guess the issue may be because the numbers are within a series of nested div attributes??
The main data I am trying to get are the numbers from the table showing the Zip code, confirmed cases, % total cases, deaths, % total deaths.
If this is possible to do in R or possibly in Python using Selenium, any help with this would be greatly appreciated!!
The problem is that the site you want to analyze relies on JavaScript to run and fetch the content for you. In such a case, httr::GET is of no help to you.
However, since manual work is also not an option, we have Selenium.
The following does what you're looking for:
library(dplyr)
library(purrr)
library(readr)
library(wdman)
library(RSelenium)
library(xml2)
library(selectr)
# using wdman to start a selenium server
selServ <- selenium(
port = 4444L,
version = 'latest',
chromever = '84.0.4147.30', # set this to a chrome version that's available on your machine
)
# using RSelenium to start chrome on the selenium server
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)
# open a new Tab on Chrome
remDr$open()
# navigate to the site you wish to analyze
report_url <- "https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2"
remDr$navigate(report_url)
# find and click the button leading to the Zip Code data
zipCodeBtn <- remDr$findElement('.//button[descendant::span[text()="Zip Code"]]', using="xpath")
zipCodeBtn$clickElement()
# fetch the site source in XML
zipcode_data_table <- read_html(remDr$getPageSource()[[1]]) %>%
querySelector("div.pivotTable")
Now we have the page source read into R, probably what you had in mind when you started your scraping task.
From here on it's smooth sailing and merely about converting that xml to a useable table:
col_headers <- zipcode_data_table %>%
querySelectorAll("div.columnHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
rownames <- zipcode_data_table %>%
querySelectorAll("div.rowHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
zipcode_data <- zipcode_data_table %>%
querySelectorAll("div.bodyCells div.pivotTableCellWrap") %>%
map(xml_parent) %>%
unique() %>%
map(~ .x %>% querySelectorAll("div.pivotTableCellWrap") %>% map_chr(xml_text)) %>%
setNames(col_headers) %>%
bind_cols()
# tadaa
df_final <- tibble(zipcode = rownames, zipcode_data) %>%
type_convert(trim_ws = T, na = c(""))
The resulting df looks like this:
> df_final
# A tibble: 15 x 5
zipcode `Confirmed Cases ` `% of Total Cases ` `Deaths ` `% of Total Deaths `
<chr> <dbl> <chr> <dbl> <chr>
1 63301 1549 17.53% 40 28.99%
2 63366 1364 15.44% 38 27.54%
3 63303 1160 13.13% 21 15.22%
4 63385 1091 12.35% 12 8.70%
5 63304 1046 11.84% 3 2.17%
6 63368 896 10.14% 12 8.70%
7 63367 882 9.98% 9 6.52%
8 534 6.04% 1 0.72%
9 63348 105 1.19% 0 0.00%
10 63341 84 0.95% 1 0.72%
11 63332 64 0.72% 0 0.00%
12 63373 25 0.28% 1 0.72%
13 63386 17 0.19% 0 0.00%
14 63357 13 0.15% 0 0.00%
15 63376 5 0.06% 0 0.00%
Related
I am trying to send search requests with rvest, but I get always the same error. I have tried several ways included this solution: https://gist.github.com/ibombonato/11507d776d1042f80ca59cd31509afd3
My code is the following.
library(rvest)
url <- 'https://www.saferproducts.gov/PublicSearch'
cocorahs <- html_session(URL)
form.unfilled <- cocorahs %>% html_node("form") %>% html_form()
form.unfilled[["fields"]][[3]][["value"]] <- "input" ## This is the line which I think should be corrected
form.filled <- form.unfilled %>%
set_values("searchParameter.AdvancedKeyword" = "amazon")
session1 <- session_submit(cocorahs, form.filled, submit = NULL)
# or
session <- submit_form(cocorahs, form.filled)
But I get always the following error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
Run `rlang::last_error()` to see where the error occurred.
I think the way is to edit the attributes of those buttons. Maybe has someone the answer to this. Thanks in advance.
An alternative method with httr2
library(tidyverse)
library(rvest)
library(httr2)
data <- "https://www.saferproducts.gov/PublicSearch" %>%
request() %>%
req_body_form(
"searchParameter.Keyword" = "Amazon"
) %>%
req_perform() %>%
resp_body_html()
tibble(
title = data %>%
html_elements(".document-title") %>%
html_text2(),
report_title = data %>%
html_elements(".info") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish()
)
#> # A tibble: 10 × 2
#> title repor…¹
#> <chr> <chr>
#> 1 Self balancing scooter was used off & on for three years. Consumer i… Incide…
#> 2 The consumer stated that when he opened one of the marshmallow roast… Incide…
#> 3 The consumer, 59, stated that he was welding with a brand new auto d… Incide…
#> 4 The consumer reported, that their hover soccer toy caught fire while… Incide…
#> 5 80 yr old male's electric hotplate was set between 1 and 2(of 5) bef… Incide…
#> 6 Amazon Recalls Amazon Basics Desk Chairs Due to Fall and Injury Haza… Recall…
#> 7 The Consumer reported to have been notified by email that the diarrh… Incide…
#> 8 consumer reported about light fixture attached to a photography umbr… Incide…
#> 9 Drive DeVilbiss Healthcare Recalls Adult Portable Bed Rails After Tw… Recall…
#> 10 MixBin Electronics Recalls iPhone Cases Due to Risk of Skin Irritati… Recall…
#> # … with abbreviated variable name ¹report_title
Created on 2023-01-15 with reprex v2.0.2
I need to scrap this webpage so I could have a data.frame like this:
value01 value02 id
SECTION I LIVE ANIMALS ANIMAL PRODUCTS sectionI
CHAPTER 1 LIVE ANIMALS chap0100000000
0101 Live horses, asses, mules and hinnies : (TN701) 0101000000-1
- Horses : 0101210000-2
0101 21 - - Pure-bred breeding animals (NC018) 0101210000-80
0101 29 - - Other : 0101290000-3
0101 29 10 - - - For slaughter 0101291000-80
0101 29 90 - - - Other 0101299000-80
0101 30 - Asses 0101300000-80
To obtain the first two rows of value01 and value02 I use:
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.section') %>% html_table())[2])
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.chapter') %>% html_table())[2])
To obtain the rest of values of value01 and value02 I use (I need to clean the obtained values after I got them with this code, but I think there is better way to obtain the data):
remDr$getPageSource()[[1]] %>% read_html() %>% html_element(xpath = '//*[#id="div_description"]') %>% html_table()
So my problem now is to get the id column of the data.frame I want and to put it all together. Any advice on how to proceed from here to achieve my goal?
The code you need to run to function the previous examples:
suppressMessages(suppressWarnings(library(RSelenium)))
suppressMessages(suppressWarnings(library(rvest)))
rD <- rsDriver(browser = 'firefox', port = 6000L, verbose = FALSE)
remDr <- rD[['client']]
remDr$navigate('https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Domain=TARIC&Offset=0&ShowMatchingGoods=false&callbackuri=CBU-1&SimDate=20220719')
It is not quite clear to me what you want to scrape exactly from that page, but this is how you can get the data I think you are after.
pg <- remDr$getPageSource()[[1]]
doc <- xml2::read_html(pg)
# first two lines
rvest::html_elements(doc, '#sectionI table , .chapter') |>
rvest::html_table()
# get the data from each further line
lines <- rvest::html_elements(doc, ".evenLine")
data <- rvest::html_table(lines)
ids <- rvest::html_attrs(lines) |> sapply(function(x) x[1])
You'll need to clean the scraped data to your liking.
If this is not what you are looking for, you should clarify your question further.
I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.
I have used the following approach in R:
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)
# show BID/ASK block:
doc.tables[2]
Which (doc.tables[2]) gives me in this case the result:
$`NULL`
Bid 0,765
1 Ask 0,80
How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?
The issue is the 0.765 is actually the name of your data.frame column.
Your data frame being doc.tables[[2]]
You can grab the name by calling names(doc.tables[[2]])[2])
store that as a variable like name <- names(doc.tables[[2]])[2])
then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.
Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])
Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.
library(httr)
library(rvest)
library(dplyr)
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
tbl <- page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
stringr::str_replace_all(",", ".")
tibble(name = stringr::str_extract(tbl, "Ask|Bid"),
value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-26 by the reprex package (v2.0.1)
Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.
library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
str_replace_all(",", ".") %>%
tibble(name = str_extract(., "Ask|Bid"),
value = readr::parse_number(.)) %>%
.[-1]
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-27 by the reprex package (v2.0.1)
This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.
library(httr)
library(XML)
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)
# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)
I am new to R and trying to scrape the map data from the following webpage:
https://www.svk.se/en/national-grid/the-control-room/. The map is called "The flow of electricity". I am trying to scrape the capacity numbers (in blue) and the corresponding countries. So far I could not find a solution on how to find the countries' names in the HTML code and consequently scrape them.
Here is an example of data I need:
Would you have any idea?
Thanks a lot in advance.
The data is not in the table, hence we need to extract all the information individually.
Here is a way to do this using rvest.
library(rvest)
url <-'https://www.svk.se/en/national-grid/the-control-room/'
webpage <- url %>% read_html() %>%html_nodes('div.island')
tibble::tibble(country = webpage %>% html_nodes('span.country') %>% html_text(),
watt = webpage %>% html_nodes('span.watt') %>% html_text() %>%
gsub('\\s', '', .) %>% as.numeric(),
unit = webpage %>% html_nodes('span.unit') %>% html_text())
# country watt unit
# <chr> <dbl> <chr>
#1 SWEDEN 3761 MW
#2 DENMARK 201 MW
#3 NORWAY 2296 MW
#4 FINLAND 1311 MW
#5 ESTONIA 632 MW
#6 LATVIA 177 MW
#7 LITHUANIA 1071 MW
The flow data comes from an API call so you need to make an additional xhr (to an url you can find in the network tab via dev tools ) to get this data. You don't need to specify values for the timestamp (Ticks) and random (rnd) params in the querystring.
library(jsonlite)
data <- jsonlite::read_json('https://www.svk.se/Proxy/Proxy/?a=http://driftsdata.statnett.no/restapi/PhysicalFlowMap/GetFlow?Ticks=&rnd=')
As a dataframe:
library(jsonlite)
library (plyr)
data <- jsonlite::read_json('https://www.svk.se/Proxy/Proxy/?a=http://driftsdata.statnett.no/restapi/PhysicalFlowMap/GetFlow?Ticks=&rnd=')
df <- ldply (data, data.frame)
I'm having an issue trying to get Rankings from the Freeride World Tour website.
I tried first to get a CSS code for rvest using selectorGadget in Chrome but but can only get the riders and their overall score. What I'm interested in is getting the points a rider scored in each heat. I'm new to web-scraping and CSS/HTML so please hang in there with me.
# Get the website url
url <- read_html("https://www.freerideworldtour.com/rankings-detailed?season=165&competition=2&discipline=38")
Download everything from the page,
(all_text <- url %>%
html_nodes("div") %>%
html_text())
then look for Kristofer Turdell's first score of 2500 pts. grep("2500 pts.", all_text) but I find...nothing?
When I right-click the 2500 pts. on the website and select "Inspect" I can see that the html code for this section is:
<div class="field__item even">2500 pts.</div>
So I tried to use the div class:
url %>%
html_nodes(".field__item.even:) %>%
html_text()
This only returns the overall score for the participants (e.g. Kristofer Turdell 7870 pts.).
Next, I tried using the right-click option to save Xpath from "Inspect".
url %>%
html_nodes(xpath = "//*[#id="page-content"]/div/div/div[2]/div/div/div/div[1]/div[2]/div/div/div[1]/div/div[4]/div/div/div") %>%
html_text()
I'm not having any luck on this so I'd really appreciate your help.
url %>%
html_node("div.panel-second")%>%
html_text() %>%
gsub("\\s*\\n+\\s*",";",.)%>%
gsub("pts.","\n",.)%>%
read.table(text=.,fill=T,sep=";",row.names = NULL)%>%
subset(select=3:4)%>%na.omit()
V3 V4
1 Kristofer Turdell 7870
2 Markus Eder 7320
3 Mickael Bimboes 6930
4 Loic Collomb-Patton 6660
5 Yann Rausis 6290
6 Berkeley Patterson 5860
7 Leo Slemett 5835
8 Ivan Malakhov 5800
9 Craig Murray 5705
10 Logan Pehota 5655
11 Reine Barkered 5470
12 Grifen Moller 4765
13 Sam Lee 4580
14 Ryan Faye 3210
15 Conor Pelton 3185
16 George Rodney 3115
17 Taisuke Kusunoki 3060
18 Trace Cooke 2905
19 Aymar Navarro 2855
20 Felix Wiemers 2655
21 Fabio Studer 2305
22 Stefan Hausl 2240
23 Drew Tabke 1880
24 Carl Regnér Eriksson 1310
Writing that much code in the comments was awful, so here goes. You can store the scraped data into a dataframe and not be limited to printing it to the console:
library(tidyverse)
library(magrittr)
library(rvest)
url_base <- "https://www.freerideworldtour.com/rider/"
riders <- c("kristofer-turdell", "markus-eder", "mickael-bimboes")
output <- data_frame()
for (i in riders) {
temp <- read_html(paste0(url_base, i)) %>%
html_node("div") %>%
html_text() %>%
gsub("\\s*\\n+\\s*", ";", .) %>%
gsub("pts.", "\n", .) %>%
read.table(text = ., fill = T, sep = ";", row.names = NULL,
col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>%
subset(select = 2:5) %>%
dplyr::filter(
!is.na(as.numeric(as.character(Ranking))) &
as.character(Points) != ""
) %>%
dplyr::mutate(name = i)
output <- bind_rows(output, temp)
}
I put in parts such as as.character(Points) != "" to exclude the sum of points (such as in Mickael Bimboe's 6930 pts) and not individual scores.
Again, much credit goes to #Onyambu though, many lines are borrowed from his answer.