R web scraping with Rselenium and rvest - html

I need to scrap this webpage so I could have a data.frame like this:
value01 value02 id
SECTION I LIVE ANIMALS ANIMAL PRODUCTS sectionI
CHAPTER 1 LIVE ANIMALS chap0100000000
0101 Live horses, asses, mules and hinnies : (TN701) 0101000000-1
- Horses : 0101210000-2
0101 21 - - Pure-bred breeding animals (NC018) 0101210000-80
0101 29 - - Other : 0101290000-3
0101 29 10 - - - For slaughter 0101291000-80
0101 29 90 - - - Other 0101299000-80
0101 30 - Asses 0101300000-80
To obtain the first two rows of value01 and value02 I use:
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.section') %>% html_table())[2])
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.chapter') %>% html_table())[2])
To obtain the rest of values of value01 and value02 I use (I need to clean the obtained values after I got them with this code, but I think there is better way to obtain the data):
remDr$getPageSource()[[1]] %>% read_html() %>% html_element(xpath = '//*[#id="div_description"]') %>% html_table()
So my problem now is to get the id column of the data.frame I want and to put it all together. Any advice on how to proceed from here to achieve my goal?
The code you need to run to function the previous examples:
suppressMessages(suppressWarnings(library(RSelenium)))
suppressMessages(suppressWarnings(library(rvest)))
rD <- rsDriver(browser = 'firefox', port = 6000L, verbose = FALSE)
remDr <- rD[['client']]
remDr$navigate('https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Domain=TARIC&Offset=0&ShowMatchingGoods=false&callbackuri=CBU-1&SimDate=20220719')

It is not quite clear to me what you want to scrape exactly from that page, but this is how you can get the data I think you are after.
pg <- remDr$getPageSource()[[1]]
doc <- xml2::read_html(pg)
# first two lines
rvest::html_elements(doc, '#sectionI table , .chapter') |>
rvest::html_table()
# get the data from each further line
lines <- rvest::html_elements(doc, ".evenLine")
data <- rvest::html_table(lines)
ids <- rvest::html_attrs(lines) |> sapply(function(x) x[1])
You'll need to clean the scraped data to your liking.
If this is not what you are looking for, you should clarify your question further.

Related

Scrape object from html with rvest

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())

Scrape website's Power BI dashboard using R

I have been trying to scrape my local government's Power BI dashboard using R but it seems like it might be impossible. I've read from the Microsoft site that it is not possible to scrable Power BI dashboards but I am going through several forums showing that it is possible, however I am going through a loop
I am trying to scrape the Zip Code tab data from this dashboard:
https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2
I've tried several "techniques" from the given code below
scc_webpage <- xml2::read_html("https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2")
# Attempt using xpath
scc_webpage %>%
rvest::html_nodes(xpath = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]') %>%
rvest::html_text()
# Attempt using div.<class>
scc_webpage %>%
rvest::html_nodes("div.pivotTableCellWrap cell-interactive tablixAlignRight ") %>%
rvest::html_text()
# Attempt using xpathSapply
query = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]'
XML::xpathSApply(xml, query, xmlValue)
scc_webpage %>%
html_nodes("ui-view")
But I always either get an output saying character(0) when using xpath and getting the div class and id, or even {xml_nodeset (0)} when trying to go through html_nodes. The weird thing is that it wouldn't show the whole html of the tableau data when I do:
scc_webpage %>%
html_nodes("div")
And this would be the output, leaving the chunk that I needed blank:
{xml_nodeset (2)}
[1] <div id="pbi-loading"><svg version="1.1" class="pulsing-svg-item" xmlns="http://www.w3.org/2000/svg" xmlns:xlink ...
[2] <div id="pbiAppPlaceHolder">\r\n <ui-view></ui-view><root></root>\n</div>
I guess the issue may be because the numbers are within a series of nested div attributes??
The main data I am trying to get are the numbers from the table showing the Zip code, confirmed cases, % total cases, deaths, % total deaths.
If this is possible to do in R or possibly in Python using Selenium, any help with this would be greatly appreciated!!
The problem is that the site you want to analyze relies on JavaScript to run and fetch the content for you. In such a case, httr::GET is of no help to you.
However, since manual work is also not an option, we have Selenium.
The following does what you're looking for:
library(dplyr)
library(purrr)
library(readr)
library(wdman)
library(RSelenium)
library(xml2)
library(selectr)
# using wdman to start a selenium server
selServ <- selenium(
port = 4444L,
version = 'latest',
chromever = '84.0.4147.30', # set this to a chrome version that's available on your machine
)
# using RSelenium to start chrome on the selenium server
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)
# open a new Tab on Chrome
remDr$open()
# navigate to the site you wish to analyze
report_url <- "https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2"
remDr$navigate(report_url)
# find and click the button leading to the Zip Code data
zipCodeBtn <- remDr$findElement('.//button[descendant::span[text()="Zip Code"]]', using="xpath")
zipCodeBtn$clickElement()
# fetch the site source in XML
zipcode_data_table <- read_html(remDr$getPageSource()[[1]]) %>%
querySelector("div.pivotTable")
Now we have the page source read into R, probably what you had in mind when you started your scraping task.
From here on it's smooth sailing and merely about converting that xml to a useable table:
col_headers <- zipcode_data_table %>%
querySelectorAll("div.columnHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
rownames <- zipcode_data_table %>%
querySelectorAll("div.rowHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
zipcode_data <- zipcode_data_table %>%
querySelectorAll("div.bodyCells div.pivotTableCellWrap") %>%
map(xml_parent) %>%
unique() %>%
map(~ .x %>% querySelectorAll("div.pivotTableCellWrap") %>% map_chr(xml_text)) %>%
setNames(col_headers) %>%
bind_cols()
# tadaa
df_final <- tibble(zipcode = rownames, zipcode_data) %>%
type_convert(trim_ws = T, na = c(""))
The resulting df looks like this:
> df_final
# A tibble: 15 x 5
zipcode `Confirmed Cases ` `% of Total Cases ` `Deaths ` `% of Total Deaths `
<chr> <dbl> <chr> <dbl> <chr>
1 63301 1549 17.53% 40 28.99%
2 63366 1364 15.44% 38 27.54%
3 63303 1160 13.13% 21 15.22%
4 63385 1091 12.35% 12 8.70%
5 63304 1046 11.84% 3 2.17%
6 63368 896 10.14% 12 8.70%
7 63367 882 9.98% 9 6.52%
8 534 6.04% 1 0.72%
9 63348 105 1.19% 0 0.00%
10 63341 84 0.95% 1 0.72%
11 63332 64 0.72% 0 0.00%
12 63373 25 0.28% 1 0.72%
13 63386 17 0.19% 0 0.00%
14 63357 13 0.15% 0 0.00%
15 63376 5 0.06% 0 0.00%

parse Google Scholar search results scraped with rvest

I am trying to use rvest to scrape one page of Google Scholar search results into a dataframe of author, paper title, year, and journal title.
The simplified, reproducible example below is code that searches Google Scholar for the example terms "apex predator conservation".
Note: to stay within the Terms of Service, I only want to process the first page of search results that I would get from a manual search. I am not asking about automation to scrape additional pages.
The following code already works to extract:
author
paper title
year
but it does not have:
journal title
I would like to extract the journal title and add it to the output.
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)
df
source: https://stackoverflow.com/a/58192323/8742237
So the output of that code looks like this:
#> titles
#> 1 [HTML][HTML] Saving large carnivores, but losing the apex predator?
#> 2 Site fidelity and sex-specific migration in a mobile apex predator: implications for conservation and ecosystem dynamics
#> 3 Effects of tourism-related provisioning on the trophic signatures and movement patterns of an apex predator, the Caribbean reef shark
#> authors years
#> 1 A Ordiz, R Bischof, JE Swenson 2013
#> 2 A Barnett, KG Abrantes, JD Stevens, JM Semmens 2011
Two questions:
How can I add a column that has the journal title extracted from the raw data?
Is there a reference where I can read and learn more about how to work out how to extract other fields for myself, so I don't have to ask here?
One way to add them is this:
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
leftovers <- authors_years %>%
str_remove_all(authors) %>%
str_remove_all(years)
journals <- str_split(leftovers, "-") %>%
map_chr(2) %>%
str_extract_all("[:alpha:]*") %>%
map(function(x) x[x != ""]) %>%
map(~paste(., collapse = " ")) %>%
unlist()
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, journals = journals, stringsAsFactors = FALSE)
For your second question: the css selector gadget chrome extension is nice for getting the css selectors of the elements you want. But in your case all elements share the same css class, so the only way to disentangle them is to use regex. So I guess learn a bit about css selectors and regex :)

How to download a html table with inconsistent number of columns in R?

I´m currently trying to download a table from the following URL:
url1<-"http://iambweb.ams.or.at/ambweb/showcognusServlet?tabkey=3643193&regionDisplay=%C3%96sterreich&export=html&outputLocale=de"
I downloaded and saved the file as .xls because I thought it is a Excel-file with the following code:
temp <- paste0(tempfile(), ".xls")
download.file(url1, destfile = temp, mode = "wb")
First I tried to read it in R as a Excel file but it seems to be a html (can be read by Excel though):
dfAMS <- read_excel(path = temp, sheet = "Sheet1", range = "I7:I37")
Therefore:
df <- read_html(temp)
Now unfortunately I´m stuck because the following lines of code won´t give me the intended result (a nice table or at least column I7:I37 in the .xls):
dfAMS <- html_node(df, "table") %>% html_table(fill = T) %>% tibble::as_tibble()
dplyr::glimpse(df)
I´m pretty sure the solution is rather simple but I´m currently stuck and can´t find a solution...
Thanks in advance!
Klamsi, the url points to an html file renamed to have a ".xls" extension. This is somewhat common practice among webmasters. Try it yourself by renaming the ".xls" extention to ".html".
A second problem is that the html has a very messy table configuration. The table of interest is the fifth table in the document.
This is a workaround to obtain the values for the overall population (or "range A7:B37, I7:K37")
url <- "http://iambweb.ams.or.at/ambweb/showcognusServlet?tabkey=3643193&regionDisplay=%C3%96sterreich&export=html&outputLocale=en"
df <- read_html(url) %>%
html_table(header = TRUE, fill = TRUE) %>%
.[[5]] %>% #Extract the fifth table in the list
as.data.frame() %>%
.[,c(1:11)] %>%
select(1:2, 9:11)
names <- unlist(df[1,])
names[1:2] <- c("item", "Bundesland")
colnames(df) <- names
df <- df[-1,]
df %>% head()
item Bundesland Bestand Veränderung zum VJ absolut Veränderung zum VJ in %
2 Arbeitslosigkeit Bgld 7119 -973 -0.120242214532872
3 Arbeitslosigkeit Ktn 16564 -2160 -0.115359965819269
4 Arbeitslosigkeit NÖ 46342 -6095 -0.116234719758949
5 Arbeitslosigkeit OÖ 29762 -4649 -0.135102147569091
6 Arbeitslosigkeit Sbg 11173 -643 -0.0544177386594448
7 Arbeitslosigkeit Stmk 28677 -5602 -0.1634236704688

html_nodes return different number of rows (R)

I'm new to R and trying to scrape a website. The website contains many products with their prices. When I scrape this, somehow the number of prices exceeds the number of products.
library(rvest)
url <- 'https://website'
webpage <- read_html(url)
SKU_data <- html_nodes(webpage,'.title') %>% html_text()
Price_data <- html_nodes(webpage,'.price') %>% html_text()
res <- data.frame(SKU_data,Price_data)
when execute I receive an error
Error in data.frame(SKU_data, Price_data) :
arguments mean different numbers of lines: 511, 521
The number of products on the website is 511, but there are 521 prices. How can I solve this?
The reason for the different lengths, is that the website gives multiple prices for some products. You would want to have the lowest one, right? The lowest price is contained in the element that has <span style="position:relative;">3 486,-<span class="grn">грн.</span></span>. Using Xpath, you can extract this:
SKU_data <- html_nodes(webpage,'.title') %>% html_text()
price_xpath <- "//span[contains(#style, 'position:relative')]"
Price_data <- html_nodes(webpage, xpath = price_xpath) %>%
html_text()
res <- data.frame(SKU_data, Price_data)
head(res)
# SKU_data Price_data
# 1 Кресло Чинция Пластик Неаполь N-20 1 699,-грн.
# 2 Стул Луиза хром Неаполь N-20 479,-грн.
# 3 OM-100 Стол письменный (1350х600х750мм) бук/бук 659,-грн.