How to read meta data from html with R

How to read meta data from html with R - html

I got an R script from a colleague but it is not entirely working. Its intention is to read a price for a product from a website.
the code is as follows:
vec_tectake <- try(paste0('https://www.tectake.ch/de/',j)%>%
read_html %>%
html_nodes('[itemprop="price"]') %>%
html_attr('content'))
to give an example of a full link, "j" could be "rudergerat-mit-trainingscomputer-401074"
After running the code, the vec_tectake i get is emtpy.
Now i'm not really sure why, as it has worked with the same code on another webpage. Could it be that it is because the price is marked as "meta content"?
Thanks for your help

The thing is the price is in own <span> tag, not within attributes:
library(rvest)
j <- "rudergerat-mit-trainingscomputer-401074"
read_html(paste0('https://www.tectake.ch/de/',j)) |>
html_nodes(".price") |>
html_text()
#> [1] "CHF 256.00"
Created on 2022-03-07 by the reprex package (v2.0.1)

Related

Parsing rvest output from an unstructured infobox

I am attempted to extract data from a wiki fandom website using the rvest package in R. However, I am running into several issues because the infobox is not structured as an HTML table. Please see below for my attempts at dealing with this issue:
library(tidyverse)
library(data.table)
library(rvest)
library(httr)
url <- c("https://starwars.fandom.com/wiki/Anakin_Skywalker")
#See here that the infobox information does not appear when checking for HTML tables in the page
df <- read_html(url) %>%
html_table()
#So now just extract data using the CSS selector
df <- read_html(url) %>%
html_element("aside")
html_text2()
The second attempt does succeed at extracting the raw data, but it is formatted in a way that is not easy to format into a clean dataframe. So, then I attempted to extract each element of the table individually, which might be easier to clean and structure into a dataframe. However, when I attempt to do so using the XPath, I get an empty result:
df <- read_html(url) %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/aside/section[1]') %>%
html_text2()
So I suppose my question is primarily: does anyone know of a good way to automatically extract the infobox in a datarfame friendly format? If not, would someone be able to point me towards why my attempt to extract each panel individually is not working?

If you target the div.pi-data directly, you could do something like this:
bind_rows(
read_html(url) %>%
rvest::html_nodes("div.pi-data") %>%
map(.f = ~tibble(
label = html_elements(.x, ".pi-data-label") %>% html_text2(),
text= html_elements(.x, ".pi-data-value") %>% html_text2() %>% strsplit(split="\n")
) %>% unnest(text)
)
)
Output:
# A tibble: 29 x 2
label text
<chr> <chr>
1 Homeworld Tatooine[1]
2 Born 41 BBY,[2] Tatooine[3]
3 Died 4 ABY,[4]DS-2 Death Star II Mobile Battle Station, Endor system[5]
4 Species Human[1]
5 Gender Male[1]
6 Height 1.88 meters,[1] later 2.03 meters (6 ft, 8 in) in armor[6]
7 Mass 120 kilograms in armor[7]
8 Hair color Blond,[8] light[9] and dark[10]
9 Eye color Blue,[11] later yellow (dark side)[12]
10 Skin color Light,[11] later pale[5]
# ... with 19 more rows

How to get information from html in R?

I want to get information about price from this page: https://www.coffeedesk.pl/product/16632/Espresso-Miesiaca-Lacava-Etiopia-Yirgacheffe-Rocko-Mountain-1Kg
My code
url <-"https://www.coffeedesk.pl/product/16632/Espresso-Miesiaca-Lacava-Etiopia-Yirgacheffe-Rocko-Mountain-1Kg"
x <- xml2::read_html(url)
price<-x%>% html_node('span.product-price smaller-price') %>%
html_text()
but it returns NA
What can I do?

You have a space in your html statement when you really need to have a period. Try html_node('span.product-price.smaller-price') in your code and see if that works.

Extracting text/number from tspan class tag HTML with R

I am trying to extract Current Production number from this website http://okg.se/sv/Produktionsinformation/ (in the blue area below).
Here is the HTML code part I need to use:
<tspan dy="0" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);">518</tspan>
The example of the code I used:
url <- "http://okg.se/sv/Produktionsinformation//"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
content %>% html_nodes(".content__info__item__value")
But the result I get shows that there is not nodes available:
{xml_nodeset (0)}
Do you have any ides on how to solve this issue?
Thanks in advance!

I am not pretty sure about the value that you need, but this work
librar(rvest)
# page url
url <- "http://okg.se/sv/Produktionsinformation/"
# current value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-current")
# Max value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-max")

The html you see with your browser has been processed by javascript, so isn't the same as the html you see with rvest.
The raw data you are looking for is actually stored in attributes of a div with the id "gauge", so you get it like this:
library(rvest)
#> Loading required package: xml2
"http://okg.se/sv/Produktionsinformation//" %>%
read_html() %>%
html_node("#gauge") %>%
html_attrs() %>%
`[`(c("data-current", "data-max"))
#> data-current data-max
#> "553" "1450"
Note that you don't need to save the html to your local drive to process it. You can read it directly from the internet by giving the url to read_html
Created on 2020-02-20 by the reprex package (v0.3.0)

R to change the values in html form and scrape web data

I would like to scrape the historical weather data from this page http://www.weather.gov.sg/climate-historical-daily.
I am using the code given in this link Using r to navigate and scrape a webpage with drop down html forms.
However, I am not able to get the data probably due to change in structure of the page. In the code from the above link pgform <-html_form(pgsession)[[3]] was used to change the values of the form. I was not able to find a similar form in my case.
url <- "http://www.weather.gov.sg/climate-historical-daily"
pgsession <- html_session(url)
pgsource <- read_html(url)
pgform <- html_form(pgsession)
result in my case
> pgform
[[1]]
<form> 'searchform' (GET http://www.weather.gov.sg/)
<button submit> '<unnamed>
<input text> 's':

Since the page has a CSV download button and the links it provides follow a pattern, you can generate and download a set of URLs. You'll need a set of the station IDs, which you can scrape from the dropdown itself:
library(rvest)
page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()
station_id <- page %>% html_nodes('button#cityname + ul a') %>%
html_attr('onclick') %>% # If you need names, grab the `href` attribute, too.
sub(".*'(.*)'.*", '\\1', .)
which can then be put into expand.grid with the months and years to generate all the necessary combinations:
df <- expand.grid(station_id,
month = sprintf('%02d', 1:12),
year = 2014:2016)
(Note if you want 2017 data, you'll need to construct those separately and rbind so as not to construct months that haven't happened yet.)
The combinations can then be paste0ed into URLs:
urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_',
df$station_id, '_', df$year, df$month, '.csv')
which can be lapplyed across to download all the files:
# Warning! This will download a lot of files! Make sure you're in a clean directory.
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})

Using rvest to scrape GoodReads pages

I'm trying to scrape ratings and review numbers on goodreads, but getting an NA result. Why is this?
SelectorGadget finds "span span" for the average rating on hover over, but there's no "valid path" found at the bottom.
Using the same method on other sites (e.g. IMDB, theatlantic.com) works fine.
Here's my code and result (I've also tried replacing html_text with html_tag)
Rating<- html("http://www.goodreads.com/book/show/22444789-delicious-foods")
Rating %>%
html_node("span span") %>%
html_text () %>%
as.numeric()
[1] NA
Warning message:
In function_list[[k]](value) : NAs introduced by coercion

I didn't have any success using selectorgadget with the Goodreads site, but sometimes you just have to look at the html source and find what you're looking for that way.
In this case, you can use the .average class selector:
Rating %>%
html_node(".average") %>%
html_text %>%
as.numeric

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to read meta data from html with R - html

The thing is the price is in own <span> tag, not within attributes: library(rvest) j <- "rudergerat-mit-trainingscomputer-401074" read_html(paste0('https://www.tectake.ch/de/',j)) |> html_nodes(".price") |> html_text() #> [1] "CHF 256.00" Created on 2022-03-07 by the reprex package (v2.0.1)

Related

Parsing rvest output from an unstructured infobox

How to get information from html in R?

Extracting text/number from tspan class tag HTML with R

R to change the values in html form and scrape web data

Using rvest to scrape GoodReads pages

Categories

Resources