How to get information from html in R? - html

I want to get information about price from this page: https://www.coffeedesk.pl/product/16632/Espresso-Miesiaca-Lacava-Etiopia-Yirgacheffe-Rocko-Mountain-1Kg
My code
url <-"https://www.coffeedesk.pl/product/16632/Espresso-Miesiaca-Lacava-Etiopia-Yirgacheffe-Rocko-Mountain-1Kg"
x <- xml2::read_html(url)
price<-x%>% html_node('span.product-price smaller-price') %>%
html_text()
but it returns NA
What can I do?

You have a space in your html statement when you really need to have a period. Try html_node('span.product-price.smaller-price') in your code and see if that works.

Related

How to read meta data from html with R

I got an R script from a colleague but it is not entirely working. Its intention is to read a price for a product from a website.
the code is as follows:
vec_tectake <- try(paste0('https://www.tectake.ch/de/',j)%>%
read_html %>%
html_nodes('[itemprop="price"]') %>%
html_attr('content'))
to give an example of a full link, "j" could be "rudergerat-mit-trainingscomputer-401074"
After running the code, the vec_tectake i get is emtpy.
Now i'm not really sure why, as it has worked with the same code on another webpage. Could it be that it is because the price is marked as "meta content"?
Thanks for your help
The thing is the price is in own <span> tag, not within attributes:
library(rvest)
j <- "rudergerat-mit-trainingscomputer-401074"
read_html(paste0('https://www.tectake.ch/de/',j)) |>
html_nodes(".price") |>
html_text()
#> [1] "CHF 256.00"
Created on 2022-03-07 by the reprex package (v2.0.1)

Rvest getting an specific text from html_node

I want to extract only "Beech Valley Solutions - "
When I run
html_nodes('li') %>%
html_nodes(".flexbox.empLoc") %>%
html_text()
All the information comes out. "Beech Valley Solutions - Atlanta, GA Today 24hr"
There is one more way of doing scraping using rvest.
Instead of passing css selector item in html_nodes(), you can pass xpath within html_nodes().Just an example below -
page %>% html_nodes(xpath = "//*[#id='series-matches']/div[20]/div[3]/div[1]/a[1]/span")
Reference:
https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/
x path is easier to fetch -
Right click the section for which you want to fetch xpath.
Select inspect code from the drop down. 3. html page will appear to the right side, from which click the right click and press Copy option.
Drop will appear from which select "Copy xpath".
Ctrl V (Paste) the xpath within html_nodes(xpath = "xpath here"). I hope this will help you.

Getting data from an html page using R

Anyone can help me why the below code doe not have any data for the selected table?
library('httr')
library('rvest')
url= read_html("http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB")
table = html_node(url,"table#f05v5-sorting-table.border-top2.border-allside.clearboth")
Thanks!
You are missing some steps. Your workflow should look like this:
dat_html <- read_html(
"http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB"
)
dat_nodes <- html_nodes(dat_html, xpath = "xxxx")
dat <- html_table(dat_nodes)
dat will be a list, so if you want a data frame, you could do something like:
dat_df <- as.data.frame(dat)
Or, if you like tibbles:
dat_tbl <- as_tibble(dat)
I cannot find the table you are interested in on that webpage, so you have to replace "xxxx" by the xpath of the table you are interested in.
To find the xpath, if you are inspecting the page from chrome or chromium, you can right click on the node in the inspector window, and look for Copy, then Copy XPath.

R to change the values in html form and scrape web data

I would like to scrape the historical weather data from this page http://www.weather.gov.sg/climate-historical-daily.
I am using the code given in this link Using r to navigate and scrape a webpage with drop down html forms.
However, I am not able to get the data probably due to change in structure of the page. In the code from the above link pgform <-html_form(pgsession)[[3]] was used to change the values of the form. I was not able to find a similar form in my case.
url <- "http://www.weather.gov.sg/climate-historical-daily"
pgsession <- html_session(url)
pgsource <- read_html(url)
pgform <- html_form(pgsession)
result in my case
> pgform
[[1]]
<form> 'searchform' (GET http://www.weather.gov.sg/)
<button submit> '<unnamed>
<input text> 's':
Since the page has a CSV download button and the links it provides follow a pattern, you can generate and download a set of URLs. You'll need a set of the station IDs, which you can scrape from the dropdown itself:
library(rvest)
page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()
station_id <- page %>% html_nodes('button#cityname + ul a') %>%
html_attr('onclick') %>% # If you need names, grab the `href` attribute, too.
sub(".*'(.*)'.*", '\\1', .)
which can then be put into expand.grid with the months and years to generate all the necessary combinations:
df <- expand.grid(station_id,
month = sprintf('%02d', 1:12),
year = 2014:2016)
(Note if you want 2017 data, you'll need to construct those separately and rbind so as not to construct months that haven't happened yet.)
The combinations can then be paste0ed into URLs:
urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_',
df$station_id, '_', df$year, df$month, '.csv')
which can be lapplyed across to download all the files:
# Warning! This will download a lot of files! Make sure you're in a clean directory.
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})

Using rvest to scrape GoodReads pages

I'm trying to scrape ratings and review numbers on goodreads, but getting an NA result. Why is this?
SelectorGadget finds "span span" for the average rating on hover over, but there's no "valid path" found at the bottom.
Using the same method on other sites (e.g. IMDB, theatlantic.com) works fine.
Here's my code and result (I've also tried replacing html_text with html_tag)
Rating<- html("http://www.goodreads.com/book/show/22444789-delicious-foods")
Rating %>%
html_node("span span") %>%
html_text () %>%
as.numeric()
[1] NA
Warning message:
In function_list[[k]](value) : NAs introduced by coercion
I didn't have any success using selectorgadget with the Goodreads site, but sometimes you just have to look at the html source and find what you're looking for that way.
In this case, you can use the .average class selector:
Rating %>%
html_node(".average") %>%
html_text %>%
as.numeric