I would like to scrape basic data about wines from Vivino. I have never done scraping before but based on some tutorials and lecture on Datacamp I tried to use basic code using library rvest.
However, it seems it does not work and returns zero values.
Could anyone please help me and tell me, where is the problem please? Is the code completely wrong and I should use some other method, or am I just missing something and doing it wrong?
Thank you in advance for any answers!
library(rvest)
library(dplyr)
url <- 'https://www.vivino.com/explore?e=eJwNybEOQDAQBuC3ubkG4z-abMQkIqdO00RbuTbF2_OtX1A0FHyEocAPWmPIvhh7suimga5_3YHK6qXwSWmDcvHR5ZWrKDuhhF2ypbvMC5oP96QajA%3D%3D&cart_item_source=nav-explore'
web <- read_html(url)
winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()
The page loads dynamically, which is why rvest alone will not work; you also need to use RSelenium.
Suppose I use Firefox, the following code should work:
# RSelenium with Firefox
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate(url)
# Scroll down a couple of times to reach the bottom of the page
# so that additional data load dynamically with each scroll.
# Here I scroll 4 times, but perhaps you will need much more than that.
for(i in 1:4){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
# close RSelenium
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
# now we can go on to our rvest code and scrape the data
winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()
Related
I am trying to scrape the following page(s):
Htpps://mywebsite.com
In particular, I would like to get the name of each entry. I noticed that the text I am interested in is always in (MY TEXT) the middle of these two tags: <div class="title"> MY TEXT
I know how to search for these tags individually:
#load libraries
library(rvest)
library(httr)
library(XML)
library(rvest)
# set up page
url<-"https://www.mywebsite.com"
page <-read_html(url)
#option 1
b = page %>% html_nodes("title")
option1 <- b %>% html_text() %>% strsplit("\\n")
#option 2
b = page %>% html_nodes("a")
option2 <- b %>% html_text() %>% strsplit("\\n")
Is there some way that I could have specified the "html_nodes" argument so that it picked up on "MY TEXT" - i.e. scrape between <div class="title"> and </a> :
<div class="title"> MY TEXT
Scraping of pages 1:10
library(tidyverse)
library(rvest)
my_function <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.dentistsearch.ca/search-doctor/",
page_n, "?category=0&services=0&province=55&city=&k=") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
df <- map_dfr(1:10, my_function)
You can use the xpath argument inside html_elements to locate each a tag inside a div with class "title".
Here's a complete reproducible example.
library(rvest)
"https://www.mywebsite.ca/extension1/" %>%
paste0("2?extension2") %>%
read_html() %>%
html_elements(xpath = "//div[#class='title']/a") %>%
html_text()
Or to get all entries on the first 10 pages:
library(rvest)
unlist(lapply(1:10, function(page){
"https://www.mywebsite.ca/extension1/" %>%
paste0(page, "?extension2") %>%
read_html() %>%
html_elements(xpath = "//div[#class='title']/a") %>%
html_text()}))
Created on 2022-07-26 by the reprex package (v2.0.1)
When I use the rvest package xpath and and try to get the embedded links (football team names) from the sites I get an empty result. Could someone help this?
The code is as follows:
library(rvest)
url <- read_html('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
xpath <- as.character('/html/body/div[2]/div[11]/div[1]/div[2]/div[2]/div')
url %>%
html_node(xpath=xpath) %>%
html_attr('href')
You can get all the links using :
library(rvest)
url <- 'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1'
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>%
paste0('https://www.transfermarkt.com', .) %>%
unique() %>%
head(20)
I am trying to read off the urls to data from StatsCan as follows:
# 2015
url <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2015/18122"
x1 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
# 2014
url2 <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2014/16993"
x2 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
Doing so returns two empty lists; I am confused as this worked for this link: https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/18087. Ultimately I want to loop over the list and read off the tables on each page as so:
for (i in 1:length(x2)){
out.data <- read_html(x2[i]) %>%
html_table(fill = TRUE) %>%
`[[`(1) %>%
as_tibble()
write.xlsx(out.data, str_c(destination,i,".xlsx"))
}
In order to extract all url, I recommend using the css selector ".field-item li a" and subset according to a pattern.
links <- read_html(url) %>%
html_nodes(".field-item li a") %>%
html_attr("href") %>%
str_subset("fuel-prices/crude")
Your XPath needs to be fixed. You can use the following one :
//strong[contains(.,"Oil")]/following-sibling::ul//a
I have been trying to display the review rating of this song using rvest in r from Pitchforkhttps://pitchfork.com/reviews/albums/us-girls-heavy-light/ . In this case, it is 8.5. But somehow I get this:
Here is my code
library(rvest)
library(dplyr)
library(RCurl)
library(tidyverse)
URL="https://pitchfork.com/reviews/albums/us-girls-heavy-light/"
webpage = read_html(URL)
cat("Review Rating")
webpage%>%
html_nodes("div span")%>%
html_text
We can get the relevant information from the class of div which is "score-circle".
library(rvest)
webpage %>% html_nodes('div.score-circle') %>% html_text() %>% as.numeric()
#[1] 8.5
I want to take number of pages from web site. I try to do it like on tutorial. I used this function:
get_last_page <- function(html){
pages_data <- html %>%
# The '.' indicates the class
html_nodes('.pagination-page') %>%
# Extract the raw text as a list
html_text()
# The second to last of the buttons is the one
pages_data[(length(pages_data)-1)] %>%
# Take the raw string
unname() %>%
# Convert to number
as.numeric()
}
first_page <- read_html(url)
(latest_page_number <- get_last_page(first_page))
for website
url <-'http://www.trustpilot.com/review/www.amazon.com'
it works fine.When I tried to do it with
url <-'https://energybase.ru/en/oil-gas-field/index'
I got integer(0).
I change
html_nodes('.pagination-page')
to
html_nodes('.html_nodes('data-page')')
And failed.
How can I change my code to make it works fine?
I think you have to go about this a little differently here.
The energybase.ru URL isn't organized quite the same way as the TrustPilot URL.
For our purposes here, we're interested in the fact that the last page has its own node .last. From there, you just have to extract the value of the data-page attribute and increment it by 1.
library("rvest")
library("magrittr")
url <- 'https://energybase.ru/en/oil-gas-field/index'
read_html(url) %>% html_nodes(".last") %>% html_children() %>% html_attr("data-page") %>% as.numeric()+1
# [1] 21
Edit: note, you can always intercept the piping at html_children() (by adding a %>% html_attrs() to it) to find out what attributes are available at your disposal there.
You could use the rel=last attribute=value node and extract the number from the href
library("rvest")
library("magrittr")
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
number_of_pages <- str_match_all(pg %>% html_node("[rel=last]") %>% html_attr("href"),'page=(\\d+)')[[1]][,2] %>% as.numeric()
Or, there are a number of ways you could calculate it given that there are more pages than pagination visibile. One way is to get the total count from the appropriate li in the drop down and divide by the results per page count.
library(rvest)
library(magrittr)
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
total_sites <- strtoi(pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount'), base = 0L)
# or use: total_sites <- pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount') %>% as.numeric()
sites_per_page <- length(pg %>% html_nodes('.index-list-item'))
number_of_pages <- ceiling(total_sites/sites_per_page)