How can I extract selective data from a webpage using rvest? - html

I have been trying to display the review rating of this song using rvest in r from Pitchforkhttps://pitchfork.com/reviews/albums/us-girls-heavy-light/ . In this case, it is 8.5. But somehow I get this:
Here is my code
library(rvest)
library(dplyr)
library(RCurl)
library(tidyverse)
URL="https://pitchfork.com/reviews/albums/us-girls-heavy-light/"
webpage = read_html(URL)
cat("Review Rating")
webpage%>%
html_nodes("div span")%>%
html_text

We can get the relevant information from the class of div which is "score-circle".
library(rvest)
webpage %>% html_nodes('div.score-circle') %>% html_text() %>% as.numeric()
#[1] 8.5

Related

Web-Scraping using R. I want to extract some table like data from a website

I'm having some problems scraping data from a website. I have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands
More precisely, I want to extract the brands on the right-hand side.
My idea so far:
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()
But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!
That data is dynamically pulled from a script tag. You can pull the content of that script tag and parse as json. subset just for the items of interest from the returned list and then extract the brand names:
library(rvest)
library(jsonlite)
library(stringr)
data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json()
data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})
I find the above easier to read but you could try other methods such as
library(rvest)
library(jsonlite)
library(stringr)
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json() %>%
{.$props$pageProps$apolloState} %>%
subset(., {str_detect(names(.), 'Brand:')}) %>%
lapply(. , function(x){x$name})
Using {} to have call be treated like an expression and not a function is something I read in a comment by #asachet

Can't extract href link from html_node in rvest

When I use the rvest package xpath and and try to get the embedded links (football team names) from the sites I get an empty result. Could someone help this?
The code is as follows:
library(rvest)
url <- read_html('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
xpath <- as.character('/html/body/div[2]/div[11]/div[1]/div[2]/div[2]/div')
url %>%
html_node(xpath=xpath) %>%
html_attr('href')
You can get all the links using :
library(rvest)
url <- 'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1'
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>%
paste0('https://www.transfermarkt.com', .) %>%
unique() %>%
head(20)

Vivino - Scraping with R

I would like to scrape basic data about wines from Vivino. I have never done scraping before but based on some tutorials and lecture on Datacamp I tried to use basic code using library rvest.
However, it seems it does not work and returns zero values.
Could anyone please help me and tell me, where is the problem please? Is the code completely wrong and I should use some other method, or am I just missing something and doing it wrong?
Thank you in advance for any answers!
library(rvest)
library(dplyr)
url <- 'https://www.vivino.com/explore?e=eJwNybEOQDAQBuC3ubkG4z-abMQkIqdO00RbuTbF2_OtX1A0FHyEocAPWmPIvhh7suimga5_3YHK6qXwSWmDcvHR5ZWrKDuhhF2ypbvMC5oP96QajA%3D%3D&cart_item_source=nav-explore'
web <- read_html(url)
winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()
The page loads dynamically, which is why rvest alone will not work; you also need to use RSelenium.
Suppose I use Firefox, the following code should work:
# RSelenium with Firefox
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate(url)
# Scroll down a couple of times to reach the bottom of the page
# so that additional data load dynamically with each scroll.
# Here I scroll 4 times, but perhaps you will need much more than that.
for(i in 1:4){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)
}
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
# close RSelenium
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
# now we can go on to our rvest code and scrape the data
winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()

How to retrieve a multiple tables from a webpage using R

I want to extract all vaccine tables with the description on the left and their description inside the table using R,
this is the link for the webpage
this is how the first table look on the webpage:
I tried using XML package, but I wasn't succeful, I used:
vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)
I get an error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: ''
How to do this?
This webpage does not use a tables thus the reason for your error. Due to the multiple subsections and hidden text, the formatting on the page is quite complicated and requires finding the nodes of interest individually.
I prefer using the "rvest" and "xml2" package for the easier and more straight forward syntax.
This is not a complete solution and should get you moving in the correct direction.
library(rvest)
library(dplyr)
#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[#id='vaccines_intro']") %>% xml_parent()
#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[#class='chart_row for_vaccines']")
#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[#class='chart_row-expanded for_vaccines']") %>% html_text()
#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[#class="chart-section for_vaccines"]') %>%
html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[#role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)
answer <- data.frame(VaccineType, company, product, phase)
head(answer)
To generate this code, involved reading the html code and identifying the correct nodes and the unique attributes for the desired information.

More efficient way to scrape ratings values from TripAdvisor using R

I am trying to scrape the ratings from TripAdvisor. So far, I have managed to extract the HTML nodes, turn them into character strings, extract the string that represents the numeric I need then converted it to the correct number, finally dividing it by 10 to get the correct value it is demonstrating.
library(rvest)
url <- "https://www.tripadvisor.co.uk/Attraction_Review-g1466790-d547811-Reviews-Royal_Botanic_Gardens_Kew-Kew_Richmond_upon_Thames_Greater_London_England.html"
ratings_too_big <- url %>%
read_html() %>%
html_nodes("#REVIEWS .ui_bubble_rating") %>%
as.character() %>%
substr(38,39) %>%
as.numeric()
ratings_too_big/10
This is without doubt very messy - what's a cleaner, more efficient way to do this? I have also tried Hadley Wickham's example shown here:
library(rvest)
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"
reviews <- url %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
rating <- reviews %>%
html_node(".rating .rating_s_fill") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
This was not successful, as no data was returned (there appears to be nothing in the HTML node ".rating .rating_s_fill"). I am new scraping and css identifiers, so apologies if the answer is obvious.