Using rvest to scrape GoodReads pages

Using rvest to scrape GoodReads pages - html

I'm trying to scrape ratings and review numbers on goodreads, but getting an NA result. Why is this?
SelectorGadget finds "span span" for the average rating on hover over, but there's no "valid path" found at the bottom.
Using the same method on other sites (e.g. IMDB, theatlantic.com) works fine.
Here's my code and result (I've also tried replacing html_text with html_tag)
Rating<- html("http://www.goodreads.com/book/show/22444789-delicious-foods")
Rating %>%
html_node("span span") %>%
html_text () %>%
as.numeric()
[1] NA
Warning message:
In function_list[[k]](value) : NAs introduced by coercion

I didn't have any success using selectorgadget with the Goodreads site, but sometimes you just have to look at the html source and find what you're looking for that way.
In this case, you can use the .average class selector:
Rating %>%
html_node(".average") %>%
html_text %>%
as.numeric

Related

Parsing rvest output from an unstructured infobox

I am attempted to extract data from a wiki fandom website using the rvest package in R. However, I am running into several issues because the infobox is not structured as an HTML table. Please see below for my attempts at dealing with this issue:
library(tidyverse)
library(data.table)
library(rvest)
library(httr)
url <- c("https://starwars.fandom.com/wiki/Anakin_Skywalker")
#See here that the infobox information does not appear when checking for HTML tables in the page
df <- read_html(url) %>%
html_table()
#So now just extract data using the CSS selector
df <- read_html(url) %>%
html_element("aside")
html_text2()
The second attempt does succeed at extracting the raw data, but it is formatted in a way that is not easy to format into a clean dataframe. So, then I attempted to extract each element of the table individually, which might be easier to clean and structure into a dataframe. However, when I attempt to do so using the XPath, I get an empty result:
df <- read_html(url) %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/aside/section[1]') %>%
html_text2()
So I suppose my question is primarily: does anyone know of a good way to automatically extract the infobox in a datarfame friendly format? If not, would someone be able to point me towards why my attempt to extract each panel individually is not working?

If you target the div.pi-data directly, you could do something like this:
bind_rows(
read_html(url) %>%
rvest::html_nodes("div.pi-data") %>%
map(.f = ~tibble(
label = html_elements(.x, ".pi-data-label") %>% html_text2(),
text= html_elements(.x, ".pi-data-value") %>% html_text2() %>% strsplit(split="\n")
) %>% unnest(text)
)
)
Output:
# A tibble: 29 x 2
label text
<chr> <chr>
1 Homeworld Tatooine[1]
2 Born 41 BBY,[2] Tatooine[3]
3 Died 4 ABY,[4]DS-2 Death Star II Mobile Battle Station, Endor system[5]
4 Species Human[1]
5 Gender Male[1]
6 Height 1.88 meters,[1] later 2.03 meters (6 ft, 8 in) in armor[6]
7 Mass 120 kilograms in armor[7]
8 Hair color Blond,[8] light[9] and dark[10]
9 Eye color Blue,[11] later yellow (dark side)[12]
10 Skin color Light,[11] later pale[5]
# ... with 19 more rows

How to read meta data from html with R

I got an R script from a colleague but it is not entirely working. Its intention is to read a price for a product from a website.
the code is as follows:
vec_tectake <- try(paste0('https://www.tectake.ch/de/',j)%>%
read_html %>%
html_nodes('[itemprop="price"]') %>%
html_attr('content'))
to give an example of a full link, "j" could be "rudergerat-mit-trainingscomputer-401074"
After running the code, the vec_tectake i get is emtpy.
Now i'm not really sure why, as it has worked with the same code on another webpage. Could it be that it is because the price is marked as "meta content"?
Thanks for your help

The thing is the price is in own <span> tag, not within attributes:
library(rvest)
j <- "rudergerat-mit-trainingscomputer-401074"
read_html(paste0('https://www.tectake.ch/de/',j)) |>
html_nodes(".price") |>
html_text()
#> [1] "CHF 256.00"
Created on 2022-03-07 by the reprex package (v2.0.1)

"rvest" not fetching the product details using html_nodes()

I was using rvest for scraping the details of the product (Names, Price, and Availability) on amazon's product search results. I was able to fetch the webpage with read_html(), but I am not able to fetch the details of the product (Names, Price, and Availability). The page has <span> tag with class as class = "a-size-medium a-color-base a-text-normal". I have used html_nodes("span.a-size-medium a-color-base a-text-normal"), but got NA.
Here is the reproducible code:
library(rvest)
library(xml2)
url <- "https://www.amazon.in/s?k=Smartphone&rh=n%3A1389401031&ref=nb_sb_noss"
page <- read_html(url)
data <- page%>%
html_node("span.a-size-medium a-color-base a-text-normal") %>%
html_text()
print(data)

You just need to change the css selector a little bit. I was able to get the names and the prices, the availability was a little bit trickier :/
library(rvest)
library(xml2)
url <- "https://www.amazon.in/s?k=Smartphone&rh=n%3A1389401031&ref=nb_sb_noss"
page <- read_html(url)
name <- page %>% html_nodes(".a-size-medium.a-color-base.a-text-normal") %>% html_text()
price <- page %>% html_nodes(".a-price-whole") %>% html_text()

Extracting text/number from tspan class tag HTML with R

I am trying to extract Current Production number from this website http://okg.se/sv/Produktionsinformation/ (in the blue area below).
Here is the HTML code part I need to use:
<tspan dy="0" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);">518</tspan>
The example of the code I used:
url <- "http://okg.se/sv/Produktionsinformation//"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
content %>% html_nodes(".content__info__item__value")
But the result I get shows that there is not nodes available:
{xml_nodeset (0)}
Do you have any ides on how to solve this issue?
Thanks in advance!

I am not pretty sure about the value that you need, but this work
librar(rvest)
# page url
url <- "http://okg.se/sv/Produktionsinformation/"
# current value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-current")
# Max value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-max")

The html you see with your browser has been processed by javascript, so isn't the same as the html you see with rvest.
The raw data you are looking for is actually stored in attributes of a div with the id "gauge", so you get it like this:
library(rvest)
#> Loading required package: xml2
"http://okg.se/sv/Produktionsinformation//" %>%
read_html() %>%
html_node("#gauge") %>%
html_attrs() %>%
`[`(c("data-current", "data-max"))
#> data-current data-max
#> "553" "1450"
Note that you don't need to save the html to your local drive to process it. You can read it directly from the internet by giving the url to read_html
Created on 2020-02-20 by the reprex package (v0.3.0)

How to get values of select menu?

I'm trying to get values (all regions) of select menu on this webpage. What's my fault? Almost tried all combinations but result is zero. One of them is:
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
regions <- page %>%
html_nodes("div") %>%
html_nodes("span") %>%
html_nodes(xpath = '//*[#id="select2-ys-areaSelector-container"]') %>%
html_attr("title")
Thanks in advance.

XPath is kind of a ugly beast. Get the id of the select element then get all options groups and finally get their text data. Use html_text to convert it to R character.
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
regions <- page %>%
html_nodes(xpath='//*[#id="ys-areaSelector"]/optgroup/*/text()') %>%
html_text()

I'd use a css selector combination assuming all option values wanted
library(rvest)
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
options <- page %>%
html_nodes('#ys-areaSelector [data-url]') %>%
html_text()

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using rvest to scrape GoodReads pages - html

I didn't have any success using selectorgadget with the Goodreads site, but sometimes you just have to look at the html source and find what you're looking for that way. In this case, you can use the .average class selector: Rating %>% html_node(".average") %>% html_text %>% as.numeric

Related

Parsing rvest output from an unstructured infobox

How to read meta data from html with R

"rvest" not fetching the product details using html_nodes()

Extracting text/number from tspan class tag HTML with R

How to get values of select menu?

Categories

Resources