How To Extract Name in this HTML Element using rvest - html

I've searched through many rvest scraping posts but can't find an example like mine. I'm following the R vignette example (https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/) for selectorgadget, but inputting my use case as necessary. None of selector gadget's suggestions get me what I need. I need to extract the name for each review on the page. A sample of what the name looks like under the hood is as follows:
<span itemprop="name" class="sg_selected">This Name</span>
Here's my code to this point. Ideally, this code should get me the individual names on this web page.
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-
vineyard#reviews")
review_names <- html_nodes(dsa_reviews,'#reviews span')
df <- bind_rows(lapply(xml_attrs(review_names), function(x)
data.frame(as.list(x), stringsAsFactors=FALSE)))
Apologies if this is a duplicate question or if it's not formatted correctly. Please feel free to request any necessary edits.

Here it is :
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-vineyard#reviews")
html_nodes(dsa_reviews,'[itemprop=name]') %>%
html_text()
[1] "Traveling Vineyard" ""
[3] "Kiersten Ray-kuhn" "Miley Sama"
[5] " Nancy Shawtone " "Amanda Moore"
[7] "Matt" "Kathy Barzal"
[9] "Lesa Brinker" "Lori Stryker"
[11] "Jeanette Holtman" "Penny Notarnicola"
[13] "Laura Ann" "Nicole Lafave"
[15] "Gretchen Hess Miller" "Gina Devine"
[17] "Ashley Lawton Converse" "Morgan Williams"
[19] "Angela Baston Mckeone" "Traci Feshler"
[21] "Kisha Marshall Dlugos" "Jody Cole Dvorak"
Colin

Related

No connection in R

I've been trying to learn webscraping from an online course, and they give the following as an example
url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html"
website<- read_html(url)
treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href")
treaties_links <-treaties_links[23:30]
treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep="")))
treaties_links_full[8] <-treaties_links[8]
treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))
When I get to this last line it returns an error
Error in open.connection(x, "rb") :
Could not resolve host: www.canada.cahttp
Your error is in your lapply() code. If you print treaties_links, you will see that they are not all internal links, i.e. links starting with /, and some are links to other domains:
print(treaties_links)
[1] "/en/employment-social-development/services/labour-relations/international/agreements/chile.html"
[2] "/en/employment-social-development/services/labour-relations/international/agreements/costa-rica.html"
[3] "/en/employment-social-development/services/labour-relations/international/agreements/peru.html"
[4] "/en/employment-social-development/services/labour-relations/international/agreements/colombia.html"
[5] "/en/employment-social-development/services/labour-relations/international/agreements/jordan.html"
[6] "/en/employment-social-development/services/labour-relations/international/agreements/panama.html"
[7] "http://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
[8] "http://international.gc.ca/trade-commerce/assets/pdfs/agreements-accords/korea-coree/18_CKFTA_EN.pdf"
This means that when you are running paste("https://www.canada.ca",x,sep="") on e.g. link 7, you get:
"https://www.canada.cahttp://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
Assuming you want to keep that link you might change your lapply to:
treaties_links_full <- lapply(
treaties_links,
function(x) {
ifelse(
substr(x,1,1)=="/",
paste("https://www.canada.ca",x,sep=""),
x
)
}
)
This will only prepend "https://www.canada.ca" to the links within that domain.

How to correctly identify html node

I want to scrape the price of a product on a webshop, but I struggle to correctly allocate the correct nodes to the price i want to scrape.
The relevant part of my code looks like this:
"https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/"%>%
read_html()%>%
html_nodes('span.woocommerce-Price-amount.amount')%>%
html_text()
When executing this code, I do get prices as a result, but not the ones i want (it shows the prices of other produts that are listed beneath.
How can I now correctly identify the node to the price of the product itself (375.-)
First: I don't know R.
This page uses JavaScript to add this price in HTML
but I don't know if rvest can run JavaScript.
But I found this value in <form data-product_variations="..."> as JSON
and I could display prices for all options:
data <- "https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/" %>%
read_html() %>%
html_nodes('form.variations_form.cart') %>%
html_attr('data-product_variations') %>%
fromJSON
data$display_price
data$regular_price
data$image$title
Result:
> data$display_price
[1] 479 375 439 479 479
> data$display_regular_price
[1] 699 549 629 699 699
> data$image$title
[1] "aqua marina fusion bamboo padddel"
[2] "aqua marina fusion aluminium padddel"
[3] "aqua marina fusion carbon padddel"
[4] "aqua marina fusion hibi padddel"
[5] "aqua marina fusion silver padddel"
> colnames(data)
[1] "attributes" "availability_html" "backorders_allowed"
[4] "dimensions" "dimensions_html" "display_price"
[7] "display_regular_price" "image" "image_id"
[10] "is_downloadable" "is_in_stock" "is_purchasable"
[13] "is_sold_individually" "is_virtual" "max_qty"
[16] "min_qty" "price_html" "sku"
[19] "variation_description" "variation_id" "variation_is_active"
[22] "variation_is_visible" "weight" "weight_html"
[25] "is_bookable" "number_of_dates" "your_discount"
[28] "gtin" "your_delivery"
EDIT:
To work with page which uses JavaScript you may need other tools - like phantomjs
How to Scrape Data from a JavaScript Website with R | R-bloggers

rvest returns empty list for html table

I really can't get my head around this problem and I would be grateful for any piece of advice that you could give me.
I am trying to scrape the Bitcoin implied volatility index (BitVol) on this website:
https://t3index.com/indices/bit-vol/
It is possible to show the raw values in the Chart via this button and by clicking on "View data table":
The id of the relevant html table is "highcharts-data-table-1":
I have used the rvest package to scrape this table. This is what I got so far:
library(rvest)
library(tidyverse)
url5 <- "https://t3index.com/indices/bit-vol/"
output <- url(url5) %>%
read_html() %>%
html_nodes(xpath='//*[#id="highcharts-data-table-1"]//table[1]') %>%
html_table()
The code runs smoothly without returning any errors but still the query returns an empty list in the variable output despite the fact that I have followed the recommendations in this article as well:
rvest returning empty list
This is the current R Version that I am using:
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "4"
$minor
[1] "0.3"
$year
[1] "2020"
$month
[1] "10"
$day
[1] "10"
$`svn rev`
[1] "79318"
$language
[1] "R"
$version.string
[1] "R version 4.0.3 (2020-10-10)"
$nickname
[1] "Bunny-Wunnies Freak Out"
Any help would be highly appreciated!

Get nodes from a html webpage to crawl URLs using R

https://i.stack.imgur.com/xeczg.png
I am trying to get the URLs under the node '.2lines' from the webpage 'https://www.sgcarmart.com/main/index.php'
library(rvest)
url <- read_html('https://www.sgcarmart.com/main/index.php') %>% html_nodes('.2lines') %>% html_attr()
Which I receive an error for html_nodes function:
Error in parse_simple_selector(stream) :
Expected selector, got <NUMBER '.2' at 1>
How do I get around this error?
You can use an xpath selector to find the nodes you want. The links are actually contained in <a> tags within the <p> tags you are trying to reference by class. You can access them in a single xpath:
library(rvest)
site <- 'https://www.sgcarmart.com'
urls <- site %>%
paste0("/main/index.php") %>%
read_html() %>%
html_nodes(xpath = "//*[#class = '2lines']/a") %>%
html_attr("href") %>%
{paste0(site, .)}
urls
#> [1] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12485"
#> [2] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11875"
#> [3] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11531"
#> [4] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11579"
#> [5] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12635"
#> [6] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12507"
#> [7] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12644"
#> [8] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12622"
#> [9] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12650"
#> [10] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12651"
#> [11] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12589"
#> [12] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12649"

Turning a table in HTML into a data frame

I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:
library(tidyverse)
library(rvest)
library(XML)
library(RCurl)
(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>%
html_node(xpath = '//*[#id="toc"]/ul') %>%
htmlTreeParse() %>%
xmlRoot())
This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag <span class="toctext"> such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?
You can read the text from a node using html_text()
url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
read_html() %>%
html_node(xpath = '//*[#id="toc"]') %>%
html_text()
This gives you a single character vector. You can then split on the \n character to give you the results as a vector (and you can clean out the blanks)
contents <- strsplit(toc, "\n")[[1]]
contents[contents != ""]
# [1] "Contents" "1 Group A" "1.1 Brazil"
# [4] "1.2 Cameroon" "1.3 Croatia" "1.4 Mexico"
# [7] "2 Group B" "2.1 Australia" "2.2 Chile"
# [10] "2.3 Netherlands" "2.4 Spain" "3 Group C"
# [13] "3.1 Colombia" "3.2 Greece" "3.3 Ivory Coast"
# [16] "3.4 Japan" "4 Group D" "4.1 Costa Rica"
# [19] "4.2 England" "4.3 Italy" "4.4 Uruguay"
# ---
# etc
Generally, to read tables in an html document you can use the html_table() function, but in this case the table of contents isn't read.
url %>%
read_html() %>%
html_table()