I've searched through many rvest scraping posts but can't find an example like mine. I'm following the R vignette example (https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/) for selectorgadget, but inputting my use case as necessary. None of selector gadget's suggestions get me what I need. I need to extract the name for each review on the page. A sample of what the name looks like under the hood is as follows:
<span itemprop="name" class="sg_selected">This Name</span>
Here's my code to this point. Ideally, this code should get me the individual names on this web page.
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-
vineyard#reviews")
review_names <- html_nodes(dsa_reviews,'#reviews span')
df <- bind_rows(lapply(xml_attrs(review_names), function(x)
data.frame(as.list(x), stringsAsFactors=FALSE)))
Apologies if this is a duplicate question or if it's not formatted correctly. Please feel free to request any necessary edits.
Here it is :
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-vineyard#reviews")
html_nodes(dsa_reviews,'[itemprop=name]') %>%
html_text()
[1] "Traveling Vineyard" ""
[3] "Kiersten Ray-kuhn" "Miley Sama"
[5] " Nancy Shawtone " "Amanda Moore"
[7] "Matt" "Kathy Barzal"
[9] "Lesa Brinker" "Lori Stryker"
[11] "Jeanette Holtman" "Penny Notarnicola"
[13] "Laura Ann" "Nicole Lafave"
[15] "Gretchen Hess Miller" "Gina Devine"
[17] "Ashley Lawton Converse" "Morgan Williams"
[19] "Angela Baston Mckeone" "Traci Feshler"
[21] "Kisha Marshall Dlugos" "Jody Cole Dvorak"
Colin
Related
I've been trying to learn webscraping from an online course, and they give the following as an example
url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html"
website<- read_html(url)
treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href")
treaties_links <-treaties_links[23:30]
treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep="")))
treaties_links_full[8] <-treaties_links[8]
treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))
When I get to this last line it returns an error
Error in open.connection(x, "rb") :
Could not resolve host: www.canada.cahttp
Your error is in your lapply() code. If you print treaties_links, you will see that they are not all internal links, i.e. links starting with /, and some are links to other domains:
print(treaties_links)
[1] "/en/employment-social-development/services/labour-relations/international/agreements/chile.html"
[2] "/en/employment-social-development/services/labour-relations/international/agreements/costa-rica.html"
[3] "/en/employment-social-development/services/labour-relations/international/agreements/peru.html"
[4] "/en/employment-social-development/services/labour-relations/international/agreements/colombia.html"
[5] "/en/employment-social-development/services/labour-relations/international/agreements/jordan.html"
[6] "/en/employment-social-development/services/labour-relations/international/agreements/panama.html"
[7] "http://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
[8] "http://international.gc.ca/trade-commerce/assets/pdfs/agreements-accords/korea-coree/18_CKFTA_EN.pdf"
This means that when you are running paste("https://www.canada.ca",x,sep="") on e.g. link 7, you get:
"https://www.canada.cahttp://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
Assuming you want to keep that link you might change your lapply to:
treaties_links_full <- lapply(
treaties_links,
function(x) {
ifelse(
substr(x,1,1)=="/",
paste("https://www.canada.ca",x,sep=""),
x
)
}
)
This will only prepend "https://www.canada.ca" to the links within that domain.
I want to scrape the price of a product on a webshop, but I struggle to correctly allocate the correct nodes to the price i want to scrape.
The relevant part of my code looks like this:
"https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/"%>%
read_html()%>%
html_nodes('span.woocommerce-Price-amount.amount')%>%
html_text()
When executing this code, I do get prices as a result, but not the ones i want (it shows the prices of other produts that are listed beneath.
How can I now correctly identify the node to the price of the product itself (375.-)
First: I don't know R.
This page uses JavaScript to add this price in HTML
but I don't know if rvest can run JavaScript.
But I found this value in <form data-product_variations="..."> as JSON
and I could display prices for all options:
data <- "https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/" %>%
read_html() %>%
html_nodes('form.variations_form.cart') %>%
html_attr('data-product_variations') %>%
fromJSON
data$display_price
data$regular_price
data$image$title
Result:
> data$display_price
[1] 479 375 439 479 479
> data$display_regular_price
[1] 699 549 629 699 699
> data$image$title
[1] "aqua marina fusion bamboo padddel"
[2] "aqua marina fusion aluminium padddel"
[3] "aqua marina fusion carbon padddel"
[4] "aqua marina fusion hibi padddel"
[5] "aqua marina fusion silver padddel"
> colnames(data)
[1] "attributes" "availability_html" "backorders_allowed"
[4] "dimensions" "dimensions_html" "display_price"
[7] "display_regular_price" "image" "image_id"
[10] "is_downloadable" "is_in_stock" "is_purchasable"
[13] "is_sold_individually" "is_virtual" "max_qty"
[16] "min_qty" "price_html" "sku"
[19] "variation_description" "variation_id" "variation_is_active"
[22] "variation_is_visible" "weight" "weight_html"
[25] "is_bookable" "number_of_dates" "your_discount"
[28] "gtin" "your_delivery"
EDIT:
To work with page which uses JavaScript you may need other tools - like phantomjs
How to Scrape Data from a JavaScript Website with R | R-bloggers
I really can't get my head around this problem and I would be grateful for any piece of advice that you could give me.
I am trying to scrape the Bitcoin implied volatility index (BitVol) on this website:
https://t3index.com/indices/bit-vol/
It is possible to show the raw values in the Chart via this button and by clicking on "View data table":
The id of the relevant html table is "highcharts-data-table-1":
I have used the rvest package to scrape this table. This is what I got so far:
library(rvest)
library(tidyverse)
url5 <- "https://t3index.com/indices/bit-vol/"
output <- url(url5) %>%
read_html() %>%
html_nodes(xpath='//*[#id="highcharts-data-table-1"]//table[1]') %>%
html_table()
The code runs smoothly without returning any errors but still the query returns an empty list in the variable output despite the fact that I have followed the recommendations in this article as well:
rvest returning empty list
This is the current R Version that I am using:
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "4"
$minor
[1] "0.3"
$year
[1] "2020"
$month
[1] "10"
$day
[1] "10"
$`svn rev`
[1] "79318"
$language
[1] "R"
$version.string
[1] "R version 4.0.3 (2020-10-10)"
$nickname
[1] "Bunny-Wunnies Freak Out"
Any help would be highly appreciated!
https://i.stack.imgur.com/xeczg.png
I am trying to get the URLs under the node '.2lines' from the webpage 'https://www.sgcarmart.com/main/index.php'
library(rvest)
url <- read_html('https://www.sgcarmart.com/main/index.php') %>% html_nodes('.2lines') %>% html_attr()
Which I receive an error for html_nodes function:
Error in parse_simple_selector(stream) :
Expected selector, got <NUMBER '.2' at 1>
How do I get around this error?
You can use an xpath selector to find the nodes you want. The links are actually contained in <a> tags within the <p> tags you are trying to reference by class. You can access them in a single xpath:
library(rvest)
site <- 'https://www.sgcarmart.com'
urls <- site %>%
paste0("/main/index.php") %>%
read_html() %>%
html_nodes(xpath = "//*[#class = '2lines']/a") %>%
html_attr("href") %>%
{paste0(site, .)}
urls
#> [1] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12485"
#> [2] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11875"
#> [3] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11531"
#> [4] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11579"
#> [5] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12635"
#> [6] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12507"
#> [7] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12644"
#> [8] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12622"
#> [9] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12650"
#> [10] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12651"
#> [11] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12589"
#> [12] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12649"
I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:
library(tidyverse)
library(rvest)
library(XML)
library(RCurl)
(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>%
html_node(xpath = '//*[#id="toc"]/ul') %>%
htmlTreeParse() %>%
xmlRoot())
This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag <span class="toctext"> such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?
You can read the text from a node using html_text()
url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
read_html() %>%
html_node(xpath = '//*[#id="toc"]') %>%
html_text()
This gives you a single character vector. You can then split on the \n character to give you the results as a vector (and you can clean out the blanks)
contents <- strsplit(toc, "\n")[[1]]
contents[contents != ""]
# [1] "Contents" "1 Group A" "1.1 Brazil"
# [4] "1.2 Cameroon" "1.3 Croatia" "1.4 Mexico"
# [7] "2 Group B" "2.1 Australia" "2.2 Chile"
# [10] "2.3 Netherlands" "2.4 Spain" "3 Group C"
# [13] "3.1 Colombia" "3.2 Greece" "3.3 Ivory Coast"
# [16] "3.4 Japan" "4 Group D" "4.1 Costa Rica"
# [19] "4.2 England" "4.3 Italy" "4.4 Uruguay"
# ---
# etc
Generally, to read tables in an html document you can use the html_table() function, but in this case the table of contents isn't read.
url %>%
read_html() %>%
html_table()