web-scraping: how to include quote character in HTML node - html

I am using the rvest package to scrape information from a website. Some information that I need belong to the class iinfo". Unfortunately, if I use this string inside the function html_nodes() I got the following error:
Error in parse_simple_selector(stream) :
Expected selector, got <STRING '' at 7>
Here's a reprex:
library(rvest)
library(xml2)
webpage <- read_html(x = paste0("https://www.gstsvs.ch/fr/trouver-un-medecin-veterinaire.html?tx_datapool_pi1%5Bhauptgebiet%5D=3&tx_datapool_pi1%5Bmapsearch%5D=cercare&tx_datapool_pi1%5BdoSearch%5D=1&tx_datapool_pi1%5Bpointer2303%5D=",
0))
webpage_address <- webpage %>%
html_nodes('.iinfo"') %>%
html_text() %>%
gsub(pattern = "\r|\t|\n",
replacement = " ")
That class refers to the addresses listed inside every box of the website. You can retrieve this information if, in the browser, you inspect the webpage structure and navigate to that box. If you do so, when you select the address division with the mouse, you'll see that a flag with div.iinfo\" appears.
Thanks a lot for your help!

Here:
webpage_address <- webpage %>%
html_nodes(xpath = "//*[#class='iinfo\"']") %>%
html_text(trim = T)
Result:
> webpage_address
[1] "Anne-Françoise HenchozEnvers 412400 Le Locle, NE"
[2] "Téléphone: 032 931 10 10Urgences: 032 931 10 10Fax: 032 931 36 10afhenchoz(at)bluewin.chafhenchoz.com"
[3] "Ursi Dommann ScheuberHauptstrasse 156222 Gunzwil, LU"
[4] "Téléphone: 041 930 14 44tiergesundheit(at)bluewin.ch"
[5] "Dr. Med. Vet. Anne KramerBaggwilgraben 33267 Seedorf, BE"
[6] "Téléphone: 079 154 70 15anne(at)alpakavet.chwww.alpakavet.ch"
[7] "Dr. med. vet. Andrea FeistAdelbodenstrasse 103714 Frutigen, BE"
[8] "Téléphone: 033 671 15 60Urgences: 033 671 15 60Fax: 033 671 86 60alpinvet(at)bluewin.chwww.alpinvet.ch"
[9] "Dr. med. vet. Peter KürsteinerAlpsteinstr. 289240 Uzwil, SG"
[10] "Téléphone: 071 951 85 44"
[11] "Kathrin Urscheler-Hollenstein, Eveline Muhl-ZollingerSchaffhauserstrasse 2458222 Beringen, SH"
[12] "Téléphone: 052 685 20 20Fax: 052 685 34 20praxis(at)tieraerzte-team.chwww.tieraerzte-team.ch"
[13] "Dr. med. vet. Erwin VincenzVia Santeri 127130 Ilanz, GR"
[14] "Téléphone: 081/925 23 23Urgences: 081/925 23 23Fax: 081/925 23 25info(at)anima-veterinari.ch"
[15] "Dr. Zlatko MarinovicMühlerain 3853855072 oeschgen, AG"
[16] "Téléphone: 49628715060Urgences: 49628715060Fax: 49628712439z.marin(at)sunrise.ch"
[17] "Manser ChläusSchwalbenweg 73186 Düdingen, FR"
[18] "Téléphone: 026 493 10 60animans.tierarzt(at)gmail.com"
[19] "W.A.GeesBrünigstrasse 38aHauptstrasse 100, 3855 Brienz3860 Meiringen, BE"
[20] "Téléphone: 033 / 971 60 42Urgences: 033 / 971 60 42Fax: 033 / 971 01 50info(at)tierarzt-meiringen.chanisano.ch"

Related

How to scrape span info using Rvest in R

Usually when scraping websites, I use "SelectorGadget". If not, I would have to inspect some elements on a page.
However, I am running in to a bit of trouble when trying to scrape this one website.
The HTML looks like this:
<div class="col-span-2 mt-16 sm:mt-4 flex justify-between sm:block space-x-12 font-bold"><span>103 m²</span><span>8 650 000 kr</span></div>
Elements that I want:
<span>103 m²</span>
</span><span>8 650 000 kr</span></div>
They look like this:
103 m²
8 650 000 kr
My simple R code:
# The URL
url = "https://www.finn.no/realestate/homes/search.html?page=%d&sort=PUBLISHED_DESC"
page_outside <- read_html(sprintf(url,1))
element_1 <- page %>% html_nodes("x") %>% html_text()
Anyone got any tips or ideas on how I can access these?
thanks!
Here is a possibility, parse out span nodes under a div with class of "justify-between".
url = "https://www.finn.no/realestate/homes/search.html?page=%d&sort=PUBLISHED_DESC"
page_outside <- read_html(sprintf(url,1))
element_1 <- page_outside %>% html_elements("div.justify-between span")
element_1
{xml_nodeset (100)}
[1] <span>47 m²</span>
[2] <span>3 250 000 kr</span>
[3] <span>102 m²</span>
[4] <span>2 400 000 kr</span>
[5] <span>100 m²</span>
[6] <span>10 000 000 kr</span>
[7] <span>122 m²</span>
[8] <span>9 950 000 kr</span>
[9] <span>90 m²</span>
[10] <span>4 790 000 kr</span>
...
Update
If the is some missing data then a slightly longer solution is need to track which element is missing
divs <- page_outside %>% html_elements("div.justify-between")
answer <-lapply(divs, function(node) {
values <- node %>% html_elements("span") %>% html_text()
if (length(values)==2)
{
results <- t(values)
} else if (grepl("kr", values) ) {
results <- c(NA, values)
} else {
results <- c(values, NA)
}
results
})
answer <- do.call(rbind, answer)
answer
[,1] [,2]
[1,] "87 m²" "2 790 000 kr"
[2,] "124 m²" "5 450 000 kr"
[3,] "105 m²" "4 500 000 kr"
[4,] "134 m²" "1 500 000 kr"

Recognizing and Keeping Elements Containing Certain Patterns in a List

I want to try and webscrape my own Stackoverflow Profiles! By this I mean, get an html link of every question I have ever asked:
https://stackoverflow.com/users/18181916/antonoyaro8
https://math.stackexchange.com/users/1024449/antonoyaro8
I tried to do this follows:
library(rvest)
library(httr)
library(XML)
url<-"https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest"
page <-read_html(url)
resource <- GET(url)
parse <- htmlParse(resource)
links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))
I tried to pick up on a pattern and noticed that all links with questions have some number - so I tried to write a code that checks if elements in the list contain a number and keep these links:
rv <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")
final <- unique (grep(paste(rv,collapse="|"),
links, value=TRUE))
But I don't think I am doing this correctly - apart from the messy formatting, the final file is returning links that do not contain any numbers at all.
Can someone please show me how to webscrape these links properly, and then repeat this for all pages (e.g. https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest, https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest&page=2, https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest&page=3)
Worse come to worst, if I can do it for one of these pages, I can manually copy/paste the code for all pages and proceed that way.
Thank you!
The output is a list of length 1. We need to extract ([[) the element before applying the grep
unique (grep(paste(rv,collapse="|"),
links[[1]], value=TRUE))
Note that the rv includes numbers 0 to 9 and it can match a digit if it is present anywhere in the link. If the intention is to subset the digits following the questions
grep("questions/\\d+", links[[1]], value = TRUE)
-output
[1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"
[2] "/questions/72843570/combing-two-selections-together"
[3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"
[4] "/questions/72840624/even-out-table-in-r"
[5] "/questions/72840548/creating-a-dictionary-reference-table"
[6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"
[7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"
[8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
[9] "/questions/72738885/referencing-a-query-in-another-query"
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"
...
If there are multiple pages, add the page= with paste or sprintf
urls <- c(url, sprintf("%s&page=%d", url, 2:3))
out_lst <- lapply(urls, function(url)
{
page <-read_html(url)
resource <- GET(url)
parse <- htmlParse(resource)
links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))
grep("questions/\\d+", links[[1]], value = TRUE)
})
-output
> out_lst
[[1]]
[1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"
[2] "/questions/72843570/combing-two-selections-together"
[3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"
[4] "/questions/72840624/even-out-table-in-r"
[5] "/questions/72840548/creating-a-dictionary-reference-table"
[6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"
[7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"
[8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
[9] "/questions/72738885/referencing-a-query-in-another-query"
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"
[14] "/questions/72710448/removing-files-from-global-environment-with-a-certain-pattern"
[15] "/questions/72710203/r-sql-is-the-default-option-sampling-with-replacement"
[16] "/questions/72695401/allocating-max-memory-in-r"
[17] "/questions/72681898/randomly-delete-columns-from-datasets"
[18] "/questions/72663516/are-rds-files-more-efficient-than-csv-files"
[19] "/questions/72625690/importing-files-using-list-files"
[20] "/questions/72623856/second-most-common-element-in-each-row"
[21] "/questions/72623744/counting-the-position-where-a-pattern-is-completed"
[22] "/questions/72620501/bulk-import-export-files-from-r"
[23] "/questions/72613413/counting-every-position-where-a-pattern-appears"
[24] "/questions/72612577/counting-the-position-of-the-first-0-in-each-row"
[25] "/questions/72607160/taking-averages-across-lists"
[26] "/questions/72589276/functions-for-finding-out-the-midpoint-interpolation"
[27] "/questions/72587298/sandwiching-values-between-rows"
[28] "/questions/72569338/integration-error-lengthlower-1-is-not-true"
[29] "/questions/72568817/synchronizing-nas-in-r"
[30] "/questions/72568661/finding-the-loser-in-each-row"
[[2]]
[1] "/questions/72566170/making-a-race-between-two-variables"
[2] "/questions/72418723/making-a-list-of-random-numbers"
[3] "/questions/72418364/random-uniform-numbers-without-runif"
[4] "/questions/72353102/integrate-normal-distribution-between-2-values"
[5] "/questions/72174868/placing-commas-between-names"
[6] "/questions/72163297/simulate-flipping-french-fries-in-r"
[7] "/questions/71982286/alternatives-to-the-partition-by-statement-in-sql"
[8] "/questions/71970960/converting-lists-into-data-frames"
[9] "/questions/71970672/random-numbers-are-too-similar-to-each-other"
[10] "/questions/71933753/making-combinations-of-items"
[11] "/questions/71874791/sorting-rows-in-specified-order"
[12] "/questions/71866097/hiding-the-legend-in-this-graph"
[13] "/questions/71866048/understanding-the-median-in-this-graph"
[14] "/questions/71852517/nas-produced-when-number-of-iterations-increase"
[15] "/questions/71791906/assigning-unique-colors-to-multiple-lines-on-a-graph"
[16] "/questions/71787336/finding-identical-rows-in-multiple-datasets"
[17] "/questions/71758983/multiple-replace-lookups"
[18] "/questions/71758648/create-ascending-id-in-a-data-frame"
[19] "/questions/71731208/webscraping-data-which-pokemon-can-learn-which-attacks"
[20] "/questions/71728273/webscraping-pokemon-data"
[21] "/questions/71683045/identifying-smallest-element-in-each-row-of-a-matrix"
[22] "/questions/71671488/connecting-all-nodes-together-on-a-graph"
[23] "/questions/71641774/overriding-colors-in-ggplot2"
[24] "/questions/71641404/applying-a-function-to-a-data-frame-lapply-vs-traditional-way"
[25] "/questions/71624111/sending-emails-from-r"
[26] "/questions/71623019/sql-joining-tables-from-2-different-servers-r-vs-sas"
[27] "/questions/71429265/overriding-sql-errors-during-r-uploads"
[28] "/questions/71429129/splitting-a-dataset-into-uneven-portions"
[29] "/questions/71418533/multiplying-and-adding-values-across-rows"
[30] "/questions/71417489/tricking-an-sql-server-to-accept-a-file-from-r"
[[3]]
[1] "/questions/71417218/splitting-a-dataset-into-arbitrary-sections"
[2] "/questions/71398804/plotting-vector-fields-and-gradient-fields"
[3] "/questions/71387596/animating-the-mandelbrot-set"
[4] "/questions/71358405/repeat-a-set-of-ids-for-every-n-rows"
[5] "/questions/71344822/time-series-graphs-with-different-symbols"
[6] "/questions/71341865/creating-a-data-frame-with-commas"
[7] "/questions/71287944/converting-igraph-to-visnetwork"
[8] "/questions/71282863/fixing-the-first-and-last-numbers-in-a-random-list"
[9] "/questions/71282403/adding-labels-to-graph-nodes"
[10] "/questions/71262761/understanding-list-and-do-call-commands"
[11] "/questions/71261431/adjusting-graph-layouts"
[12] "/questions/71255038/overriding-non-existent-components-in-a-loop"
[13] "/questions/71244872/fixing-cluttered-titles-on-graphs"
[14] "/questions/71243676/directly-adding-titles-and-labels-to-visnetwork"
[15] "/questions/71232353/removing-all-edges-in-igraph"
[16] "/questions/71230273/writing-a-function-that-references-elements-in-a-matrix"
[17] "/questions/71227260/generating-random-graphs-according-to-some-conditions"
[18] "/questions/71087349/adding-combinations-of-numbers-in-a-matrix"

How to correctly identify html node

I want to scrape the price of a product on a webshop, but I struggle to correctly allocate the correct nodes to the price i want to scrape.
The relevant part of my code looks like this:
"https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/"%>%
read_html()%>%
html_nodes('span.woocommerce-Price-amount.amount')%>%
html_text()
When executing this code, I do get prices as a result, but not the ones i want (it shows the prices of other produts that are listed beneath.
How can I now correctly identify the node to the price of the product itself (375.-)
First: I don't know R.
This page uses JavaScript to add this price in HTML
but I don't know if rvest can run JavaScript.
But I found this value in <form data-product_variations="..."> as JSON
and I could display prices for all options:
data <- "https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/" %>%
read_html() %>%
html_nodes('form.variations_form.cart') %>%
html_attr('data-product_variations') %>%
fromJSON
data$display_price
data$regular_price
data$image$title
Result:
> data$display_price
[1] 479 375 439 479 479
> data$display_regular_price
[1] 699 549 629 699 699
> data$image$title
[1] "aqua marina fusion bamboo padddel"
[2] "aqua marina fusion aluminium padddel"
[3] "aqua marina fusion carbon padddel"
[4] "aqua marina fusion hibi padddel"
[5] "aqua marina fusion silver padddel"
> colnames(data)
[1] "attributes" "availability_html" "backorders_allowed"
[4] "dimensions" "dimensions_html" "display_price"
[7] "display_regular_price" "image" "image_id"
[10] "is_downloadable" "is_in_stock" "is_purchasable"
[13] "is_sold_individually" "is_virtual" "max_qty"
[16] "min_qty" "price_html" "sku"
[19] "variation_description" "variation_id" "variation_is_active"
[22] "variation_is_visible" "weight" "weight_html"
[25] "is_bookable" "number_of_dates" "your_discount"
[28] "gtin" "your_delivery"
EDIT:
To work with page which uses JavaScript you may need other tools - like phantomjs
How to Scrape Data from a JavaScript Website with R | R-bloggers

how to just retrieve the titles from the query result using rvest

I use rvest to retrieve the titles from google query result. My code is like this:
> url = URLencode(paste0("https://www.google.com.au/search?q=","600d"))
> page <- read_html(url)
> page %>%
html_nodes("a") %>%
html_text()
However, the result includes not only just titles, but also something else, like:
[24] "Past month"
[25] "Past year"
[26] "Verbatim"
[27] "EOS 600D - Canon"
[28] "Similar"
[29] "Canon 600D | BIG W"
[30] "Cached"
[31] "Similar"
......
[45] ""
[46] ""
where what I need are [27] "EOS 600D - Canon" and [29] "Canon 600D | BIG W". They are shown in google query like this:
All of others are just noises for me. Could anyone please tell me how to get rid of those?
Also, if I want the description part as well, what I should do?
To just get the titles, do not use <a> (=link) but <h3>:
page %>%
html_nodes("h3") %>%
html_text()
[1] "EOS 600D - Canon"
[2] "Canon EOS 600D - Wikipedia"
[3] "Canon 600D | BIG W"
[4] "Canon EOS 600D Digital SLR Camera with 18-55mm IS Lens Kit ..."
[5] "Canon Rebel T3i / EOS 600D Review: Digital Photography Review"
[6] "Canon EOS 600D review - CNET"
[7] "canon eos 600d | Cameras | Gumtree Australia Free Local Classifieds"
[8] "Images for 600d"
[9] "Canon 600D - Snapsort"
[10] "Canon EOS 600D - Georges Cameras"

Encoding Issue in R htmlParse XML

I try to scrape a website but can't handle this encoding issue:
# putting together the url:
search_str <- "allintitle:amphibian richness OR diversity"
url <- paste("http://scholar.google.at/scholar?q=",
search_str, "&hl=en&num=100&as_sdt=1,5&as_vis=1", sep = "")
# get content and parse it:
doc <- htmlParse(url)
# encoding isssue, like here..
xpathSApply(doc, '//div[#class="gs_a"]', xmlValue)
[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
[5] "Mà Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
[7] "D Vallan - Journal of Tropical Ecology, 2002 - Cambridge Univ Press"
[8] "MO Rödel, R Ernst - Ecotropica, 2004 - gtoe.de"
# ...
any pointers?
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.91-1.1 bitops_1.0-4.1 XML_3.9-4.1
loaded via a namespace (and not attached):
[1] tools_2.15.1
> getOption("encoding")
[1] "native.enc"
This worked to some degree for me
doc <- htmlParse(url,encoding="UTF-8")
head(xpathSApply(doc, '//div[#class="gs_a"]', xmlValue))
#[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
#[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
#[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
#[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
#[5] "MÁ Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
#[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
thou
xpathSApply(doc, '//div[#class="gs_a"]', xmlValue)[[81]]
was displaying incorrectly on my windows box for example.
switching to Font DotumChe using GUI preferences however showed it displaying correctly so it may just be a display issue not a parsing one.