Rvest html_nodes span div other items - html

I'm scrapping through this html and I want to extract the text inside the <span data-testid="distance">
<span class="class1">
<span data-testid="distance">the text i want</span>
</span>
<span class="class2">
<span class="class1"><span>the other text i'm obtaining</span>
</span>
distancia <- hoteles_verdes %>%
html_elements("span.class1") %>%
html_text()
The question would be how to isolate the data-testid="distance" on the html elements to later retrieve the html_text.
It's my first question posting. thanks!

You can use a CSS attribute selector.
For example, the [attribute|="value"] selector to select attribute "data-testid" with value = "distance" (note the single and double quotes):
library(rvest)
hoteles_verdes %>%
html_nodes('[data-testid|="distance"]') %>%
html_text()
Result:
[1] "the text i want"
Data:
hotel_verdes <- read_html('<span class="class1">
<span data-testid="distance">the text i want</span>
</span>
<span class="class2">
<span class="class1"><span>the other text im obtaining</span>
</span>')

Related

How can I extract this text from HTML, using RSelenium?

I want to scrape the 57, but if I'm only able to get the Text that says Search Result
this is the code that i use to grab the element with RSelenium
element<- remDr$findElements(using = 'id','resultListHeadingName')
lapply(element,function (x) x$getElementText()) %>% unlist()
this is the output
[1] "Search Results:"
this is the HTML text from the Website
<h1 class="page-title alt selectorgadget_selected" xpath="1">
<span id="resultListHeadingName" class="color-p2">Search Results:</span> 1 - 50 of 57 </h1>
Find the parent:
element<- remDr$findElements(using = 'xpath', "//*[#id='resultListHeadingName']/..")
lapply(element,function (x) x$getElementText()) %>% unlist()

rvest : extract span content

Welcome, I have been searching for quite a long time but could not find how to manage with this example using html_nodes() from rvest. I would like to extract the data-value from span, but only the first number. For the following html piece, it should return only : "504 012"
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="504012">504 012</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="1 024 560">$1.02M</span>
</p>
I would be glad for any kind of help.
You can specify the name attribute ("nv") and use html_node() to get only the first occurrence.
library(rvest)
p <- '<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="504012">504 012</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="1 024 560">$1.02M</span>
</p>'
p %>%
read_html() %>%
html_node("span[name='nv']") %>%
html_text()
[1] "504 012"

How to identify a node with its XML value in XPath?

I use R to scrape a web site, and when parsing the HTML code, I have this code below:
<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
Now I would like to get some values in this code.
How to identify the span with the xml value "Number". and get the node, in order to extract "number extra"?
I know how to use xpathApply to identify nodes in order to get the xmlValue or some attributes (like href with xmlGetAttr). But I don't know how to identify a node with knowing its xmlvalue.
xpathApply(page, '//span[#class="property"]',xmlValue)
If I want to get the "value" 72 for the property class "Surface", what is the most efficient way?
Here's I started to do:
First, I extract all "property":
xpathApply(page, '//span[#class="property"]',xmlValue)
Then I extract all "value":
xpathApply(page, '//span[#class="value"]',xmlValue)
Then I build a list or a matrix, so that I can identify the value of "Surface", which is 72. But the problem is that sometimes, a span with class="property" can not have a span with class="value" that just follows in a h2. So I can not build a proper list.
Could this be the most efficient way? Identify the span with class="property", then identify the h2 that contains this span, then identify the span with class="value"?
For your HTML made to be well-formed by adding a single root element,
<?xml version="1.0" encoding="UTF-8"?>
<r>
<div class="line">
<h2 class="clearfix">
<span class="property">Number
<div>number extra</div>
</span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
</r>
(A) This XPath expression,
//span[#class='property' and starts-with(., 'Number')]/div/text()
will return
number extra
as requested.
(B) This XPath expression,
//h2[span[#class='property' and . = 'Surface']]/span[#class='value']/text()
will return
72
as requested.
XPath can evaluate the contents of a tag using its own function text(). Using rvest for simplicity:
library(rvest)
html <- '<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>'
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Number"]/*') %>% # select node
html_text() # get text contents of node
# [1] "number extra"
XPath also has selectors to follow family axes, in this case following:::
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Surface"]/following::*') %>% # select node
html_text() # get text contents of node
# [1] "72"

how to retrieve data from html between <span> and </span>

I want to get the rate that is from 1 to 5 in amazon customer reviews.
I check the source, and find this part looks as
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" ><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Works great right out of the box with Surface Pro</b>, <nobr>October 5, 2013</nobr></span>
</div>
I want to get 5.0 out of 5 stars from
<span>5.0 out of 5 stars</span></span> </span>
how can i use xpathSApply to get it?
Thank you!
I would recommend using the selectr package, which uses css selectors in place of xpath.
library(XML)
doc <- htmlParse('
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;">
<span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" >
<span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;">
<b>Works great right out of the box with Surface Pro</b>,
<nobr>October 5, 2013</nobr></span>
</div>', asText = TRUE
)
library(selectr)
xmlValue(querySelector(doc, 'div > span > span > span'))
UPDATE: If you are looking to use xpath, you can use the css_to_xpath function in selectr to figure out the appropriate xpath command, which in this case turns out to be
"descendant-or-self::div/span/span/span"
I do not know r much but I can give you the XPath string. It seems you want the first span's text which has no attribute and this would be:
//span[not(#*)][1]/text()
You can put this string into xpathSApply.

Shiny html output object that takes html code for easy copy and paste

I would like to create a Shiny R application that can take unformatted Stata code input by the user, add html tags, and return the entire block of code for easy copy and paste into an html publishing venue such as blogs or webpages.
I already have the R code that can handle the formatting A Stata HTML syntax highlighter in R. And most of the Shiny implementation seems very easy. The major challenge I am having is creating an html textbox or other object that can easily take a reactive element from the Shiny's server.R and return it to the user without formatting the html tags.
Example:
Stata code input through a text box
clear
set obs 4000
gen id = _n
gen eta1 = rnormal()
gen eta2 = rnormal()
XX Shiny submit button XX
Return in another text box
<span style="color: #9900FF">set</span> <span style="color: #0000CC"><b>obs</b></span> 4000
<span style="color: #0000CC"><b>gen</b></span> id = <span style="color: #9900FF">_n</span>
<span style="color: #0000CC"><b>gen</b></span> eta1 = <span style="color: #9900FF">rnormal</span>()
<span style="color: #0000CC"><b>gen</b></span> eta2 = <span style="color: #9900FF">rnormal</span>()
Overall, I think this is generally a long question for a potentially very simple answer. Thanks for your consideration.
renderText() does not parse HTML tags. E.g. if you do:
output$code <- renderText({
paste0(
'<span style="color: #9900FF">set</span> <span style="color: #0000CC"><b>obs</b></span> 4000',
'<span style="color: #0000CC"><b>gen</b></span> id = <span style="color: #9900FF">_n</span>',
'<span style="color: #0000CC"><b>gen</b></span> eta1 = <span style="color: #9900FF">rnormal</span>',
'<span style="color: #0000CC"><b>gen</b></span> eta2 = <span style="color: #9900FF">rnormal</span>'
)
})
Where this is your ui.R:
library(shiny)
shinyUI(pageWithSidebar(
headerPanel("Code"),
sidebarPanel(
),
mainPanel(
verbatimTextOutput("code")
)
))
The content comes out as just text.
But since you haven't posted your ui.R (or index.html) I'm not sure how you are rendering your output. If you are having issues displaying raw text instead of parsed HTML you can always replace < with < and > with > like this:
html <- '<span>text</span>'
x <- gsub('<', '<', html)
gsub('>', '>', x)
Which will produce: <span>text</span> and should not be displayed as parsed HTML in your browser.