How to identify a node with its XML value in XPath? - html

I use R to scrape a web site, and when parsing the HTML code, I have this code below:
<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
Now I would like to get some values in this code.
How to identify the span with the xml value "Number". and get the node, in order to extract "number extra"?
I know how to use xpathApply to identify nodes in order to get the xmlValue or some attributes (like href with xmlGetAttr). But I don't know how to identify a node with knowing its xmlvalue.
xpathApply(page, '//span[#class="property"]',xmlValue)
If I want to get the "value" 72 for the property class "Surface", what is the most efficient way?
Here's I started to do:
First, I extract all "property":
xpathApply(page, '//span[#class="property"]',xmlValue)
Then I extract all "value":
xpathApply(page, '//span[#class="value"]',xmlValue)
Then I build a list or a matrix, so that I can identify the value of "Surface", which is 72. But the problem is that sometimes, a span with class="property" can not have a span with class="value" that just follows in a h2. So I can not build a proper list.
Could this be the most efficient way? Identify the span with class="property", then identify the h2 that contains this span, then identify the span with class="value"?

For your HTML made to be well-formed by adding a single root element,
<?xml version="1.0" encoding="UTF-8"?>
<r>
<div class="line">
<h2 class="clearfix">
<span class="property">Number
<div>number extra</div>
</span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
</r>
(A) This XPath expression,
//span[#class='property' and starts-with(., 'Number')]/div/text()
will return
number extra
as requested.
(B) This XPath expression,
//h2[span[#class='property' and . = 'Surface']]/span[#class='value']/text()
will return
72
as requested.

XPath can evaluate the contents of a tag using its own function text(). Using rvest for simplicity:
library(rvest)
html <- '<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>'
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Number"]/*') %>% # select node
html_text() # get text contents of node
# [1] "number extra"
XPath also has selectors to follow family axes, in this case following:::
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Surface"]/following::*') %>% # select node
html_text() # get text contents of node
# [1] "72"

Related

Rvest html_nodes span div other items

I'm scrapping through this html and I want to extract the text inside the <span data-testid="distance">
<span class="class1">
<span data-testid="distance">the text i want</span>
</span>
<span class="class2">
<span class="class1"><span>the other text i'm obtaining</span>
</span>
distancia <- hoteles_verdes %>%
html_elements("span.class1") %>%
html_text()
The question would be how to isolate the data-testid="distance" on the html elements to later retrieve the html_text.
It's my first question posting. thanks!
You can use a CSS attribute selector.
For example, the [attribute|="value"] selector to select attribute "data-testid" with value = "distance" (note the single and double quotes):
library(rvest)
hoteles_verdes %>%
html_nodes('[data-testid|="distance"]') %>%
html_text()
Result:
[1] "the text i want"
Data:
hotel_verdes <- read_html('<span class="class1">
<span data-testid="distance">the text i want</span>
</span>
<span class="class2">
<span class="class1"><span>the other text im obtaining</span>
</span>')

how to get content within a span tag

#Example 1
<span class="levelone">
<span class="leveltwo" dir="auto">
::before
"Blue"
::after
</span>
</span>
#Example 2
<div class="itemlist">
<div dir="auto" style="text-align: start;">
"mobile"
</div>
</div>
#Example 3
<div class="quantity">
<div class="color">...</div>
<span class="num">10</span>
</div>
Hi, I am trying to use selenium to extract content from html. I managed to extract the content for example 1 & 2, the code that I have used is
example1 = driver.find_elements_by_css_selector("span[class='leveltwo']")
example2 = driver.find_elements_by_css_selector("div[class='itemlist']")
and printed out as text with
data = [dt.text for dt in example1]
print(data)
I got "Blue" for example 1 & "mobile" for example 2. For simplicity purposes, the html given above is for one iteration, I have scraped all elements with the class mentioned above
However, for the 3rd example, I tried to use
example3a = driver.find_elements_by_css_selector("div[class='quantity']")
and
example3b = driver.find_elements_by_css_selector("div[class='num']")
and
example3c = driver. find_element_by_class_name("num")
but all of it returned an empty list. I'm not sure is it because there is no dir in example 3? What method should I use to extract the "10"?
for 3rd example, you can try the below css :
div.quantity span.num
in code you can write like this :
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
print(example3a.text)
or
print(example3a.get_attribute('innerHTML'))
To extract specifically the 10 you can use
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
To extract both elements inside <div class="quantity"> you can use
example3 = driver.find_elements_by_xpath("//div[#class='quantity']//*")
for el in example3:
print(el.text)

rvest : extract span content

Welcome, I have been searching for quite a long time but could not find how to manage with this example using html_nodes() from rvest. I would like to extract the data-value from span, but only the first number. For the following html piece, it should return only : "504 012"
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="504012">504 012</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="1 024 560">$1.02M</span>
</p>
I would be glad for any kind of help.
You can specify the name attribute ("nv") and use html_node() to get only the first occurrence.
library(rvest)
p <- '<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="504012">504 012</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="1 024 560">$1.02M</span>
</p>'
p %>%
read_html() %>%
html_node("span[name='nv']") %>%
html_text()
[1] "504 012"

How to get Nokogiri to scrape text from span in Ruby

I'm trying to scrape information from a website using Nokogiri and Curb, but I can't seem to find the right name/ to find where to scrape. I'm trying to scrape the API key, which is at the bottom of the HTML code as "xxxxxxx".
The HTML code is:
<body class="html not-front logged-in no-sidebars page-app page-app- page-app-8383900 page-app-keys i18n-en" data-twttr-rendered="true">
<div id="skip-link"></div>
<div id="page-wrapper">
<!--
Code for the global nav
-->
<nav id="globalnav" class="without-subnav"></nav>
<nav id="subnav"></nav>
<section id="hero" class="hero-short"></section>
<section id="gaz-content">
<div class="container">
::before
<div id="messages"></div>
<div id="gaz-content-wrap-outer" class="row">
::before
<div id="gaz-content-wrap-inner" class="span12">
<div class="row">
::before
<div class="article-wrap span12">
<article id="gaz-content-body" class="content">
<header></header>
<div class="header-action"></div>
<div class="tabs"></div>
lass="d-block d-block-system g-main">
<div class="app-details">
<h2>
Application Settings
</h2>
<div class="description"></div>
<div class="app-settings">
<div class="row">
::before
<span class="heading">
Consumer Key (API Key)
</span>
<span>
xxxxxxxxx
</span>
All I can seem to get is the "content" text.
My code looks like:
consumer = html.at("#gaz-content-body")['class']
puts consumer
I'm not sure what to type to select the class and/or span then the input text. All I can get is Nokogiri to put "content".
In this case we need to find the second span after the span class="heading", and inside the div class="app-settings" - I'm being a bit general but not too much. I'm using search instead of at to retrieve the two spans and get the second one:
# Gets the 2 span elements under <div class='app-settings'>.
res = html.search('#gaz-content-body .app-settings span')
# Use .text to get the contents of the 2nd element.
res[1].text.strip
# => "xxxxxxxx"
But you can also use at to target the same:
res = html.at("#gaz-content-body .app-settings span:nth-child(2)")
res.text.strip
# => "xxxxxxxx"

how to retrieve data from html between <span> and </span>

I want to get the rate that is from 1 to 5 in amazon customer reviews.
I check the source, and find this part looks as
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" ><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Works great right out of the box with Surface Pro</b>, <nobr>October 5, 2013</nobr></span>
</div>
I want to get 5.0 out of 5 stars from
<span>5.0 out of 5 stars</span></span> </span>
how can i use xpathSApply to get it?
Thank you!
I would recommend using the selectr package, which uses css selectors in place of xpath.
library(XML)
doc <- htmlParse('
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;">
<span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" >
<span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;">
<b>Works great right out of the box with Surface Pro</b>,
<nobr>October 5, 2013</nobr></span>
</div>', asText = TRUE
)
library(selectr)
xmlValue(querySelector(doc, 'div > span > span > span'))
UPDATE: If you are looking to use xpath, you can use the css_to_xpath function in selectr to figure out the appropriate xpath command, which in this case turns out to be
"descendant-or-self::div/span/span/span"
I do not know r much but I can give you the XPath string. It seems you want the first span's text which has no attribute and this would be:
//span[not(#*)][1]/text()
You can put this string into xpathSApply.