XPath over more levels - html

is it possible with xpath to extract something in another Level ?
In my table there are diffrent diffrent rows . In one if it there is my row where somewhere is print "bpmtestinstanz". I need to get the URL (href) of the the Link where is "Instanz abbauen" in the Link and in the table-row (tr) "bpmtestinstanz" inside . But I don't know how i should handle xpath over the diffrent levels of the Link and the printed text .
<tr class="htmlobject_tr even" onclick="tr_click(this, 'idp54bf6c72037f1')" onmouseover="tr_hover(this, 'idp54bf6c72037f1')" onmouseout="tr_hover(this, 'idp54bf6c72037f1')">
<td class="htmlobject_td state">
<span class="pill active">
active</span>
</td>
<td class="htmlobject_td config">
<b>
Request ID</b>
14218305642887<br>
<b>
Hostname</b>
bpmtestinstanz<br>
<b>
Typ</b>
KVM VM (localboot)<br>
<b>
CPU</b>
1<br>
<b>
Speicher</b>
1.024 GB<br>
<b>
Kernel</b>
default<br>
<b>
Festplatte</b>
5 GB<br>
<b>
Image</b>
13982361680940.cloud_14218305642887_1_<br>
<b>
IP</b>
192.168.200.166, 172.26.41.105</td>
<td class="htmlobject_td comment">
Requested by user danieldrichardt<hr>
<a class="plugin console" onclick="sshwindow = window.open('https://172.26.41.105:8022','window1722641105', 'location=0,status=0,scrollbars=yes,resizable=yes,width=973,height=500,left=100,top=100,screenX=400,screenY=100'); sshwindow.focus(); return false;" href="#">
Ajaxterm</a>
<a class="plugin novnc" target="_blank" href="api.php?action=novnc&appliance_id=14218305990082">
noVNC</a>
</td>
<td class="htmlobject_td action">
<a data-message="" class="pause" title="Instanz pausieren" href="index.php?cloud_ui=pause&cloudappliance_id[]=14218308730556">
Instanz pausieren</a>
<a data-message="" class="restart" title="Instanz neu starten" href="index.php?cloud_ui=restart&cloudappliance_id[]=14218308730556">
Instanz neu starten</a>
<a data-message="" class="private" title="Privates Image anlegen" href="index.php?cloud_ui=image_private&appliance_id=14218305990082">
Privates Image anlegen</a>
<a data-message="" class="edit" title="Instanz bearbeiten" href="index.php?cloud_ui=appliance_update&cloudappliance_id=14218308730556">
Instanz bearbeiten</a>
<a data-message="" class="remove" title="Instanz abbauen" href="index.php?cloud_ui=deprovision&cloudappliance_id[]=14218308730556">
Instanz abbauen</a>
</td>
</tr>

It can be done, but it's slightly complicated.
//tr[td[#class = 'htmlobject_td config' and contains(b[contains(., 'Hostname')]
/following-sibling::text()[1],'bpmtestinstanz')]]/td[#class = 'htmlobject_td action']
/a[contains(.,'Instanz abbauen')]/#href
which means
//tr[td[#class = 'htmlobject_td config' Select all `tr` elements anywhere in
the document, but only if they have
at least one `td` child element that
has a `class` attribute with the
value "htmlobject_td config"
and contains(b[contains(., 'Hostname')] and only if this `td` child element
also has a child element `b` whose
text value contains "Hostname"
/following-sibling::text()[1],'bpmtestinstanz')]] and if the first following sibling
of this `b` element which is a text
node contains "bpmtestinstanz"
/td[#class = 'htmlobject_td action'] of this `tr` select a child `td`
element that has a `class` attribute
with the value "htmlobject_td action"
/a[contains(.,'Instanz abbauen')]/#href and select its child `a` if its text
contains "Instanz abbauen" - and
finally select the `href` attribute
of this `a` element.
This solution requires "bpmtestinstanz" to be immediately after <b>Hostname</b>. If it can appear anywhere in the tr element, you can somewhat simplify the expression.

Related

Xpath issues selecting <spans> nested in <td>

I'm trying to extract text from a lot of XHTML documents with a program that uses Xpath queries to map the text into a structured table. the XHTML document looks like this
<td class="td-3 c12" valign="top">
<p class="pa-4">
<span class="ca-5">text I would like to select </span>
</p>
</td>
<td class="td-3 c13" valign="top">
<p class="pa-2">
<span class="ca-0">some more text I want to select </span>
</p>
<p class="pa-2">
<span class="ca-0">
<br>
</br>
</span>
</p>
<p class="pa-2">
<span class="ca-5">text and values I don't want to select.</span>
</p>
<p class="pa-2">
<span class="ca-5"> also text and values I don't want to </span>
</p>
</td>
I'm able to select the the spans by their class and retrieve the text/values, however they're not unique enough and I need to filter by table classes. for example only the text from span class ca-0 that is a child of td class td-3 c13
which would be <span class="ca-0">some more text I want to select </span>
I've tried all these combinations
//xhtml:td[#class="td-3 c13"]/xhtml:span[#class = "ca-0"]
//xhtml:span[#class = "ca-0"] //ancestor::xhtml:td[#class= "td-3 c13"]
//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"]
I'm not sure how much your sample xml reflects your actual xml, but strictly based on your sample xml (AND disregarding possible namespaces issues you will probably face), the following xpath expression:
//td[contains(#class,"td-3")]/p[1]/span/text()
selects
text I would like to select
some more text I want to select
According to the doc, and to support namespaces, you should write something like this (fn:...) :
//*:td[fn:contains(#class,"td-3")]/*:p[1]/*:span
Or with a binding namespace :
node.xpath("//xhtml:td[fn:contains(#class,'td-3')]/xhtml:p[1]/xhtml:span", {"xhtml":"http://example.com/ns"})
This expression should work too (select the first span of the first p of each td element) :
//*:td/*:p[1]/*:span[1]
Side notes :
Your XPath expressions could be fixed. Span is not a child but a descendant, so we use //. We use () to keep the first result only.
(//xhtml:td[#class="td-3 c13"]//xhtml:span[#class = "ca-0"])[1]
(//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"])[1]
Replace // with a predicate [] :
(//xhtml:span[#class = "ca-0"][ancestor::xhtml:td[#class= "td-3 c13"]])[1]
Test your XPath with : https://docs.marklogic.com/cts.validIndexPath
The solution is
//td[(#class ="td-3") and (#class = "c13)]/p/span
for some reason it sees the
<td class="td-3 c13">
as separate classes e.g.
<td class = "td-3" and class = "c13"
so you need to treat them as such
Thanks to #E.Wiest and #JackFleeting for validating and pointing me in the right direction.

(Reverse) Traverse XPath Query for Accessing a DIV with a particular Text Value

Working with a DOM that has the same HTML loop 100+ times that looks like this
<div class="intro">
<div class="header">
<h1 class="product-code"> <span class="code">ZY001</span> <span class="intro">ZY001 Title/Intro</span> </h1>
</div>
<div>
<table>
<tbody>
<tr>
<td>Available</td>
<td> S </td>
<td> M </td>
<td> XL </td>
</tr>
I was previously using this XPath Query to get ALL the node values back (all 100+ instances of the DOM Query in connection with the variable nodes that may contain in Available
//div[#class='intro']/div/table/tbody/tr/td[contains(text(),'Available')]/following-sibling::td
object(DOMNodeList)[595]
public 'length' => int 591
Now I am needing to target the product-code / code specifically to retrieve all the td attributes for a particular code
Because the div that contains the unique identifier (in the example above, ZY001) is not a direct ancestor, my thinking is I have to do a Reverse XPath Query
Here's one of my attempts:
//h1[#class='product-code']/span[contains(#class, 'code') and text() = 'ZY001']/../../div[#class='intro']/div/table/tbody/tr/td[contains(text(),'Available')]/following-sibling::td
As I am defining /span[contains(#class, 'code') and text() = 'ZY001'] and then attempting to traverse the dom backwards twice using /../../ I was hoping/expecting to get back the div[#class='intro'] with the text ZY001 immediately above it, or rather a public 'length' => int 1
But all my attempts thus far have resulted in 0 results. Not false, indicating an improper XPath, but 0.
How can I modify my XPath Query to get back the single instance in the one-of-many <div class="intro">'s that contain the <h1 class="product-code">/<span class="code"> text value ZY001?
Use
//h1[#class='product-code']/span[contains(#class, 'code') and text() = 'ZY001']/../../../div/table/tbody
instead of
//h1[#class='product-code']/span[contains(#class, 'code') and text() = 'ZY001']/../../div[#class='intro']/div/table/tbody
You can use any of the below xpath's for that:
//div[#class='intro' and //h1[#class='product-code']/span[#class='code' and text()='ZY001']]//tbody/tr[td[text()='Available']]/td[2]
//div[#class='intro' and //span[#class='code' and text()='ZY001']]//tbody/tr[td[text()='Available']]/td[2]
//div[#class='intro' and //span[#class='code' and text()='ZY001']]//tr[td[text()='Available']]/td[2]
Change td[2] to td[3] and td[4] to get the 3rd and 4th td respectively

XPath of element preceding some text

Trying to find the XPath for this code:
<a class="account-name" long-text="Sample Text" ng-
click="accountClick(account)" ng-href="sample URL" href="sample URL1">
<span>Plan Name</span></a>
I've tried this:
//span[text()='Plan Name']
and it doesn't seem to be working.
For this HTML,
<a class="account-name" long-text="Sample Text" ng-
click="accountClick(account)" ng-href="sample URL" href="sample URL1"/>
<span>Plan Name</span>
This XPath,
//span[.="Plan Name"]/preceding-sibling::a[1]
will selecting the immediately preceding a elements to span elements whose string value is "Plan Name".
For this HTML,
<a class="account-name" long-text="Sample Text" ng-
click="accountClick(account)" ng-href="sample URL" href="sample URL1">
<span>Plan Name</span>
</a>
This XPath,
//a[.="Plan Name"]
will select all a elements with string values equal to "Plan Name".

css selector - get the value

What is the css selector to get the text value '2017-10-09' ?
<div class="col-xs-6">
<strong>
<!--ko text: date -->
2017-10-09
<!--/ko-->
<!--ko text: time -->
12:55
<!--/ko-->
</strong><br>
<!--ko text: locationName -->
City
<!--/ko-->
</div>
There is no such selector.
With a few exceptions (such as :first-line), a selector only allows you to select an element.
The text 2017-10-09 is not an element, it isn't even the whole text content of an element.
strong would allow you to select <strong><!--ko text: date -->2017-10-09<!--/ko--><!--ko text: time -->12:55<!--/ko--></strong> but that is more than you are asking for.
You could select the strong element, then read its text content, and then parse that (e.g. by splitting it across space characters or using a regular expression such as /(\d{4}-\d{2}-\d{2})/).
There are no CSS selectors that can check the data format in the text node that I know of. You should wrap a tag around the date and that use a proper selector for that tag like:
<div class="col-xs-6">
<strong>
<!--ko text: date -->
<span class="date">
2017-10-09
</span>
<!--/ko-->
<!--ko text: time -->
12:55
<!--/ko-->
</strong><br>
<!--ko text: locationName -->
City
<!--/ko-->
</div>
Alternatively, you can target the strong tag inside the div and then process the date out of text node.

Using Nokogiri's CSS method to get all elements within an alt tag

I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.
How could I simply get the element within alt?
Your desired output isn't clear and the HTML is broken.
Start with this:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]
Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"
This behavior is documented in NodeSet#text:
Get the inner text of all contained Node objects
Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:
Returns the content for this Node
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.