xPath/HTML: Select node based on related node - html

<html>
<body>
<table>
<tr>
<th>HeaderA</th>
<th>HeaderB</th>
<th>HeaderC</th>
<th>HeaderD</th>
</tr>
<tr>
<td>ContentA</td>
<td>ContentB</td>
<td>ContentC</td>
<td>ContentD</td>
</tr>
</table>
</body>
</html>
I am looking for the most efficient way to select the content 'td' node based on the heading in the corresponding 'th' node..
My current xPath expression..
/html/body/table/tr/td[count(/html/body/table/tr/th[text() = 'HeaderA']/preceding-sibling::*)+1]
Some questions..
Can you use relative paths (../..) inside count()?
What other options to find current node number td[?] or is count(/preceding-sibling::*)+1 the most efficient?

It is possible to use relative paths inside count()
I have never heard of another way to find the node number...
Here is the code with relative xpath-code inside count()
/html/body/table/tr/td[count(../../tr/th[text()='HeaderC']/preceding-sibling::*)+1]
But well, it is not much shorter... It won't be shorter than this in my opinion:
//td[count(../..//th[text()='HeaderC']/preceding-sibling::*)+1]

Harmen's answer is exactly what you need for a pure XPATH solution.
If you are really concerned with performance, then you could define an XSLT key:
<xsl:key name="columns" match="/html/body/table/tr/th" use="text()"/>
and then use the key in your predicate filter:
/html/body/table/tr/td[count(key('columns', 'HeaderC')/preceding-sibling::th)+1]
However, I suspect you probably won't be able to see a measurable difference in performance unless you need to filter on columns a lot (e.g. for-each loops with checks for every row for a really large document).

I would have left Xpath aside... since I assume it was DOM parsed, I'd use a Map data structure, and match the nodes in either client side or server side (JavaScript / Java) manually.
Seems to me XPath is being streatched beyond its limit here.

Perhaps you want position() and XPath axes?

Related

HtmlAgilityPack Wildcard Search in Powershell

How could I shorten the following?
$contactsBlock is an HTMLAgilityPack node, XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]
$contactsBlock.SelectSingleNode(".//table").SelectSingleNode(".//table")
Results in desired XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]/table[1]/tr[2]/td[1]/div[1]/div[2]/table[1]
The second table is nested in the first, and I'd like to shorten the above SelectSingleNode twice to something like this
$contactsBlock.SelectSingleNode(".//table/*/table") and skip the in-between.
Is there a way to wild-card like this?
An XPath expression .//table//table should match all tables nested within other tables under the current node. Double forward slashes match arbitrary length paths.
.//table/*/table is unlikely to give you a match, because the asterisk wildcard matches one node (i.e. one level of hierarchy), so the nested table would have to be a grandchild node of the first table:
<table>
<tr>
<table>...</table> <!-- nested table would have to go here -->
</tr>
</table>
which would be quite unusual. Doesn't match the structure suggested by the XPath expression from your question, too.

Selenium automation- finding best xpath

I am looking to avoid using xpaths that are 'xpath position'. Reason being, the xpath can change and fail an automation test if a new object is on the page and shifts the expected xpath position.
But on some web pages, this is the only xpath I can find. For example, I am looking to click a tab called 'FooBar'.
If I use the Selenium IDE FireFox plugin, I get:
//td[12]/a/font
If I use the FirePath Firefox plugin, I get:
html/body/form/table[2]/tbody/tr/td[12]/font
If a new tab called "Hello, World" is added to the web page (before FooBar tab) then FooBar tab will change and have an xpath position of
//td[13]/a/font
What would you suggest to do?
TY!
Instead of using absolute xpath you could use relateive xpath which is short and more reliable.
Say
<td id="FooBar" name="FooBar">FooBar</td>
By.id("FooBar");
By.name("FooBar");
By.xpath("//td[text()='FooBar']") //exact match
By.xpath("//td[#id='FooBar']") //with any attribute value
By.xpath("//td[contains(text(),'oBar')]") //partial match with contains function
By.xpath("//td[starts-with(text(),'FooB')]") //partial match with startswith function
This blog post may be useful for you.
Relative xpath is good idea. relative css is even better(faster)
If possible suggest/request id for element.
Check also chrome -> check element -> copy css/xpath
Using //td is not a good idea because it will return all your td nodes. Any predicate such as //td[25] will be a very fragile selection because any td added to any previous table will change its result. Using plugins to generate XPath is great to find quickly what you want, but its always best to use it just as a starting point, and then analyze the structure of the file to write a locator that will be harder to break when changes occur.
The best locators are anchored to invariant values or attributes. Plugins usually won't suggest id or attribute anchors. They usually use absolute positional expressions. If can rewrite your locator path in terms of invariant structures in the file, you can then select the elements or text that you want relative to it.
For example, suppose you have
<body> ...
... lots of code....
<h1>header that has a special word</h1>
... other tags and text but not `h1` ...
<table id="some-id">
...
<td>some-invariant-text</td>
<td>other text</td>
<td>the field that you want</td>
...
The table has an ID. That's the best anchor. Now you can select the table as
//table[#id='some-id']
But many times you don't have the id, or even some other invariant attribute. You can still try to discover a pattern. For example: suppose that the last <h1> before the table you want contains a word you can match, you could still find the table using:
//table[preceding::h1[1][contains(.,'word')]]
Once you have the table, you can use relative axes to find the other nodes. Let's assume you want an td but there are no attributes on any tbody, tr, etc. You can still look for some invariant text. Tables usually have headers, or some fixed text which you can match. In the example above, if you find a td that is 2 fields before the one that you want, you could use:
//table[preceding::h1[1][contains(.,'word')]]/td[preceding-sibling::td[2][.='some-invariant-text']]
This is a simple example. If you apply some of these suggestions to the file you are working on, you can improve your XPath expression and make your selection code more robust.

Using XPath to select table that includes specific class

I have an HTML table that I need to select using XPath. The table may or may not contain multiple classes, but I only want tables that include a specific class.
Here is a sample HTML snippet:
<html>
<body>
<table class="no-border">
<tr>
<th colspan="2">Blah Blah Blah</th>
</tr>
<tr>
<td>Content</td>
<td>
<table class="info no-border">
<tr>
<!-- Inner table content -->
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I need to use XPath to retrieve ONLY the table that includes the class info. I've tried using /html/body/table/tr/td/table[#class='info*'], but that doesn't work. The table I'm trying to retrieve may exist ANYWHERE in the HTML document - technically, not ANYWHERE, but there may be varying levels of hierarchy between the outer and inner table.
If anyone can point me in the right direction, I'd be grateful.
The closest you can do is with the contains function:
//table[contains(#class,'info')]
But please be aware that this would capture a table with the class information, or anything else that has the info substring. As far as I know XPath can't distinguish whole-word matches. So you'd have to filter results to check for this possible condition.
What you'd ideally need is a CSS selector like table.info. And some XPath engines and toolkits fo XML/HTML parsing do support these selectors, which are translated to XPath expressions internally, e.g. cssselect if you use Python and which is included in lxml, or Nokogiri for Ruby.
In the general case, to emulate a CSS selector like table.info with XPath, a common trick or pattern is to use contains() combined with concat() and space characters. In your case, it looks like this:
.//table[contains(concat(' ', normalize-space(#class), ' '), ' info')]
I know that you did not asked for this answer, but I think it will help you to make your queries more precise.
//table[ (contains(#class,"result-cont") or contains(#class,"resultCont")) and not(contains(#class,"hide")) ]
This will get classes that contain 'result-cont' or 'resultCont', and do not have the 'hide' class.
XPath 1.0 is , indeed, fairly limited in its string processing. You can do modest amounts of processing with starts-with() substring() and similar functions. See this answer for creating something similar to a regex.
XSLT2.0 (which not all browsers and software support) has support for regex.

Excel VBA: get content from online HTML table

can anybody pleas show me part of VBA code, which will get text "hello" from this example online HTML table? first node will be found by his ID (id="something").
...
<table id="something">
<tr>
<td><TABLE><TR><TD></TD></TR><TR><TD></TD></TR></TABLE></td><td></td>
</tr>
<tr>
<td></td><td></td><td>hello</td>
</tr>
...
i think it will be something like child->sibling->child->sibling->sibling->child, but I don't know the exact way.
EDIT
updated code tags are CAPITALS. so if I use getElemenetsById("something").getElemenetsByTagName('tr') it get only two tr tags to collection, or four (with tags which are deeper children)?
If you did search for an answer, you might want to broaden your scope next time. There are plenty of questions and answers that deal with DOM stuff and VBA.
Use getElementById on HTMLElement instead of HTMLDocument
While the question (and answers) aren't exactly what you want, it will show you how to create something you can work with.
You'll need to use a mixture of getElementById() and getElemenetsByTagName() to retrieve your desired "hello"
eg: Document.getElementById("something").getElementsByTagName("tr")(1).getElementsByTagName("td")(2).innerText
Get the element "something"
Inside "something" get all "tr" tags (specifically the one at index 1)
Inside the returned tr tag get all "td" tags (specifically the one at index 2)
Get the innerText of the previous result
These objects use a 0 based array so the first item is item(0).
Update
document.getElementById() will return an (singular) IHTMLElement (which will include all of its children) or nothing/null if it does not exist.
document.getElementsByTagName() will return a collection of IHTMLElement (again, each element will include all of its children). (or an empty collection if none exist)
document.getElementsByTagName("tr") this will return all tr elements inside the "document" element.
document.getElementsByTagName("tr")(0) will return the first (singular) IHTMLElement from the collection. (note the index at the end?)
There is no (that i could find) "sibling" feature of the InternetExplorer object in VBA, so you'd have to do it manually using the child index.
Using the DOM Functions is the clean way to do it. Its much clearer than just looking at a chain "Element.Children(0).children(1).children(2)" as you've no idea what the index means without manually looking it up.
I looked all over for the answer to this question, too. I finally found the solution by talking to a coworker which was actually through recording a macro.
I know, you all think you are above this, but it is actually the best way. See the full post here: http://automatic-office.com/?p=344
In short, you want to record the macro and go to data --> from web and navigate to your website and select the table you want.
I have used the above solutions "get element by id" type stuff in the past, and it is great for a few elements, but if you want a whole table, and you aren't super experienced, just record a macro.
don't tell your friends and then reformat it to look like your own work so no one knows you used the macro tool ;)

Semantics: Is a table or a list better for a "list" of details

I'm thinking, for a list like that:
OS iOS
Version 5.1
Size 12.5MB
(There shouldn't actually be a dot or any other list style in front of it, it just to make it more readable.)
Is it better to use a list (like this for example):
<ul>
<li>
<h3>OS</h3>
<p>iOS</p>
</li>
</ul>
Or a table:
<table>
<tbody>
<tr>
<th>OS</th>
<td>iOS</td>
</tr>
</tbody>
</table>
Or is even a definition list (dl) the most accurate?
Personally I would go with a list, but I'm not sure how to wrap the content semantically correct within the <li>-tags.
I've read an interesting article about it "DEFINITION LISTS VERSUS TABLES", but I'm still not sure.
As of HTML5, a dl would definitely be the most semantically appropriate:
http://html5doctor.com/the-dl-element/.
Unless you're planning on listing multiple instances of data/rows in a way that would be tabular, a table wouldn't make sense at all.
I generally prefer lists over tables for reasons like responsive design, but as a rule of thumb unless you're dealing with something that is conceptually a view of a matrix rather than a detail of a single record or otherwise a single set of key/value attributes it would be difficult to justify a table (and no matter what in this case the dl is an exact match for the use).
I would use table. You are trying to organize key-value pairs, which I think are easier to convey in a tabular format than a list. I would use list if you did not include the key names, i.e. if you omitted OS, Version, and Size.