HtmlAgilityPack Wildcard Search in Powershell - html

How could I shorten the following?
$contactsBlock is an HTMLAgilityPack node, XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]
$contactsBlock.SelectSingleNode(".//table").SelectSingleNode(".//table")
Results in desired XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]/table[1]/tr[2]/td[1]/div[1]/div[2]/table[1]
The second table is nested in the first, and I'd like to shorten the above SelectSingleNode twice to something like this
$contactsBlock.SelectSingleNode(".//table/*/table") and skip the in-between.
Is there a way to wild-card like this?

An XPath expression .//table//table should match all tables nested within other tables under the current node. Double forward slashes match arbitrary length paths.
.//table/*/table is unlikely to give you a match, because the asterisk wildcard matches one node (i.e. one level of hierarchy), so the nested table would have to be a grandchild node of the first table:
<table>
<tr>
<table>...</table> <!-- nested table would have to go here -->
</tr>
</table>
which would be quite unusual. Doesn't match the structure suggested by the XPath expression from your question, too.

Related

XPath to separately select each of two values in a table cell?

<td _ngcontent-wp class="align-middle">
"4.79728"
<small _ngcontent-wp class="neo_red_dark"> -0.08% </small>
</td>
My XPath as follows:
(//table[#class="table"]/tbody/tr/td[3])[1]
It works, but it gets two values together (4.79728 -0.08%). How can I get them separately?
You can get the value before the space and after the space using:
substring-before() and substring-after()
or change your XPath to target the text() descendants of the td instead of the td itself (which is producing the calculated text value).
In order to select "4.79728":
(//table[#class="table"]/tbody/tr/td[3])[1]/text()
In order to select -0.08%:
(//table[#class="table"]/tbody/tr/td[3])[1]/small/text()
You should indicate with XPath questions which XPath version you are using.
If it's version 1.0, remember that the set of data types you can return is very limited: a single string, number, or boolean, or a node-set. And some APIs only allow you to return a node-set.
Your current query is returning a node-set containing one node, namely a td element, whose string value contains the concatenation of all the text within. You could return a node-set containing all the text nodes individually by appending //text() to the query. But of course, it won't always be the case that the two numbers are in separate text nodes.

How do I get rid of the tags in XPath

I have a bunch of html files with tons of data in it and I want to extract the important parts of it.
The files are all very similar; I've to search for a <tr> which contains a certain keyword. The third column of this table row always contains the name of the "block" I'm searching for (it's a few table rows).
//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]
with this XPath query I get the names (maybe one, maybe more)
The problem is, how do I get rid of the tags around the data?
Right now my output is something like this:
<span class="log_entry_text">Name1</span><span class="log_entry_text">Name2</span><span class="log_entry_text">Name3</span>
I want to have something like that: Name1 Name2 Name3
So I can use it for extracting these blocks more easily.
With string() i can only extract the first element (result would be: Name1)
Thanks for helping me!
Just wrap your xpath with data() element like data(//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]) for retrieve text.
Your XPath expression asks to retrieve span elements and that's what it has returned. If you're seeing tags with angle brackets in the output, that's because of the way the XPath result is being processed and rendered by the receiving application.
If you're in XPath 2.0+ or XQuery 1.0+ you can combine the several span elements into a single string using
string-join(//path/span, ' ')

What is the role of parentheses in XPath 1.0?

In Chrome DevTools > Elements, when I search for //tr/td/span I find an element (because such an element exists on my page).
When I search for (//tr)/td/span or (//tr/td)/span I also find this element.
But neither //tr(/td)/span nor //tr/(td)/span nor //tr/(td/)span find anything.
What is the meaning of these parentheses in XPath?
Parenthesis in XPath are used as they are in other programming languages:
Function argument grouping: e.g: //tr/td[contains(.,"e")]
Evaluation precedence indication: e.g: normal arithmetic expression grouping as well as leading path grouping (trace LocationPath through to PrimaryExpr in the XPath grammar) as in (//td)[1] to find the first td in the document as opposed to //td[1] which finds the td elements that are the first child of their respective parent elements.
They're also used in
node tests: e.g: node(), element(), ...
processing instructions: e.g: PageBreak().
Your examples that do not find anything (e.g: //tr(/td)/span, //tr/(td)/span1, etc) have parenthesis embedded within the path that do not follow in one of the above categories. Such use of parenthesis are actually syntactically invalid and should have been reported as such rather than silently failing.
1Note that this expression would actually be syntatically valid under XPath 2.0/3.0. Thanks, #Andersson, for noticing.
I don't think that parenthesis mean something in your case, but it might be used to return required node/nodes set depending on passed index
For instance, HTML is like below:
<table>
<tr>
<td>
<span>first</span>
</td>
<td>
<span>second</span>
</td>
</tr>
<tr>
<td>
<span>third</span>
</td>
<td>
<span>fourth</span>
</td>
</tr>
</table>
(//tr)[1]/td will return cells for first row only (first, second)
(//tr)[2]/td - for second row (third, fourth)
(//tr/td)[1] - first cell of first row (first). Note that //tr/td[1] will returns each first cell of each row (first, third)
...

Confounded by XPath

When it comes to indexing in XPath, I feel like I'm missing something here.
If I have two table tags in an HTML document, and within the Chrome console I type $x("//table[1]");, I expect to get the first table tag on the page.
Instead, I get a list containing both table tags. I suspected it might have something to do with using // but using an absolute XPath expression yielded the same results.
I think this is a pretty simple misunderstanding, but I'm not seeing it when reading the docs.
//table[1] returns all tables that are the first table child of their respective parents.
To get the first table use /descendant::table[1] or in XPath 2.0 (//table)[1].
Here it is in the standard:
The path expression //para[1] does not mean the same as the path expression /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their respective parents.
Use
(//table)[1]
i.e. the first of all the tables.

xPath/HTML: Select node based on related node

<html>
<body>
<table>
<tr>
<th>HeaderA</th>
<th>HeaderB</th>
<th>HeaderC</th>
<th>HeaderD</th>
</tr>
<tr>
<td>ContentA</td>
<td>ContentB</td>
<td>ContentC</td>
<td>ContentD</td>
</tr>
</table>
</body>
</html>
I am looking for the most efficient way to select the content 'td' node based on the heading in the corresponding 'th' node..
My current xPath expression..
/html/body/table/tr/td[count(/html/body/table/tr/th[text() = 'HeaderA']/preceding-sibling::*)+1]
Some questions..
Can you use relative paths (../..) inside count()?
What other options to find current node number td[?] or is count(/preceding-sibling::*)+1 the most efficient?
It is possible to use relative paths inside count()
I have never heard of another way to find the node number...
Here is the code with relative xpath-code inside count()
/html/body/table/tr/td[count(../../tr/th[text()='HeaderC']/preceding-sibling::*)+1]
But well, it is not much shorter... It won't be shorter than this in my opinion:
//td[count(../..//th[text()='HeaderC']/preceding-sibling::*)+1]
Harmen's answer is exactly what you need for a pure XPATH solution.
If you are really concerned with performance, then you could define an XSLT key:
<xsl:key name="columns" match="/html/body/table/tr/th" use="text()"/>
and then use the key in your predicate filter:
/html/body/table/tr/td[count(key('columns', 'HeaderC')/preceding-sibling::th)+1]
However, I suspect you probably won't be able to see a measurable difference in performance unless you need to filter on columns a lot (e.g. for-each loops with checks for every row for a really large document).
I would have left Xpath aside... since I assume it was DOM parsed, I'd use a Map data structure, and match the nodes in either client side or server side (JavaScript / Java) manually.
Seems to me XPath is being streatched beyond its limit here.
Perhaps you want position() and XPath axes?