Using XPath to select table that includes specific class

Using XPath to select table that includes specific class - html

I have an HTML table that I need to select using XPath. The table may or may not contain multiple classes, but I only want tables that include a specific class.
Here is a sample HTML snippet:
<html>
<body>
<table class="no-border">
<tr>
<th colspan="2">Blah Blah Blah</th>
</tr>
<tr>
<td>Content</td>
<td>
<table class="info no-border">
<tr>
<!-- Inner table content -->
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I need to use XPath to retrieve ONLY the table that includes the class info. I've tried using /html/body/table/tr/td/table[#class='info*'], but that doesn't work. The table I'm trying to retrieve may exist ANYWHERE in the HTML document - technically, not ANYWHERE, but there may be varying levels of hierarchy between the outer and inner table.
If anyone can point me in the right direction, I'd be grateful.

The closest you can do is with the contains function:
//table[contains(#class,'info')]
But please be aware that this would capture a table with the class information, or anything else that has the info substring. As far as I know XPath can't distinguish whole-word matches. So you'd have to filter results to check for this possible condition.

What you'd ideally need is a CSS selector like table.info. And some XPath engines and toolkits fo XML/HTML parsing do support these selectors, which are translated to XPath expressions internally, e.g. cssselect if you use Python and which is included in lxml, or Nokogiri for Ruby.
In the general case, to emulate a CSS selector like table.info with XPath, a common trick or pattern is to use contains() combined with concat() and space characters. In your case, it looks like this:
.//table[contains(concat(' ', normalize-space(#class), ' '), ' info')]

I know that you did not asked for this answer, but I think it will help you to make your queries more precise.
//table[ (contains(#class,"result-cont") or contains(#class,"resultCont")) and not(contains(#class,"hide")) ]
This will get classes that contain 'result-cont' or 'resultCont', and do not have the 'hide' class.

XPath 1.0 is , indeed, fairly limited in its string processing. You can do modest amounts of processing with starts-with() substring() and similar functions. See this answer for creating something similar to a regex.
XSLT2.0 (which not all browsers and software support) has support for regex.

Related

HtmlAgilityPack Wildcard Search in Powershell

How could I shorten the following?
$contactsBlock is an HTMLAgilityPack node, XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]
$contactsBlock.SelectSingleNode(".//table").SelectSingleNode(".//table")
Results in desired XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]/table[1]/tr[2]/td[1]/div[1]/div[2]/table[1]
The second table is nested in the first, and I'd like to shorten the above SelectSingleNode twice to something like this
$contactsBlock.SelectSingleNode(".//table/*/table") and skip the in-between.
Is there a way to wild-card like this?

An XPath expression .//table//table should match all tables nested within other tables under the current node. Double forward slashes match arbitrary length paths.
.//table/*/table is unlikely to give you a match, because the asterisk wildcard matches one node (i.e. one level of hierarchy), so the nested table would have to be a grandchild node of the first table:
<table>
<tr>
<table>...</table> <!-- nested table would have to go here -->
</tr>
</table>
which would be quite unusual. Doesn't match the structure suggested by the XPath expression from your question, too.

regex to get tags previous and next

I have some html in a single string that may or may not have newlines. It could look something like this:
<table><tr><th>blah1</th></tr><tr><input class="inputClass"><span>open<pfelclose/>pfelsingle'pfeldouble"pfel</span></input></tr></table>
formatted nicely:
<table>
<tr>
<th>blah1</th>
</tr>
<tr>
<input class="inputClass">
<span>open<pfelclose/>pfelsingle'pfeldouble"pfel</span>
</input>
</tr>
</table>
I'd like to search this string for
(open<pfel|close/>pfel|single'pfel|double"pfel)
but also get two open tags before and two close tags after. So I'd like to get something like:
<input class="inputClass"><span>open<pfelclose/>pfelsingle'pfeldouble"pfel</span></input>
I cannot assume that input or span will be there, nor can I assume that there are necessarily two tags before or two tags after.
My attempt seems to always pull the entire start of the string:
.*[<]{0,2}?(open<pfel|close/>pfel|single'pfel|double"pfel)[/>]{0,2}?

The trouble with your situation is that you want to find matching tags (the open and close tags before and after the text you're searching for). Regex cannot do this. It isn't capable of parsing a nested structure like HTML. Regex parses regular languages, and HTML isn't one. Advanced Regex engines can sometimes be coerced into doing almost what you're trying to do here, but it's usually more trouble than it's worth.
Your solution in the comments is probably the correct one. Find what you're looking for with the regex, and then use an HTML parser to get what you need.

Powershell modifying HTML from ConvertTo-HTML

I have a script that generates an array of objects that I want to email out in HTML format. That part works fine. I am trying to modify the HTML string to make certain rows a different font color.
Part of the html string looks like this (2 rows only):
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.pdf'</td>
<td>13124</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.xls'</td>
<td>15716</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
I tried regex to add a font color to the beginning and end of the rows where the row ends with "Delivered":
$email = [regex]::Replace($email, "<tr><td>(.*?)Delivered</td></tr>", '<tr><font color = green><td>$1Delivered</td></font></tr>')
This didn't work (I am not sure if you can set font color for a whole row like that).
Any ideas on how to do this easily/efficiently? I have to do it on several different statuses (like Delivered)

Disclaimer: HTML cannot be parsed by regular expression parser. A regular expression will NOT provide a general solution to this problem. If your HTML structure is well known and you don't have any other <tr></tr> elements, though, the following might work. On that note, though, is there some reason you can't modify the HTML generation to do this then instead of waiting until the HTML is already generated?
Try this command:
PS > $email = $email -replace '(?s)<tr>(.*?)<td>Delivered</td>(.*?)</tr>','<tr style="color: #FF0000">$1<td>Delivered</td>$2</tr>'
The first string is the pattern. The (?s) tells the parser to allow . to accept newlines; this is called "single line" mode. Then it grabs a <tr> element that contains the string <td>Delivered</td>. The two capture groups grab everything else in the <tr> element around the <td>Delivered</td> string. Take note of the question marks following the *s. * by itself is greedy and matches as much text as possible; *? matches as little text as possible. If we just used * here, it would treat your entire string as one match and only replace the first <tr>.
The second string is the replacement. It plops the <tr> element and its contents back in place with an added style attribute, and all without back ref.
One other minor note is the quoting. I tend toward single quotes anyway, but in this case, you're likely to have double quotes in the replacement string. So single quotes are probably the way to go.
As for how you could do this for different statuses, regular expressions really aren't designed for conditional content like that; it's like trying to use a screwdriver as a drill. You can hard code several replaces or loop over status/color pairs and build your pattern and replace strings from them. A full blown HTML parser would be more efficient if you can find one for .NET; you might try to get away with an XML parser if you can guarantee it's valid XML. Or, going back to my question at the beginning, you could modify the HTML generation. If your e-mails are few in number, though, this may not be a bottleneck worth addressing. Development time spent is also costly. See if it's fast enough and try a different route if not.
Credit where it's due: I took the HTML style attribute from #FrankieTheKneeMan.

xPath/HTML: Select node based on related node

<html>
<body>
<table>
<tr>
<th>HeaderA</th>
<th>HeaderB</th>
<th>HeaderC</th>
<th>HeaderD</th>
</tr>
<tr>
<td>ContentA</td>
<td>ContentB</td>
<td>ContentC</td>
<td>ContentD</td>
</tr>
</table>
</body>
</html>
I am looking for the most efficient way to select the content 'td' node based on the heading in the corresponding 'th' node..
My current xPath expression..
/html/body/table/tr/td[count(/html/body/table/tr/th[text() = 'HeaderA']/preceding-sibling::*)+1]
Some questions..
Can you use relative paths (../..) inside count()?
What other options to find current node number td[?] or is count(/preceding-sibling::*)+1 the most efficient?

It is possible to use relative paths inside count()
I have never heard of another way to find the node number...
Here is the code with relative xpath-code inside count()
/html/body/table/tr/td[count(../../tr/th[text()='HeaderC']/preceding-sibling::*)+1]
But well, it is not much shorter... It won't be shorter than this in my opinion:
//td[count(../..//th[text()='HeaderC']/preceding-sibling::*)+1]

Harmen's answer is exactly what you need for a pure XPATH solution.
If you are really concerned with performance, then you could define an XSLT key:
<xsl:key name="columns" match="/html/body/table/tr/th" use="text()"/>
and then use the key in your predicate filter:
/html/body/table/tr/td[count(key('columns', 'HeaderC')/preceding-sibling::th)+1]
However, I suspect you probably won't be able to see a measurable difference in performance unless you need to filter on columns a lot (e.g. for-each loops with checks for every row for a really large document).

I would have left Xpath aside... since I assume it was DOM parsed, I'd use a Map data structure, and match the nodes in either client side or server side (JavaScript / Java) manually.
Seems to me XPath is being streatched beyond its limit here.

Perhaps you want position() and XPath axes?

PHP Regex: Get info between groups of HTML tags?

I have been programming a word-unscrambler. I need to parse the information between a group of tags and another, and put all matches into an array. The beginning tag is:
<tr> <td></td><td><li>
and the ending tag is:
</li></td> </tr>
I know some regular expressions, but I am unfamiliar with PHP.

<tr> <td><\/td><td><li>(.*)<\/li><\/td> <\/tr>
Test is here: http://rubular.com/regexes/13241

PHP can cope with both PERL and POSIX regexes, so you'll need to use the function for the flavour that you know. I believe that the code posted by #Gazler should work in either case, though it will need to be surrounded by /s, and quoted:
$reg = '/<tr> <td><\/td><td><li>(.*)<\/li><\/td> <\/tr>/';
There's a PHP PERL regex based test site here: http://j-r.camenisch.net/regex/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using XPath to select table that includes specific class - html

XPath 1.0 is , indeed, fairly limited in its string processing. You can do modest amounts of processing with starts-with() substring() and similar functions. See this answer for creating something similar to a regex. XSLT2.0 (which not all browsers and software support) has support for regex.

Related

HtmlAgilityPack Wildcard Search in Powershell

regex to get tags previous and next

Powershell modifying HTML from ConvertTo-HTML

xPath/HTML: Select node based on related node

PHP Regex: Get info between groups of HTML tags?

Categories

Resources