Get last </td></tr> with regular expression? - html

I need to get all tags between last </td> and the closing </tr> in each row. The regular expression I use <\/TD\s*>(.*?)<\/TR\s*> retrieve all from first </TD> till last </TR> - marked with bold on sample below.
<TABLE>
<TR><TD>TD11**</TD><TD>TD12</TD><TD>TD13</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21**</TD><TD>TD22</TD><TD>TD23</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
But a what I really need is
<TABLE>
<TR><TD>TD11</TD><TD>TD12</TD><TD>TD13**</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21</TD><TD>TD22</TD><TD>TD23**</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>

Its not recommended to use regular expressions to parse HTML, html is non regular and there for notoriously unreliable when trying to use regular expressions.
Heres a good blog post explaining the logic and offering alternatives:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

</TD>((?:(?!</T[DR]>).)*)</TR>
The regex starts to match at the first </TD>, but fails as soon as it reaches the second </TD> because of the (?!</T[DR]>)., which matches any character that's not the first character of a </TD> or </TR> tag. That's optional because of the enclosing (?:...)*, so it tries to match the next part of the regex, which is </TR>. That fails too, so the match attempt is abandoned.
It tries again starting at the second </TD> and fails again. Finally, it starts matching at the third </TD> and successfully matches from there to the first </TR>.
You may want to specify "single-line" or "dot-matches-all" mode, in case there are newlines that didn't show in your example. You didn't specify a regex flavor, so I can't say exactly how to do that.

Related

What is the role of parentheses in XPath 1.0?

In Chrome DevTools > Elements, when I search for //tr/td/span I find an element (because such an element exists on my page).
When I search for (//tr)/td/span or (//tr/td)/span I also find this element.
But neither //tr(/td)/span nor //tr/(td)/span nor //tr/(td/)span find anything.
What is the meaning of these parentheses in XPath?
Parenthesis in XPath are used as they are in other programming languages:
Function argument grouping: e.g: //tr/td[contains(.,"e")]
Evaluation precedence indication: e.g: normal arithmetic expression grouping as well as leading path grouping (trace LocationPath through to PrimaryExpr in the XPath grammar) as in (//td)[1] to find the first td in the document as opposed to //td[1] which finds the td elements that are the first child of their respective parent elements.
They're also used in
node tests: e.g: node(), element(), ...
processing instructions: e.g: PageBreak().
Your examples that do not find anything (e.g: //tr(/td)/span, //tr/(td)/span1, etc) have parenthesis embedded within the path that do not follow in one of the above categories. Such use of parenthesis are actually syntactically invalid and should have been reported as such rather than silently failing.
1Note that this expression would actually be syntatically valid under XPath 2.0/3.0. Thanks, #Andersson, for noticing.
I don't think that parenthesis mean something in your case, but it might be used to return required node/nodes set depending on passed index
For instance, HTML is like below:
<table>
<tr>
<td>
<span>first</span>
</td>
<td>
<span>second</span>
</td>
</tr>
<tr>
<td>
<span>third</span>
</td>
<td>
<span>fourth</span>
</td>
</tr>
</table>
(//tr)[1]/td will return cells for first row only (first, second)
(//tr)[2]/td - for second row (third, fourth)
(//tr/td)[1] - first cell of first row (first). Note that //tr/td[1] will returns each first cell of each row (first, third)
...

RegEx multiple instances

I know that also that is not recommended regex to parse html, but this question is rather to help understand regex than parse html with it.
So I have a sample string (singleline without any of linebreak or newline):
<tr><th> H1<th>H2 <th> H3 <tr><td> R1C1<td>R1C2 <td> R1C3 <tr><td> R2C1<td>R2C2 <td> R2C3 <tr><td> R3C1<td>R3C2 <td> R3C3 < ..
For a better understanding there is 3 row and 3 cells, and the end is unknown tag what but no TR or TD:
<tr><th> H1<th>H2 <th> H3
<tr><td> R1C1<td>R1C2 <td> R1C3
<tr><td> R2C1<td>R2C2 <td> R2C3
<tr><td> R3C1<td>R3C2 <td> R3C3
< ..
First try I'd like to get only all of the rows, here is how I think with the expected results:
start with <tr>
following zero or more any character
the end starts with < what follows zero or more any character but not td or th
I'm tried with 'basics' to see how it works..
With pattern (<tr>.*?) why only get <tr> strings, and not TR to TR?
With pattern (<tr>.*?<tr) why only get 1st and 3rd rows only?
I don't find any pattern what is good for the end of the string. I've tried with this pattern:
(<tr>.*?<(?!(td)|(th)))
..but I'm not sure is this good, moreover this gives back only 1st and 3rd rows.
Here is a DEMO what I've tried.
The pattern is:
(<tr>.*?(?=<tr))
Now the answer to the first 2 questions.
Question 1: With pattern (<tr>.*?) why only get <tr> strings, and not TR to TR?
Because with *? you are asking for a lazy operator. Since it is lazy, it tries to match less character as possible. And since 0 characters fulfill the pattern, it stops to 0.
Question 2: With pattern (<tr>.*?<tr) why only get 1st and 3rd rows only?
Because the parser's cursor is already passed over the 2nd <tr while you are getting the first match. In the second match it doesn't go back, because the 2nd <tr is part of the first match.
Using lookahead ((?=<tr)) the parser does not consume the 2nd <tr in the first match. Your last pattern is almost good, but even with that, the < of the 2nd TR tag is consumed in the first match, so it can't be part of the second match.

When doing a search I want to skip all code that ends in a closing tag

I have HTML that looks like this:
<td class="danish"> Det
tycker jag!</td>
I'm fixing the line break with this:
<td class="danish">(.*)
\s*(.*)</td>
But sometimes the HTML ends in a tag on the same line:
<td class="danish">Det tyckeg jag</td>
I want it to skip lines like these when searching and find the next broken line.
In case anyone thinks it's just a frivolous thing to make the code look good, the rest of the code looks like this (not required reading):
<td class="danish"> Det
tycker jag!</td>
<td>
<?php audioButton("../../audio//det_lyder_godt","det_lyder_godt"); ?>
I ultimately have to take the text in the table and replace the one in the audiobutton a thousand times, but that's a different problem
I think this is what you're looking for:
(<td class="danish">(?:(?!</td>).)*)\r?\n\s*
This matches from <td class="danish"> to the next newline, unless there's a </td> tag first. Replace with "$1 " or "\1 " (without the quotes).
Using \r?\n instead of a literal newline makes the regex more robust. Even better is \R, if your regex flavor supports it.
\s means "any white-space character", which includes spaces and new lines. You could explicitly search for lines that must contain a new line, by using something like:
<td class="danish">(.*)\n\s*(.*)</td>
Note the additional \n in the regex.

Using XPath to select table that includes specific class

I have an HTML table that I need to select using XPath. The table may or may not contain multiple classes, but I only want tables that include a specific class.
Here is a sample HTML snippet:
<html>
<body>
<table class="no-border">
<tr>
<th colspan="2">Blah Blah Blah</th>
</tr>
<tr>
<td>Content</td>
<td>
<table class="info no-border">
<tr>
<!-- Inner table content -->
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I need to use XPath to retrieve ONLY the table that includes the class info. I've tried using /html/body/table/tr/td/table[#class='info*'], but that doesn't work. The table I'm trying to retrieve may exist ANYWHERE in the HTML document - technically, not ANYWHERE, but there may be varying levels of hierarchy between the outer and inner table.
If anyone can point me in the right direction, I'd be grateful.
The closest you can do is with the contains function:
//table[contains(#class,'info')]
But please be aware that this would capture a table with the class information, or anything else that has the info substring. As far as I know XPath can't distinguish whole-word matches. So you'd have to filter results to check for this possible condition.
What you'd ideally need is a CSS selector like table.info. And some XPath engines and toolkits fo XML/HTML parsing do support these selectors, which are translated to XPath expressions internally, e.g. cssselect if you use Python and which is included in lxml, or Nokogiri for Ruby.
In the general case, to emulate a CSS selector like table.info with XPath, a common trick or pattern is to use contains() combined with concat() and space characters. In your case, it looks like this:
.//table[contains(concat(' ', normalize-space(#class), ' '), ' info')]
I know that you did not asked for this answer, but I think it will help you to make your queries more precise.
//table[ (contains(#class,"result-cont") or contains(#class,"resultCont")) and not(contains(#class,"hide")) ]
This will get classes that contain 'result-cont' or 'resultCont', and do not have the 'hide' class.
XPath 1.0 is , indeed, fairly limited in its string processing. You can do modest amounts of processing with starts-with() substring() and similar functions. See this answer for creating something similar to a regex.
XSLT2.0 (which not all browsers and software support) has support for regex.

Powershell modifying HTML from ConvertTo-HTML

I have a script that generates an array of objects that I want to email out in HTML format. That part works fine. I am trying to modify the HTML string to make certain rows a different font color.
Part of the html string looks like this (2 rows only):
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.pdf'</td>
<td>13124</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.xls'</td>
<td>15716</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
I tried regex to add a font color to the beginning and end of the rows where the row ends with "Delivered":
$email = [regex]::Replace($email, "<tr><td>(.*?)Delivered</td></tr>", '<tr><font color = green><td>$1Delivered</td></font></tr>')
This didn't work (I am not sure if you can set font color for a whole row like that).
Any ideas on how to do this easily/efficiently? I have to do it on several different statuses (like Delivered)
Disclaimer: HTML cannot be parsed by regular expression parser. A regular expression will NOT provide a general solution to this problem. If your HTML structure is well known and you don't have any other <tr></tr> elements, though, the following might work. On that note, though, is there some reason you can't modify the HTML generation to do this then instead of waiting until the HTML is already generated?
Try this command:
PS > $email = $email -replace '(?s)<tr>(.*?)<td>Delivered</td>(.*?)</tr>','<tr style="color: #FF0000">$1<td>Delivered</td>$2</tr>'
The first string is the pattern. The (?s) tells the parser to allow . to accept newlines; this is called "single line" mode. Then it grabs a <tr> element that contains the string <td>Delivered</td>. The two capture groups grab everything else in the <tr> element around the <td>Delivered</td> string. Take note of the question marks following the *s. * by itself is greedy and matches as much text as possible; *? matches as little text as possible. If we just used * here, it would treat your entire string as one match and only replace the first <tr>.
The second string is the replacement. It plops the <tr> element and its contents back in place with an added style attribute, and all without back ref.
One other minor note is the quoting. I tend toward single quotes anyway, but in this case, you're likely to have double quotes in the replacement string. So single quotes are probably the way to go.
As for how you could do this for different statuses, regular expressions really aren't designed for conditional content like that; it's like trying to use a screwdriver as a drill. You can hard code several replaces or loop over status/color pairs and build your pattern and replace strings from them. A full blown HTML parser would be more efficient if you can find one for .NET; you might try to get away with an XML parser if you can guarantee it's valid XML. Or, going back to my question at the beginning, you could modify the HTML generation. If your e-mails are few in number, though, this may not be a bottleneck worth addressing. Development time spent is also costly. See if it's fast enough and try a different route if not.
Credit where it's due: I took the HTML style attribute from #FrankieTheKneeMan.