Nokogiri how to traverse every row of a table with two classes - html

I am attempting to parse an HTML table using Nokogiri. The table is marked up well and has no structural issues except for table header is embedded as an actual row instead of using <thead>. The problem I have is that I want every row but the first row, as I'm not interested in the header, but everything that follows instead. Here's an example of how the table is structured.
<table id="foo">
<tbody>
<tr class="headerrow">....</tr>
<tr class="row">...</tr>
<tr class="row_alternate">...</tr>
<tr class="row">...</tr>
<tr class="row_alternate">...</tr>
</tbody>
</table>
I'm interesting in grabbing only rows with the class row and row_alternate. However, this syntax is not legal in Nokogiri as far as I'm aware:
doc.css('.row .row_alternate').each do |a_row|
# do stuff with a_row
end
What's the best way to solve this with Nokogiri?

I would try this:
doc.css(".row, .row_alternate").each do |a_row|
# do stuff with a_row
end

A CSS selector can contain multiple components separated by comma:
A comma-separated list of selectors represents the union of all elements selected by each of the individual selectors in the list. (A comma is U+002C.) For example, in CSS when several selectors share the same declarations, they may be grouped into a comma-separated list. White space may appear before and/or after the comma.
doc.css('.row, .row_alternate').each do |a_row|
p a_row.to_html
end
# "<tr class=\"row\">...</tr>"
# "<tr class=\"row_alternate\">...</tr>"
# "<tr class=\"row\">...</tr>"
# "<tr class=\"row_alternate\">...</tr>"

try doc.at_css(".headerrow").remove and then
doc.css("tr").each do |row|
#some code
end

Related

HtmlAgilityPack Wildcard Search in Powershell

How could I shorten the following?
$contactsBlock is an HTMLAgilityPack node, XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]
$contactsBlock.SelectSingleNode(".//table").SelectSingleNode(".//table")
Results in desired XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]/table[1]/tr[2]/td[1]/div[1]/div[2]/table[1]
The second table is nested in the first, and I'd like to shorten the above SelectSingleNode twice to something like this
$contactsBlock.SelectSingleNode(".//table/*/table") and skip the in-between.
Is there a way to wild-card like this?
An XPath expression .//table//table should match all tables nested within other tables under the current node. Double forward slashes match arbitrary length paths.
.//table/*/table is unlikely to give you a match, because the asterisk wildcard matches one node (i.e. one level of hierarchy), so the nested table would have to be a grandchild node of the first table:
<table>
<tr>
<table>...</table> <!-- nested table would have to go here -->
</tr>
</table>
which would be quite unusual. Doesn't match the structure suggested by the XPath expression from your question, too.

Using readHTMLTable with multiple tbody

Suppose I have an HTML table with multiple <tbody>, which we know is perfectly legal HTML, and attempt to read it with readHTMLTable as follows:
library(XML)
table.text <- '<table>
<thead>
<tr><th>Col1</th><th>Col2</th>
</thead>
<tbody>
<tr><td>1a</td><td>2a</td></tr>
</tbody>
<tbody>
<tr><td>1b</td><td>2b</td></tr>
</tbody>
</table>'
readHTMLTable(table.text)
The output I get only takes the first <tbody> element:
$`NULL`
Col1 Col2
1 1a 2a
and ignores the rest. Is this expected behavior? (I can't find any mention in the documentation.) And what are the most flexible and robust ways to access the entire table?
I'm currently using
table.text <- gsub('</tbody>[[:space:]]*<tbody>', '', table.text)
readHTMLTable(table.text)
which prevents me from using readHTMLTable directly on a URL to get a table like this, and also doesn't feel very robust.
If you look at the source for readHTMLTable getMethod(readHTMLTable, "XMLInternalElementNode") it contains the line
if (length(tbody))
node = tbody[[1]]
so it is purposefully designed to select only the content of the first tbody. Also ?readHTMLTable describes the function as providing
somewhat robust methods for extracting data from HTML tables in an HTML document
It is designed to be a utility function. Its great when it works but you may need to hack around it.

Powershell modifying HTML from ConvertTo-HTML

I have a script that generates an array of objects that I want to email out in HTML format. That part works fine. I am trying to modify the HTML string to make certain rows a different font color.
Part of the html string looks like this (2 rows only):
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.pdf'</td>
<td>13124</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.xls'</td>
<td>15716</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
I tried regex to add a font color to the beginning and end of the rows where the row ends with "Delivered":
$email = [regex]::Replace($email, "<tr><td>(.*?)Delivered</td></tr>", '<tr><font color = green><td>$1Delivered</td></font></tr>')
This didn't work (I am not sure if you can set font color for a whole row like that).
Any ideas on how to do this easily/efficiently? I have to do it on several different statuses (like Delivered)
Disclaimer: HTML cannot be parsed by regular expression parser. A regular expression will NOT provide a general solution to this problem. If your HTML structure is well known and you don't have any other <tr></tr> elements, though, the following might work. On that note, though, is there some reason you can't modify the HTML generation to do this then instead of waiting until the HTML is already generated?
Try this command:
PS > $email = $email -replace '(?s)<tr>(.*?)<td>Delivered</td>(.*?)</tr>','<tr style="color: #FF0000">$1<td>Delivered</td>$2</tr>'
The first string is the pattern. The (?s) tells the parser to allow . to accept newlines; this is called "single line" mode. Then it grabs a <tr> element that contains the string <td>Delivered</td>. The two capture groups grab everything else in the <tr> element around the <td>Delivered</td> string. Take note of the question marks following the *s. * by itself is greedy and matches as much text as possible; *? matches as little text as possible. If we just used * here, it would treat your entire string as one match and only replace the first <tr>.
The second string is the replacement. It plops the <tr> element and its contents back in place with an added style attribute, and all without back ref.
One other minor note is the quoting. I tend toward single quotes anyway, but in this case, you're likely to have double quotes in the replacement string. So single quotes are probably the way to go.
As for how you could do this for different statuses, regular expressions really aren't designed for conditional content like that; it's like trying to use a screwdriver as a drill. You can hard code several replaces or loop over status/color pairs and build your pattern and replace strings from them. A full blown HTML parser would be more efficient if you can find one for .NET; you might try to get away with an XML parser if you can guarantee it's valid XML. Or, going back to my question at the beginning, you could modify the HTML generation. If your e-mails are few in number, though, this may not be a bottleneck worth addressing. Development time spent is also costly. See if it's fast enough and try a different route if not.
Credit where it's due: I took the HTML style attribute from #FrankieTheKneeMan.

Regex in html code

<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody>
I'm looking to find a Regex to get this value from a big text.
I tried this one but without result:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#
Regex with html is often a bad idea, because of potential recursive tags. Have you tried using an XML/HTML parser? For example, XmlDocument, XmlElement and XmlAttribute.
EDIT: The problem with regex and html in your example:
Cannot keep count of recursive tbody tags
Will the tbody tag can look like <tbody>...</tbody> or <tbody .../>?
Even if you know there will be one start and end tag, how do you know there won't be any plain text containing "tbody" somewhere inside the table, thus breaking the regex?
You may want to tell your regex engine that it should match newlines with the . as well.
In PHP, that would make the regex:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#s
Note the trailing s
Warning if there are 2 tbodies, this regex will match everything starting from the first tbody (with this ID) until the last tbody (ID-independent).
Example:
<tbody id="clavier:infractionList2:tb">Some data</tbody>
<tbody id="tbody2"></tbody>
will also be matched.
This works:
/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is
Or full PHP:
<?php
$html = '<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody> ';
preg_match_all('/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is', $html, $matches);
var_dump($matches[1]);
That gives you the <tr...>....</tr> as a result. If you only want the dots you'll need to use something like:
/<tbody id="clavier:infractionList2:tb">.*?<tr.*?>(.*?)<\/tr>.*?<\/tbody>/is

RegEx - Get the HTML TRs where they have no TRs nested

I am trying to get the contents of TRs on a web page that have no TRs nested inside them. The HTML is nested with many TRs
I am limited to RegEx only for this problem.
This is good:
TR
Contents
/TR
This is not
TR
other HTML
TR
Contents
This is actually not that much of a problem with regex (assuming you can guarantee that <tr> will not show up in comments, strings etc.; otherwise the regex will mis-match):
<tr\b(?:(?!</?tr\b).)*</tr>
will only match innermost tr tags. Use the dot-matches-newlines option of your regex engine, or it won't work correctly. If you don't have one (JavaScript, I'm talking to you!), then use [\s\S] instead of the ..
Explanation:
<tr\b # Match a tag that starts with tr
(?: # Match...
(?! # (unless it's possible to match
</?tr\b # <tr or </tr at the current position)
)
. # any character
)* # any number of times.
</tr> # Match </tr>