Using readHTMLTable with multiple tbody

Using readHTMLTable with multiple tbody - html

Suppose I have an HTML table with multiple <tbody>, which we know is perfectly legal HTML, and attempt to read it with readHTMLTable as follows:
library(XML)
table.text <- '<table>
<thead>
<tr><th>Col1</th><th>Col2</th>
</thead>
<tbody>
<tr><td>1a</td><td>2a</td></tr>
</tbody>
<tbody>
<tr><td>1b</td><td>2b</td></tr>
</tbody>
</table>'
readHTMLTable(table.text)
The output I get only takes the first <tbody> element:
$`NULL`
Col1 Col2
1 1a 2a
and ignores the rest. Is this expected behavior? (I can't find any mention in the documentation.) And what are the most flexible and robust ways to access the entire table?
I'm currently using
table.text <- gsub('</tbody>[[:space:]]*<tbody>', '', table.text)
readHTMLTable(table.text)
which prevents me from using readHTMLTable directly on a URL to get a table like this, and also doesn't feel very robust.

If you look at the source for readHTMLTable getMethod(readHTMLTable, "XMLInternalElementNode") it contains the line
if (length(tbody))
node = tbody[[1]]
so it is purposefully designed to select only the content of the first tbody. Also ?readHTMLTable describes the function as providing
somewhat robust methods for extracting data from HTML tables in an HTML document
It is designed to be a utility function. Its great when it works but you may need to hack around it.

Related

Can I use knitr to apply CSS styles to individual table cells?

Is it possible to apply a class attribute to individual table cells using knitr? I have successfully applied a class attribute to the section heading that contains a knitr::kable generated table and used that to format the entire table. However, I would like to be able to conditionally format individual cells which would require being able to apply a class to specific <td> elements.
My current workaround is to programmatically wrap the cell contents in a pair of <span> tags and pass that on to knitr::kable. This approach only allows me to format the text inside the cell versus the entire cell (e.g. setting the cell background color). Here's an example of what I'm currently using:
## Read in the report, process the data, send to kable
rpt <- generate.report()
mutate(rpt, Col2 = ifelse(abs(Col2) > Threshold,
paste('<span class="warning">',
sprintf("%.2f", Col2), '</span>'),
sprintf("%.2f", Col2))) %>%
knitr::kable(format="markdown", align = c("l", rep("r", 4)),
col.names = gsub("\\.", "<br>", colnames(.)))
Which results in the following example HTML output:
<td align="right"><span class="warning"> -1.74 </span></td>
I would like to be able to have knitr::kable generate something like this:
<td align="right" class="warning"> -1.74 </td>
That way I could apply css styles to the <td> tag vice the <span> tag.

package ReporteRs may help. Have a look here FlexTable.
You can then get the corresponding HTML code with function as.html and reuse it within your knitr code.

Ok, this may not be the answer but it may point you in the right direction. I had a similar problem formatting individual cells in knitr to prepare a pdf. In the end, I use xtable and wrote a function that relied on a logical matrix to decide whether or not a cell in the output table would be formatted.
I couldn't quite get it to work smoothly by myself so I had to post it on here and with the help of ivyleavedtoadflax I was able to develop a reasonably easy to use function to apply formatting to certain cells in an xtable in knitr.
Here's the link to my post
As I say, it's not the exact solution to your problem but it may point you in the right direction.

Selenium automation- finding best xpath

I am looking to avoid using xpaths that are 'xpath position'. Reason being, the xpath can change and fail an automation test if a new object is on the page and shifts the expected xpath position.
But on some web pages, this is the only xpath I can find. For example, I am looking to click a tab called 'FooBar'.
If I use the Selenium IDE FireFox plugin, I get:
//td[12]/a/font
If I use the FirePath Firefox plugin, I get:
html/body/form/table[2]/tbody/tr/td[12]/font
If a new tab called "Hello, World" is added to the web page (before FooBar tab) then FooBar tab will change and have an xpath position of
//td[13]/a/font
What would you suggest to do?
TY!

Instead of using absolute xpath you could use relateive xpath which is short and more reliable.
Say
<td id="FooBar" name="FooBar">FooBar</td>
By.id("FooBar");
By.name("FooBar");
By.xpath("//td[text()='FooBar']") //exact match
By.xpath("//td[#id='FooBar']") //with any attribute value
By.xpath("//td[contains(text(),'oBar')]") //partial match with contains function
By.xpath("//td[starts-with(text(),'FooB')]") //partial match with startswith function
This blog post may be useful for you.

Relative xpath is good idea. relative css is even better(faster)
If possible suggest/request id for element.
Check also chrome -> check element -> copy css/xpath

Using //td is not a good idea because it will return all your td nodes. Any predicate such as //td[25] will be a very fragile selection because any td added to any previous table will change its result. Using plugins to generate XPath is great to find quickly what you want, but its always best to use it just as a starting point, and then analyze the structure of the file to write a locator that will be harder to break when changes occur.
The best locators are anchored to invariant values or attributes. Plugins usually won't suggest id or attribute anchors. They usually use absolute positional expressions. If can rewrite your locator path in terms of invariant structures in the file, you can then select the elements or text that you want relative to it.
For example, suppose you have
<body> ...
... lots of code....
<h1>header that has a special word</h1>
... other tags and text but not `h1` ...
<table id="some-id">
...
<td>some-invariant-text</td>
<td>other text</td>
<td>the field that you want</td>
...
The table has an ID. That's the best anchor. Now you can select the table as
//table[#id='some-id']
But many times you don't have the id, or even some other invariant attribute. You can still try to discover a pattern. For example: suppose that the last <h1> before the table you want contains a word you can match, you could still find the table using:
//table[preceding::h1[1][contains(.,'word')]]
Once you have the table, you can use relative axes to find the other nodes. Let's assume you want an td but there are no attributes on any tbody, tr, etc. You can still look for some invariant text. Tables usually have headers, or some fixed text which you can match. In the example above, if you find a td that is 2 fields before the one that you want, you could use:
//table[preceding::h1[1][contains(.,'word')]]/td[preceding-sibling::td[2][.='some-invariant-text']]
This is a simple example. If you apply some of these suggestions to the file you are working on, you can improve your XPath expression and make your selection code more robust.

Using XPath to select table that includes specific class

I have an HTML table that I need to select using XPath. The table may or may not contain multiple classes, but I only want tables that include a specific class.
Here is a sample HTML snippet:
<html>
<body>
<table class="no-border">
<tr>
<th colspan="2">Blah Blah Blah</th>
</tr>
<tr>
<td>Content</td>
<td>
<table class="info no-border">
<tr>
<!-- Inner table content -->
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I need to use XPath to retrieve ONLY the table that includes the class info. I've tried using /html/body/table/tr/td/table[#class='info*'], but that doesn't work. The table I'm trying to retrieve may exist ANYWHERE in the HTML document - technically, not ANYWHERE, but there may be varying levels of hierarchy between the outer and inner table.
If anyone can point me in the right direction, I'd be grateful.

The closest you can do is with the contains function:
//table[contains(#class,'info')]
But please be aware that this would capture a table with the class information, or anything else that has the info substring. As far as I know XPath can't distinguish whole-word matches. So you'd have to filter results to check for this possible condition.

What you'd ideally need is a CSS selector like table.info. And some XPath engines and toolkits fo XML/HTML parsing do support these selectors, which are translated to XPath expressions internally, e.g. cssselect if you use Python and which is included in lxml, or Nokogiri for Ruby.
In the general case, to emulate a CSS selector like table.info with XPath, a common trick or pattern is to use contains() combined with concat() and space characters. In your case, it looks like this:
.//table[contains(concat(' ', normalize-space(#class), ' '), ' info')]

I know that you did not asked for this answer, but I think it will help you to make your queries more precise.
//table[ (contains(#class,"result-cont") or contains(#class,"resultCont")) and not(contains(#class,"hide")) ]
This will get classes that contain 'result-cont' or 'resultCont', and do not have the 'hide' class.

XPath 1.0 is , indeed, fairly limited in its string processing. You can do modest amounts of processing with starts-with() substring() and similar functions. See this answer for creating something similar to a regex.
XSLT2.0 (which not all browsers and software support) has support for regex.

Powershell modifying HTML from ConvertTo-HTML

I have a script that generates an array of objects that I want to email out in HTML format. That part works fine. I am trying to modify the HTML string to make certain rows a different font color.
Part of the html string looks like this (2 rows only):
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.pdf'</td>
<td>13124</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.xls'</td>
<td>15716</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
I tried regex to add a font color to the beginning and end of the rows where the row ends with "Delivered":
$email = [regex]::Replace($email, "<tr><td>(.*?)Delivered</td></tr>", '<tr><font color = green><td>$1Delivered</td></font></tr>')
This didn't work (I am not sure if you can set font color for a whole row like that).
Any ideas on how to do this easily/efficiently? I have to do it on several different statuses (like Delivered)

Disclaimer: HTML cannot be parsed by regular expression parser. A regular expression will NOT provide a general solution to this problem. If your HTML structure is well known and you don't have any other <tr></tr> elements, though, the following might work. On that note, though, is there some reason you can't modify the HTML generation to do this then instead of waiting until the HTML is already generated?
Try this command:
PS > $email = $email -replace '(?s)<tr>(.*?)<td>Delivered</td>(.*?)</tr>','<tr style="color: #FF0000">$1<td>Delivered</td>$2</tr>'
The first string is the pattern. The (?s) tells the parser to allow . to accept newlines; this is called "single line" mode. Then it grabs a <tr> element that contains the string <td>Delivered</td>. The two capture groups grab everything else in the <tr> element around the <td>Delivered</td> string. Take note of the question marks following the *s. * by itself is greedy and matches as much text as possible; *? matches as little text as possible. If we just used * here, it would treat your entire string as one match and only replace the first <tr>.
The second string is the replacement. It plops the <tr> element and its contents back in place with an added style attribute, and all without back ref.
One other minor note is the quoting. I tend toward single quotes anyway, but in this case, you're likely to have double quotes in the replacement string. So single quotes are probably the way to go.
As for how you could do this for different statuses, regular expressions really aren't designed for conditional content like that; it's like trying to use a screwdriver as a drill. You can hard code several replaces or loop over status/color pairs and build your pattern and replace strings from them. A full blown HTML parser would be more efficient if you can find one for .NET; you might try to get away with an XML parser if you can guarantee it's valid XML. Or, going back to my question at the beginning, you could modify the HTML generation. If your e-mails are few in number, though, this may not be a bottleneck worth addressing. Development time spent is also costly. See if it's fast enough and try a different route if not.
Credit where it's due: I took the HTML style attribute from #FrankieTheKneeMan.

Retrieve <TD> text using WATIR

I am using WATIR for automated testing, and I need to copy in a variable the value of a rate. In the example below (from webpage source code), I need that variable myrate has value 2.595. I know how to retrieve value from <input> or <span> (see below), but not directly from a <td>. Any help? Thanks
<TABLE>
<TR>
<TD></TD>
<TD>Rate</TD>
<TD>2.595</TD>
</TR>
</TABLE>
For a <span> I use this code:
raRetrieved = browser.span(:name => 'myForm.raNumber').text

try this, find the row you want using a regular expression to match a row that contains the word 'Rate', then get the text of the third cell in the row.
myrate = browser.tr(:text, /Rate/).td(:index => 2).text
#or you can use the more user-friendly aliases for those tags
myrate = browser.row(:text, /Rate/).cell(:index => 2).text
If the word 'Rate' might appear elsewhere in other textin that table, but is always just the only entry in the second cell of the row you want, then find the cell with that exact text, use the parent method to locate the row that holds that cell , and then get the text from the third cell.
myrate = browser.cell(:text, 'Rate').parent.cell(:index => 2).text
use of .cell & .row vs .td & .tr is up to you, some people prefer the tags, others like the more descriptive names. Use whatever you feel makes the code the most readable for you or others who will work with it.
Note: code above presumes use of Watir-Webdriver, or Watir 2.x which both use zero based indexing. For older versions of Watir, change the index values to 3
And for the record I totally agree with comments of others about the lack of testability of the code sample you posted. it's horrid. Asking for something to locate the proper elements, such as ID values or Names is not out of line in terms of making the page easier to test.

Try this:
browser.td(how, what).text
The problem here is that table, tr and td tags do not have any attributes. You can try something like this (not tested):
browser.table[0][2].text

If this helps to anyone who is having the same issue, it is working for me like this:
browser.td(:text => "Rate").parent.cell(:index, 2).text
Thank you all

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using readHTMLTable with multiple tbody - html

Related

Can I use knitr to apply CSS styles to individual table cells?

Selenium automation- finding best xpath

Using XPath to select table that includes specific class

Powershell modifying HTML from ConvertTo-HTML

Retrieve <TD> text using WATIR

Categories

Resources