Xpath works in Chrome Dev Tools, but not RSelenium - html

I have been working on a project to pull an html table that has a specific text ("Current Prison History:") from multiple URLs that change according to one's ID. With that being said, I have tried to use the CSS selector, but the problem with that is because some pages have more tables than others the CSS selector will change by page. Therefore, I thought I would be able to use xpath in order to get the table that I am looking for based on the table's text contents. The HTML is below
<table class="dcCSStableLight" border="1" cellspacing="0" cellpadding="1"
bordercolor="#ececd7">
<tbody>
<tr>
<td class="dark" align="left" colspan="8" bgcolor="#B0C4DE">
<b>Current Prison Sentence History:</b>
</td>
</tr>
<tr bgcolor="#B0C4DE">
<th><b>Offense Date</b>
</th>
<th><b>Offense</b>
</th>
<th><b>Sentence Date</b>
</th>
<th><b>County</b>
</th>
<th><b>Case No.</b>
</th>
<th><b>Prison Sentence Length</b>
</th>
</tr>
<tr valign="top" bgcolor="#FFFFFF">
<td>06/14/2015</td>
<td>BURG/DWELL/OCCUP.CONVEY</td>
<td>08/04/2016</td><td>ST. JOHNS</td>
<td>1501553</td>
<td nowrap="">5Y 0M 0D </td>
</tr>
</tbody>
</table>
I came up with the following xpath to pull the table
//*[#id='dcCSScontentContainer'/div/table/tbody/tr/td/b[contains(text(),"Current")]/ancestor::table
When I check the xpath with Chrome Developer tools it returns the table that I want, however in my R Selenium code, it returns an empty list.
for(i in 1:2){
remDR$navigate(URLs[i])
remDR$screenshot(display=TRUE)
remDR$setImplicitWaitTimeout(10000)
CPSHList[[i]] <- remDR$getPageSource()[[1]] %>%
read_html()%>%
html_nodes(xpath = "//*[#id='dcCSScontentContainer']/div/table/tbody/tr/td/b[contains(text(),'Current')]/ancestor::table")%>%
html_table()%>%
data.frame(stringsAsFactors = FALSE)
}

You should try to find the table that contains a b that has this text.
//table[.//b[contains(text(), 'Current')]]

Related

Make a table row span multiple columns using kable and kableExtra

I am trying to create an HTML table using R and the kable and kableExtra packages. I am having problems creating a row that spans several columns. I want to create a table where the last row contains the same values for all the columns without actually repeating this value. I've created a small example of what I am trying to do below.
library(kableExtra)
library(knitr)
summary_stats <- matrix(c(51,43,22,22),ncol=2,byrow=TRUE)
colnames(summary_stats) <- c("Mean","SD")
rownames(summary_stats) <- c("Age","Observations")
summary_stats
kable_table <- kable(summary_stats) %>%
kable_styling()
Instead of repeating the number 22 on the last row for the two columns, I'd like to center it between the two columns.
I am able to achieve what I want with the following HTML code using the colspan argument:
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;"> </th>
<th style="text-align:right;"> Mean </th>
<th style="text-align:right;"> SD </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;"> Age </td>
<td style="text-align:right;"> 51 </td>
<td style="text-align:right;"> 43 </td>
</tr>
<tr>
<td style="text-align:left;"> Observations </td>
<td style="text-align:center;" colspan = "2"> 22 </td>
</tr>
</tbody>
</table>
Note that the HTML code is just the output of the kable_table object I created in R where I've manually edited the HTML code to include the colspan argument. I would like to do this programmatically within R instead of having to manually change the code.
I've tried to use the row_spec function from the kableExtra package to add the necessary code but I am limited by the fact that the add_css option (as expected) only accepts arguments related to styling. In other words, I cannot pass the colspan argument to the option.
My question is if there is a reasonable way of adding the necessary HTML to the table after I've created it or if there is any option within the kable/kabeExtra framework that allows me to do this that I've missed?

Accessible Table with Sub Headings / Category Separation

EDIT: To the person who tagged this as having nothing to do with ADA. This question has everything to do with ADA. I have tons of websites with tables formatted like that which I am trying to figure out how to make them understandable to someone using a screen reader.
Hello I am trying to figure out a way to make a table which has subheadings / separator rows to announce the proper headings when being read by a screen reader.
The first table works as I would like, announcing the rowgroup's TH and then the column heading. However the second table doesn't announce as I was hoping. For example, Jill announces "Field Techs, Name, Jill" Instead of "Office, Name, Jill" as I had expected.
I've tried scope="col" and scope="colgroup" but neither helped. Is this even possible? or just a badly structured table?
Thank you for reading and any help/advice you may offer!
table thead, table th { background:#d3d3d3; }
table { margin-bottom:40px; }
<!-- This table's headings seem to work properly -->
<table width="100%" cellspacing="0" cellpadding="4" >
<thead>
<tr>
<td> </td>
<th id="name_col" scope="col" width="50%">Name</th>
<th id="position_col" scope="col" width="50%">Position</th>
</tr>
</thead>
<tbody>
<tr>
<th id="office_row" scope="rowgroup" rowspan="2">Office</th>
<td headers="office_row name_col">Jill</td>
<td headers="office_row position_col">Office Manager</td>
</tr>
<tr>
<td headers="office_row name_col">Robert</td>
<td headers="office_row position_col">Project Manager</td>
</tr>
<tr>
<th id="field_row" scope="rowgroup" rowspan="2">Field Techs</th>
<td headers="field_row name_col">Jason</td>
<td headers="field_row position_col">Tech</td>
</tr>
<tr>
<td headers="field_row name_col">Mike</td>
<td headers="field_row position_col">Tech</td>
</tr>
</tbody>
</table>
<!-- This table's headings don't announce correctly. Jill announces "Field Techs, Name, Jill"-->
<table width="100%" cellspacing="0" cellpadding="4" >
<thead>
<tr>
<th id="name_col" scope="col" width="50%">Name</th>
<th id="position_col" scope="col" width="50%">Position</th>
</tr>
<tr>
<th id="office_group" colspan="2">Office</th>
</tr>
</thead>
<tbody>
<tr>
<td headers="office_group name_col">Jill</td>
<td headers="office_group position_col">Office Manager</td>
</tr>
<tr>
<td headers="office_group name_col">Robert</td>
<td headers="office_group position_col">Project Manager</td>
</tr>
</tbody>
<thead>
<tr>
<th id="field_group" colspan="2">Field Techs</th>
</tr>
</thead>
<tbody>
<tr>
<td headers="field_group name_col">Jason</td>
<td headers="field_group position_col">Tech</td>
</tr>
<tr>
<td headers="field_group name_col">Mike</td>
<td headers="field_group position_col">Tech</td>
</tr>
</tbody>
</table>
table can only have zero or one thead element (see documentation).
Permitted contents : An optional caption element, followed by zero or more colgroup elements, followed by an optional thead element
By having multiple thead elements only the last one is considered by your browser and your screenreader. You can use ARIA attributes and roles to handle multiple separated heading lines (using for instance aria-labelledby attribute to specify the heading).
One example from WCAG:
ARIA9: Using aria-labelledby to concatenate a label from several text nodes
You are using both the scope method and header/id's method in one table, which will create problems. Also, as others have pointed out, you're using multiple <th> and <tbody> elements, which isn't good either.
I've prepared some code samples here on how to correctly code this table using both the scope method and header/id's method:
https://jsfiddle.net/oody1b8x/
It's worth noting that <th> and <tbody> are not accessibility-related elements, even though they appear to be. These are essentially only used when printing. It lets the printer know that the header rows can be repeated on the next page if the table requires pagination.
Also -- don't use ARIA for this purpose; it will only create more problems. The native HTML semantics are perfectly capable of communicating how this data is structured.

HTML Text after table

I have been trying to get text after a table in HTML but for some reason I cannot get this to work. I have tried using div, padding and margin on the table but nothing seem to work. No matter what I do the text always end up behind the first row of the table unless I use </br>.
Here is my HTML:
<div>
<h2 align=left>1. Delivery schedule</h2>
<body> The table below list the various delivery cycles per store:</body>
<br>
<br/>
<p>
<table border="1" align="left" width="61%" height="100px" frame="border">
<tr>
<th height="30" bgcolor="#002387">Store name</th>
<th height="30" bgcolor="#002387">Order deadline</th>
<th height="30" bgcolor="#002387">Delivery lead time from approval date</th>
</tr>
<tr>
<td colspan="3" bgcolor="#002387" ><font color="#FFFFFF"> Cycle 1</font></td>
</tr>
<tr>
<td>Borehamwood</td>
<td>Friday 1st August 2014 by midday</td>
<td>2-4 working days</td>
</tr>
<tr>
<td>Hemel</td>
<td>Friday 1st August 2014 by midday</td>
<td>2-4 working days</td>
</tr>
</table>
</div>
<importantLink>Please note that the advertised 2-4 working days delivery lead time is conditional of the orders being approved by the regional operation managers by the end of order deadline day.</importantLink>
Your code is bleeding from many wounds. First of all, you should forget about the align attribute, and use a CSS class instead.
.align-left {
text-align: left;
}
<h2 class="align-left">1. Delivery schedule</h2>
Then, you have an unclosed <p> tag right before your table, which could be causing your problem. Having invalid markup can lead to unexpected results. And finally, importantLink - depending on your <!DOCTYPE> - is likely not valid (you have a doctype, right?). Use a standard element like an <a> tag, which actually means that it's a link, and if you want to be able to tell it apart from the rest, use a class or id tag to give it a reusable/unique name, respectively. In your case, the text you are presenting in that tag is nothing like a link, so I suppose a <p> tag is the most suited for your case.
<p class="importantLink">Please note that the advertised 2-4 working days delivery lead time is conditional of the orders being approved by the regional operation managers by the end of order deadline day.</p>
usually html elements go one after another unless you give an element a property such as float:left or float:right or in your case align:left so elements after the table won't be under that table, they will be positioned on it's right and from the top.
if you want that link to be after the table (under that table) remove the align:left from the table,
and when writing HTML make sure the opening and closing tags are the same and that your content is inside the body tag, here is the correction for that:
<html> <!--<div>-->
<body>
<h2 align=left>1. Delivery schedule</h2>
<p> The table below list the various delivery cycles per store:</p>
<br>
<br/>
<div><!--<p>-->
<table border="1" align="left" width="61%" height="100px" frame="border">
<tr>
<th height="30" bgcolor="#002387">Store name</th>
<th height="30" bgcolor="#002387">Order deadline</th>
<th height="30" bgcolor="#002387">Delivery lead time from approval date</th>
</tr>
<tr>
<td colspan="3" bgcolor="#002387" ><font color="#FFFFFF"> Cycle 1</font></td>
</tr>
<tr>
<td>Borehamwood</td>
<td>Friday 1st August 2014 by midday</td>
<td>2-4 working days</td>
</tr>
<tr>
<td>Hemel</td>
<td>Friday 1st August 2014 by midday</td>
<td>2-4 working days</td>
</tr>
</table>
</div>
<importantLink>Please note that the advertised 2-4 working days delivery lead time is conditional of the orders being approved by the regional operation managers by the end of order deadline day.</importantLink>
</body>
<html>
this line <table border="1" align="left" width="61%" height="100px" frame="border">
is causing your issue, either remove align:left or change <importantLink> to <importantLink style="clear:both">
If you want the text within the <importantLink> to be displayed below the table you insert text in a div tag as below:
<div style="clear:both">
<importantLink>Your text comes here....</importantLink>
</div>
Replace your importantLink with the three lines of code above.

R-Advanced Web Scraping-bypassing aspNetHidden using xmlTreeParse()

This question takes a bit of time to introduce, bear with me. It will be fun to solve if you can get there. This scrape would be replicated over thousands of pages on this website using a loop.
I'm trying to scrape the website http://www.digikey.com/product-detail/en/207314-1/A25077-ND/ looking to capture the data in the table with Digi-Key Part Number, Quantity Available etc.. including the right hand side with Price Break, Unit Price, Extended Price.
Using the R function readHTMLTable() doesn't work and only returns NULL values. The reason for this (I believe) is because the website has hidden it's content using the tag "aspNetHidden" in the html code.
For this reason I also found difficulty using htmlTreeParse() and xmlTreeParse() with the whole section parented by not appearing in the results.
Using the R function scrape() from the scrapeR package
require(scrapeR)
URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
does return the full html code including the lines of interest:
<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber">
<meta itemprop="productID" content="sku:A25077-ND">A25077-ND</td>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
However, I haven't been able to select the nodes out of this block of code with the error being returned:
no applicable method for 'xpathApply' applied to an object of class "list"
I've received that error using different functions such as:
xpathSApply(URL,'//*[#id="pricing"]/tbody/tr[2]')
getNodeSet(URL,"//html[#class='rd-product-details-page']")
I'm not the most familiar with xpath but have been identifying the xpath using inspect element on the webpage and copy xpath.
Any help you can give on this would be much appreciated!
You've not read the help for scrape have you? It returns a list, you need to get parts of that list (if parse=TRUE) and so on.
Also I think that web page is doing some heavy heavy browser detection. If I try and wget the page from the command line I get an error page, the scrape function gets something usable (but seems different to you) and Chrome gets the full junk with all the encoded stuff. Yuck. Here's what works for me:
> URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
> tables = xpathSApply(URL[[1]],'//table')
> tables[[2]]
<table class="product-details" border="1" cellspacing="1" cellpadding="2">
<tr class="product-details-top"/>
<tr class="product-details-bottom">
<td class="pricing-description" colspan="3" align="right">All prices are in US dollars.</td>
</tr>
<tr>
<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber"><meta itemprop="productID" content="sku:A25077-ND"/>A25077-ND</td>
<td class="catalog-pricing" rowspan="6" align="center" valign="top">
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
<tr>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
Adjust to your use-case, here I'm getting all the tables and showing the second one, which has the info you want, some of it in the pricing table which you can get directly with:
pricing = xpathSApply(URL[[1]],'//table[#id="pricing"]')[[1]]
> pricing
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
<tr>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
</tr>
and so on.

How to extract only the 1st table tag from a html page having various nested table tag

I have the following html page. I want to extract data only within the 1st table tag in C#. the html page code is:
<table cellpadding=2 cellspacing=0 border=0 width=100%>
<tbody>
<tr>
<td align=right><b>11/09/2013 at 09:48</b></td>
</tr>
</tbody>
</table>
<center>
<table border="1" bordercolor="silver" cellpadding="2" cellspacing="0" width="100%">
<thead>
<tr>
<th width=100>ETA</th>
<th width=100>Ship Name</th>
<th width=80>From port</th>
<th width=80>To berth</th>
<th width=130>Agent</th>
</tr>
</thead>
<tbody>
<tr><td>11/09/2013 at 09:00 </td>
<td>SONANGOL KALANDULA </td>
<td>Cabinda </td>
<td>Valero 6 </td>
<td>Graypen </td>
</tr>
</tbody>
</table>
To be more specific I want to extract only the row having date 11/09/2013 at 09:48 the below mentioned code is under the first of tag I am using regex
"<table[^>]*>([^<]*(?:(?!</table)<[^<]*)*)[</table>]*"
but with this I am getting whole of the page source that is I am getting the data between all the table tags but I want only text between first table tag.
Can anyone tell me regular expression with which I can only extract this particular portion from the whole html page?
When trying out your version here, it seems to work to me on the input you specified, though [</table>]* should really be just </table> ([</table>]* means any number of characters in the set: <,/,t,a,b,l,e,>)
This seems like it would bear simplification, though. This should also work:
<table[^>]*>.*?</table>
All bets are off if you have nested tables, of course.