R-Advanced Web Scraping-bypassing aspNetHidden using xmlTreeParse() - html

This question takes a bit of time to introduce, bear with me. It will be fun to solve if you can get there. This scrape would be replicated over thousands of pages on this website using a loop.
I'm trying to scrape the website http://www.digikey.com/product-detail/en/207314-1/A25077-ND/ looking to capture the data in the table with Digi-Key Part Number, Quantity Available etc.. including the right hand side with Price Break, Unit Price, Extended Price.
Using the R function readHTMLTable() doesn't work and only returns NULL values. The reason for this (I believe) is because the website has hidden it's content using the tag "aspNetHidden" in the html code.
For this reason I also found difficulty using htmlTreeParse() and xmlTreeParse() with the whole section parented by not appearing in the results.
Using the R function scrape() from the scrapeR package
require(scrapeR)
URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
does return the full html code including the lines of interest:
<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber">
<meta itemprop="productID" content="sku:A25077-ND">A25077-ND</td>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
However, I haven't been able to select the nodes out of this block of code with the error being returned:
no applicable method for 'xpathApply' applied to an object of class "list"
I've received that error using different functions such as:
xpathSApply(URL,'//*[#id="pricing"]/tbody/tr[2]')
getNodeSet(URL,"//html[#class='rd-product-details-page']")
I'm not the most familiar with xpath but have been identifying the xpath using inspect element on the webpage and copy xpath.
Any help you can give on this would be much appreciated!

You've not read the help for scrape have you? It returns a list, you need to get parts of that list (if parse=TRUE) and so on.
Also I think that web page is doing some heavy heavy browser detection. If I try and wget the page from the command line I get an error page, the scrape function gets something usable (but seems different to you) and Chrome gets the full junk with all the encoded stuff. Yuck. Here's what works for me:
> URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
> tables = xpathSApply(URL[[1]],'//table')
> tables[[2]]
<table class="product-details" border="1" cellspacing="1" cellpadding="2">
<tr class="product-details-top"/>
<tr class="product-details-bottom">
<td class="pricing-description" colspan="3" align="right">All prices are in US dollars.</td>
</tr>
<tr>
<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber"><meta itemprop="productID" content="sku:A25077-ND"/>A25077-ND</td>
<td class="catalog-pricing" rowspan="6" align="center" valign="top">
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
<tr>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
Adjust to your use-case, here I'm getting all the tables and showing the second one, which has the info you want, some of it in the pricing table which you can get directly with:
pricing = xpathSApply(URL[[1]],'//table[#id="pricing"]')[[1]]
> pricing
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
<tr>
<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>
</tr>
and so on.

Related

How to write HTML and CSS code in Node js to Send Eamils to Users

I want to send Emails using node-cron and wanted to style the message. Basically, I wanted to send user a list of things as attached below. I wanted to write something like this
<table id="schedule">
<tr>
<th>Centre Name</th>
<th>Vaccine</th>
<th>Address</th>
<th>PinCode</th>
<th>Date</th>
<th>Fee Type</th>
</tr>
</table>
And I wanted to style these elements of the table. When I tried to write CSS code in Nodejs it showed an error.
How can I do that in node js.
return `
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td align="center" valign="top" bgcolor="#f6f6f8" style="background-color:#eeeeee;">
Some avesome things here
</td>
</tr>
</table>
`
Like this

THEAD Not repeating after certain size

I am using this to create a quotation template for our online project management software (which prints it to a PDF):
<thead>
<tr>
<th colspan="2" align="right" style="width:146px" valign="top">
<img class="body_table img_logo" name="user:logo_url" src="/images/logo.jpg" />
</th>
</tr>
<tr class="address_details" style="font-weight:500;">
<th align="left"><span name="job:company">{{job.company}}</span></th>
<th align="right"><span name="user:company">{{user.company}}</span></th>
</tr>
<tr class="address_details" style="font-weight:400">
<th align="left"><span name="job:address">{{job.address}}</span></th>
<th align="right">
<span name="user:address">{{user.address}}</span>
<br/>
<span name="user:depot_email">{{user.depot_email}}</span>
<br/>
<span name="depot:telephone">{{depot.telephone}}</span>
</th>
</tr>
</thead>
So it works as intended, giving repeated branding on each printed page, but as soon as the last <span name="depot:telephone">{{depot.telephone}}</span> is added, it stops repeating on subsequent pages. It doesn't seem to be that one line though, if I comment out another random bit, it starts working. So, I assume it's a length thing.
I am too much of a newb to HTML to know what I broke, any ideas? Is there a max length that the THEAD can be?
The stuff in the "{{xxx}}}" is what draws data from the online software.
The PDF generator options are "Webkit" which doesn't seem to do the repititions right or "Chromium", which does.

Xpath works in Chrome Dev Tools, but not RSelenium

I have been working on a project to pull an html table that has a specific text ("Current Prison History:") from multiple URLs that change according to one's ID. With that being said, I have tried to use the CSS selector, but the problem with that is because some pages have more tables than others the CSS selector will change by page. Therefore, I thought I would be able to use xpath in order to get the table that I am looking for based on the table's text contents. The HTML is below
<table class="dcCSStableLight" border="1" cellspacing="0" cellpadding="1"
bordercolor="#ececd7">
<tbody>
<tr>
<td class="dark" align="left" colspan="8" bgcolor="#B0C4DE">
<b>Current Prison Sentence History:</b>
</td>
</tr>
<tr bgcolor="#B0C4DE">
<th><b>Offense Date</b>
</th>
<th><b>Offense</b>
</th>
<th><b>Sentence Date</b>
</th>
<th><b>County</b>
</th>
<th><b>Case No.</b>
</th>
<th><b>Prison Sentence Length</b>
</th>
</tr>
<tr valign="top" bgcolor="#FFFFFF">
<td>06/14/2015</td>
<td>BURG/DWELL/OCCUP.CONVEY</td>
<td>08/04/2016</td><td>ST. JOHNS</td>
<td>1501553</td>
<td nowrap="">5Y 0M 0D </td>
</tr>
</tbody>
</table>
I came up with the following xpath to pull the table
//*[#id='dcCSScontentContainer'/div/table/tbody/tr/td/b[contains(text(),"Current")]/ancestor::table
When I check the xpath with Chrome Developer tools it returns the table that I want, however in my R Selenium code, it returns an empty list.
for(i in 1:2){
remDR$navigate(URLs[i])
remDR$screenshot(display=TRUE)
remDR$setImplicitWaitTimeout(10000)
CPSHList[[i]] <- remDR$getPageSource()[[1]] %>%
read_html()%>%
html_nodes(xpath = "//*[#id='dcCSScontentContainer']/div/table/tbody/tr/td/b[contains(text(),'Current')]/ancestor::table")%>%
html_table()%>%
data.frame(stringsAsFactors = FALSE)
}
You should try to find the table that contains a b that has this text.
//table[.//b[contains(text(), 'Current')]]

Accessible Table with Sub Headings / Category Separation

EDIT: To the person who tagged this as having nothing to do with ADA. This question has everything to do with ADA. I have tons of websites with tables formatted like that which I am trying to figure out how to make them understandable to someone using a screen reader.
Hello I am trying to figure out a way to make a table which has subheadings / separator rows to announce the proper headings when being read by a screen reader.
The first table works as I would like, announcing the rowgroup's TH and then the column heading. However the second table doesn't announce as I was hoping. For example, Jill announces "Field Techs, Name, Jill" Instead of "Office, Name, Jill" as I had expected.
I've tried scope="col" and scope="colgroup" but neither helped. Is this even possible? or just a badly structured table?
Thank you for reading and any help/advice you may offer!
table thead, table th { background:#d3d3d3; }
table { margin-bottom:40px; }
<!-- This table's headings seem to work properly -->
<table width="100%" cellspacing="0" cellpadding="4" >
<thead>
<tr>
<td> </td>
<th id="name_col" scope="col" width="50%">Name</th>
<th id="position_col" scope="col" width="50%">Position</th>
</tr>
</thead>
<tbody>
<tr>
<th id="office_row" scope="rowgroup" rowspan="2">Office</th>
<td headers="office_row name_col">Jill</td>
<td headers="office_row position_col">Office Manager</td>
</tr>
<tr>
<td headers="office_row name_col">Robert</td>
<td headers="office_row position_col">Project Manager</td>
</tr>
<tr>
<th id="field_row" scope="rowgroup" rowspan="2">Field Techs</th>
<td headers="field_row name_col">Jason</td>
<td headers="field_row position_col">Tech</td>
</tr>
<tr>
<td headers="field_row name_col">Mike</td>
<td headers="field_row position_col">Tech</td>
</tr>
</tbody>
</table>
<!-- This table's headings don't announce correctly. Jill announces "Field Techs, Name, Jill"-->
<table width="100%" cellspacing="0" cellpadding="4" >
<thead>
<tr>
<th id="name_col" scope="col" width="50%">Name</th>
<th id="position_col" scope="col" width="50%">Position</th>
</tr>
<tr>
<th id="office_group" colspan="2">Office</th>
</tr>
</thead>
<tbody>
<tr>
<td headers="office_group name_col">Jill</td>
<td headers="office_group position_col">Office Manager</td>
</tr>
<tr>
<td headers="office_group name_col">Robert</td>
<td headers="office_group position_col">Project Manager</td>
</tr>
</tbody>
<thead>
<tr>
<th id="field_group" colspan="2">Field Techs</th>
</tr>
</thead>
<tbody>
<tr>
<td headers="field_group name_col">Jason</td>
<td headers="field_group position_col">Tech</td>
</tr>
<tr>
<td headers="field_group name_col">Mike</td>
<td headers="field_group position_col">Tech</td>
</tr>
</tbody>
</table>
table can only have zero or one thead element (see documentation).
Permitted contents : An optional caption element, followed by zero or more colgroup elements, followed by an optional thead element
By having multiple thead elements only the last one is considered by your browser and your screenreader. You can use ARIA attributes and roles to handle multiple separated heading lines (using for instance aria-labelledby attribute to specify the heading).
One example from WCAG:
ARIA9: Using aria-labelledby to concatenate a label from several text nodes
You are using both the scope method and header/id's method in one table, which will create problems. Also, as others have pointed out, you're using multiple <th> and <tbody> elements, which isn't good either.
I've prepared some code samples here on how to correctly code this table using both the scope method and header/id's method:
https://jsfiddle.net/oody1b8x/
It's worth noting that <th> and <tbody> are not accessibility-related elements, even though they appear to be. These are essentially only used when printing. It lets the printer know that the header rows can be repeated on the next page if the table requires pagination.
Also -- don't use ARIA for this purpose; it will only create more problems. The native HTML semantics are perfectly capable of communicating how this data is structured.

How to extract only the 1st table tag from a html page having various nested table tag

I have the following html page. I want to extract data only within the 1st table tag in C#. the html page code is:
<table cellpadding=2 cellspacing=0 border=0 width=100%>
<tbody>
<tr>
<td align=right><b>11/09/2013 at 09:48</b></td>
</tr>
</tbody>
</table>
<center>
<table border="1" bordercolor="silver" cellpadding="2" cellspacing="0" width="100%">
<thead>
<tr>
<th width=100>ETA</th>
<th width=100>Ship Name</th>
<th width=80>From port</th>
<th width=80>To berth</th>
<th width=130>Agent</th>
</tr>
</thead>
<tbody>
<tr><td>11/09/2013 at 09:00 </td>
<td>SONANGOL KALANDULA </td>
<td>Cabinda </td>
<td>Valero 6 </td>
<td>Graypen </td>
</tr>
</tbody>
</table>
To be more specific I want to extract only the row having date 11/09/2013 at 09:48 the below mentioned code is under the first of tag I am using regex
"<table[^>]*>([^<]*(?:(?!</table)<[^<]*)*)[</table>]*"
but with this I am getting whole of the page source that is I am getting the data between all the table tags but I want only text between first table tag.
Can anyone tell me regular expression with which I can only extract this particular portion from the whole html page?
When trying out your version here, it seems to work to me on the input you specified, though [</table>]* should really be just </table> ([</table>]* means any number of characters in the set: <,/,t,a,b,l,e,>)
This seems like it would bear simplification, though. This should also work:
<table[^>]*>.*?</table>
All bets are off if you have nested tables, of course.