XPATH how to extract one td at a time from a tbody in HTML using HTML agility pack - html

I am trying to parse the table from the URL (Google finance) below
http://www.google.com/finance/historical?q=BOM:533278
I am trying to extract only the close values in the close column. But when i try with the XPATH
hd.DocumentNode.SelectSingleNode("//td[#class='rgt']")
I am getting all the nodes of having attribute as class and value of attribute as rgt in one Node.innerText itself.
I need the values one by one, not all at the same time. I must be doing something silly here. Thank you.
Actual XPath found using Firebug is a follows
/html/body/div/div/div[3]/div[2]/div/div[2]
/div[2]/div/form/div[2]/table/tbody/tr[2]/td[5]
But some how after the form tag...HTMLagility pack is returning null node. Never thought this would take so long to implement.

If you're using Firebug or any Firefox extension (like XPather) to obtain the XPath of the elements you need to parse, you might need to remove the tbody tags from the XPath.
Take a look at the following answer here on SO: Why does firebug add <tbody> to <table>?
If you're using HtmlAgilityPack, the XPath returned by Firebug or by any other tool related with Firefox may differ, because the HTML source you're parsing can be different from the HTML source in Firefox.
Sometimes might be useful to open the same page in Internet Explorer 8 and using Developer Tools (F12) do the same you're doing with Firebug, or if not, use another tool like HAP Explorer that can be downloaded from the HtmlAgilityPack page

There are many ways to do it. Here is one solution, which is based on the Data td (the one withe the 'lm' class):
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
... load the doc ...
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[#class='lm']/../td[5]"))
{
Console.WriteLine("node=" + node.InnerText);
}

XPath for the first cell in Close column is //div[#id='prices']/table/tbody/tr[2]/td[5] and for the second one it's //div[#id='prices']/table/tbody/tr[3]/td[5] and so on.

Related

Why is scrapy shell returning an empty list when my XPath selector works as it should in the “Elements” tab of my Chrome browser?

The XPath selector in Scrapy shell response.xpath('//div[#class="chr-lot-header__bid-details"]//span[#class="chr-lot-header__value-field"] returns an empty list while the same XPath selector selects the right html tag in the "Elements" tab of my Chrome browser.
Here's the website the XPath selector is intended for:
https://www.christies.com/en/lot/lot-5973059
The output I want the XPath selector to produce is "GBP 11,282,500".
I just checked the website you mentioned in your mentioned is getting the required information dynamically loaded, which means it can not be scrapped directly. Because scrapy only scrap statically available data not dynamically loaded data. To scrap dynamically loaded data either you need to mimic real time browser like you can use selenium/playwright and integrate there library inside your scrapy code or you can try to find API calls in network tab, which this required data is being loaded/fetched.

JSoup Select Tag Recursive Search

I recently tried to work with JSoup to parse HTML documents, I went through the turorial on JSoup and found that the select-Method might be what I am looking for.
What I try to accomplish is to find all elements in a html document which possess a certain class. To test that, I tried this with the amazon web page (idea: find all deals with certain offers).
So I inspected the web page to see which classes and ids are being used and then I tried to integrate this into a small code snippet. In this example I found the follwing element:
<span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">PROCAVE Matratzen-Brücke aus Schaumstoff 25 x 200 cm für ...</span>
This element is embedded in other elements and exists multiple times (for each deal of course). So here is my code to read the deal elements:
Document doc = Jsoup.connect("https://www.amazon.de/gp/angebote/ref=gbph_ftr_s-8_cd61_page_1?gb_f_LD=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CUPCOMING,dealTypes:LIGHTNING_DEAL,page:1,sortOrder:BY_SCORE,dealsPerPage:8&pf_rd_p=425ddcb8-bed4-4e85-ac0f-c1a79d14cd61&pf_rd_s=slot-8&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_r=BTHRY008J9N3N5CCMNEN&gb_f_second=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,dealTypes:COUPON_DEAL,page:8,sortOrder:BY_SCORE,dealsPerPage:8").timeout(0).get();
Elements deals = doc.select("span.a-size-base.a-color-link.dealTitleTwoLine.restVisible.singleCellTitle.autoHeight");
for (Element deal : deals) {
if (deal.text().contains("ItemMatch")) {
System.out.println("Found deal: " + deal.text());
}
}
Unfortunately I can't get the element I am looking for. deals has always the size of 0. I tried to modify my select with only part of the classes, I added the id-attribute and so on. Nevertheless, I do not get the elements (in this case these are nested into some others). If I try an element which is above this element in the DOM hierarchy (e.g. the div with class "a-section a-spacing-none slotContainer"), this is found.
Do I actually need to specify the whole DOM hierarchy (by using ">" in my select expressions? I expected to be able to define a selector and JSoup would travers and search the whole DOM-tree.
No, you do not have to specify the full DOM hierarchy. Your test should work, if the elements are really part of the DOM. I suspect that they might not be part of DOM as it is loaded be JSoup. The reason might me, that the inner DOM nodes are filled by JavaScript through AJAX. JSoup does not run JavaScript, so dynamically loaded parts of the DOM are not accessible. To achieve what you want you can either look into the AJAX calls directly and analyze them, or you move on to another solution like selenium webdriver, which runs a real browser including a working JavaScript engine.

Save generated HTML using Canopy

Can a website's generated HTML be saved using Canopy? Looking at the documentation under 'Getting Started', I could not find anything related.
You can run arbitrary JavaScript using js, document.documentElement.outerHTML will return the current DOM, so
let html = js "return document.documentElement.outerHTML" |> string
does the trick.
Canopy is a wrapper around Selenium that provides some useful helper functions. But it also provides access to the Selenium IWebElement instances in case you need them, via the element function (halfway down the page; there don't seem to be internal anchors in that page so I couldn't link directly to the function). Then once you have the IWebElement object, your problem becomes similar to this one, where the answer seems to be elem.getAttribute("innerHtml") where elem is the elememt whose content you want (which might even be the html element). Note that the innerHtml attribute is not a standard DOM attribute, so this won't work with all Selenium drivers; it will be dependent on which browser you're running in. But it apparently works on all major Web browsers.
See Get HTML Source of WebElement in Selenium WebDriver using Python for a related question using Python, which has more discussion about whether the innetHtml attribute will work in all browsers. If it doesn't, Canopy also has the js function, which you could leverage to run some Javascript to get the HTML you're looking for -- but if you're having trouble with that, you probably need to ask a Javascript question rather than an F# question.

Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.
I'm developing an application that needs to parse a website with tables in it. As deriving XPath expression for scraping web pages is boring and error-prone work, I'd like to use the XPath extractor feature of Firebug (or similar tools in other browsers) for this.
Example input looks like this:
<!-- snip -->
<table id="example">
<tr>
<th>Example Cell</th>
<th>Another one</th>
</tr>
<tr>
<td>foobar</td>
<td>42</td>
</tr>
</table>
<!-- snip -->
I want to extract the first data cell ("foobar"). Firebug proposes the XPath expression
//table[#id="example"]/tbody/tr[2]/td[1]
which works fine in any XPath tester plugins, but not my own application (no results found). If I cut down the query to //table[#id], it works again.
What's going wrong?
The Problem: DOM Requires <tbody/> Tags
Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.
The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says
The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element.
There is an in-depth explanation in another answer on stackoverflow.
On the other hand, HTML does not necessarily require that tag to be used:
The TBODY start tag is always required except when the table contains only one table body and no table head or foot sections.
Most XPath Processors Work on raw XML
Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/> tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".
This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!
Reproducing the Issue
Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org on the command line. Latter will probably not contain any <tbody/> elements (they're rarely used), Firebug will always show them.
Solution 1: Remove /tbody Axis Step
Check if the table you're stuck at really does not contain a <tbody/> element (see last paragraph). If it does, you've probably got another kind of problem.
Now remove the /tbody axis step, so your query will look like
//table[#id="example"]/tr[2]/td[1]
Solution 2: Skip <tbody/> Tags
This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.
Replace the /tbody axis step by a descendant-or-self step:
//table[#id="example"]//tr[2]/td[1]
Solution 3: Allow Both Input With and Without <tbody/> Tags
If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).
XPath 1.0:
//table[#id="example"]/tr[2]/td[1] | //table[#id="example"]/tbody/tr[2]/td[1]
XPath 2.0: //table[#id="example"]/(tbody, .)/tr[2]/td[1]
Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)
Before parsing, get the html as a string. Insert missing <tbody> and </tbody> tags with regex, then load it back into your DOMDocument object.
Jens Erat gives a good explanation, but here is
Solution 4: Make sure the HTML source always has the <tbody> tags with regex
JavaScript
var html = '<html><table><tr><td>foo</td><td>bar</td></tr></table></html>';
html.replace(/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/g,"$1<tbody>").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1</tbody>$4");
PHP
$html = $dom->saveHTML();
$html = preg_replace(array('/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1<tbody>','$1</tbody>$4'),$html);
$dom->loadHTML($html);
Just the regex:
matches `<table>` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `<tbody>` also with stuff inside the tag
/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/
replace with
$1<tbody>
the $1 referencing the captured `<table>` tag with contents.
Do the same for the closing tag like this:
/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/
replace with
$1</tbody>$4
This way the dom will ALWAYS have the <tbody> tags where necessary.

how to create XPATH for a HTML DOM element?

How to create XPATH for a HTML DOM element?
for example, "/HTML/BODY/DIV[1]/TABLE[1]/TR[2]/TD[1]/INPUT".
Given an DOM element how to get this XPATH string?
Any ideas?
Thanks,
Dattebayo.
You can create a new domdocument and then import the node element
$DD= new DOMDocument('1.0', 'utf-8');
$DD->loadXML( "<html></html>" );
$DD->documentElement->appendChild($DD->importNode($DE,true));
then you can use xpath insithe the domelement:
$xpathe=new DOMXPath($DD);
As I recall, the xpath checker extension to firefox gives you a point-and-click interface for getting the xpath to DOM elements in a HTML document.
After a lot of struggle I found a way to do so.
Along with the DOM path also use the SourceIndex of each node. Like "/Html:1/Body:2/Div:5/Input:6"
But again,
1. This might not work in case of dynamic page (ajax to modify the content).
2. This might not be unique accross browsers since the sourceIndex might vary accross browsers based on the Browser Rendering Engine arranges the nodes. (not sure of this yet though, just a thought).
In Mozilla an xpath generator component was implemented although it never made it to default builds.
You can find the tests in the "final patch" attached to the bug I linked to to see how it can be used. You can also look up its implementation, might be helpful.
Here is a chrome extension that might help you (ChromyQlip)
https://chrome.google.com/extensions/detail/bkmllkjbfbeephbldeflbnpclgfbjfmn