how to create XPATH for a HTML DOM element? - html

How to create XPATH for a HTML DOM element?
for example, "/HTML/BODY/DIV[1]/TABLE[1]/TR[2]/TD[1]/INPUT".
Given an DOM element how to get this XPATH string?
Any ideas?
Thanks,
Dattebayo.

You can create a new domdocument and then import the node element
$DD= new DOMDocument('1.0', 'utf-8');
$DD->loadXML( "<html></html>" );
$DD->documentElement->appendChild($DD->importNode($DE,true));
then you can use xpath insithe the domelement:
$xpathe=new DOMXPath($DD);

As I recall, the xpath checker extension to firefox gives you a point-and-click interface for getting the xpath to DOM elements in a HTML document.

After a lot of struggle I found a way to do so.
Along with the DOM path also use the SourceIndex of each node. Like "/Html:1/Body:2/Div:5/Input:6"
But again,
1. This might not work in case of dynamic page (ajax to modify the content).
2. This might not be unique accross browsers since the sourceIndex might vary accross browsers based on the Browser Rendering Engine arranges the nodes. (not sure of this yet though, just a thought).

In Mozilla an xpath generator component was implemented although it never made it to default builds.
You can find the tests in the "final patch" attached to the bug I linked to to see how it can be used. You can also look up its implementation, might be helpful.

Here is a chrome extension that might help you (ChromyQlip)
https://chrome.google.com/extensions/detail/bkmllkjbfbeephbldeflbnpclgfbjfmn

Related

How to use chrome dev tools to find elements based on css class or id?

Long time automation developer here (just for context).
It's been bugging me for quite a while that the dev tools in chrome used to find elements just don't seem to work as I expect. Hopefully someone can point out what I'm doing wrong.
Looking at , say, sauce labs page: https://saucelabs.com/blog/selenium-tips-finding-elements-by-their-inner-text-using-contains-a-css-pseudo-class
ok now that page has div's and anchors
and indeed I can do find ('a') or find('div')
but why do I have a problem using classes or id's ?
The find() method refers to window.find(), a non-standard API for the browser's built-in Find function. It does not find web elements the same way Selenium or Capybara do, and so it does not parse the input as a selector.
You find elements with selectors in Chrome DevTools using document.querySelector() or document.querySelectorAll(). There are no special methods in Chrome DevTools for this, however it does provide the $() and $$() aliases (respectively) to save you time and keystrokes.
You can use jquery code in chrome console, for example if you want to find something with class of "foo" you can write $('.foo') or a id of "bar" you write $('#bar')
You can read all about it here
Also you can just google what you want "Jquery how to find a div with id"

JSoup Select Tag Recursive Search

I recently tried to work with JSoup to parse HTML documents, I went through the turorial on JSoup and found that the select-Method might be what I am looking for.
What I try to accomplish is to find all elements in a html document which possess a certain class. To test that, I tried this with the amazon web page (idea: find all deals with certain offers).
So I inspected the web page to see which classes and ids are being used and then I tried to integrate this into a small code snippet. In this example I found the follwing element:
<span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">PROCAVE Matratzen-Brücke aus Schaumstoff 25 x 200 cm für ...</span>
This element is embedded in other elements and exists multiple times (for each deal of course). So here is my code to read the deal elements:
Document doc = Jsoup.connect("https://www.amazon.de/gp/angebote/ref=gbph_ftr_s-8_cd61_page_1?gb_f_LD=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CUPCOMING,dealTypes:LIGHTNING_DEAL,page:1,sortOrder:BY_SCORE,dealsPerPage:8&pf_rd_p=425ddcb8-bed4-4e85-ac0f-c1a79d14cd61&pf_rd_s=slot-8&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_r=BTHRY008J9N3N5CCMNEN&gb_f_second=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,dealTypes:COUPON_DEAL,page:8,sortOrder:BY_SCORE,dealsPerPage:8").timeout(0).get();
Elements deals = doc.select("span.a-size-base.a-color-link.dealTitleTwoLine.restVisible.singleCellTitle.autoHeight");
for (Element deal : deals) {
if (deal.text().contains("ItemMatch")) {
System.out.println("Found deal: " + deal.text());
}
}
Unfortunately I can't get the element I am looking for. deals has always the size of 0. I tried to modify my select with only part of the classes, I added the id-attribute and so on. Nevertheless, I do not get the elements (in this case these are nested into some others). If I try an element which is above this element in the DOM hierarchy (e.g. the div with class "a-section a-spacing-none slotContainer"), this is found.
Do I actually need to specify the whole DOM hierarchy (by using ">" in my select expressions? I expected to be able to define a selector and JSoup would travers and search the whole DOM-tree.
No, you do not have to specify the full DOM hierarchy. Your test should work, if the elements are really part of the DOM. I suspect that they might not be part of DOM as it is loaded be JSoup. The reason might me, that the inner DOM nodes are filled by JavaScript through AJAX. JSoup does not run JavaScript, so dynamically loaded parts of the DOM are not accessible. To achieve what you want you can either look into the AJAX calls directly and analyze them, or you move on to another solution like selenium webdriver, which runs a real browser including a working JavaScript engine.

Save generated HTML using Canopy

Can a website's generated HTML be saved using Canopy? Looking at the documentation under 'Getting Started', I could not find anything related.
You can run arbitrary JavaScript using js, document.documentElement.outerHTML will return the current DOM, so
let html = js "return document.documentElement.outerHTML" |> string
does the trick.
Canopy is a wrapper around Selenium that provides some useful helper functions. But it also provides access to the Selenium IWebElement instances in case you need them, via the element function (halfway down the page; there don't seem to be internal anchors in that page so I couldn't link directly to the function). Then once you have the IWebElement object, your problem becomes similar to this one, where the answer seems to be elem.getAttribute("innerHtml") where elem is the elememt whose content you want (which might even be the html element). Note that the innerHtml attribute is not a standard DOM attribute, so this won't work with all Selenium drivers; it will be dependent on which browser you're running in. But it apparently works on all major Web browsers.
See Get HTML Source of WebElement in Selenium WebDriver using Python for a related question using Python, which has more discussion about whether the innetHtml attribute will work in all browsers. If it doesn't, Canopy also has the js function, which you could leverage to run some Javascript to get the HTML you're looking for -- but if you're having trouble with that, you probably need to ask a Javascript question rather than an F# question.

How to programmatically get a list of used css images from IE WebBrowser (IHTMLDocument2)

It is relatively straight forward to iterate through IHTMLStyleSheetsCollection, IHTMLStyleSheet, IHTMLStyleSheetRulesCollection etc. of IHTMLDocument2 to obtain list of all styles in current document.
Any ideas on how to get a list of only used styles in the document? And to be more precise, I am looking for how to find out which images from the css files are being used in the document.
There is a program that says it is able to do this (determine which css images are being used) if IE8/IE9 is installed.
Thanks
Ok I have found an answer to this:
Recent browser versions (FF 3.5, IE 8) have implemented a querySelector method that can be used to query if a selector is used on a page.
see: https://developer.mozilla.org/En/DOM/Document.querySelector for more info.

XPATH how to extract one td at a time from a tbody in HTML using HTML agility pack

I am trying to parse the table from the URL (Google finance) below
http://www.google.com/finance/historical?q=BOM:533278
I am trying to extract only the close values in the close column. But when i try with the XPATH
hd.DocumentNode.SelectSingleNode("//td[#class='rgt']")
I am getting all the nodes of having attribute as class and value of attribute as rgt in one Node.innerText itself.
I need the values one by one, not all at the same time. I must be doing something silly here. Thank you.
Actual XPath found using Firebug is a follows
/html/body/div/div/div[3]/div[2]/div/div[2]
/div[2]/div/form/div[2]/table/tbody/tr[2]/td[5]
But some how after the form tag...HTMLagility pack is returning null node. Never thought this would take so long to implement.
If you're using Firebug or any Firefox extension (like XPather) to obtain the XPath of the elements you need to parse, you might need to remove the tbody tags from the XPath.
Take a look at the following answer here on SO: Why does firebug add <tbody> to <table>?
If you're using HtmlAgilityPack, the XPath returned by Firebug or by any other tool related with Firefox may differ, because the HTML source you're parsing can be different from the HTML source in Firefox.
Sometimes might be useful to open the same page in Internet Explorer 8 and using Developer Tools (F12) do the same you're doing with Firebug, or if not, use another tool like HAP Explorer that can be downloaded from the HtmlAgilityPack page
There are many ways to do it. Here is one solution, which is based on the Data td (the one withe the 'lm' class):
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
... load the doc ...
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[#class='lm']/../td[5]"))
{
Console.WriteLine("node=" + node.InnerText);
}
XPath for the first cell in Close column is //div[#id='prices']/table/tbody/tr[2]/td[5] and for the second one it's //div[#id='prices']/table/tbody/tr[3]/td[5] and so on.