HtmlUnit's DomElement can't save state of XPath result

HtmlUnit's DomElement can't save state of XPath result - html

when I search big document sometimes I need to save result of big XPath expression, but if I try to store XPath result in DomElement object and make new XPath query just on block of code that I've pointed to DomElement object, I get results based on whole document. For example:
DomElement block = page.getByXPath("//div[#class='block_of_code']");
System.out.println(block.getByXPath("//span[#class='red']"));
So, first line will fetch all divs on the page with class='block_of_code'. But when I try to print out all span elements from block object I get back all span element that are on the page, not only in that block.
Is there an alternative (in HtmlUnit package preferably) where to store small chunks of html blocks and do manipulation by xPath just on it, not whole page?
Thanks!

An XPath expression starting with a / character will always query the entire document, even if you pass a context node to the function.
To make a query relative to the context node, you can start it with a . character.
The following should achieve what you want:
DomElement block = page.getByXPath("//div[#class='block_of_code']");
System.out.println(block.getByXPath(".//span[#class='red']"));

Related

Watir finds elements by class, but elements are null?

I am able to find elements by class using Watir, but I can't figure out how to do additional processing with them after selection - the elements found are nil (see below).
I would love to see the html text of each element found.

You have instances of Watir::HTMLElement which at time of definition only stores the parent and selector. The #element variable which represents the object in the DOM located by Selenium through a browser driver will only be populated when you take an action on the element.
To see the text of each element, just put puts event.text inside your loop.

Why does $x return items outside of the context?

I am attempting to use an xpath locator within a context for a Codeception test using the Selenium driver with Firefox. Specifically, I am trying to click the second link in the message body of an email, viewed with roundcube.
The body of the email is in the div with xpath //div[#class="rcmBody"]
I can get the link with this path: (//div[#class="rcmBody"]//a)[2]
But for some reason when I try //a[2] within the context of the body div, it returns all a elements within the iframe.
An example from codeception: (after selecting the correct iframe)
$I->click('//a[2]', '//div[#class="rcmBody"]')
This causes the web driver to click the second link in the iframe which comes before the body div begins.
I can also test this from directly in chrome:
$x('//a', $x('//div[#class="rcmBody"]')[0])
This returns a list of all a elements within the iframe, not within the context.
How can I get the context part to work?

Add a dot to the beginning of XPath to make it context-specific:
$I->click('(.//a)[2]', '//div[#class="rcmBody"]')
HERE^
Note that the parenthesis here are also important to get the desired a descendant of the parent.

contenteditable div in UiWebView - new lines are not saved when clicking on done

I have the following div in UIWebView:
<div contenteditable="true"></div>
If the user inserts new line (using the return key in the visual keyboard), and when he is done he clicks on done in the previous/next/done grey visual keyboard, it combines the lines to one line.
How can I avoid it?

Perhaps this JSFiddle can shed some light onto what's happening within your application. If you type some lines in the top DIV (gray background color), the HTML code that you get as the return value of its innerHTML property will first display in a textarea field below it (including HTML tags formatting). As you will soon see it's not merely what you'd expect to handle in your application ('line one' + CRLF + 'line two'...), but it also contains HTML elements separating lines one from another. That's how browsers are able to display contenteditable DIVs as if they're 'memo' type controls - by parsing their HTML (that's what browsers do). This HTML formatted text is also how your application receives user submitted text, and there you have to decide what to do with this formatting. You can either strip them away (which is, I suspect, how you set that object's property and it deals with that for you) replacing HTML elements like <DIV></DIV> and so on with a space character, or choose (with your control's property, or in code) to handle this formatting whichever way you'd like them to be handled. I'm not familiar with UIWebView though, and you'll have to find on your own how to retrieve complete HTML formatted values that you want to apply to the next DIV element that you're displaying (or same one that you're assigning new values to).
UPDATE: After searching the web for UIWebView reference, I've actually stumbled across one related thread on SO that shows how to retrieve innerHTML value of an element in your underlying HTML document:
//where 'wView' is your UIWebView
NSString *webText = [wView stringByEvaluatingJavaScriptFromString:#"document.getElementById('inputDIV').innerHTML"];
This way you'd be able to retrieve the whole innerHTML string contained within the contenteditable DIV that you use in a webText string variable and parse its HTML formatted text to whatever suits your needs better. Note though, that different browsers format contenteditable DIVs differently when Enter Key is pressed and some will return the next line enclosed in a new DIV, while others might enclose it in paragraph P and/or end the line with a break <BR> or <BR />, when shift+enter were used together to move to the next line. You will have to account for all these possibilities when processing your input string. Refer to the JSFiddle script I wrote using your UIWebView component to check what formatting applies.
Of course, in your case, it might be simpler to replace your contenteditable DIV with a textarea that will return more commonly formatted \n end-of-line (CR+LF). DIVs however are easier to design, so choose whichever suits your needs better.
Cheers!

I don't believe there's a solution to this from the objective-c side of the stack. The standard HTML- element only delivers a single string. It might be possible to achieve through some javascript magic or similar on the web-end of things.
My HTML-skills are not up to scratch but if you also control that end perhaps changing the to a textArea might help?

How can I access an element by using its DOM hierarchy(parent element)?

I want to access an element using a DOM hierarchy Node structure, through its parent nodes.I am trying to find the DOM hierarchy through firebug; want something like, <parent_node1>.<child_node1>.<child_node2> (not by document.getElementByID, getElementbyname) to access an element.
I want to automate a scenario like, I have column headers and corresponding values. Want to test, whether the values present under each column header, is correct...
I am thinking of using DOM as a method of automating this case...But, how can I find the DOM hierarchy...?
What I see through Inspect Element in Firebug is something like, list of events, elements and is not looking like a hierarchy node structure...Can somebody help in this regard please?

As discussed, you probably mean the DOM Element properties like element.childNodes, element.firstChild or similar.
Have a look at the DOM Element property reference over at JavaScriptKit, you'll get a good overview there how to access the hierarchy.
var currentTD = document.getElementsByTagName("td")[0];
var currentTable = document.getElementsByTagName("table")[0];
currentTD.parentNode // contains the TR element the TD resides in.
currentTable.childNodes // contains THEAD TBODY and TFOOT if present.
DOM Tables even have more properties like a rows collection and a cells collection.
A reminder of caution: Beware that these collections are live collections, so iterating over them and accessing collection.length in each iteration can be really slow because to get the length, the DOM has to be queried each time.

document.getElementById and document.getElementByTagname are using the DOM. They take an object within the DOM (specifically the document object, though you can also call both of those on elements) and return an object which is a single element or a collection of zero or more elements, respectively. That's a DOM operation. From there you can do other DOM operations on the results like getting children, parents or siblings, changing values etc.
All DOM operations come down to:
Take a starting point. This is often document though it's so often that the first thing we do is call document.getElementById or document.getElementByTagname and then work from the result that we could really consider that the starting point.
Find the element or elements we are interested in, relative to the starting point whether through startingPoint.getElementById* or startingPoing.getElementByTagname perhaps combined with some test (e.g. only working on those with a particular classname, if they have children of particular types, etc.
Read and/or change certain values, add new child nodes and/or delete nodes.
In a case like yours the starting point will be one or more tables found by document.getElementById(someID), document.getElementById(someID).getElementsByTagname('table')[0], or similar. From that table, myTable.getElementsByTagname('th') will get you the column headings. Depending on the structure, and what you are doing with it, you could just select corresponding elements from myTable.getElementsByTagname('td') or go through each row and then work on curRow.getElementsByTagname('td').
You could also just use firstChild, childNodes etc. though it's normally more convenient to have elements you don't care about filtered out by tagname.
*Since there can only be one element with a given id in a document, this will return the same if called on any element higher in the document hierarchy, so we normally just call this on document. It can be useful to call it on an element if we want to do something if the element is a descendant of our current element, and not otherwise.

Scraping largest block of text from HTML document

I am working on an algorithm that will try to pick out, given an HTML file, what it thinks is the parent element that most likely contains the majority of the page's content text.
For example, it would pick the div "content" in the following HTML:
<html>
<body>
<div id="header">This is the header we don't care about</div>
<div id="content">This is the <b>Main Page</b> content. it is the
longest block of text in this document and should be chosen as
most likely being the important page content.</div>
</body>
</html>
I have come up with a few ideas, such as traversing the HTML document tree to its leaves, adding up the length of the text, and only seeing what other text the parent has if the parent gives us more content than the children do.
Has anyone ever tried something like this, or know of an algorithm that can be applied? It doesn't have to be solid, but as long as it can guess a container that contains most of the page content text (for articles or blog posts, for example), that would be awesome.

One word: Boilerpipe

Here's roughly how I would approach this:
// get array of all elements (body is used as parent here but you could use whatever)
var elms = document.body.getElementsByTagName('*');
var nodes = Array.prototype.slice.call( elms, 0 );
// get inline elements out of the way (incomplete list)
nodes = nodes.filter(function (elm) {
return !/^(a|br?|hr|code|i(ns|mg)?|u|del|em|s(trong|pan))$/i.test( elm.nodeName );
});
// sort elements by most text first
nodes.sort(function(a,b){
if (a.textContent.length == b.textContent.length) return 0;
if (a.textContent.length > b.textContent.length) return -1;
return 1;
});
Using ancestry functions like a.compareDocumentPosition(b), you can also sink elements during sorting (or after), depending on how complex this thing needs to be.

You will also have to formulate a level on which you want to select the node. In your example, the 'body' node has an even larger amount of text in it. So you have to formulate what a 'parent element' exactly is.

You could create an app that looks for contiguous block of text disregarding formatting tags (if required). You could do this by using a DOM parser and walking the tree, keeping track of the immediate parent (because that is your output).
Start form parent nodes and traverse the tree for each node that is just formatting, it would continue the 'count' within that sub block. It would count the characters of the content.
Once you find the most content block, traverse back up the tree to its parent to get your answer.
I think your solution relies on how you traverse the DOM and keep track of the nodes that you are scanning.
What language are you using? Any other details for your project? There may be language specific or package specific tools you could use as well.

I can also say that word banks are a great help. Any lists of common 'advertisey' words like twitter and click and several capitalized nouns in a row. Having a POS tagger can improve accuracy. For news sites, a list of all known major cities in the world can help separate. In fact, you can almost scrape a page without even looking at the HTML.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008