Getting first level using TFHpple - html

I have some trouble using TFHpple, so here it is :
I would like to parse the following lines :
<div class=\"head\" style=\"height: 69.89px; line-height: 69.89px;\">
<div class=\"cell editable\" style=\"width: 135px;\"contenteditable=\"true\">
<p> 1</p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>2</p>
</div>
</div>
<div style=\"height: 69.89px; line-height: 69.89px;\" class=\"head\">
<div class=\"cell\" style=\"width: 135px; text-align: left;\"contenteditable=\"false\">
<p>3 </p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>4</p>
</div>
</div>
<div style=\"height: 69.89px; line-height: 69.89px;\" class=\"\">
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>5</p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>6</p>
</div>
</div>
For now I would like to put the first level of div "element" (sorry I don't know the proper terminology) in an array.
So I have tried to do it by simply giving /div as the xPath to the searchWithXPathQuery methods but it simply doesn't find anything.
My second solution was to try using a path of this kind : //div[#class=\"head\"] but also allowing [#class=\"\"] but I don't even know if it is possible.
(I would like to do so because I need the elements to be in the same order in the array as they are in the data)
So here is my question, is there a particular reason why TFHpple wouldn't work with /div?
And if there is noway to just take the first level of div, then is it possible to make a predicate on the value of an attribute with xPath (here the attribute class) ? (And how ? I have looked quite a lot now and couldn't find anything)
Thanks for your help.
PS : If it helps, here is the code I use to try and parse the data, it is first contained in the string self.material.Text :
NSData * data = [self.material.Text dataUsingEncoding:NSUnicodeStringEncoding];
TFHpple * tableParser = [TFHpple hppleWithHTMLData:data];
NSString * firstXPath = #"/div";
NSArray<TFHppleElement *> * tableHeader = [tableParser searchWithXPathQuery:firstXPath];
NSLog(#"We found : %d", tableHeader.count);

You wrote:
Getting first level using TFHpple
I assume you mean: without also getting all descendants?
Taking your other requirements into account, you can do so as follows:
//div[not(ancestor::div)][#class='head' or #class='']
Dissecting this:
Select all div elements (yes, correct term ;) at any level in the whole document: //div
Filter with a predicate (the thing between brackets) for elements not containing a div themselves, by checking if there's some div ancestor (parent of a parent of a parent of a....) [not(ancestor::div)]
Filter by your other requirements: [#class='head' or #class='']
Note 1: your given XML is not valid, it contains multiple root elements. XML can have at most one root element.
Note 2: if your requirements are to first get all divs by #class or empty #class, and then only those that are "first level", reverse the predicates:
//div[#class='head' or #class=''][not(ancestor::div)]

You can use the following XPath expression to get div element -that's quite a correct term-, having class attribute value equals "head" or empty :
//div[#ciass='head' or #class='']

Related

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

Get div class title content text using xpath

I have a requirement of getting the text below of "ELECTRONIC ARTS" (this can change according to data) using class title "Offered By" (this class will be same for all) using Xpath. I tried various xpath coding, but couldn't get the results I want. I'm really looking for someone's help on this.
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div> </div>
This is one possible XPath expression to starts with, which then you can simplify or add more criteria as needed (XPath formatted to be more readable) :
//div[
#class='meta-info'
and
div[#class='title' and normalize-space()='Offered By']
]/div[#class='content']
explanation :
//div[#class='meta-info' and ... : find div element where class attribute value equals "meta-info" and ...
div[#class='title' and normalize-space()='Offered By']] : ... has child element div where class attribute value equals "title" and content equals "Offered By"
/div[#class='content'] : from such div (the <div class="meta-info"> to be clear), return child element div where class attribute value equals "content"
Using the examples on Mozilla:
var xpath = document.evaluate("//div[#class='content']", document, null, XPathResult.STRING_TYPE, null);
document.write('The text found is: "' + xpath.stringValue + '".');
console.log(xpath);
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div>
</div>
By the way, I think document.querySelector or document.querySelectorAll are much more convenient in this situation:
var content = document.querySelector('.meta-info .content').innerText;
document.write('The text found is: "' + content + '".');
console.log(content);
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div>
</div>

How can I get list of elements or data which are on same level with same attributes?

I have one web application which have one HTML page.
In this page structure is like this:
<div class = 'abc'>
<div class = 'pqr'>test1</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>test2</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
Here I want to take data from test1 to test2.
I have tried xpath with [Node Number] But I have found all nodes at [1] level.
Is there any way to get all data or List of elements test1 to test2 with "-" ?
I have seen this kind of issue before.
You have to use following-sibling here.
First I use this type of xpath :
//div[text()='test1']/..//following-sibling::div[#class='pqr' and not(contains(text(),'test'))]
Then you need to change script. "Note : I have written code in JAVA"
Logic :
while(element found text = '-')
{
//get data here
}
Please try this approach.
I guess you want the following xpath :
(//div[#class='pqr'])[position()<=4]
Notice the brackets () before position() predicate.
output in xpath tester :
Element='<div class="pqr">test1</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">test2</div>'
I think you can't use the Test1 and Test2 elements as identifiers because they are on the same line as the nodes you want to collect. Otherwise, I think you can use findElements(by.Xpath("patern_to_search")). that will return you a collection of elements that are matching your pattern.
one more way without using xpath:
List<WebElement> element = driver.findElements(By.className("pqr"));
for(int i=0;i<element.size()-1;i++){
System.out.println(element.get(i).getText());
}

How to access div element text based on adjacent text

I have the following HTML code and am trying to access "QA1234", which is the value of the Serial Number. Can you let me know how I can access this text?
<div class="dataField">
<div class="dataName">
<span id="langSerialNumber">Serial Number</span>
</div>
<div class="dataValue">QA1234</div>
</div>
<div class="dataField">
<div class="dataName">
<span id="langHardwareRevision">Hardware Revision</span>
</div>
<div class="dataValue">05</div>
</div>
<div class="dataField">
<div class="dataName">
<span id="langManufactureDate">Manufacture Date</span>
</div>
<div class="dataValue">03/03/2011</div>
</div>
I assume you are trying to get the "QA1234" text in terms of being the "Serial Number". If that is correct, you basically need to:
Locate the "dataField" div that includes the serial number span.
Get the "dataValue" within that div.
One way is to get all the "dataField" divs and find the one that includes the span:
parent = browser.divs(class: 'dataField').find { |div| div.span(id: 'langSerialNumber').exists? }
p parent.div(class: 'dataValue').text
#=> "QA1234"
parent = browser.divs(class: 'dataField').find { |div| div.span(id: 'langManufactureDate').exists? }
p parent.div(class: 'dataValue').text
#=> "03/03/2011"
Another option is to find the serial number span and then traverse up to the parent "dataField" div:
parent = browser.span(id: 'langSerialNumber').parent.parent
p parent.div(class: 'dataValue').text
#=> "QA1234"
parent = browser.span(id: 'langManufactureDate').parent.parent
p parent.div(class: 'dataValue').text
#=> "03/03/2011"
I find the first approach to be more robust to changes since it is more flexible to how the serial number is nested within the "dataField" div. However, for pages with a lot of fields, it may be less performant.

How to get a single text node in Selenium WebDriver [duplicate]

This question already has answers here:
Getting text from a node
(5 answers)
Closed 5 years ago.
I want to get the text from a tag but without the text from nested tags. I.e. in the example below, I only want to get the string 183591 from inside the <small> tag and exclude the text Service Request ID: from the <span> tag. This is not trivial because the <span> tag is nested in the <small> tag. Is this possible with WebDriver and XPath?
The text in the tag is going to change every time.
<div id="claimInfoBox" style="background-color: transparent;">
<div class="col-md-3 rhtCol">
<div class="cib h530 cntborder">
<h4 class="no-margin-bottom">
<p>
<small style="background-color: transparent;">
<span class="text-primary" style="background-color: transparent;">Service Request ID:</span>
183591
</small>
</p>
<div class="border-bottom" style="background-color: transparent;"></div>
<div id="CIB_PersonalInfo_DisplayMode" class="cib_block">
<div id="CIB_PersonalInfo_EditMode" class="cib_block" style="display: none">
</div>
</div>
<script type="text/javascript">
</div>
</div>
You are going to have to use String manipulation. Something like:
// you will need to adjust these XPaths to suit your needs
String outside = driver.findElement(By.xpath("//small")).getText();
String inside = driver.findElement(By.xpath("//span")).getText();
String edge = outside.replace(inside, "");
The simplest way I've found is by getting the parent small node and the child span node and removing the number of characters in the child from the text of the parent:
public String getTextNode() {
WebElement parent = driver.findElement(By.xpath("//small")); //or By.tagName("small")
WebElement child = parent.findElement(By.xpath(".//span")); //or By.tagName("span")
return parent.getText().substring(child.getText().length()).trim();
}
The actual simplest way is using javascript executor as below:
JavascriptExecutor js = ((JavascriptExecutor)driver);
js.executeScript("return $(\"small\").clone().children().remove().end().text();");
This will return the text associated with the parent element 'small' only. Use trim() to omit leading and trailing whitespace. For the full explanation of what is happening here, please refer the link below.
Reference:
http://exploreselenium.com/selenium/exclude-text-content-of-child-elements-of-the-parent-element-in-selenium-webdriver/