Get div class title content text using xpath - html

I have a requirement of getting the text below of "ELECTRONIC ARTS" (this can change according to data) using class title "Offered By" (this class will be same for all) using Xpath. I tried various xpath coding, but couldn't get the results I want. I'm really looking for someone's help on this.
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div> </div>

This is one possible XPath expression to starts with, which then you can simplify or add more criteria as needed (XPath formatted to be more readable) :
//div[
#class='meta-info'
and
div[#class='title' and normalize-space()='Offered By']
]/div[#class='content']
explanation :
//div[#class='meta-info' and ... : find div element where class attribute value equals "meta-info" and ...
div[#class='title' and normalize-space()='Offered By']] : ... has child element div where class attribute value equals "title" and content equals "Offered By"
/div[#class='content'] : from such div (the <div class="meta-info"> to be clear), return child element div where class attribute value equals "content"

Using the examples on Mozilla:
var xpath = document.evaluate("//div[#class='content']", document, null, XPathResult.STRING_TYPE, null);
document.write('The text found is: "' + xpath.stringValue + '".');
console.log(xpath);
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div>
</div>
By the way, I think document.querySelector or document.querySelectorAll are much more convenient in this situation:
var content = document.querySelector('.meta-info .content').innerText;
document.write('The text found is: "' + content + '".');
console.log(content);
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div>
</div>

Related

How to get a div or span class from a related span class?

I've found the lowest class: <span class="pill css-1a10nyx e1pqc3131"> of multiple elements of a website but now I want to find the related/linked upper-class so for example the highest <div class="css-1v73czv eh8fd9011" xpath="1">. I've got the soup but can't figure out a way to get from the 'lowest' class to the 'highest' class, any idea?
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">
<div class="css-1rkuvma eh8fd908">
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">
End result would be:
INPUT- Search on on all elements of a page with class <span class="pill css-1a10nyx e1pqc3131">(lowest)
OUTPUT - Get all related titles or headers of said class.
I've tried it with if-statements but that doesn't work consistently. Something with an if class = (searchable class) then get (desired higher class) should work.
I can add any more details if needed please let me know, thanks in advance!
EDIT: Picture per clarification where the title(highest class) = "Wooferland Festival 2022" and the number(lowest class) = 253
As mentioned, question needs some more information, to give a concret answer.
Assuming you like to scrape the information in the picture based on your example HTML you select your pill and use .find_previous() to locate your elements:
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
Assuming there is a cotainer tag in HTML structure like <a> or other you would select this based on the condition, that it contains a <span> wit class pill:
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Note: Instead of using css classes, that can be highly dynamic, try use more static attributes or the HTML structure.
Example
See both options, for first one the <a> do not matter.
from bs4 import BeautifulSoup
html='''
<a>
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">some date information</header>
<div class="css-1rkuvma eh8fd908">some title</div>
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">some number</span>
<footer>
</div>
</div>
</a>
'''
soup = BeautifulSoup(html)
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
print('---------')
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Output
some date information
some title
some number
---------
some date information
some date information
some number

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

Getting first level using TFHpple

I have some trouble using TFHpple, so here it is :
I would like to parse the following lines :
<div class=\"head\" style=\"height: 69.89px; line-height: 69.89px;\">
<div class=\"cell editable\" style=\"width: 135px;\"contenteditable=\"true\">
<p> 1</p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>2</p>
</div>
</div>
<div style=\"height: 69.89px; line-height: 69.89px;\" class=\"head\">
<div class=\"cell\" style=\"width: 135px; text-align: left;\"contenteditable=\"false\">
<p>3 </p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>4</p>
</div>
</div>
<div style=\"height: 69.89px; line-height: 69.89px;\" class=\"\">
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>5</p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>6</p>
</div>
</div>
For now I would like to put the first level of div "element" (sorry I don't know the proper terminology) in an array.
So I have tried to do it by simply giving /div as the xPath to the searchWithXPathQuery methods but it simply doesn't find anything.
My second solution was to try using a path of this kind : //div[#class=\"head\"] but also allowing [#class=\"\"] but I don't even know if it is possible.
(I would like to do so because I need the elements to be in the same order in the array as they are in the data)
So here is my question, is there a particular reason why TFHpple wouldn't work with /div?
And if there is noway to just take the first level of div, then is it possible to make a predicate on the value of an attribute with xPath (here the attribute class) ? (And how ? I have looked quite a lot now and couldn't find anything)
Thanks for your help.
PS : If it helps, here is the code I use to try and parse the data, it is first contained in the string self.material.Text :
NSData * data = [self.material.Text dataUsingEncoding:NSUnicodeStringEncoding];
TFHpple * tableParser = [TFHpple hppleWithHTMLData:data];
NSString * firstXPath = #"/div";
NSArray<TFHppleElement *> * tableHeader = [tableParser searchWithXPathQuery:firstXPath];
NSLog(#"We found : %d", tableHeader.count);
You wrote:
Getting first level using TFHpple
I assume you mean: without also getting all descendants?
Taking your other requirements into account, you can do so as follows:
//div[not(ancestor::div)][#class='head' or #class='']
Dissecting this:
Select all div elements (yes, correct term ;) at any level in the whole document: //div
Filter with a predicate (the thing between brackets) for elements not containing a div themselves, by checking if there's some div ancestor (parent of a parent of a parent of a....) [not(ancestor::div)]
Filter by your other requirements: [#class='head' or #class='']
Note 1: your given XML is not valid, it contains multiple root elements. XML can have at most one root element.
Note 2: if your requirements are to first get all divs by #class or empty #class, and then only those that are "first level", reverse the predicates:
//div[#class='head' or #class=''][not(ancestor::div)]
You can use the following XPath expression to get div element -that's quite a correct term-, having class attribute value equals "head" or empty :
//div[#ciass='head' or #class='']

How can I get list of elements or data which are on same level with same attributes?

I have one web application which have one HTML page.
In this page structure is like this:
<div class = 'abc'>
<div class = 'pqr'>test1</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>test2</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
Here I want to take data from test1 to test2.
I have tried xpath with [Node Number] But I have found all nodes at [1] level.
Is there any way to get all data or List of elements test1 to test2 with "-" ?
I have seen this kind of issue before.
You have to use following-sibling here.
First I use this type of xpath :
//div[text()='test1']/..//following-sibling::div[#class='pqr' and not(contains(text(),'test'))]
Then you need to change script. "Note : I have written code in JAVA"
Logic :
while(element found text = '-')
{
//get data here
}
Please try this approach.
I guess you want the following xpath :
(//div[#class='pqr'])[position()<=4]
Notice the brackets () before position() predicate.
output in xpath tester :
Element='<div class="pqr">test1</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">test2</div>'
I think you can't use the Test1 and Test2 elements as identifiers because they are on the same line as the nodes you want to collect. Otherwise, I think you can use findElements(by.Xpath("patern_to_search")). that will return you a collection of elements that are matching your pattern.
one more way without using xpath:
List<WebElement> element = driver.findElements(By.className("pqr"));
for(int i=0;i<element.size()-1;i++){
System.out.println(element.get(i).getText());
}

Dojo StackController: how can I set a title for each button?

(using dojo 1.10.1)
I am working with dojo's dijit/layout/StackContainer and dijit/layout/StackController which are working fine, here is a simplified example. My problem is that I cant find a "clean" way to add mouseover titles to each controller button that the StackController creates?
html
<div>
<div data-dojo-type="dijit/layout/StackContainer"
data-dojo-props="id: 'contentStack'">
<div data-dojo-type="dijit/layout/ContentPane" title="one">
<h4>Group 1 Content</h4>
</div>
<div data-dojo-type="dijit/layout/ContentPane" title="two">
<h4>Group 2 Content</h4>
</div>
<div data-dojo-type="dijit/layout/ContentPane" title="three">
<h4>Group 3 Content</h4>
</div>
</div>
<div data-dojo-type="dijit/layout/StackController" data-dojo-props="containerId:'contentStack'"></div>
</div>
So for each title in each child contained within the StackContainer, a button is cerated by the StackController with the same label, but the button has no mouseover text, I need to add that as well.
I am not interested in any solution that involves me looping over the nodes and finding each button, its just not nice.
One of the best solutions would be to send properties, methods and events of buttons via corresponding ContentPanes. For example:
<div data-dojo-type="dijit/layout/ContentPane" title="one" data-dojo-props=
"controllerProps: {onMouseOver: function(){"doSomething"}}">
<h4>Group 1 Content</h4>
</div>
But as far as I understood this is not possible, because StackController passes to its buttons "title" and some other unimportant properties of ContentPane. So if you are really interested in above solutions you have to override the default behavior of StackController. Which is possible, but needs more time! :)
So I suggest you other solution which works and faster. You give to StackController-div an id:
<div id="myController" data-dojo-type="dijit/layout/StackController" data-dojo-
props="containerId:'contentStack'"></div>
You use "dijit/registry" to call that id:
var controllerWidget = registry.byId("myController");
You have now StackController widget. Call getChildren() method of it and you have an array of Button widgets. The rest I guess straightforward.
Here is the JSFiddle example.
Cheers!
Update:
Hey I have found another solution, which satisfies your requirements: "No button search"
These are the properties which StackController passes to buttonWidget:
var Cls = lang.isString(this.buttonWidget) ? lang.getObject(this.buttonWidget) : this.buttonWidget;
var button = new Cls({
id: this.id + "_" + page.id,
name: this.id + "_" + page.id, // note: must match id used in pane2button()
label: page.title,
disabled: page.disabled,
ownerDocument: this.ownerDocument,
dir: page.dir,
lang: page.lang,
textDir: page.textDir || this.textDir,
showLabel: page.showTitle,
iconClass: page.iconClass,
closeButton: page.closable,
title: page.tooltip,
page: page
});
So if you give a tag "tooltip" for your ContentPane, it will appear in buttonWidget as "title".
<div data-dojo-type="dijit/layout/ContentPane" title="one" tooltip="First Page">
Here is another JSFiddle example.