Find Element with Text, then Select Element Many Levels Above it - html

I am making a chrome extension that deletes a section of a website. To do this, I need to find a <span> that contains some text, and then select the containing <div> tag. Often this tag will be many levels above the span in the DOM, and it doesn't have a consistent attribute to select by.
HTML
<body>
<div> <!-- I want to select this DIV -->
<div>
<div>
<div>
<span>some text</span>
</div>
</div>
</div>
</div>
</body>
I have used //span[text() = 'some text' to find the right <span> but now I need to go back up to the first <div> in the example HTML
Have tried //*[ancestor::span[text() = 'some text']] and //span[ancestor::*[text() = 'some text']]. Yes these would only go up to the first parent, but that's not even working for me, even though they come up as valid XPath expressions when I test on XPath Tester.
What is the simplest way of writing an XPath expression that can do this?

You might try with the ../ syntax to go up one level (ie: to the immediate parent ) and chain them like so:
const getnodes=function( expr, parent ){
let results=[];
let contextNode=parent || document;
let query=document.evaluate( expr, contextNode, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null );
for ( let i=0, length=query.snapshotLength; i < length; ++i ) {
results.push( query.snapshotItem( i ) );
}
return results;
};
let col=getnodes( '//span[ text()="some text" ]/../../../../../div', document.body );
col.forEach( n=>console.info( n.textContent ) )
<!--This is the div I want to select -->
<div>top
<div>a
<div>b
<div>c
<span>some text</span>
</div>
</div>
</div>
</div>

A better solution I have found is to use ancestor in the XPath query, because I am often wanting to traverse up many levels.
Examples
//span[text() = 'some text']/ancestor::*[4]
selects the fourth element up from the selected span, regardless of what type those elements might be.
//span[text() = 'some text']/ancestor::a[2]
Will find the second anchor tag up from the selected span, provided it is a direct ascendant of the span.

Related

BeatifulSoup - Trying to get text inside span tags

I want to pull the text inside the span tags but when I try and use .text or get_text() I get errors (either after print spans or in the for loop). What am I missing? I have it set just now to just do this for the first div of class col, just to test if it is working, but I will want it to work for the 2nd aswell.
Thanks
My Code -
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
for x in premier_soup_tr[0]:
spans = x.find('span')
print (spans)
Output
-1
<span itemprop="name">Alisson Ramses Becker</span>
-1
<span itemprop="birthDate">02/10/1992</span>
-1
<span itemprop="nationality"> Brazil</span>
-1
>>>
The HTML
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span> </strong></p>
<p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>
<p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p>
<p>Squad: 13</p><p>Position: Goal Keeper</p>
</div>
If you just want the text in the spans you can search specifically for the spans:
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span')
for span in spans:
print(span.text)
If you want to find the spans with the specific divs, then you can do:
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
spans = div.find_all('span')
for span in spans:
print(span.text)
If you just want all of the values after the colons, you can search for the paragraph tags:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
ps = div.find_all('p')
for p in ps:
print(p.text.split(":")[1].strip())
Kyle's answer is good, but to avoid printing the same value multiple times like you said happened, you need to change up the logic a little bit. First you parse and add all matches you find to a list and THEN you loop through the list with all the matches and print them.
Another thing that you may have to consider is this problem:
<div class=col>
<div class=col>
<span/>
</div>
</div>
By using a list instead of printing right away, you can handle any matches that are identical to any existing records
in the above html example you can see how the span could be added twice with how you find matches in the answer suggested by Kyle. It's all about making sure you create a logic that will only find the matches you need. How you do it is often/always dependant on how the html is formatted, but its also important to be creative!
Good luck.

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

Get div class title content text using xpath

I have a requirement of getting the text below of "ELECTRONIC ARTS" (this can change according to data) using class title "Offered By" (this class will be same for all) using Xpath. I tried various xpath coding, but couldn't get the results I want. I'm really looking for someone's help on this.
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div> </div>
This is one possible XPath expression to starts with, which then you can simplify or add more criteria as needed (XPath formatted to be more readable) :
//div[
#class='meta-info'
and
div[#class='title' and normalize-space()='Offered By']
]/div[#class='content']
explanation :
//div[#class='meta-info' and ... : find div element where class attribute value equals "meta-info" and ...
div[#class='title' and normalize-space()='Offered By']] : ... has child element div where class attribute value equals "title" and content equals "Offered By"
/div[#class='content'] : from such div (the <div class="meta-info"> to be clear), return child element div where class attribute value equals "content"
Using the examples on Mozilla:
var xpath = document.evaluate("//div[#class='content']", document, null, XPathResult.STRING_TYPE, null);
document.write('The text found is: "' + xpath.stringValue + '".');
console.log(xpath);
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div>
</div>
By the way, I think document.querySelector or document.querySelectorAll are much more convenient in this situation:
var content = document.querySelector('.meta-info .content').innerText;
document.write('The text found is: "' + content + '".');
console.log(content);
<div class="meta-info">
<div class="title"> Offered By</div>
<div class="content">ELECTRONIC ARTS</div>
</div>

Selenium WebDriver how to verify Text from Span Tag

I'm trying to verify the text in the span by using WebDriver. There is the span tag:
<span class="value">
/Company Home/IRP/tranzycja
</span>
I tried something like this:
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja']'"));
driver.findElement(By.cssSelector("span./Company Home/IRP/tranzycja"));
but none of this work.
Any help would be really appreciated. Thanks
More code:
<span id="uniqName_64_0" class="alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small" data-dojo-attach-point="renderedValueNode" widgetid="uniqName_64_0">
<span class="inner" tabindex="0" data-dojo-attach-event="ondijitclick:onLinkClick">
<span class="label">
In folder:
</span>
<span class="value">
/Company Home/IRP/tranzycja
</span>
</span>
uniqName shouldn't be a target because are a lot of them and they are change.
There is a full html code:
http://www.filedropper.com/spantag
Here I am assuming you are trying to verify the text in the span tag.
i.e '/Company Home/IRP/tranzycja'
Try Below code
String expected String = "/Company Home/IRP/tranzycja";
String actual_String = driver.findElement(By.xpath("//span[#class='alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small']//span[#class='value']")).getText();
if(expected String.equals(actual_String))
{
System.out.println("Text is Matched");
}
else
{
System.out.println("Text is not Matched");
}
You can try using xpath ('some text' can be replaced by variable like #Rupesh suggested):
driver.findElement(By.xpath("//span/span[#class='value'][normalize-space(.) = 'some text']"))
or
driver.findElement(By.xpath("//span/span[#class='value'][contains(text(),'some text')]"))
(Be aware that this xpath will find first matching element, so if there are span elements with text 'some text 1' and 'some text 2', only first occurrence will be found.)
Of course, those two methods will throw NoSuchElementException if element (with defined text) is not found on page. If you're using Java and if needed, you can easy catch that error and print proper message.
One possible xpath to find that <span> element :
//span[normalize-space(.) = '/Company Home/IRP/tranzycja']
I think your going to want to use something like
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja'])).getText();
the getText(); will get the text within that span
You can use text() method inside Xpath. I hope this will resolve your problem
String str1 = driver.findElement(By.xpath("//span[text()='/Company Home/IRP/tranzycja']")).getText();
System.out.println("str1");
Output = /Company Home/IRP/tranzycja

How to access div element text based on adjacent text

I have the following HTML code and am trying to access "QA1234", which is the value of the Serial Number. Can you let me know how I can access this text?
<div class="dataField">
<div class="dataName">
<span id="langSerialNumber">Serial Number</span>
</div>
<div class="dataValue">QA1234</div>
</div>
<div class="dataField">
<div class="dataName">
<span id="langHardwareRevision">Hardware Revision</span>
</div>
<div class="dataValue">05</div>
</div>
<div class="dataField">
<div class="dataName">
<span id="langManufactureDate">Manufacture Date</span>
</div>
<div class="dataValue">03/03/2011</div>
</div>
I assume you are trying to get the "QA1234" text in terms of being the "Serial Number". If that is correct, you basically need to:
Locate the "dataField" div that includes the serial number span.
Get the "dataValue" within that div.
One way is to get all the "dataField" divs and find the one that includes the span:
parent = browser.divs(class: 'dataField').find { |div| div.span(id: 'langSerialNumber').exists? }
p parent.div(class: 'dataValue').text
#=> "QA1234"
parent = browser.divs(class: 'dataField').find { |div| div.span(id: 'langManufactureDate').exists? }
p parent.div(class: 'dataValue').text
#=> "03/03/2011"
Another option is to find the serial number span and then traverse up to the parent "dataField" div:
parent = browser.span(id: 'langSerialNumber').parent.parent
p parent.div(class: 'dataValue').text
#=> "QA1234"
parent = browser.span(id: 'langManufactureDate').parent.parent
p parent.div(class: 'dataValue').text
#=> "03/03/2011"
I find the first approach to be more robust to changes since it is more flexible to how the serial number is nested within the "dataField" div. However, for pages with a lot of fields, it may be less performant.