This question already has answers here:
Getting text from a node
(5 answers)
Closed 5 years ago.
I want to get the text from a tag but without the text from nested tags. I.e. in the example below, I only want to get the string 183591 from inside the <small> tag and exclude the text Service Request ID: from the <span> tag. This is not trivial because the <span> tag is nested in the <small> tag. Is this possible with WebDriver and XPath?
The text in the tag is going to change every time.
<div id="claimInfoBox" style="background-color: transparent;">
<div class="col-md-3 rhtCol">
<div class="cib h530 cntborder">
<h4 class="no-margin-bottom">
<p>
<small style="background-color: transparent;">
<span class="text-primary" style="background-color: transparent;">Service Request ID:</span>
183591
</small>
</p>
<div class="border-bottom" style="background-color: transparent;"></div>
<div id="CIB_PersonalInfo_DisplayMode" class="cib_block">
<div id="CIB_PersonalInfo_EditMode" class="cib_block" style="display: none">
</div>
</div>
<script type="text/javascript">
</div>
</div>
You are going to have to use String manipulation. Something like:
// you will need to adjust these XPaths to suit your needs
String outside = driver.findElement(By.xpath("//small")).getText();
String inside = driver.findElement(By.xpath("//span")).getText();
String edge = outside.replace(inside, "");
The simplest way I've found is by getting the parent small node and the child span node and removing the number of characters in the child from the text of the parent:
public String getTextNode() {
WebElement parent = driver.findElement(By.xpath("//small")); //or By.tagName("small")
WebElement child = parent.findElement(By.xpath(".//span")); //or By.tagName("span")
return parent.getText().substring(child.getText().length()).trim();
}
The actual simplest way is using javascript executor as below:
JavascriptExecutor js = ((JavascriptExecutor)driver);
js.executeScript("return $(\"small\").clone().children().remove().end().text();");
This will return the text associated with the parent element 'small' only. Use trim() to omit leading and trailing whitespace. For the full explanation of what is happening here, please refer the link below.
Reference:
http://exploreselenium.com/selenium/exclude-text-content-of-child-elements-of-the-parent-element-in-selenium-webdriver/
Related
I've found the lowest class: <span class="pill css-1a10nyx e1pqc3131"> of multiple elements of a website but now I want to find the related/linked upper-class so for example the highest <div class="css-1v73czv eh8fd9011" xpath="1">. I've got the soup but can't figure out a way to get from the 'lowest' class to the 'highest' class, any idea?
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">
<div class="css-1rkuvma eh8fd908">
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">
End result would be:
INPUT- Search on on all elements of a page with class <span class="pill css-1a10nyx e1pqc3131">(lowest)
OUTPUT - Get all related titles or headers of said class.
I've tried it with if-statements but that doesn't work consistently. Something with an if class = (searchable class) then get (desired higher class) should work.
I can add any more details if needed please let me know, thanks in advance!
EDIT: Picture per clarification where the title(highest class) = "Wooferland Festival 2022" and the number(lowest class) = 253
As mentioned, question needs some more information, to give a concret answer.
Assuming you like to scrape the information in the picture based on your example HTML you select your pill and use .find_previous() to locate your elements:
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
Assuming there is a cotainer tag in HTML structure like <a> or other you would select this based on the condition, that it contains a <span> wit class pill:
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Note: Instead of using css classes, that can be highly dynamic, try use more static attributes or the HTML structure.
Example
See both options, for first one the <a> do not matter.
from bs4 import BeautifulSoup
html='''
<a>
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">some date information</header>
<div class="css-1rkuvma eh8fd908">some title</div>
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">some number</span>
<footer>
</div>
</div>
</a>
'''
soup = BeautifulSoup(html)
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
print('---------')
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Output
some date information
some title
some number
---------
some date information
some date information
some number
#Example 1
<span class="levelone">
<span class="leveltwo" dir="auto">
::before
"Blue"
::after
</span>
</span>
#Example 2
<div class="itemlist">
<div dir="auto" style="text-align: start;">
"mobile"
</div>
</div>
#Example 3
<div class="quantity">
<div class="color">...</div>
<span class="num">10</span>
</div>
Hi, I am trying to use selenium to extract content from html. I managed to extract the content for example 1 & 2, the code that I have used is
example1 = driver.find_elements_by_css_selector("span[class='leveltwo']")
example2 = driver.find_elements_by_css_selector("div[class='itemlist']")
and printed out as text with
data = [dt.text for dt in example1]
print(data)
I got "Blue" for example 1 & "mobile" for example 2. For simplicity purposes, the html given above is for one iteration, I have scraped all elements with the class mentioned above
However, for the 3rd example, I tried to use
example3a = driver.find_elements_by_css_selector("div[class='quantity']")
and
example3b = driver.find_elements_by_css_selector("div[class='num']")
and
example3c = driver. find_element_by_class_name("num")
but all of it returned an empty list. I'm not sure is it because there is no dir in example 3? What method should I use to extract the "10"?
for 3rd example, you can try the below css :
div.quantity span.num
in code you can write like this :
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
print(example3a.text)
or
print(example3a.get_attribute('innerHTML'))
To extract specifically the 10 you can use
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
To extract both elements inside <div class="quantity"> you can use
example3 = driver.find_elements_by_xpath("//div[#class='quantity']//*")
for el in example3:
print(el.text)
I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children
I have some trouble using TFHpple, so here it is :
I would like to parse the following lines :
<div class=\"head\" style=\"height: 69.89px; line-height: 69.89px;\">
<div class=\"cell editable\" style=\"width: 135px;\"contenteditable=\"true\">
<p> 1</p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>2</p>
</div>
</div>
<div style=\"height: 69.89px; line-height: 69.89px;\" class=\"head\">
<div class=\"cell\" style=\"width: 135px; text-align: left;\"contenteditable=\"false\">
<p>3 </p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>4</p>
</div>
</div>
<div style=\"height: 69.89px; line-height: 69.89px;\" class=\"\">
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>5</p>
</div>
<div class=\"cell\" style=\"width: 135px;\" contenteditable=\"false\">
<p>6</p>
</div>
</div>
For now I would like to put the first level of div "element" (sorry I don't know the proper terminology) in an array.
So I have tried to do it by simply giving /div as the xPath to the searchWithXPathQuery methods but it simply doesn't find anything.
My second solution was to try using a path of this kind : //div[#class=\"head\"] but also allowing [#class=\"\"] but I don't even know if it is possible.
(I would like to do so because I need the elements to be in the same order in the array as they are in the data)
So here is my question, is there a particular reason why TFHpple wouldn't work with /div?
And if there is noway to just take the first level of div, then is it possible to make a predicate on the value of an attribute with xPath (here the attribute class) ? (And how ? I have looked quite a lot now and couldn't find anything)
Thanks for your help.
PS : If it helps, here is the code I use to try and parse the data, it is first contained in the string self.material.Text :
NSData * data = [self.material.Text dataUsingEncoding:NSUnicodeStringEncoding];
TFHpple * tableParser = [TFHpple hppleWithHTMLData:data];
NSString * firstXPath = #"/div";
NSArray<TFHppleElement *> * tableHeader = [tableParser searchWithXPathQuery:firstXPath];
NSLog(#"We found : %d", tableHeader.count);
You wrote:
Getting first level using TFHpple
I assume you mean: without also getting all descendants?
Taking your other requirements into account, you can do so as follows:
//div[not(ancestor::div)][#class='head' or #class='']
Dissecting this:
Select all div elements (yes, correct term ;) at any level in the whole document: //div
Filter with a predicate (the thing between brackets) for elements not containing a div themselves, by checking if there's some div ancestor (parent of a parent of a parent of a....) [not(ancestor::div)]
Filter by your other requirements: [#class='head' or #class='']
Note 1: your given XML is not valid, it contains multiple root elements. XML can have at most one root element.
Note 2: if your requirements are to first get all divs by #class or empty #class, and then only those that are "first level", reverse the predicates:
//div[#class='head' or #class=''][not(ancestor::div)]
You can use the following XPath expression to get div element -that's quite a correct term-, having class attribute value equals "head" or empty :
//div[#ciass='head' or #class='']
I'm trying to verify the text in the span by using WebDriver. There is the span tag:
<span class="value">
/Company Home/IRP/tranzycja
</span>
I tried something like this:
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja']'"));
driver.findElement(By.cssSelector("span./Company Home/IRP/tranzycja"));
but none of this work.
Any help would be really appreciated. Thanks
More code:
<span id="uniqName_64_0" class="alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small" data-dojo-attach-point="renderedValueNode" widgetid="uniqName_64_0">
<span class="inner" tabindex="0" data-dojo-attach-event="ondijitclick:onLinkClick">
<span class="label">
In folder:
</span>
<span class="value">
/Company Home/IRP/tranzycja
</span>
</span>
uniqName shouldn't be a target because are a lot of them and they are change.
There is a full html code:
http://www.filedropper.com/spantag
Here I am assuming you are trying to verify the text in the span tag.
i.e '/Company Home/IRP/tranzycja'
Try Below code
String expected String = "/Company Home/IRP/tranzycja";
String actual_String = driver.findElement(By.xpath("//span[#class='alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small']//span[#class='value']")).getText();
if(expected String.equals(actual_String))
{
System.out.println("Text is Matched");
}
else
{
System.out.println("Text is not Matched");
}
You can try using xpath ('some text' can be replaced by variable like #Rupesh suggested):
driver.findElement(By.xpath("//span/span[#class='value'][normalize-space(.) = 'some text']"))
or
driver.findElement(By.xpath("//span/span[#class='value'][contains(text(),'some text')]"))
(Be aware that this xpath will find first matching element, so if there are span elements with text 'some text 1' and 'some text 2', only first occurrence will be found.)
Of course, those two methods will throw NoSuchElementException if element (with defined text) is not found on page. If you're using Java and if needed, you can easy catch that error and print proper message.
One possible xpath to find that <span> element :
//span[normalize-space(.) = '/Company Home/IRP/tranzycja']
I think your going to want to use something like
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja'])).getText();
the getText(); will get the text within that span
You can use text() method inside Xpath. I hope this will resolve your problem
String str1 = driver.findElement(By.xpath("//span[text()='/Company Home/IRP/tranzycja']")).getText();
System.out.println("str1");
Output = /Company Home/IRP/tranzycja