Getting text in <div> using Selenium - html

I have a question related to Selenium in Python:
I want to obtain the text content "D. New Jersey" on a webpage. In addition, the text that I want to get can be different on different pages, but it is always under "COURT:".
The HTML code is:
<div class="span4">
<strong>COURT:</strong>
D. New Jersey
</div>
The code I use now is as follows. And it doesn't work.
self.driver.get(address)
element=driver.findElement("//a[contains(#class,'span4') and contains(div/div/text(),'COURT:')]").gettext()
I have also tried the following solutions with no luck, and no Selenium exception is being thrown either:
text = self.driver.find_element_by_xpath("//div[strong[text()='COURT:']]").text
and
text = self.driver.find_element_by_xpath("//a[contains(#class,'span4') and contains(div/div/text(),'COURT:')]").text
Is there anyone who knows how to get the text from this code using Selenium?
Thanks

For Python, you can get the text as such:
text = self.driver.find_element_by_xpath("//div[strong[text()='COURT:']]").text
This uses an XPath to query on the div element, using its inner strong element to ensure we have selected the correct div. Then, we call Python's webelement.text method to get the div's text.

Related

How can I get the element of a-tag in the div class with selenium?

I recently work on the project that I have to get the element from a specific website.
I want to get the text elements that are something below.
<div class="block-content">
<div class="block-heading">
<a href="https://www~~~~~~">
<i class="fa fa-map">
::before
</i>
"Text I want to get"
</a>
</div>
</div>
I have been trying to solve this for a while, but I could not find anything working fine.
I would love you if you could help me.
Thank you.
According to the information you provided the text you are looking for is inside a element so the xpath for this element is something like:
//a[contains(#href,'https://www')]
But since there is also i element inside it, getting the text from a element will give you both text contained in a itself and the text inside the i.
So you should get the text from i that is looking like just a (space) here and reduce it from the text you are receiving from the a.
In case you want to perform this action on all the a elements containing href and i element inside it you can use the following xpath:
//a[#href and ./i]
If there are more specific definitions about the elements you are looking for - the xpath I mentioned should be updated accordingly
From your comment, I understood that you would like to extract that text. So here is the code for you which would extract the text you want.
Selenium::WebDriver::Wait
.new(timeout: 60)
.until { !driver.find_element(xpath: "//i[#class='fa fa-map-marker']/..").text.empty? }
p driver.find_element(xpath: "//i[#class='fa fa-map-marker']/..").text[/(?<=before \")\w+ \w+ \w+ \w+ \w+/]
output
"Text I want to get"
I couldn't get the elements that I wanted directly, so here's what I did.
It is just that I did modify the elements with some methods though.
def seller_name
shop_info_elements = #driver.find_elements(:class_name, "block-content")
shop_info_text= shop_info_elements.first.text
shop_info_text_array = shop_info_text.lines
seller_name = shop_info_text_array.first.chomp
seller_name
end
It is not beautiful, but it can work for any other pages on the same site.

XPath: Find a node within a text node

I have the following html:
<code>The first code block</code>
<p>Some text and <code>the second code block</code> followed by other text</p>
I need to find and remove all code blocks from it. I use the following XPath '//code', but it finds only the first code block while the second remains.
Question: Why '//code' is not able to catch the second code block? How to fix it?
Details: I'm doing it in Ruby using Nokagiry. My code looks like this:
html = Nokogiri::HTML(File.read(htmlFile))
html.search('//code').remove
UPDATE:
The XPath worked in fact. I just made a mistake in different place.
Seems like You forget about iterator...
Try:
html = Nokogiri::HTML(File.read(htmlFile))
html.search('//code').each{|htm| htm.remove}

HTML XPath: Extracting text mixed in with multiple level and complex tags?

related questions before:
HTML XPath: Extracting text mixed in with multiple tags?
HTML XPath: Selectively avoiding tags when extracting text
//sorry for my poor English
I'm a beginner of writing web crawler, I'm trying to extract main content from a web pages(in Chinese) by xpath(though I have learned that there are algorithms both taditional and machine learning ways to extracting web main content) ,and I'm a very beginner at writing xpath rules.
I'm in faced with a web page that contains text mixed in complex tags,I summarize it as follows,where character(e.g. A,A2) means text only,'...' means more tags even nested without text.I want to get "AA2BB2CDEFGHIJKLMNOP"
...
<div id="artibody" class="art_context">
<div align="center">...</div>
<div align="center"><font>A</font>A2</div>
<div align="left"><br><br><strong>B</strong>B2</div>
<div align="left">
<p>C<a>D</a>E</p>
<p>F<a>G</a>H<a>I</a>J</p>K
</div>
<div align="center">...</div>
<div align="center"><font>L</font></div>
<p>M</p><!--M contains only text luckly-->
<p>N</p>
<p>O</p>
<p>P<span>...</span><div class="shareBox">...</div>
</p>
<span id="arctTailMark"></span>
<script>
var page_navigation = document.getElementById('page_navigation');
...
</script>
<div style="padding:10px 0 30px 0">...</div>
</div>
Thanks for previous questions, I write a rule
'string(//div[#class=\"art_context\"])'
I get all content in plain text I want without tags ,but the js code in <script> is extracted as well.I tried the following,but it seems not helpful.There are still js codes in it .
'string(//div[#class=\"art_context\" and not(self::script)])'
The following one get "\r\n" only.
'//div[#class=\"art_context\" and not(self::script)]/text()'
Here are my questions:
1.How to write the xpath rule to meet my need : extracting content in div[#id="artibody"] except codes in <script>
2.Is the rule for question1 simple and powerful? Maybe I will meet more pages with a div[#id="artibody"] but the descendant nodes are quite different.
3.Any further suggestions on my task? Extracting web content from one website,but the main content lays in <div> with different id,class,and descendant node structure. I run the spider on my laptop(Intel corei5 3225,8G RAM) while using machine learning algorithms may decrease the crawl speed significantly.At the same time writing many xpath rule seems bothering.
I'd appreciate it if you could give me any suggestions on this question(and my English).
To get all descendant text nodes except the script contents, you can use this:
//div[#class="art_context"]//*[not(self::script)]/text()
In natural language: “Get all text nodes from descendants of all div[#class="art_context"] elements that are not script elements”.
The // after div[#class="art_context"] is needed to select descendants, not just children.
In comparison, the //div[#class="art_context" and not(self::script)]/text() expression in the question says “Get all text-node children of all div[#class="art_context"] elements that are not also script elements.”
So the and not(self::script) part in the expression in the question is redundant, because all the expression is doing is selecting just //div[#class="art_context"] anyway, and then the /text() part is selecting only the text-node direct children of that div, which is just line breaks.
Also, if instead of using XPath to just get the set of text nodes, you want to use XPath to get the result as a single string, you can use the functions string-join(…) and normalize-space(…):
normalize-space(string-join(//div[#class="art_context"]//*[not(self::script)]/text(), ""))

HTML inside TextArea?

So I have this textarea in my website. By default, it has something like this as its contents:
Name : Sample Value
Age : Sample Value
Location : Sample Value
It is editable before the user hits the button and inserts it into the database, although I am not using a rich text editor since it's nothing but a simple text.
Since basic HTML codes are not browser readable inside the textarea tag, I used
to separate lines.
Now my problem is that I am not able to include the HTML code when I'm reading the value of the textarea tag in the server side.
Thus, the value inserted to the database is not HTML formatted as well, and when it is once again fetched into a web browser, it has no format at all.
What alternatives do I have? Thanks.
Not possible using textarea, use contenteditable DIV instead.
<div contenteditable="true"></div>
You can use getters and setter as shown below:
//Get content
var contents = document.getElementById("divId").innerHTML;
//Set content
document.getElementById("divId").innerHTML = contents
Here is the browser support for this approach.
Why don't you use JQuery and do this $(textarea).val() to get the value of the textarea as a string and use it server side. you might have to consider using Ajax to make a call to the server side method you want to pass the Html data.
The answer is very simple.
Use contenteditable DIVs instead of TextBox and TextArea.
But remember to add contenteditable="false" to all your inner HTML tags.
This worked for me.

Write Selenium Test to test text wrapping?

I am trying to write the selenium test using selenium 2.0 for the attached scenario.
HTML code is as follows
<div width="100px" style = "background-color:blue">
This%is%Normal%Text%on%page%without%dynamic%view%and%not%width%set.
</div>
I need to verify whether text is wrapping or not. Text should wrap in the 2nd scenario
Any help would be appreciated.
Thanks,
Sahil
I'm afraid it goes too deep to browser's internals so it is not possible with Selenium. Maybe you can check the height ow wrapping element (div) if it exceeds height of line, but it looks fragile to me.
May be should try the url encode/uncode to see if that particular text is encoded or not?
You can check if the text is overflowed.
public boolean isElementOverflowed(CoreLocator element) {
return (boolean) coreDriver.executeScript("return arguments[0].scrollWidth > arguments[0].clientWidth",
element.getWebElement());
}
and call it in your code.