Using XPath to get text of paragraph with links inside - html

I'm parsing HTML page with XPath and want to grab whole text of some specific paragraph, including text of links.
For example I have following paragraph:
<p class="main-content">
This is sample paragraph with link inside.
</p>
I need to get following text as result: "This is sample paragraph with link inside", however applying "//p[#class'main-content']/text()" gives me only "This is sample paragraph with inside".
Could you please assist? Thanks.

To get the whole text content of a node, use the string function:
string(//p[#class="main-content"])
Note that this gets a string value. If you want text nodes (as returned by text()), you can do this. You need to search at all depths:
//p[#class="main-content"]//text()
This returns three text nodes: This is sample paragraph with, link and inside.

Related

Get text of a tag and the text of child tags

I have this HTML
<p>
<strong>aquiline</strong>
<i> adj. </i>
of or like the eagle.
</p>
All this this node is wrapped by a div with class= field-item even
I would like to recive Aquiline adj. of or like the eagle.... Now i have this uncorrect xpath response.xpath('//div[#class="field-item even"]//descendant-or-self::p/text()').getall()
Your xpath is almost correct. Replace p with * to select all text nodes and not only text nodes of paragraph tags. Also using normalize-space function you can get all the text as one string instead of a list. See below code snippet.
response.xpath('normalize-space(//div[#class="field-item even"]//descendant-or-self::*)').get()

Extracting full text from HTML span element with XPath expression

I have a HTML tree which looks like this:
<div id="RF4FOEQ3OPBEX" data-hook="review" class="a-section review aok-relative"><div
<div data-hook="review-collapsed" aria-expanded="false" class="a-expander-content reviewText review-text-content a-expander-partial-collapse-content">
<span>
Text line1.
<br>
Text line2.
</span>
I am trying to extract all the text from the span with the following XPath expression:
//div[#data-hook="review"]//div[#data-hook="review-collapsed"]/span/text()
However this approach only returns the first text line until the break? The question is: how would I approach this problem in the correct way in order to extract the full text content of the HTML span tag? I would appreciate any help very much and thank you in advance for the support.
use // and getall method to get all text inside specific element
getall returns list, just join it
txt = "".join(response.xpath('//div[#data-hook="review"]//div[#data-hook="review-collapsed"]/span//text()').getall())

Get text of an element when it has more elements with text

I am using Selenium in Python and Firefox and I would want to get the text "TextB" in the next portion of HTML, I have tried with element.get_attribute('textContent') but it takes "TextA" too, is there any form of getting ONLY that text?
<p class="class_name" id="id_name">
<i class="class_name2"></i>
<b>TextA</b>
TextB
</p>
Try to get text content of the last child only
element = driver.find_element_by_id('id_name')
driver.execute_script('return arguments[0].lastChild.textContent', element)
As per the HTML you have provided to extract the text TextB in a more Pythonic way would be to use the splitlines() method as follows:
myText = driver.find_element_by_xpath("//p[#class='class_name' and #id='id_name']").get_attribute("innerHTML").splitlines()[3]
Below gives you TextA which is the text of b tag
driver.find_element_by_xpath("//p[#class='class_name']/b").text;
Below gives you TextB which is the text of p tag
driver.find_element_by_xpath("//p[#class='class_name']").text;

Get text followed by certain text or get all text if that text is missing

I need to get the texts from HTML pages but some of them contain unnecessary texts which go after certain text in page ('---------').
E.g. example of HTML page 1:
...
<p> This is correct text. Everything after it is wrong</p>
<p>---------</p>
<p><strong>This is wrong text</strong></p>
<p> This is wrong another text</p>
...
Example of HTML page 2:
...
<p> This is correct text. Everything after it is wrong</p>
<p> This text is also valid </p>
<p> This is another correct text</p>
...
So if page contains '-----------------', I need to grab only texts before it otherways - I need to grab everything. As noted here (Get text followed by certain text) I can use:
//p[following-sibling::p[contains(.,'---------')]][1]/text()
For the 1st example. But is there a way to use one XPath for both cases?
//p[ not(contains(.,'---------'))
and not(preceding-sibling::p[contains(.,'---------')])]//text()
Will return
This is correct text. Everything after it is wrong
for your first case and
This is correct text. Everything after it is wrong
This text is also valid
This is another correct text
for your second case, as requested.

Get text followed by certain text

I need to get the text but only before the certain text ('---------------').
E.g. example of HTML code:
...
<p> This is correct text. Everything after it is wrong</p>
<p>---------------------</p>
<p><strong>This is wrong text</strong></p>
<p> This is wrong another text</p>
...
I'm trying to solve this with the next XPath expression:
/p/text()[normalize-space()][not(ancestor::p[contains(.,'---')])]
But unfortunately this doesn't work as expected.
Would be appreciate for the correct solution.
This XPath will select the text of a p whose immediately following sibling contains ---:
//p[following-sibling::p[contains(.,'---')]][1]/text()