XPath - not match element with certain ancestor

XPath - not match element with certain ancestor - html

I would like to count the text of all descendant elements which do not have a link as an ancestor.
//*[string-length(normalize-space(//*[not(ancestor::a)])) > 10]
Which if tested on this structure would return [Get This Text]
<b>
ignore
<a>ignore</a>
Get This Text
</b>

It's not really clear what you mean by "count the text" but the following expression returns all elements that don't have a link as ancestor and whose normalized string value is longer than 10 characters:
//*[not(ancestor::a) and string-length(normalize-space()) > 10]
Since you want the expression to return the string 'Get this text', maybe you want select text nodes, not elements:
//text()[not(ancestor::a) and string-length(normalize-space()) > 10]

Related

How to search for nodes at the uppermost levels only in Ruby-Nokogiri?

HTML (particularly MathML) can be heavily nested. With Ruby-Nokogiri, I want to search for a node at the uppermost levels, which are arbitrary, within a parent node. Here is an example HTML/MathML.
<math><semantics>… (arbitrary depth)
<mrow> (call it (1))
<mrow> (1-1)
<mrow> (2)
<mrow> (2-1)
<mrow> (2-2)
For a Nokogiri::HTML object for it, page, page.css("math mrow") returns an Array of all the nodes of <mrow>, having an Array size of 5 in this case, with the last node being "<mrow> (2-2)".
My goal is to identify the last <mrow> node at the upper-most level, i.e., "<mrow> (2)" in the example above (so that I can add another node after it).
In other words, I want to get the "last node of a certain kind at the shallowest depth among all the nodes of the kind". The depth of the uppermost level for the type of node is unknown and so I cannot limit the depth for the search.

If you want the uppermost mrow node in terms of depth, you could select among all :first-of-type the one with the least number of ancestors:
first_mrow = page.css('mrow:first-of-type').min_by.with_index { |node, index| [node.ancestors.size, index] }
Adding with_index ensures that for nodes with identical number of ancestors, the first one will be picked.
To get the first mrow node from the start of the document (regardless of depth), you could simply use:
first_mrow = page.at_css('mrow')
With the first mrow node you can then select its parent node:
parent = first_mrow.parent
and finally retrieve the last element from the parent's (immediate) mrow nodes:
last_mrow = parent.css('> mrow').last
The latter can also be expressed via the :last-of-type CSS pseudo-class:
last_mrow = parent.at_css('> mrow:last-of-type')

Sounds like a breadth-first search problem at 1st glance: add a node to the queue, if it's not of the desired type remove it and add its children to the queue in reversed order, repeat until you find the desired node (it will be the shallowest one because of BFS properties and the last one because we add children reversed).
Quick and dirty example:
require "nokogiri"
def find_last_shallowest(root)
queue = [root]
while queue.any?
element = queue.shift
break element if matching?(element)
queue += element.children.reverse
end
end
def matching?(element)
# Put your matching logic here
element.name == "m"
end
doc = <<~XML
<foo>
<bar>
<m>
<x></x>
</m>
</bar>
<baz>
</baz>
<m>
<y></y>
</m>
<m>
<z></z>
</m>
</foo>
XML
xml = Nokogiri::XML(doc)
find_last_shallowest(xml.root) # => #(Element:0xf744 { name = "m", children = [ #(Text "\n "), #(Element:0xf758 { name = "z" }), #(Text "\n ")] })
It finds m thing which has z as a child - which is the last and shallowest one...

How to find the right node in Rselenium to press load more buttom on dynamic webpage

I want to click the [more recipes](in german: [mehr Rezepte]) Button via Rselenium on the following webpage: https://migusto.migros.ch/de/rezept-uebersicht/mexiko
I tried the following:
rD<-rsDriver(browser = 'chrome', port = 427L, chromever = '87.0.4280.88')
remDr<-rD$client
remDr$navigate('https://migusto.migros.ch/de/rezept-uebersicht/mexiko')
load_btn <- remDr$findElement(using = 'class', value = '.icon-right')
load_btn$clickElement
Does someone know how to find the right input into findElement() to get the button clicked via Rselenium?
Thank you a lot and BR
David

Rselenium is able to find different types of element inside the html page.
You can use one of these elements.
findElement( using = c("xpath", "css selector", "id", "name", "tag name",
"class name", "link text", "partial link text", "tag name", "xpath"),
value ="the code that you find in the html page")$clickElement()
class name : Returns an element whose class name contains the search value; compound class names are not permitted.
css selector : Returns an element matching a CSS selector.
id : Returns an element whose ID attribute matches the search value.
name : Returns an element whose NAME attribute matches the search value.
link text : Returns an anchor element whose visible text matches the search value.
partial link text : Returns an anchor element whose visible text partially matches the
search value.
tag name : Returns an element whose tag name matches the search value.
xpath : Returns an element matching an XPath expression.
Belwow a small example:
library(RSelenium)
rD<-rsDriver(browser = 'chrome', port = 428L, chromever = '87.0.4280.88')
remDr<-rD$client
remDr$navigate('https://migusto.migros.ch/de/rezept-uebersicht/mexiko')
remDr$findElement(using = 'xpath', value = '//*[#id="top"]/div[3]/main/div[1]/div/div[2]/form/a')$clickElement()
If you want to deep the argument here the official documentation.

How to extract text as well as hyperlink text in scrapy?

I want to extract from following html code:
<li>
<a test="test" href="abc.html" id="11">Click Here</a>
"for further reference"
</li>
I'm trying to do with following extract command
response.css("article div#section-2 li::text").extract()
But it is giving only "for further reference" line
And Expected output is "Click Here for further reference" as a one string.
How to do this?
How to modify this to do the same if following patterns are there:
Text Hyperlink Text
Hyperlink Text
Text Hyperlink

There are at least a couple of ways to do that:
Let's first build a test selector that mimics your response:
>>> response = scrapy.Selector(text="""<li>
... <a test="test" href="abc.html" id="11">Click Here</a>
... "for further reference"
... </li>""")
First option, with a minor change to your CSS selector.
Look at all text descendants, not only text children (notice the space between li and ::text pseudo element):
# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()
[u'\n ', u'\n "for further reference"\n']
# notice the extra space
# here
# |
# v
>>> response.css("li ::text").extract()
[u'\n ', u'Click Here', u'\n "for further reference"\n']
# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n Click Here\n "for further reference"\n'
Another option is to chain your .css() call with XPath 1.0 string() or normalize-space() inside a subsequent .xpath() call:
>>> response.css("li").xpath('string()').extract()
[u'\n Click Here\n "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']
# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'

I use xpath if that is the case the selector will be:
response.xpath('//article/div[#id="section-2"]/li/a/text()').extract()#this will give you text of mentioned hyper link >> "Click Here"
response.xpath('//article/div[#id="section-2"]/li/a/#href').extract()#this will give you link of mentioned hyper link >> "abc.html"
response.xpath('//article/div[#id="section-2"]/li/text()').extract()#this will give you text of li >> "for further reference"

How to excluded XPATH Nested SPAN Class

From the following Html code, I want to select only the first span class Text..
<span class="item_amount order_minibasket_amount order_full_minibasket">10
<span class="article">Article
<i class="icon"> >
</i>
</span>
</span>
This is my current XPATH :
//span[contains(#class,'order_minibasket_amount')]
when I use this in my Selenium Test, I got the whole SPAN TEXT.. Like :
10 Article >
I just want to get the "10" article amount..
AMOUNT(new PageElement(By.xpath("//span[contains(#class,'order_minibasket_amount')]/text()[1]"), "not such Element...."))
public String getAmount() {
return amount = PageObjectUtil.findAndInitElementInside(webElement, PageElements.AMOUNT.pe, amount, String.class);
}
Many thanks in advance,
Cheers,
koko

What you want cannot be done directly, you will have to resort to String manipulation. Something like:
String completeString = driver.findElement(By.className("item_amount")).getText()
String endString = driver.findElement(By.className("article")).getText()
String beginString = completeString.replace(endString, "")

You could add /text() to the end:
//span[contains(#class,'order_minibasket_amount')]/text()
Instead of selecting the span element node, this XPath will select the set of text nodes that are direct children of the span element. This should be a set of two nodes, one containing "10", a newline and four spaces (the text between the opening tag of the target span and the opening tag of the nested span) and the other containing just a newline (between the closing tags of the two spans. If you only want the first text node child (10, nl, spaces) then use
//span[contains(#class,'order_minibasket_amount')]/text()[1]

Now I am using work around solution...but i am not happy with it. :-(
public String getAmount() {
String tempAmount = PageObjectUtil.waitFindAndInitElement(PageElements.AMOUNT.pe).getText();
String output = tempAmount.replaceAll("[a-zA-Z->]", "");
return amount = output.trim();
}
cheers,
KoKo

How to parse HTML tags as raw text using ElementTree

I have a file that has HTML within XML tags and I want that HTML as raw text, rather than have it be parsed as children of the XML tag. Here's an example:
import xml.etree.ElementTree as ET
root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>")
If i try:
root.find('text').text
It returns no output
but root.find('text/p').text will return the paragraph text without the tags. I want everything within the text tag as raw text, but I can't figure out how to get this.

Your solution is reasonable. An element object is the list of children. The .text attribute of the element object is related only to things (usually a text) that are not part of other (nested) elements.
There are things to be improved in your code. In Python, string concatenation is an expensive operation. It is better to build the list of substrings and to join them later -- like this:
output_lst = []
for child in root.find('text'):
output_lst.append(ET.tostring(child, encoding="unicode"))
output_text = ''.join(output_lst)
The list can be also build using the Python list comprehension construct, so the code would change to:
output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]
output_text = ''.join(output_lst)
The .join can consume any iterable that produces strings. This way the list need not to be constructed in advance. Instead, a generator expression (that is what can be seen inside the [] of the list comprehension) can be used:
output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))
The one-liner can be formatted to more lines to make it more readable:
output_text = ''.join(ET.tostring(child, encoding="unicode")
for child in root.find('text'))

I was able to get what I wanted by appending all child elements of my text tag to a string using ET.tostring:
output_text = ""
for child in root.find('text'):
output_text += ET.tostring(child, encoding="unicode")
>>>output_text
>>>"<p>This is some text that I want to read</p>"

Above solutions will miss initial part of your html if your content begins with text. E.g.
<root><text>This is <i>some text</i> that I want to read</text></root>
You can do that:
node = root.find('text')
output_list = [node.text] if node.text else []
output_list += [ET.tostring(child, encoding="unicode") for child in node]
output_text = ''.join(output_list)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

XPath - not match element with certain ancestor - html

I would like to count the text of all descendant elements which do not have a link as an ancestor. //[string-length(normalize-space(//[not(ancestor::a)])) > 10] Which if tested on this structure would return [Get This Text] <b> ignore <a>ignore</a> Get This Text </b>

Related

How to search for nodes at the uppermost levels only in Ruby-Nokogiri?

How to find the right node in Rselenium to press load more buttom on dynamic webpage

How to extract text as well as hyperlink text in scrapy?

How to excluded XPATH Nested SPAN Class

How to parse HTML tags as raw text using ElementTree

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

XPath - not match element with certain ancestor - html

I would like to count the text of all descendant elements which do not have a link as an ancestor. //*[string-length(normalize-space(//*[not(ancestor::a)])) > 10] Which if tested on this structure would return [Get This Text] <b> ignore <a>ignore</a> Get This Text </b>

Related

How to search for nodes at the uppermost levels only in Ruby-Nokogiri?

How to find the right node in Rselenium to press load more buttom on dynamic webpage

How to extract text as well as hyperlink text in scrapy?

How to excluded XPATH Nested SPAN Class

How to parse HTML tags as raw text using ElementTree

Categories

Resources

I would like to count the text of all descendant elements which do not have a link as an ancestor. //[string-length(normalize-space(//[not(ancestor::a)])) > 10] Which if tested on this structure would return [Get This Text] <b> ignore <a>ignore</a> Get This Text </b>