XPath - all elements except those in header

XPath - all elements except those in header - html

Trying to figure out XPATH which match all elements except header or inside header. Let's assume that header can be detected by three conditions:
outer tag is header eg. <header><div.....></header>
outer tag has id which contains string "header"
outer tag has class which contains string "header"
My xpath: //*[not(ancestor::header)] and //*[not(ancestor::*[contains(#id,"header")])] and //*[not(ancestor::*[contains(#class,"header")])]
is not correct.
EDIT:
This should match all links which are inside header:
//*[ancestor::*[contains(#id,"header") or contains(#class,"header") or header]]
Now I want to get all elements except these.
Do you know how to make it work?

Each of the expressions in your original XPath were being evaluated separately, testing whether there is an element in the XML document that satisfies those conditions, and returning a boolean().
Now that you have combined the predicates to order select the particular element(s) that you don't want, you just need to negate the test:
//*[not(ancestor-or-self::header) and
not(ancestor::*[contains(#id,"header") or contains(#class,"header")])
]

Related

Why is XPath contains(text(),'substring') not working as expected?

Let's say I have a piece of HTML like this:
<a>Ask Question<other/>more text</a>
I can match this piece of XPath:
//a[text() = 'Ask Question']
Or...
//a[text() = 'more text']
Or I can use dot to match the whole thing:
//a[. = 'Ask Questionmore text']
This post describes this difference between . (dot) and text(), but in short the first returns a single element, where the latter returns a list of elements. But this is where it gets a bit weird to me. Because while text() can be used to match either of the elements on the list, this is not the case when it comes to the XPath function contains(). If I do this:
//a[contains(text(), 'Ask Question')]
...I get the following error:
Error: Required cardinality of first argument of contains() is one or zero
How can it be that text() works when using a full match (equals), but doesn't work on partial matches (contains)?

For this markup,
<a>Ask Question<other/>more text</a>
notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").
Here's how to reason through what's happening when evaluating //a[contains(text(),'Ask Question')] against that markup:
contains(x,y) expects x to be a string, but text() matches two text nodes.
In XPath 1.0, the rule for converting multiple nodes to a string is this:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order. If the
node-set is empty, an empty string is returned. [Emphasis added]
In XPath 2.0+, it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.
In your case...
XPath 1.0 would treat contains(text(),'Ask Question') as
contains('Ask Question','Ask Question')
which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.
XPath 2.0 would treat it as an error.
Better alternatives
If the goal is to find all a elements whose string value contains the substring, "Ask Question":
//a[contains(.,'Ask Question')]
This is the most common requirement.
If the goal is to find all a elements with an immediate text node child equal to "Ask Question":
//a[text()='Ask Question']
This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,
<a>Ask Question<other/>more text</a>
but not this a:
<a>more text before <not>Ask Question</not> more text after</a>
See also
How contains() handles a nodeset first arg
How to use XPath contains() for specific text?
Testing text() nodes vs string values in XPath

The reason for this is that the contains function doesn't accept a nodeset as input - it only accepts a string. (Well, it may be engine dependent, because it works for Python's lxml module. According to the specification, it should convert the value of the first node in the set to a string and act on that. See also XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode)
//a[text() = 'Ask Question'] is matching any a elements which contain a text node which equals Ask Question.
//a[text() = 'more text'] is matching any a elements which contain a text node which equals more text.
So both of these expressions match the same a element.
You can re-work your query to //a[text()[contains(., 'Ask Question')]] so that the contains method will only act on a single text node at a time.

RegEx pattern to scrape data from HTML Tags

What will be the regex pattern to match this pattern where inside header tag any number of attributes like id, class can be there and zero or more number of strong tag can be inside the header tag? I want to match the pattern which follows:
Any HTML header (h1-h5)
Any attributes can be present inside header tag.
zero or more number of strong tag can be present.
<h5 id="some_id"><strong><strong><strong>SOME_TEXT</strong></strong></strong></h5>

You can try:
match = re.search(r"<(h[1-5])\b(?:[^>]|>[<\s])*>([^<]+)(?:[^<]|<(?!/\1))*</\1>",
subject, re.IGNORECASE)
if match:
result = match.group(2)
else:
result = ""
I'll add a regex101 in a sec to show how this works.
Here it is: https://regex101.com/r/du8PCn/1 (it's Group 2 that matches).
EDIT: I don't know much about Python, but I believe you'll need to use re.findall or re.finditer above if you're matching an html string with a number of headings in it (instead of re.search).

How to use XPath contains() for specific text?

Say we have an HTML table which basically looks like this:
2|1|28|9|
3|8|5|10|
18|9|8|0|
I want to select the cells which contain only 8 and nothing else, that is, only 2nd cell of row2 and 3rd cell of row3.
This is what I tried: //table//td[contains(.,'8')]. It gives me all cells which contain 8. So, I get unwanted values 28 and 18 as well.
How do I fix this?
EDIT: Here is a sample table if you want to try your xpath. Use the calendar on the left side-https://sfbay.craigslist.org/sfc/

Be careful of the contains() function.
It is a common mistake to use it to test if an element contains a value. What it really does is test if a string contains a substring. So, td[contains(.,'8')] takes the string value of td (.) and tests if it contains any '8' substrings. This might be what you want, but often it is not.
This XPath,
//td[.='8']
will select all td elements whose string-value equals 8.
Alternatively, this XPath,
//td[normalize-space()='8']
will select all td elements whose normalize-space() string-value equals 8. (The normalize-space() XPath function strips leading and trailing whitespace and replaces sequences of whitespace characters with a single space.)
Notes:
Both will work even if the 8 is inside of another element such as a
a, b, span, div, etc.
Both will not match <td>gr8t</td>, <td>123456789</td>, etc.
Using normalize-space() will ignore leading or trailing whitespace
surrounding the 8.
See also:
Why is contains(text(), "string" ) not working in XPath?

Try the following xpath, which will select the whole text contents rather than partial matches:
//table//td[text()='8']
Edit: Your example HTML has a tags inside the td elements, so the following will work:
//table//td/a[text()="8"]
See example in php here: https://3v4l.org/56SBn

XPath: Way to match text inside an arbitrary number of nested elements?

Is it possible for one XPath expression to match all the following <a> elements using the text in the element, in this case "Link"?
Examples:
Link
<span>Link</span>
<div>Link</div>
<div><span>Link</span></div>

This simple XPath expression,
//a[contains(., 'Link')]
will select the a elements of all of your examples because . represents the current node (a), and contains() will check the string value of a to see if it contains 'Link'. The string value of a already conveniently abstracts away from any descendent elements.
This even simpler XPath expression,
//a[. = 'Link']
will also select the a elements in all of your examples. It's appropriate to use if the string value of a will exactly equal, rather than just contain, "Link".
Note: The above expressions will also select Li<br/>nk, which may or may not be desirable.

You could use the following:
//a[(.//*|.)[contains(text(), "Link")]]
This will select a elements that contain the text "Link" or a elements that have a descendant element that contains the text "Link".
//a - Select all a elements
( - Open OR grouping
.//* Select all the descendant nodes
| - Or..
. - Select the current node
) - Close OR grouping
[contains(text(), "Link")] - If they contain the text "Link"
Alternatively, you could also use:
//a[(.//*|.)[.="Link"]]

Confounded by XPath

When it comes to indexing in XPath, I feel like I'm missing something here.
If I have two table tags in an HTML document, and within the Chrome console I type $x("//table[1]");, I expect to get the first table tag on the page.
Instead, I get a list containing both table tags. I suspected it might have something to do with using // but using an absolute XPath expression yielded the same results.
I think this is a pretty simple misunderstanding, but I'm not seeing it when reading the docs.

//table[1] returns all tables that are the first table child of their respective parents.
To get the first table use /descendant::table[1] or in XPath 2.0 (//table)[1].
Here it is in the standard:
The path expression //para[1] does not mean the same as the path expression /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their respective parents.

Use
(//table)[1]
i.e. the first of all the tables.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

XPath - all elements except those in header - html

Related

Why is XPath contains(text(),'substring') not working as expected?

RegEx pattern to scrape data from HTML Tags

How to use XPath contains() for specific text?

XPath: Way to match text inside an arbitrary number of nested elements?

Confounded by XPath

Categories

Resources