using Xpath to scrape inconsistent DOM - html

I want to scrape the post name, which for pattern one it's located within a span
but the forum thread can goes like this (line 7)
because the thread is a poll.
so in my case I can't target the span (line 8 first picture), I used descendants-or-self but hardly to get it right. What's wrong here?
$postTitle = $xpath->query("//tr/td[#class='row1'][3]/div/div[1]//descendant-or-self::text()");

With this expression you will select the first <a> in the <div> where the text you wish to extract is located:
//tr/td[#class='row1'][3]/div/div[1]/a[1]
I'm assuming you intend to select one element (and not a node-set). For that you can get the string-value of this expression (which will return all the text in the descendant nodes) using string() or normalize-space() (which trims and removes extra spaces):
normalize-space(//tr/td[#class='row1'][3]/div/div[1]/a[1])
This will extract Salary vs age or /ktards are you... depending on the node found.
If there is more than one match it will return a collection, which you should iterate over and get the string value of each one individually. Using those functions on a node-set will give you the text in the first element, discarding the others.
If you only have to deal with two cases: 1) text inside a/span, 2) text inside a, you can select the text nodes directly using a union (|) operator:
//tr/td[#class='row1'][3]/div/div[1]/a[1]/text() | //tr/td[#class='row1'][3]/div/div[1]/a[1]/span/text()

Related

Why is XPath contains(text(),'substring') not working as expected?

Let's say I have a piece of HTML like this:
<a>Ask Question<other/>more text</a>
I can match this piece of XPath:
//a[text() = 'Ask Question']
Or...
//a[text() = 'more text']
Or I can use dot to match the whole thing:
//a[. = 'Ask Questionmore text']
This post describes this difference between . (dot) and text(), but in short the first returns a single element, where the latter returns a list of elements. But this is where it gets a bit weird to me. Because while text() can be used to match either of the elements on the list, this is not the case when it comes to the XPath function contains(). If I do this:
//a[contains(text(), 'Ask Question')]
...I get the following error:
Error: Required cardinality of first argument of contains() is one or zero
How can it be that text() works when using a full match (equals), but doesn't work on partial matches (contains)?
For this markup,
<a>Ask Question<other/>more text</a>
notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").
Here's how to reason through what's happening when evaluating //a[contains(text(),'Ask Question')] against that markup:
contains(x,y) expects x to be a string, but text() matches two text nodes.
In XPath 1.0, the rule for converting multiple nodes to a string is this:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order. If the
node-set is empty, an empty string is returned. [Emphasis added]
In XPath 2.0+, it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.
In your case...
XPath 1.0 would treat contains(text(),'Ask Question') as
contains('Ask Question','Ask Question')
which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.
XPath 2.0 would treat it as an error.
Better alternatives
If the goal is to find all a elements whose string value contains the substring, "Ask Question":
//a[contains(.,'Ask Question')]
This is the most common requirement.
If the goal is to find all a elements with an immediate text node child equal to "Ask Question":
//a[text()='Ask Question']
This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,
<a>Ask Question<other/>more text</a>
but not this a:
<a>more text before <not>Ask Question</not> more text after</a>
See also
How contains() handles a nodeset first arg
How to use XPath contains() for specific text?
Testing text() nodes vs string values in XPath
The reason for this is that the contains function doesn't accept a nodeset as input - it only accepts a string. (Well, it may be engine dependent, because it works for Python's lxml module. According to the specification, it should convert the value of the first node in the set to a string and act on that. See also XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode)
//a[text() = 'Ask Question'] is matching any a elements which contain a text node which equals Ask Question.
//a[text() = 'more text'] is matching any a elements which contain a text node which equals more text.
So both of these expressions match the same a element.
You can re-work your query to //a[text()[contains(., 'Ask Question')]] so that the contains method will only act on a single text node at a time.

What is the Correct XPath to Identify Element with Text Occuring Minimum Number of Times?

I'm trying to identify an element that has certain text but I only want to identify the element if the desired text occurs a specific number of times.
For example, imagine we have the following two HTML snippets on the same page:
Snippet 1:
<span id="price">
$36.46
<span>
($0.38 / Count)
</span>
</span>
Snippet 2:
<span id="price">$38.38</span>
I could identify both elements using the XPath: .//span[contains(text(),'$')] However, I only want to identify the element if it (or any descendant of span element) contain at least two instances of the character: $
In above example, it would only identify the first snippet because the second snippet only contains one instance of $, not two.
What is the correct XPath syntax to use?
You can use the XPath //span[count(.//text()[contains(., "$")]) >= 2]
This is a moderately complicated XPath, so to explain it some by expanding outwards:
.//text()[contains(., "$")]
Select all text elements descending from the current node whose self contains "$".
count(.//text()[contains(., "$")])
Count the number of text elements descending from the current node whose self contains "$".
//span[count(.//text()[contains(., "$")]) >= 2]
Select all span elements with two or more text descendants whose self contains "$"
As a caveat, this only works if the dollar sign is in two different text elements. If you want to include the span in this example:
<span>
$$
<span>
foo
</span>
</span>
...then you'll need a different approach:
//span[string-length(.) - string-length(translate(., "$", "")) >= 2]
This predicate compares the string length of the span to the string length of the same span with all "$" characters removed.
One usable XPath-1.0 expression is
string-length(/span[#id='price'])-string-length(translate(/span[#id='price'],'$',''))
In a predicate this could look like
//span[string-length(.)-string-length(translate(.,'$',''))>=2]
This expression selects only the elements with a count of $ >= 2

Select all deepest nodes with XPath 1.0 containing text, ignoring markup

I want to extract elements from the HTML page, containing text, ignoring markup. For example, I want to extract node containing the text "Run, Sarah, run!" from https://en.wiktionary.org/wiki/run. I know about node test text() and function string(). I tried them both:
As you see, if I use string() it returns too many nodes (result includes the nodes that include the node I need) and if I use text() it returns nothing (because of the <b> tag).
How do I find required nodes?
UPD: I want all deepest nodes. That means if the Wikitionary page contained this sentence twice, I wanted to select two nodes.
Also, I don't know the node type.
//*[contains(string(.), "Run, Sarah, run!")] returns all elements (starting from html node till last descendant node) that contains that string.
//*[contains(text(), "Run, Sarah, run!")] returns nothing as "Run, Sarah, run!" is compound text from several text nodes, but not from single text node
You can use below to match italic node with required text:
'//i[normalize-space()="Run, Sarah, run!"]'
If you don't want to specify node name, you can try
'//*[normalize-space()="Run, Sarah, run!" and not(./*[normalize-space()="Run, Sarah, run!"])]'

Searching HTML document by ID using XPATH returns wrong result

So, id like to get this element from line 200
<p id="Para">Hello, how are you.</p>
To do so I am using the XPATH
HtmlDoc.DocumentNode.SelectSingleNode("//*[contains(#id,'Para')]")
However, the node that is returned is not the one I am looking for and instead gets an element before it on line 10
<p id="ParaInstruction">Click here to begin</p>
I think this is because the ids have the first 4 chars in common so it gets the first one it can find. How do I ensure that the node that is returned only has the chars specified in the XPATH.
Change
//*[contains(#id,'Para')]
to
//*[#id='Para']
to avoid matching every element whose #id contains a "Para" substring, which is what contains() does -- test for substring containment.

XPath: Way to match text inside an arbitrary number of nested elements?

Is it possible for one XPath expression to match all the following <a> elements using the text in the element, in this case "Link"?
Examples:
Link
<span>Link</span>
<div>Link</div>
<div><span>Link</span></div>
This simple XPath expression,
//a[contains(., 'Link')]
will select the a elements of all of your examples because . represents the current node (a), and contains() will check the string value of a to see if it contains 'Link'. The string value of a already conveniently abstracts away from any descendent elements.
This even simpler XPath expression,
//a[. = 'Link']
will also select the a elements in all of your examples. It's appropriate to use if the string value of a will exactly equal, rather than just contain, "Link".
Note: The above expressions will also select Li<br/>nk, which may or may not be desirable.
You could use the following:
//a[(.//*|.)[contains(text(), "Link")]]
This will select a elements that contain the text "Link" or a elements that have a descendant element that contains the text "Link".
//a - Select all a elements
( - Open OR grouping
.//* Select all the descendant nodes
| - Or..
. - Select the current node
) - Close OR grouping
[contains(text(), "Link")] - If they contain the text "Link"
Alternatively, you could also use:
//a[(.//*|.)[.="Link"]]