How to use XPath contains() for specific text? - html

Say we have an HTML table which basically looks like this:
2|1|28|9|
3|8|5|10|
18|9|8|0|
I want to select the cells which contain only 8 and nothing else, that is, only 2nd cell of row2 and 3rd cell of row3.
This is what I tried: //table//td[contains(.,'8')]. It gives me all cells which contain 8. So, I get unwanted values 28 and 18 as well.
How do I fix this?
EDIT: Here is a sample table if you want to try your xpath. Use the calendar on the left side-https://sfbay.craigslist.org/sfc/

Be careful of the contains() function.
It is a common mistake to use it to test if an element contains a value. What it really does is test if a string contains a substring. So, td[contains(.,'8')] takes the string value of td (.) and tests if it contains any '8' substrings. This might be what you want, but often it is not.
This XPath,
//td[.='8']
will select all td elements whose string-value equals 8.
Alternatively, this XPath,
//td[normalize-space()='8']
will select all td elements whose normalize-space() string-value equals 8. (The normalize-space() XPath function strips leading and trailing whitespace and replaces sequences of whitespace characters with a single space.)
Notes:
Both will work even if the 8 is inside of another element such as a
a, b, span, div, etc.
Both will not match <td>gr8t</td>, <td>123456789</td>, etc.
Using normalize-space() will ignore leading or trailing whitespace
surrounding the 8.
See also:
Why is contains(text(), "string" ) not working in XPath?

Try the following xpath, which will select the whole text contents rather than partial matches:
//table//td[text()='8']
Edit: Your example HTML has a tags inside the td elements, so the following will work:
//table//td/a[text()="8"]
See example in php here: https://3v4l.org/56SBn

Related

How do I get rid of the tags in XPath

I have a bunch of html files with tons of data in it and I want to extract the important parts of it.
The files are all very similar; I've to search for a <tr> which contains a certain keyword. The third column of this table row always contains the name of the "block" I'm searching for (it's a few table rows).
//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]
with this XPath query I get the names (maybe one, maybe more)
The problem is, how do I get rid of the tags around the data?
Right now my output is something like this:
<span class="log_entry_text">Name1</span><span class="log_entry_text">Name2</span><span class="log_entry_text">Name3</span>
I want to have something like that: Name1 Name2 Name3
So I can use it for extracting these blocks more easily.
With string() i can only extract the first element (result would be: Name1)
Thanks for helping me!
Just wrap your xpath with data() element like data(//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]) for retrieve text.
Your XPath expression asks to retrieve span elements and that's what it has returned. If you're seeing tags with angle brackets in the output, that's because of the way the XPath result is being processed and rendered by the receiving application.
If you're in XPath 2.0+ or XQuery 1.0+ you can combine the several span elements into a single string using
string-join(//path/span, ' ')

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

XPath: Way to match text inside an arbitrary number of nested elements?

Is it possible for one XPath expression to match all the following <a> elements using the text in the element, in this case "Link"?
Examples:
Link
<span>Link</span>
<div>Link</div>
<div><span>Link</span></div>
This simple XPath expression,
//a[contains(., 'Link')]
will select the a elements of all of your examples because . represents the current node (a), and contains() will check the string value of a to see if it contains 'Link'. The string value of a already conveniently abstracts away from any descendent elements.
This even simpler XPath expression,
//a[. = 'Link']
will also select the a elements in all of your examples. It's appropriate to use if the string value of a will exactly equal, rather than just contain, "Link".
Note: The above expressions will also select Li<br/>nk, which may or may not be desirable.
You could use the following:
//a[(.//*|.)[contains(text(), "Link")]]
This will select a elements that contain the text "Link" or a elements that have a descendant element that contains the text "Link".
//a - Select all a elements
( - Open OR grouping
.//* Select all the descendant nodes
| - Or..
. - Select the current node
) - Close OR grouping
[contains(text(), "Link")] - If they contain the text "Link"
Alternatively, you could also use:
//a[(.//*|.)[.="Link"]]

Regex match a string that doesn't contains a string

I want to replace all the <span...> (including <span id="... and <span class="...) in an html by <span> except if the span starts by <span id="textmarker (for example I don't want to keep this span : <span attr="blah" id="textmarker">)
I've tried the regex proposed here and here, I finally came up with this regex that never returns a <span id="textmarker but somehow it sometimes misses the other spans:
<span(?!.*? id="textmarker).*?">
You can see my (simplified) html here : https://regex101.com/r/yT9jG2/2
Strangely, if I run the regex in notepad++ it returns 3 matches (the three spans in the second paragraph) but regex101 only returns 1 match. Notepad++ and regex101 both miss the span in the first paragraph.
This regex also doesn't return every spans it should( cf the spans with a gray highlights here
<span(?![^>]*? id="textmarker)[^>]*?>
Updated: To exclude id="textmarker while including id="anythingelse and all other spans:
(<span(?! *id="textmarker)[^>]*>)
On your posted example at: https://regex101.com/r/yT9jG2/2 , and at the top, choosing version 2, set the fields so:
field 1: (<span(?! *id="textmarker)[^>]*>)
field 2, (the smaller field that lets you set modifier): g
With your example and choosing version 2, matches 9 and lists them on the right, including empty spans as well as non-id="textmarker such as <span id="YellowType">
Explanation
Field 1:
optional: ( and ). An extra outer parenthesis was added to the expression for educational purposes, just for making use of regex101's matched group listing feature to list results on the right pane in addition to the default inline highlighting of matches. When using Notepad++ you can of course omit these outer ( ) parentheses.
<span: matches <span
(?! starts a negative lookahead assertion for the following,
* meaning space zero or more times, in case you have extra spaces
followed by id="textmarker
) to end the negative lookahead assertion
so if the match sees the negative lookahead assertion it automatically discards that as a match
[^ starts an exclusion set. so not of of the following, the following being the >
] to stop defining the exclusion
* to match the preceding 0 or more times. The preceding being [^>]
> to match to end of the open-a-span tag
Field 2
g tells regex101 you want this to be a greedy match
so the result does not stop at the first match, but will have all matches

using Xpath to scrape inconsistent DOM

I want to scrape the post name, which for pattern one it's located within a span
but the forum thread can goes like this (line 7)
because the thread is a poll.
so in my case I can't target the span (line 8 first picture), I used descendants-or-self but hardly to get it right. What's wrong here?
$postTitle = $xpath->query("//tr/td[#class='row1'][3]/div/div[1]//descendant-or-self::text()");
With this expression you will select the first <a> in the <div> where the text you wish to extract is located:
//tr/td[#class='row1'][3]/div/div[1]/a[1]
I'm assuming you intend to select one element (and not a node-set). For that you can get the string-value of this expression (which will return all the text in the descendant nodes) using string() or normalize-space() (which trims and removes extra spaces):
normalize-space(//tr/td[#class='row1'][3]/div/div[1]/a[1])
This will extract Salary vs age or /ktards are you... depending on the node found.
If there is more than one match it will return a collection, which you should iterate over and get the string value of each one individually. Using those functions on a node-set will give you the text in the first element, discarding the others.
If you only have to deal with two cases: 1) text inside a/span, 2) text inside a, you can select the text nodes directly using a union (|) operator:
//tr/td[#class='row1'][3]/div/div[1]/a[1]/text() | //tr/td[#class='row1'][3]/div/div[1]/a[1]/span/text()