How to select elements containing special characters in XPath? - html

I am trying to exclude three <td> elements from a result set:
<td>
πŸ₯‡
</td>
<td>
πŸ₯ˆ
</td>
<td>
πŸ₯‰
</td>
I've tried using:
td[not(contains(., 'πŸ₯ˆ'))]
For example, but the element I don't want still comes back...

In the xpath expression, you need to use the escape conventions of the host language. Using &-escaping is fine if the host is XSLT, but if it’s JavaScript, for example, you’ll need to use backslash escaping.

To avoid the labyrinth of escaping conventions, just use literal Unicode characters themselves, which can be searched and then copy-and-pasted from sites such as Compart:
Char Entity Ref
Literal Unicode
XPath
πŸ₯‡
πŸ₯‡
//td[not(contains(.,'πŸ₯‡'))]
πŸ₯ˆ
πŸ₯ˆ
//td[not(contains(.,'πŸ₯ˆ'))]
πŸ₯‰
πŸ₯‰
//td[not(contains(.,'πŸ₯‰'))]
Here's a single XPath 2.0+ expression that will select all td elements in the document except those consisting of only the targeted special characters:
//td[not(normalize-space() = ('πŸ₯‡', 'πŸ₯ˆ','πŸ₯‰'))]
In XPath 1.0, you'd have to write out the clauses separately:
//td[not(normalize-space() = 'πŸ₯‡') and
not(normalize-space() = 'πŸ₯ˆ') and
not(normalize-space() = 'πŸ₯‰')]
Rearrange via DeMorgan's per taste. Go back to contains() if you truly want to test via substring containment rather than string value equality.

Related

How a write a common XPath for same text displayed for different HTML tags?

I want to write a common XPath for the result displayed for my searched text 'Automation Server'
The same text is displayed for td HTML tags as well as for div html tags as shown below, and I wrote XPath as below based on my understanding by going through different article
displayed_text = //td[contains(text(),'Automation Server') or div[contains(text(),' Automation Server ')]
<td role="cell" mat-cell="" class="mat-cell cdk-cell cdk-column-siteName mat-column-siteName ng-star-inserted">Automation Server</td>
<div class="change-list-value ng-star-inserted"> Automation Server </div>
The operator you are looking for in XPath is |. It is a union operator and will return both sets of elements.
The XPath you are looking for is
//td[contains(text(),'Automation Server')] | //div[contains(text(),'Automation Server')]
This XPath,
//*[self::td or self::div][text()[normalize-space()='Automation Server']]
will select all td or div elements with an immediate text node whose normalize string value equals 'Automation Server'.
Cautions regarding other answers here
| is not logical-OR or "OR-like".
It is a union operator over node sets (XPath 1.0) or sequences (XPath 2.0+), not boolean values.
See: Logical OR in XPath? Why isn't | working?
contains(text(), "string") only tests the first text node child.
See: Why is contains(text(), "string" ) not working in XPath?
A few alternatives to JeffC answer, using common properties for both:
1. use the * as a wildcard for any element:
//*[contains(#class,'ng-star-inserted') and normalize-space(text())='Automation Server']
2. use in addition the local-name() function to narrow down the names of the elements:
//*[local-name()[.='td' or .='div']][contains(#class,'ng-star-inserted') and normalize-space(text())='Automation Server']
The normalize-space() function can be used to clean-up the optional white space, so a = operator can be used.
You could use the following XPath to test the local-name() of the element in a predicate and whether it's text() contains the phrase:
//*[(local-name() = "td" or local-name() = "div") and contains(text(), "Automation Server")]

Selecting elements based on string/text matching in XPath?

For HTML tables on web page I am using the following XPath:
/tr/td[2]/.[contains(text(),'Some')]
This works fine in all the case but it also match 'Something'.
/tr/td[2]/.[normalize-space(text()) = 'Some']
doesn't work in all the cases.
Can somebody comment on what's wrong with latter XPath?
Your problem doesn't likely involve normalize-space() but rather one of two common text/string matching areas of confusion:
Text node vs string value
text() matches text nodes.
//td[contains(text(), 'Some')] will match this
<td>Some text</td>
but not
<td><b>Some text</b></td>
To match the latter too, use //td[contains(., 'Some')] instead. This will check that the string value of td contains the string "Some".
For more details, see XPath text() = is different than XPath . =
String contains vs string equals
Note also that contains() tests for substring containment. If you want string equality, use the = operator against string:
//td[. = 'Some']
Will match
<td><b>Some</b></td>
but not
<td><b>Some text</b></td>
Be aware of the difference.

One Xpath expression doesn't work in selenium, but works in Firefox

I have one question about xpath.
There is td like this in chrome:
<td class="dataCol col02">
"Hello world(notes:there is$)nbsp;"
[View Hierarchy]
</td>
but when I inspect the same element in Firefox it doesn't have $nbsp and double quotes;
<td class="dataCol col02">
Hello world
[View Hierarchy]
</td>
I used FireFinder and use the xpath:
//td[text()='Hello world']
, it can locate that element.
but when I use selenium api 2.24, it couldn't find that element.
by.xpath("//td[text()='Hello world']")
Do you have any idea of that?
Thanks!
Try with normalize-space() which trims leading and trailing whitespace characters:
//td[normalize-space(text())='Hello world']
Edit following the different comments:
here's an XPath expression that's probably better suited in the general case:
//td[starts-with(normalize-space(.), 'Hello world')]
meaning it matches <td> nodes if the concatenated string content of the whole <td>, less leading and trailing whitespace, starts with "Hello world"
I would try to use contains() function.
Your xpath will look like: //td[contains(text(),'Hello world')]

Powershell modifying HTML from ConvertTo-HTML

I have a script that generates an array of objects that I want to email out in HTML format. That part works fine. I am trying to modify the HTML string to make certain rows a different font color.
Part of the html string looks like this (2 rows only):
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.pdf'</td>
<td>13124</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
<tr>
<td>ABL - Branch5206 Daily OD Report</td>
<td>'\\CTB052\Shared_Files\FIS-BIC Reporting\Report Output Files\ABL\Operations\Daily\ABL - Branch5206 Daily OD Report.xls'</td>
<td>15716</td>
<td>4/23/2013 8:05:34 AM</td>
<td>29134</td>
<td>0</td>
<td>Delivered</td>
</tr>
I tried regex to add a font color to the beginning and end of the rows where the row ends with "Delivered":
$email = [regex]::Replace($email, "<tr><td>(.*?)Delivered</td></tr>", '<tr><font color = green><td>$1Delivered</td></font></tr>')
This didn't work (I am not sure if you can set font color for a whole row like that).
Any ideas on how to do this easily/efficiently? I have to do it on several different statuses (like Delivered)
Disclaimer: HTML cannot be parsed by regular expression parser. A regular expression will NOT provide a general solution to this problem. If your HTML structure is well known and you don't have any other <tr></tr> elements, though, the following might work. On that note, though, is there some reason you can't modify the HTML generation to do this then instead of waiting until the HTML is already generated?
Try this command:
PS > $email = $email -replace '(?s)<tr>(.*?)<td>Delivered</td>(.*?)</tr>','<tr style="color: #FF0000">$1<td>Delivered</td>$2</tr>'
The first string is the pattern. The (?s) tells the parser to allow . to accept newlines; this is called "single line" mode. Then it grabs a <tr> element that contains the string <td>Delivered</td>. The two capture groups grab everything else in the <tr> element around the <td>Delivered</td> string. Take note of the question marks following the *s. * by itself is greedy and matches as much text as possible; *? matches as little text as possible. If we just used * here, it would treat your entire string as one match and only replace the first <tr>.
The second string is the replacement. It plops the <tr> element and its contents back in place with an added style attribute, and all without back ref.
One other minor note is the quoting. I tend toward single quotes anyway, but in this case, you're likely to have double quotes in the replacement string. So single quotes are probably the way to go.
As for how you could do this for different statuses, regular expressions really aren't designed for conditional content like that; it's like trying to use a screwdriver as a drill. You can hard code several replaces or loop over status/color pairs and build your pattern and replace strings from them. A full blown HTML parser would be more efficient if you can find one for .NET; you might try to get away with an XML parser if you can guarantee it's valid XML. Or, going back to my question at the beginning, you could modify the HTML generation. If your e-mails are few in number, though, this may not be a bottleneck worth addressing. Development time spent is also costly. See if it's fast enough and try a different route if not.
Credit where it's due: I took the HTML style attribute from #FrankieTheKneeMan.

PHP Regex: Get info between groups of HTML tags?

I have been programming a word-unscrambler. I need to parse the information between a group of tags and another, and put all matches into an array. The beginning tag is:
<tr> <td></td><td><li>
and the ending tag is:
</li></td> </tr>
I know some regular expressions, but I am unfamiliar with PHP.
<tr> <td><\/td><td><li>(.*)<\/li><\/td> <\/tr>
Test is here: http://rubular.com/regexes/13241
PHP can cope with both PERL and POSIX regexes, so you'll need to use the function for the flavour that you know. I believe that the code posted by #Gazler should work in either case, though it will need to be surrounded by /s, and quoted:
$reg = '/<tr> <td><\/td><td><li>(.*)<\/li><\/td> <\/tr>/';
There's a PHP PERL regex based test site here: http://j-r.camenisch.net/regex/