xpath replace not working? - html

With this HTML
<tr>
<td class="listOddRow">Chequing </td>
<td class="listOddRow">00-227-03</td>
<td class="listOddRow" nowrap="">0275-1</td>
<td class="listOddRow" align="right" nowrap="">$ 28.08</td>
</tr>
Anyone knows why this works
//td[contains(text(),"00-227-03")]/parent::tr//a
but not this? I want to remove the dashes from the text() before calling contains()
//td[contains(replace(text(), "-", ""),"0022703")]/parent::tr//a

At least in xpath 1.0 there is no function replace, but there is translate working in the same way - it replaces in the 1st string chars presented in the 2nd one by corresponding chars in the 3rd. So, you can use the Xpath
//td[contains(translate(text(), "-", ""),"0022703")]/parent::tr//a

Related

Regular expression different result

Hello I want match a table tag, followed by any characters but not another table, followed by an element with id ContentPlaceHolder1 and finally followed by the /table closed tag.
I write this reg exp:
~\<table[^>]*>.*?ContentPlaceHolder1.+?<\/table>~is
In my text editor (Emeditor) work fine, in PHP script this match the first table tag of page and al the followed code.
Can anyone tell me what's wrong?
Tks a lot
I am just assuming what you wish to achieve, and as Matt has commented on your question, a code snippet with an explanation of what exactly you are trying to achieve would help us help you.
So, in that context, I will try to guess the issue:
I'm guessing that your code has an element with id ContentPlaceHolder1 near the end and maybe nowhere else. What is leading me to assume that is that you are stating:
in PHP script this match the first table tag of page and al the followed code.
and also
want match a table tag, followed by any characters but not another table
Though this is not the case. In fact your regex is doing the following:
Match the first <table> tag with any attributes there might be inside it ([^>]*)
Match any character as few times as possible (.*?)
Match ContentPlaceHolder1
Match at least one character to any, but as few as possible to make a match (.+?)
Match a closing <\/table> tag
What I tend to believe you are misinterpreting is step #2. What this step is trying to achieve, is not to ignore leading <table> tags, but instead ignore multiple occurrences of the keyword ContentPlaceHolder1.
Consider the following example (please ignore that the html is broken, it's just an example):
<table border="3" cellpadding="10" cellspacing="10">
<td>
<table border="3" cellpadding="3" cellspacing="3">
<td>2nd table</td>
<some_element id="ContentPlaceHolder1"></some_element>
</table>
<td>2nd table</td>
<tr>
<td>2nd table</td>
<td>2nd table</td>
</tr>
</table>
<some_element id="ContentPlaceHolder1"></some_element>
</td>
<td> the cell next to this one has a smaller table inside of it, a table inside a table.</td>
</table>
Here, .*? is not instructing the regex engine to avoid matching a second <table> tag, what is instructing instead is to match the first occurence of the keyword ContentPlaceHolder1 instead of greedily matching the last one.
What you are trying to achieve, can be achieved using Negative Lookahead. What this implies, is that it instructs the regex engine, to look further away and assure that it doesn't match the first subset, if the second one exists. You can see this in practice in this demo, where I'm using negative lookahead to instruct the regex engine to only match a <table> tag if it is not followed by another <table> tag (<table[^>]*>(?!.*<table[^>]*>).
Please review my answer, and if it does solve your issue, please add more information and a sample of your code so that we can provide further assistance.
Regards
tks for the answere
This is an ipotetical page code:
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>
<!-- from here -->
<table border="3" cellpadding="3" cellspacing="3">
<tr>
<td>aaaa</td>
<td><a id="ContentPlaceHolder1">link</a></td>
</tr>
</table>
<!-- to here -->
</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
I want to match from the first comment to the second.
In other word I want to match the complete table that contains elemnt with ContentPlaceHolder1 id.
My regexp in PHP match the first page table tag.
Tks a lot

Xpath using sibblings or fellowing in two defrent Cell

Put bluntly I want to locate TestCoupon10% inside td then open a sibling td then locate //a[contains(#id,"cmdOpen")] I did try sibling and fellowing but likely I didnt do it right because
//span[./text()="TestCoupon10%"]/following-sibling:a[contains(#id,"cmdOpen")]
result into an invalid xpath. the HTML structure look as fellow
<tr>
<td>
<span id="oCouponGrid_ctl03_lblCode">TestCoupon10%</span>
</td>
<td>...</td>
<td>...</td>
<td valign="middle" align=""right">
<a id="oCouponGrid_ctl03_cmdOpen">
</td>
</tr>
I need to find cmdOpen and test coupon does anyone has an idea how to?
Axes are delimited with double colons, not single ones (those are used for namespace prefixes). You wanted to say this:
//span[./text()="TestCoupon10%"]/following-sibling::a[contains(#id,"cmdOpen")]
But - the <a> is not a following sibling of the <span> in question. You need to do some navigating:
//span[./text()="TestCoupon10%"]/parent::td/following-sibling::td/a[contains(#id,"cmdOpen")]
Or, simply avoid descending into the tree you you don't have to "climb up" again in the first place.
//td[span = "TestCoupon10%"]/following-sibling::td/a[contains(#id,"cmdOpen")]

Xpath help to Find Unique value

I want to find the first tr tag with PONumber: text. I am not able to do that. Any help? I can find it using the //table/tbody/tr/td[contains(text(),'PONumber')] but it gives 2 objects. I want to find the first one only.
<tr>
<td class="clsLabel" align="right"> PONumber: </td>
<td class="clsInput"> PN659 </td>
</tr>
<tr>
<td class="clsLabel" align="right"> PreviousPONumber: </td>
<td class="clsInput"/>
</tr>
You can use following xpath to find exact object which you want
//tr/td[normalize-space(.)='PONumber:']
You can use something like
(//tr/td[contains(text(),'PONumber')])[1]
so put the xpath in brackets and with [1] you can specifiy to only return the first entry. Otherwise you could also use something like:
//tr/td[contains(text(),'PONumber') and not(contains(text(),'Previous'))]
so "Previous" will be excluded from the search results
You can limit the XPath result to return only the first matched by using [1] :
(//table/tbody/tr/td[contains(.,'PONumber')])[1]

Overcoming line breaks in a block of HTML text

I know that I shouldn't be using REGEX for parsing HTML, and I promise to check out the HTML agility pack, but in the meantime, could some expert tell me if there's a pattern that will match this entire block?
<tr bgcolor="#f4f4ff"><td align="center"><font size="2">42</font></td>
<td align="center"><font size="2">35</font></td>
<td><font size="2"><b>Bears</b></font></td>
<td><font size="2">BV</font></td>
<td align="right"><font size="2"><b>$33,845</b></font></td>
<td align="right"><font size="2"><font color="#ff0000">-60.1%</font></font></td>
<td align="right"><font size="2">75</font></td>
<td align="right"><font size="2"><font color="#ff0000">-35</font></font></td>
<td align="right"><font size="2">$451</font></td>
<td align="right"><font size="2">$17,492,470</font></td>
<td align="right"><font size="2">-</font></td>
<td align="center"><font size="2">8</font></td>
</tr>
I'm using VBA, and regexoptions don't seem to be available. I've fiddled endlessly, and "should works" like
<tr[.\n]+tr>
<tr[.\s]+tr>
<tr[.\x0C\x\0A]+tr>
don't. I can match everything up to the first line break, then I hit a brick wall. Is there any workaround if the singleline option isn't available? Maybe using the VBA REPLACE function to change all vbcrlf instances to some other character before I try to match? And can somebody point me to an example of how much easier this would be with the HTML agility pack?
Okay, so you've heard the warnings against using regex to parse html?
As far as I know, VBAScript doesn't have the DOTALL mode, so to match across lines we have to make a fake DOTALL dot with something like this: (?:.|[\r\n]) or [\s\S]
This regex will match your block:
<tr[\s\S]*?</tr>
But if there are nested rows, you will be sorry you used regex. :)
Sample code
Dim myRegExp, myMatches, matched_block
Set myRegExp = New RegExp
myRegExp.Pattern = "<tr[\s\S]*?</tr>"
Set myMatches = myRegExp.Execute(SubjectString)
If myMatches.Count >= 1 Then
matched_block = myMatches(0).Value
Else
matched_block = ""
End If

Regular expression in Visual Studio find Replace

Hi I have a html file with hundreds of lines like this
<tr>
<td class="text-column">
Risk
</td>
<td>
7,848,705
</td>
<td>
7,828,750
</td>
<td>
19,955
</td>
</tr>
To save time formatting it, does anyone know the visual studio find/ replace regular expression that will produce
<tr>
<td class="text-column">Risk</td>
<td>7,848,705</td>
<td>7,828,750</td>
<td>19,955</td>
</tr>
I plan to fill in the figures with razor later and this will ease readability.
Find: {\<[^\>]+\>}[:b\n]*{[^\n]*}[:b\n]*{\</[^\>]+\>}
Replace: \1\2\3
Explanation:
{\<[^\>]+\>} -- capture open tag
[:b\n]* -- discard whitespace
{[^\n]*} -- get contents (assuming no line breaks)
[:b\n]* -- discard whitespace
{\</[^\>]+\>} -- capture closing tag
Not perfect, but it produces the expected output on your sample.
Did it with code in the end. But thanks to Devon for taking the time