Regular expression different result - html

Hello I want match a table tag, followed by any characters but not another table, followed by an element with id ContentPlaceHolder1 and finally followed by the /table closed tag.
I write this reg exp:
~\<table[^>]*>.*?ContentPlaceHolder1.+?<\/table>~is
In my text editor (Emeditor) work fine, in PHP script this match the first table tag of page and al the followed code.
Can anyone tell me what's wrong?
Tks a lot

I am just assuming what you wish to achieve, and as Matt has commented on your question, a code snippet with an explanation of what exactly you are trying to achieve would help us help you.
So, in that context, I will try to guess the issue:
I'm guessing that your code has an element with id ContentPlaceHolder1 near the end and maybe nowhere else. What is leading me to assume that is that you are stating:
in PHP script this match the first table tag of page and al the followed code.
and also
want match a table tag, followed by any characters but not another table
Though this is not the case. In fact your regex is doing the following:
Match the first <table> tag with any attributes there might be inside it ([^>]*)
Match any character as few times as possible (.*?)
Match ContentPlaceHolder1
Match at least one character to any, but as few as possible to make a match (.+?)
Match a closing <\/table> tag
What I tend to believe you are misinterpreting is step #2. What this step is trying to achieve, is not to ignore leading <table> tags, but instead ignore multiple occurrences of the keyword ContentPlaceHolder1.
Consider the following example (please ignore that the html is broken, it's just an example):
<table border="3" cellpadding="10" cellspacing="10">
<td>
<table border="3" cellpadding="3" cellspacing="3">
<td>2nd table</td>
<some_element id="ContentPlaceHolder1"></some_element>
</table>
<td>2nd table</td>
<tr>
<td>2nd table</td>
<td>2nd table</td>
</tr>
</table>
<some_element id="ContentPlaceHolder1"></some_element>
</td>
<td> the cell next to this one has a smaller table inside of it, a table inside a table.</td>
</table>
Here, .*? is not instructing the regex engine to avoid matching a second <table> tag, what is instructing instead is to match the first occurence of the keyword ContentPlaceHolder1 instead of greedily matching the last one.
What you are trying to achieve, can be achieved using Negative Lookahead. What this implies, is that it instructs the regex engine, to look further away and assure that it doesn't match the first subset, if the second one exists. You can see this in practice in this demo, where I'm using negative lookahead to instruct the regex engine to only match a <table> tag if it is not followed by another <table> tag (<table[^>]*>(?!.*<table[^>]*>).
Please review my answer, and if it does solve your issue, please add more information and a sample of your code so that we can provide further assistance.
Regards

tks for the answere
This is an ipotetical page code:
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>
<!-- from here -->
<table border="3" cellpadding="3" cellspacing="3">
<tr>
<td>aaaa</td>
<td><a id="ContentPlaceHolder1">link</a></td>
</tr>
</table>
<!-- to here -->
</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
<table border="3" cellpadding="10" cellspacing="10">
<tr>
<td>aaaaa</td>
</tr>
</table>
I want to match from the first comment to the second.
In other word I want to match the complete table that contains elemnt with ContentPlaceHolder1 id.
My regexp in PHP match the first page table tag.
Tks a lot

Related

RegEx and HTML: How to match an element "foo", which contains at least two other elements "bar"? (negative look ahead assertion)

I like to match the element "table" which has the class "zot" and contains at least two elements "td".
A table e.g. which contains only "th" but no "td" should not be matched.
I tried the following expression without success:
<table class="zot">([\S\s]*?(?!\/table>)<td){2,}
The same expression in more readable free spacing syntax:
<table class="zot"> # literal
( # begin of group
[\S\s] # non whitespace or whitespace
* # quantifier
? # greediness modificator
(?!\/table>) # negative look ahead assertion with the literal "/table>"
<td # literal
) # end of group
{2,} # quantifier
Probably my understanding of the negative lookahead is wrong.
I created a code pen for the case: https://regexr.com/43mmh
What is my mistake, please? Thanks.
Below you find my HTML code for the test (the same as in the code pen):
<table class="zot">
<tr>
<th>a</th>
<th>b</th>
</tr>
<tr>
<td>c</td>
<td>d</td>
</tr>
</table>
<p>Lorem</p>
<table class="zot">
<tr>
<th>e</th>
</tr>
<tr>
<td>f</td>
</tr>
</table>
<table class="zot">
<tr>
<th>g</th>
<th>h</th>
</tr>
<tr>
<td>i</td>
<td>j</td>
</tr>
</table>
Which matches do I wish to have?
<table class="zot">
<tr>
<th>a</th>
<th>b</th>
</tr>
<tr>
<td>c</td>
<td
and
<table class="zot">
<tr>
<th>g</th>
<th>h</th>
</tr>
<tr>
<td>i</td>
<td
Assuming you want foo to come before bar, you can use
<table class="zot">((?!\/table>).)+foo(?1)+bar(?1)+<\/table>
https://regexr.com/43nkb
The general idea is to repeat any character which isn't the / in /table>, match foo, repeat the previous pattern again, match bar, match the previous pattern again, and finally match the end table tag.
Note the s flag and the use of (?1) syntax, which makes the regex a lot easier to read. Without that, you'll have to use [\s\S] instead of ., and type out the first subpattern manually instead of the (?1)s, eg
<table class="zot">(?:(?!\/table>)[\s\S])+foo(?:(?!\/table>)[\s\S])+bar(?:(?!\/table>)[\s\S])+<\/table>
That said, if at all possible, in whatever environment you're using, it would likely be more elegant to use a proper HTML parser.
I have totally rewritten my answer, now you will get 1 match per table with more than one table cell.
The regex:
<table class="zot">(?:(?:[\S\s](?!\/table>))*?<td){2,}[\S\s]*?<\/table>
Explanation:
<table class="zot"> matches the literal string <table class="zot">.
(?: creates a non capturing Group.
(?: creates a non capturing Group.
[\S\s] matches Space and non Space (everything) one time.
(?!\/table>) creates a negative look ahead for: '/table'.
*? This Group is matched zero or more times - non greedy.
<td matches the literal string <td.
{2,} The outer Group is matched 2 or more times.
[\S\s]*? matches anything zero or more times.
<\/table> matches the literal string <\/table>
You need to set the 'global' flag.
Now you get one match per table that contains at least 2 table cells.
You can test it Regexr or Here

HTML/CSS, getting table cells to balance

I have a table formatted as:
<table>
<tr>
<td>
Long list of info
Line two
</td>
<td>
Shorter list of info
</td>
</tr>
</table>
How can I get them to both display from the top of 'tr'? I assume there's a way to stop automatic vertical alignment with CSS?
To clarify for future viewers, I got it working using:
CSS:
td {
vertical-align: top;
}
HTML:
<table>
<tr>
<td>
Long list of info
Line two
</td>
<td>
Shorter list of info
</td>
</tr>
</table>
(Here's the fiddle)
Hey if you have some questions how to use tables pls look some examples here
http://www.w3schools.com/html/html_tables.asp
http://www.w3schools.com/css/css_table.asp
If you prefer to avoid CSS you can do the same thing inline with the depreciated valign tag. Valign is not supported in HTML5 but is the recommended method if you are working within the context of an HTML email.
<table border="1">
<tr>
<td width="110">
Long list of info Line two
</td>
<td valign="top">
Shorter list of info
</td>
</tr>
</table>
Or you could always use two rows if you can predict and control where your breaks are going to be and want to simplify the markup further.

How to extract only the 1st table tag from a html page having various nested table tag

I have the following html page. I want to extract data only within the 1st table tag in C#. the html page code is:
<table cellpadding=2 cellspacing=0 border=0 width=100%>
<tbody>
<tr>
<td align=right><b>11/09/2013 at 09:48</b></td>
</tr>
</tbody>
</table>
<center>
<table border="1" bordercolor="silver" cellpadding="2" cellspacing="0" width="100%">
<thead>
<tr>
<th width=100>ETA</th>
<th width=100>Ship Name</th>
<th width=80>From port</th>
<th width=80>To berth</th>
<th width=130>Agent</th>
</tr>
</thead>
<tbody>
<tr><td>11/09/2013 at 09:00 </td>
<td>SONANGOL KALANDULA </td>
<td>Cabinda </td>
<td>Valero 6 </td>
<td>Graypen </td>
</tr>
</tbody>
</table>
To be more specific I want to extract only the row having date 11/09/2013 at 09:48 the below mentioned code is under the first of tag I am using regex
"<table[^>]*>([^<]*(?:(?!</table)<[^<]*)*)[</table>]*"
but with this I am getting whole of the page source that is I am getting the data between all the table tags but I want only text between first table tag.
Can anyone tell me regular expression with which I can only extract this particular portion from the whole html page?
When trying out your version here, it seems to work to me on the input you specified, though [</table>]* should really be just </table> ([</table>]* means any number of characters in the set: <,/,t,a,b,l,e,>)
This seems like it would bear simplification, though. This should also work:
<table[^>]*>.*?</table>
All bets are off if you have nested tables, of course.

xpath ignore node

I'm looking for a way to select a node in xpath, giving that a node on it's path may exist or not. Just like '?' works in regexp ;)
For instance, I'd like to figure out a xpath query to get to <td> regardless of the case whether <tbody> node exists or not, with something like /table/(tbody)?/tr/td. I'd like it to work in both cases:
<table>
<tr>
<td />
</tr>
</table>
and
<table>
<tbody>
<tr>
<td />
</tr>
</tbody>
</table>
This may fail to cover more complex cases, but in this example using /table/tbody/tr/td | /table/tr/td should do the trick.
You can do:
//table/descendant::tr/td
or
//table//tr/td
depending on your taste. The double slash is a "look that up somewhere on this level or deeper" (more formally, descendant-or-self:: axis). The spec is, surprisingly, a very good read on this!

html table syntax validation

This should be an easy one.
I have a table like so:
<table>
<tr>
<td></td><td></td><td></td><td></td>
</tr>
<tr>
<td></td>
</tr>
</table>
My firefox 3 validator says this is acceptable code. It just seems wrong to me, are there any possible issues leaving the table rows uneven like this? It works in IE7 too.
You should use 'rowspan' or 'colspan' attributes
<table>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="3"></td>
</tr>
</table>
Table rows are not required to have the same number of cells. The number of columns in the table is determined from the row with most cells.
Your second table row will just have three cells that are blank (which is not the same as empty cells).
If you want to use uneven amounts of rows/columns, you need to should use rowspan and/or colspan attributes to indicate this.
eg:
<table>
<tr><td></td><td></td><td></td></tr>
<tr><td colspan="3"></td></tr>
</table>
As guffa corrected me below, colspan isn't technically needed, but it never hurts to be explicit about your intent.
Well, there are no syntax errors there, and I really can't see why you should be sceptical about a table like that, as long as you use the colspan attribute of the td-element:
<table>
<tr>
<td></td><td></td><td></td><td></td>
</tr>
<tr>
<td colspan="3"></td>
</tr>
</table>
Hope that helped.
That code is fine from a structural point of view. It's valid XHTML. Compare this:
<orders>
<order id='2009/1'>
<item id='1'/><item id='2'><item id='3'/>
</order>
<order id='2009/2'>
<item id='33'/>
</order>
</orders>
It might look strange though, hence the suggestion to use colspan. That way you can get the single TD to fill up the row, instead of being the width of the TD above it.