I know that also that is not recommended regex to parse html, but this question is rather to help understand regex than parse html with it.
So I have a sample string (singleline without any of linebreak or newline):
<tr><th> H1<th>H2 <th> H3 <tr><td> R1C1<td>R1C2 <td> R1C3 <tr><td> R2C1<td>R2C2 <td> R2C3 <tr><td> R3C1<td>R3C2 <td> R3C3 < ..
For a better understanding there is 3 row and 3 cells, and the end is unknown tag what but no TR or TD:
<tr><th> H1<th>H2 <th> H3
<tr><td> R1C1<td>R1C2 <td> R1C3
<tr><td> R2C1<td>R2C2 <td> R2C3
<tr><td> R3C1<td>R3C2 <td> R3C3
< ..
First try I'd like to get only all of the rows, here is how I think with the expected results:
start with <tr>
following zero or more any character
the end starts with < what follows zero or more any character but not td or th
I'm tried with 'basics' to see how it works..
With pattern (<tr>.*?) why only get <tr> strings, and not TR to TR?
With pattern (<tr>.*?<tr) why only get 1st and 3rd rows only?
I don't find any pattern what is good for the end of the string. I've tried with this pattern:
(<tr>.*?<(?!(td)|(th)))
..but I'm not sure is this good, moreover this gives back only 1st and 3rd rows.
Here is a DEMO what I've tried.
The pattern is:
(<tr>.*?(?=<tr))
Now the answer to the first 2 questions.
Question 1: With pattern (<tr>.*?) why only get <tr> strings, and not TR to TR?
Because with *? you are asking for a lazy operator. Since it is lazy, it tries to match less character as possible. And since 0 characters fulfill the pattern, it stops to 0.
Question 2: With pattern (<tr>.*?<tr) why only get 1st and 3rd rows only?
Because the parser's cursor is already passed over the 2nd <tr while you are getting the first match. In the second match it doesn't go back, because the 2nd <tr is part of the first match.
Using lookahead ((?=<tr)) the parser does not consume the 2nd <tr in the first match. Your last pattern is almost good, but even with that, the < of the 2nd TR tag is consumed in the first match, so it can't be part of the second match.
Related
In Chrome DevTools > Elements, when I search for //tr/td/span I find an element (because such an element exists on my page).
When I search for (//tr)/td/span or (//tr/td)/span I also find this element.
But neither //tr(/td)/span nor //tr/(td)/span nor //tr/(td/)span find anything.
What is the meaning of these parentheses in XPath?
Parenthesis in XPath are used as they are in other programming languages:
Function argument grouping: e.g: //tr/td[contains(.,"e")]
Evaluation precedence indication: e.g: normal arithmetic expression grouping as well as leading path grouping (trace LocationPath through to PrimaryExpr in the XPath grammar) as in (//td)[1] to find the first td in the document as opposed to //td[1] which finds the td elements that are the first child of their respective parent elements.
They're also used in
node tests: e.g: node(), element(), ...
processing instructions: e.g: PageBreak().
Your examples that do not find anything (e.g: //tr(/td)/span, //tr/(td)/span1, etc) have parenthesis embedded within the path that do not follow in one of the above categories. Such use of parenthesis are actually syntactically invalid and should have been reported as such rather than silently failing.
1Note that this expression would actually be syntatically valid under XPath 2.0/3.0. Thanks, #Andersson, for noticing.
I don't think that parenthesis mean something in your case, but it might be used to return required node/nodes set depending on passed index
For instance, HTML is like below:
<table>
<tr>
<td>
<span>first</span>
</td>
<td>
<span>second</span>
</td>
</tr>
<tr>
<td>
<span>third</span>
</td>
<td>
<span>fourth</span>
</td>
</tr>
</table>
(//tr)[1]/td will return cells for first row only (first, second)
(//tr)[2]/td - for second row (third, fourth)
(//tr/td)[1] - first cell of first row (first). Note that //tr/td[1] will returns each first cell of each row (first, third)
...
Please help me with this validation error. I can't understand what it means or what's not standards complaint with my HTML.
I'll repost it here since hopefully I'll fix it and that link will no longer work:
Table column 2 established by element td has no cells beginning in it.
…="tooltip_table"><tr><td colspan="2">20 yd range</td></tr><tr><td colspan="2"
↑
When you say colspan="2", the column is supposed to stretch across two columns. My guess would be that there is no second column defined anywhere else in the able, thus making colspan="2" impossible (and unnecessary).
I can't find anything in the spec explicitly saying it's illegal. Maybe the table calculating algorithm quoted in that spec is different from 4.01, but it's way too late in my time zone to try and get around that :)
However, I find the error message makes too perfect sense to be an outright bug.
Table column 2 established by element td has no cells beginning in it.
By using colspan="2", you imply the existence of a second column, which doesn't exist in that case. Common sense tells me it is correct to nag about.
Maybe somebody can shed some light on this... Or it is, indeed, a bug.
HTML 5.2 Draft: Section 4.9.12.1 Forming a table
http://w3c.github.io/html/tabular-data.html#forming-a-table
Step 22: If there exists a row or column in the table containing only slots that do not have a cell anchored to them, then this is a table model error.
I believe it is a bug, and still unfixed. Consider this example page and run it through the W3C validator. It gives errors for "Table column 3 established by element td has no cells beginning in it.", and yet each table has 4 cells/columns, and the "colspan" of 2 is called on the second cell.
Looks like an issue with the HTML5 validator.
That error does not come up if you validate is with HTML 4.01 Transitional, and the table html has not been changed that much in html5.
http://validator.w3.org/check?uri=http://www.wowpanda.net/s9712&charset=(detect+automatically)&doctype=HTML+4.01+Transitional&ss=1&outline=1&group=0&verbose=1&user-agent=W3C_Validator/1.654
Reporting it is probably a good idea
I had the same error on a dynamically created table. Depending on the input, some rows were displayed or not. Like this:
Causes no error:
<table>
<tr>
<td> cell 1 in row 1 </td>
<td> cell 2 in row 1 </td>
</tr>
<tr>
<td colspan=2> one cell in row 2 </td>
</tr>
</table>
Causes no error:
<table>
<tr>
<td> cell 1 in row 1 </td>
<td> cell 2 in row 1 </td>
</tr>
</table>
Causes an error:
<table>
<tr>
<td colspan=2> one cell in what is now the only row </td>
</tr>
</table>
Once I programmed the page to delete the colspan from the last example when the first row was not displayed, the error disappeared. Something like this:
<?php if (first row with two cells is displayed) echo 'colspan=2'; ?>
I find this logical. colspan=2 with only single cells is like telling someone visiting me to turn right on a street that does not have any junctions, believing that they will continue straight on. They won't. Instead they will get hung up searching for something that is not there. Maybe not a completely accurate analogy, but you can imagine a dumb browser creating display errors while looking for stuff that you tell it is there, but is not. Browsers shouldn't be expected to "think" that maybe you meant your code differently from how you wrote it.
Just fixing the link for Alohci's answer.
https://w3c.github.io/html/single-page.html#forming-a-table
If there exists a row or column in the table containing only slots that do not have a cell anchored to them, then this is a table model error.
This thread is a bit old but I post this for anyone bumping into it.
Defining each column using tag removes the message and also gives the colspan something to relate to.
More info in the answer here: Why is colspan not applied as expected
If you initiate the table - it fixes the validation column errors.
If your table has 8 columns then the first row must have 8 elements, which if you are only initiating you don't want to see. The css element is:
tr.Init{border:none;}
and the following first row of an 8 column table.
The result is - you don't see the first row and your validation errors are fixed.
Having this html table:
<table class="info">
<tbody>
<tr><td class="name">Year</td><td>2011</td></tr>
<tr><td class="name">Area</td><td>45 m<sup>2</sup></td></tr>
<tr><td class="name">Condition</td><td>Renovated</td></tr>
</tbody>
</table>
I am trying to extract data from 2-nd cell in each row (it is: 2011, 45 m, Renovated)
I use this Xpath expression:
//table[#class="info"]//td[2]//text()
Received output (wrong):
2011
45 m
2
Renovated
Desired output:
2011
45 m
Renovated
As you can see, from the 2-nd row I received value that is enclosed in <sup> tags. I want to exclude this value.
I know that instead of my current Xpath code I can use this one (removed 1 slash in the end):
//table[#class="info"]//td[2]/text()
It will solve problem, but I need to exclude this specific <sup> tag inside <td>. Because sometimes I have some tags inside <td> that I do not want to exclude.
So, I want to get data from 2-nd cell in each row and exclude value in <sup> tags
For every tr get the second td and get the /text() (single slash) to avoid getting the element children texts. Worked for me:
//table[#class="info"]//tr/td[2]/text()
Prints:
2011
45 m
Renovated
Or, if you want to exclude sup element only:
//table[#class="info"]//tr/td[2]//text()[not(parent::sup)]
<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody>
I'm looking to find a Regex to get this value from a big text.
I tried this one but without result:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#
Regex with html is often a bad idea, because of potential recursive tags. Have you tried using an XML/HTML parser? For example, XmlDocument, XmlElement and XmlAttribute.
EDIT: The problem with regex and html in your example:
Cannot keep count of recursive tbody tags
Will the tbody tag can look like <tbody>...</tbody> or <tbody .../>?
Even if you know there will be one start and end tag, how do you know there won't be any plain text containing "tbody" somewhere inside the table, thus breaking the regex?
You may want to tell your regex engine that it should match newlines with the . as well.
In PHP, that would make the regex:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#s
Note the trailing s
Warning if there are 2 tbodies, this regex will match everything starting from the first tbody (with this ID) until the last tbody (ID-independent).
Example:
<tbody id="clavier:infractionList2:tb">Some data</tbody>
<tbody id="tbody2"></tbody>
will also be matched.
This works:
/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is
Or full PHP:
<?php
$html = '<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody> ';
preg_match_all('/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is', $html, $matches);
var_dump($matches[1]);
That gives you the <tr...>....</tr> as a result. If you only want the dots you'll need to use something like:
/<tbody id="clavier:infractionList2:tb">.*?<tr.*?>(.*?)<\/tr>.*?<\/tbody>/is
I need to get all tags between last </td> and the closing </tr> in each row. The regular expression I use <\/TD\s*>(.*?)<\/TR\s*> retrieve all from first </TD> till last </TR> - marked with bold on sample below.
<TABLE>
<TR><TD>TD11**</TD><TD>TD12</TD><TD>TD13</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21**</TD><TD>TD22</TD><TD>TD23</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
But a what I really need is
<TABLE>
<TR><TD>TD11</TD><TD>TD12</TD><TD>TD13**</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21</TD><TD>TD22</TD><TD>TD23**</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
Its not recommended to use regular expressions to parse HTML, html is non regular and there for notoriously unreliable when trying to use regular expressions.
Heres a good blog post explaining the logic and offering alternatives:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
</TD>((?:(?!</T[DR]>).)*)</TR>
The regex starts to match at the first </TD>, but fails as soon as it reaches the second </TD> because of the (?!</T[DR]>)., which matches any character that's not the first character of a </TD> or </TR> tag. That's optional because of the enclosing (?:...)*, so it tries to match the next part of the regex, which is </TR>. That fails too, so the match attempt is abandoned.
It tries again starting at the second </TD> and fails again. Finally, it starts matching at the third </TD> and successfully matches from there to the first </TR>.
You may want to specify "single-line" or "dot-matches-all" mode, in case there are newlines that didn't show in your example. You didn't specify a regex flavor, so I can't say exactly how to do that.