Regex in html code

Regex in html code - html

<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody>
I'm looking to find a Regex to get this value from a big text.
I tried this one but without result:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#

Regex with html is often a bad idea, because of potential recursive tags. Have you tried using an XML/HTML parser? For example, XmlDocument, XmlElement and XmlAttribute.
EDIT: The problem with regex and html in your example:
Cannot keep count of recursive tbody tags
Will the tbody tag can look like <tbody>...</tbody> or <tbody .../>?
Even if you know there will be one start and end tag, how do you know there won't be any plain text containing "tbody" somewhere inside the table, thus breaking the regex?

You may want to tell your regex engine that it should match newlines with the . as well.
In PHP, that would make the regex:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#s
Note the trailing s
Warning if there are 2 tbodies, this regex will match everything starting from the first tbody (with this ID) until the last tbody (ID-independent).
Example:
<tbody id="clavier:infractionList2:tb">Some data</tbody>
<tbody id="tbody2"></tbody>
will also be matched.

This works:
/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is
Or full PHP:
<?php
$html = '<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody> ';
preg_match_all('/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is', $html, $matches);
var_dump($matches[1]);
That gives you the <tr...>....</tr> as a result. If you only want the dots you'll need to use something like:
/<tbody id="clavier:infractionList2:tb">.*?<tr.*?>(.*?)<\/tr>.*?<\/tbody>/is

Related

When doing a search I want to skip all code that ends in a closing tag

I have HTML that looks like this:
<td class="danish"> Det
tycker jag!</td>
I'm fixing the line break with this:
<td class="danish">(.*)
\s*(.*)</td>
But sometimes the HTML ends in a tag on the same line:
<td class="danish">Det tyckeg jag</td>
I want it to skip lines like these when searching and find the next broken line.
In case anyone thinks it's just a frivolous thing to make the code look good, the rest of the code looks like this (not required reading):
<td class="danish"> Det
tycker jag!</td>
<td>
<?php audioButton("../../audio//det_lyder_godt","det_lyder_godt"); ?>
I ultimately have to take the text in the table and replace the one in the audiobutton a thousand times, but that's a different problem

I think this is what you're looking for:
(<td class="danish">(?:(?!</td>).)*)\r?\n\s*
This matches from <td class="danish"> to the next newline, unless there's a </td> tag first. Replace with "$1 " or "\1 " (without the quotes).
Using \r?\n instead of a literal newline makes the regex more robust. Even better is \R, if your regex flavor supports it.

\s means "any white-space character", which includes spaces and new lines. You could explicitly search for lines that must contain a new line, by using something like:
<td class="danish">(.*)\n\s*(.*)</td>
Note the additional \n in the regex.

Nokogiri how to traverse every row of a table with two classes

I am attempting to parse an HTML table using Nokogiri. The table is marked up well and has no structural issues except for table header is embedded as an actual row instead of using <thead>. The problem I have is that I want every row but the first row, as I'm not interested in the header, but everything that follows instead. Here's an example of how the table is structured.
<table id="foo">
<tbody>
<tr class="headerrow">....</tr>
<tr class="row">...</tr>
<tr class="row_alternate">...</tr>
<tr class="row">...</tr>
<tr class="row_alternate">...</tr>
</tbody>
</table>
I'm interesting in grabbing only rows with the class row and row_alternate. However, this syntax is not legal in Nokogiri as far as I'm aware:
doc.css('.row .row_alternate').each do |a_row|
# do stuff with a_row
end
What's the best way to solve this with Nokogiri?

I would try this:
doc.css(".row, .row_alternate").each do |a_row|
# do stuff with a_row
end

A CSS selector can contain multiple components separated by comma:
A comma-separated list of selectors represents the union of all elements selected by each of the individual selectors in the list. (A comma is U+002C.) For example, in CSS when several selectors share the same declarations, they may be grouped into a comma-separated list. White space may appear before and/or after the comma.
doc.css('.row, .row_alternate').each do |a_row|
p a_row.to_html
end
# "<tr class=\"row\">...</tr>"
# "<tr class=\"row_alternate\">...</tr>"
# "<tr class=\"row\">...</tr>"
# "<tr class=\"row_alternate\">...</tr>"

try doc.at_css(".headerrow").remove and then
doc.css("tr").each do |row|
#some code
end

One Xpath expression doesn't work in selenium, but works in Firefox

I have one question about xpath.
There is td like this in chrome:
<td class="dataCol col02">
"Hello world(notes:there is$)nbsp;"
[View Hierarchy]
</td>
but when I inspect the same element in Firefox it doesn't have $nbsp and double quotes;
<td class="dataCol col02">
Hello world
[View Hierarchy]
</td>
I used FireFinder and use the xpath:
//td[text()='Hello world']
, it can locate that element.
but when I use selenium api 2.24, it couldn't find that element.
by.xpath("//td[text()='Hello world']")
Do you have any idea of that?
Thanks!

Try with normalize-space() which trims leading and trailing whitespace characters:
//td[normalize-space(text())='Hello world']
Edit following the different comments:
here's an XPath expression that's probably better suited in the general case:
//td[starts-with(normalize-space(.), 'Hello world')]
meaning it matches <td> nodes if the concatenated string content of the whole <td>, less leading and trailing whitespace, starts with "Hello world"

I would try to use contains() function.
Your xpath will look like: //td[contains(text(),'Hello world')]

Get last </td></tr> with regular expression?

I need to get all tags between last </td> and the closing </tr> in each row. The regular expression I use <\/TD\s*>(.*?)<\/TR\s*> retrieve all from first </TD> till last </TR> - marked with bold on sample below.
<TABLE>
<TR><TD>TD11**</TD><TD>TD12</TD><TD>TD13</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21**</TD><TD>TD22</TD><TD>TD23</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
But a what I really need is
<TABLE>
<TR><TD>TD11</TD><TD>TD12</TD><TD>TD13**</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21</TD><TD>TD22</TD><TD>TD23**</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>

Its not recommended to use regular expressions to parse HTML, html is non regular and there for notoriously unreliable when trying to use regular expressions.
Heres a good blog post explaining the logic and offering alternatives:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

</TD>((?:(?!</T[DR]>).)*)</TR>
The regex starts to match at the first </TD>, but fails as soon as it reaches the second </TD> because of the (?!</T[DR]>)., which matches any character that's not the first character of a </TD> or </TR> tag. That's optional because of the enclosing (?:...)*, so it tries to match the next part of the regex, which is </TR>. That fails too, so the match attempt is abandoned.
It tries again starting at the second </TD> and fails again. Finally, it starts matching at the third </TD> and successfully matches from there to the first </TR>.
You may want to specify "single-line" or "dot-matches-all" mode, in case there are newlines that didn't show in your example. You didn't specify a regex flavor, so I can't say exactly how to do that.

How to get a HTML element by text using XPath?

I'm encoutered a problem that is could not get a HTML element by element's text.My HTML looks like:
...
<table>
...
<tr>
...
<td class="oMain">test value</td>
...
<tr>
...
</table>
...
For some special reasons,I have to get the '<td class="oMain">' element using it's text 'test value'. I tried '//tr[td='test value']/td' but no result.How can i write the XPath expression?
Any help is welcome.Thanks!

Your expression
//tr[td='test value']/td
places the predicate on the parent node "tr". Maybe that's what's causing the problem.
What you want probably is this
//td[#class = "oMain" and child::text() = 'test value']]
Here's a link to th W3 specification of the xPath language for further reading: http://www.w3.org/TR/xpath/

Your XPath expression seems to be correct. Do you have a default namespace (e.g. XHTML) in your html? If so, you can modify your XPath like this:
//*[local-name()='td' and text()='test value']
If you can figure out how to use namespaces, you could also do
//xhtml:tr[xhtml:td='test value']/xhtml:td
Does that help?

In the xpath expression, first put the element node, which in your case is td, and then apply the filter text()='text node'
//td[text()='test value']
Hope this helps.

What are you using to do the parsing? In Ruby + Hpricot, you can do
doc.search("//td.oMain").each do |cell|
if cell.inner_html == "test value"
return cell
end
end
In this case, cell would be:
<td class="oMain">test value</td>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Regex in html code - html

Related

When doing a search I want to skip all code that ends in a closing tag

Nokogiri how to traverse every row of a table with two classes

One Xpath expression doesn't work in selenium, but works in Firefox

Get last </td></tr> with regular expression?

How to get a HTML element by text using XPath?

Categories

Resources