I am trying to get the contents of TRs on a web page that have no TRs nested inside them. The HTML is nested with many TRs
I am limited to RegEx only for this problem.
This is good:
TR
Contents
/TR
This is not
TR
other HTML
TR
Contents
This is actually not that much of a problem with regex (assuming you can guarantee that <tr> will not show up in comments, strings etc.; otherwise the regex will mis-match):
<tr\b(?:(?!</?tr\b).)*</tr>
will only match innermost tr tags. Use the dot-matches-newlines option of your regex engine, or it won't work correctly. If you don't have one (JavaScript, I'm talking to you!), then use [\s\S] instead of the ..
Explanation:
<tr\b # Match a tag that starts with tr
(?: # Match...
(?! # (unless it's possible to match
</?tr\b # <tr or </tr at the current position)
)
. # any character
)* # any number of times.
</tr> # Match </tr>
Related
I am attempting to parse an HTML table using Nokogiri. The table is marked up well and has no structural issues except for table header is embedded as an actual row instead of using <thead>. The problem I have is that I want every row but the first row, as I'm not interested in the header, but everything that follows instead. Here's an example of how the table is structured.
<table id="foo">
<tbody>
<tr class="headerrow">....</tr>
<tr class="row">...</tr>
<tr class="row_alternate">...</tr>
<tr class="row">...</tr>
<tr class="row_alternate">...</tr>
</tbody>
</table>
I'm interesting in grabbing only rows with the class row and row_alternate. However, this syntax is not legal in Nokogiri as far as I'm aware:
doc.css('.row .row_alternate').each do |a_row|
# do stuff with a_row
end
What's the best way to solve this with Nokogiri?
I would try this:
doc.css(".row, .row_alternate").each do |a_row|
# do stuff with a_row
end
A CSS selector can contain multiple components separated by comma:
A comma-separated list of selectors represents the union of all elements selected by each of the individual selectors in the list. (A comma is U+002C.) For example, in CSS when several selectors share the same declarations, they may be grouped into a comma-separated list. White space may appear before and/or after the comma.
doc.css('.row, .row_alternate').each do |a_row|
p a_row.to_html
end
# "<tr class=\"row\">...</tr>"
# "<tr class=\"row_alternate\">...</tr>"
# "<tr class=\"row\">...</tr>"
# "<tr class=\"row_alternate\">...</tr>"
try doc.at_css(".headerrow").remove and then
doc.css("tr").each do |row|
#some code
end
<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody>
I'm looking to find a Regex to get this value from a big text.
I tried this one but without result:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#
Regex with html is often a bad idea, because of potential recursive tags. Have you tried using an XML/HTML parser? For example, XmlDocument, XmlElement and XmlAttribute.
EDIT: The problem with regex and html in your example:
Cannot keep count of recursive tbody tags
Will the tbody tag can look like <tbody>...</tbody> or <tbody .../>?
Even if you know there will be one start and end tag, how do you know there won't be any plain text containing "tbody" somewhere inside the table, thus breaking the regex?
You may want to tell your regex engine that it should match newlines with the . as well.
In PHP, that would make the regex:
#<tbody id=\"clavier:infractionList2:tb\">(.*)</tbody>#s
Note the trailing s
Warning if there are 2 tbodies, this regex will match everything starting from the first tbody (with this ID) until the last tbody (ID-independent).
Example:
<tbody id="clavier:infractionList2:tb">Some data</tbody>
<tbody id="tbody2"></tbody>
will also be matched.
This works:
/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is
Or full PHP:
<?php
$html = '<tbody id="clavier:infractionList2:tb">
<tr class="rich-table-row rich-table-firstrow ">
..............
..............
............
</tr>
</tbody> ';
preg_match_all('/<tbody id="clavier:infractionList2:tb">(.*?)<\/tbody>/is', $html, $matches);
var_dump($matches[1]);
That gives you the <tr...>....</tr> as a result. If you only want the dots you'll need to use something like:
/<tbody id="clavier:infractionList2:tb">.*?<tr.*?>(.*?)<\/tr>.*?<\/tbody>/is
Firstly I would like to say to the more experienced people than myself that it has to be done in regex. No access to a DOM parser due to weird situation.
So I have a full HTML/XHTML string and would like to strip everything from it except the links. Basically just the <a> tags are important. I need the tags to keep their information fully, so href, target, class, etc and it should work if its a self terminating tag or if it has a separate end tag. i.e. <a /> or <a></a>
Thanks for any HELP guys!
Of course you have the possibility to parse HTML in a Firefox extension. Have a look at HTML to DOM, especially the second and third way.
It might seem to be more complex, but it is less error prone than a regular expression.
As soon as you have a reference to the parsed content, all you have to do is to call ref.getElementsByTagName('a') and you are done.
result = subject.match(/<a[^<>]*?(?:\/>|>(?:(?!<\/a>).)*<\/a>)/ig);
gets you an array of all <a> tags in the HTML source (even self-closed tags which are illegal but which you specifically asked for). Is that sufficient?
Explanation:
<a # Match <a
[^<>]*? # Match any characters besides angle brackets, as few as possible
(?: # Now either match
/> # /> (self-closed tag)
| # or
> # a closing angle bracket
(?: # followed by...
(?!</a>) # (if we're not at the closing tag)
. # any character
)* # any number of times
</a> # until the closing tag
)
the regex will look something like this
/\<\a.*[\/]{0,1}>(.*<\/\a>){0,1}/gm
I'm trying to construct a regular expression to look for empty html tags that may have embedded JSTL. I'm using Perl for my matching.
So far I can match any empty html tag that does not contain JSTL with the following?
/<\w+\b(?!:)[^<]*?>\s*<\/\w+/si
The \b(?!:) will avoid matching an opening JTSL tag but that doesn't address the whether JSTL may be within the HTML tag itself (which is allowable). I only want to know if this HTML tag has no children (only whitespace or empty). So I'm looking for a pattern that would match both the following:
<div id="my-id">
</div>
<div class="<c:out var="${my.property}" />"></div>
Currently the first div matches. The second does not. Is it doable? I tried several variations using lookahead assertions, and I'm starting to think it's not. However, I can't say for certain or articulate why it's not.
Edit: I'm not writing something to interpret the code, and I'm not interested in using a parser. I'm writing a script to point out potential issues/oversights. And at this point, I'm curious, too, to see if there is something clever with lookaheads or lookbehinds that I may be missing. If it bothers you that I'm trying to "solve" a problem this way, don't think of it as looking for a solution. To me it's more of a challenge now, and an opportunity to learn more about regular expressions.
Also, if it helps, you can assume that the html is xhtml strict.
Try
<(\w+)(?:\s+\w+="[^"]+(?:"\$[^"]+"[^"]+)?")*>\s*</\1>
A short explanation:
< # match a '<'
(\w+) # match one or more a-z, A-Z, 0-9 or '_' and store it in group 1
(?: # open non-matching-group 1
\s+ # match one or more white space characters
\w+ # match one or more a-z, A-Z, 0-9 or '_'
=" # match '="'
[^"]+ # match one or more characters other than '"'
(?: # open non-matching-group 2
"\$ # match '"$'
[^"]+ # match one or more characters other than '"'
" # match '"'
[^"]+ # match one or more characters other than '"'
)? # close non-matching-group 2, and make it optional
" # match '"'
)* # close non-matching-group 1, and make repeat itself zero or more times
> # match '>'
\s* # match zero or more white space characters
</\1> # match '</X>' where `X` is what is captured in group 1
This works for both you examples but I am sure someone can construct html that you want to match but will not be matched by the regex. But after reading your 'edit', it seems you are aware of that.
It's not a good idea to use regexes for HTML as there are many constructs that cannot be matched by most regex systems. Also much HTML (as opposed to XHTML) has many difficult constructs. Suggest you use an HTML parser. [This has been frequently addressed on SO and the universal answer is don't use regex).
Using an HTML parser doesn't mean you're interpreting or running the content: it means you are transforming it from a string of characters into a nested object. HTML is not regular, so regular expressions aren't the best solution to this problem.
See the docs for HTML::TreeBuilder as a good place to start. Other good resources include HTML::Parser and of course this site. :)
Edit: I'll pretend that your question has nothing to do with HTML and is just an interesting regex puzzle, and as such will ponder it... ...[still thinking.. edit coming] (puzzle abandoned in the face of a really awesome solution presented above)
If you assume that your input is valid XML, as you say, my tool of choice would be XML::Twig.
based on what i have read i believe the (?: is a non-capturing group not a non-matching group, thus the comment on the regex should be changed.
A non-matching group would be (?!
I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.
It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.
Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)
I used Xetius regex and it works fine. Except for some flex generated tags which can be : with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags from flex generated html text so i also added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.
Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.
So as I say, I don't really think regexps are the right tool for the job.
Since HTML is not a regular language
HTML isn't but HTML tags are and they can be adequatly described by regular expressions.
Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre> or <param> tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That should cover <p> tags that have attributes, too.
You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.
The original regex can be made to work with very little effort:
<(?>/?)(?!p).+?>
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.
(That said I agree that generally parsing HTML with regexes is not the way to go).
Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it matches either a single letter except āpā, followed by an optional whitespace and more characters, or multiple letters (at least two).
/EDIT: I've added the ability to handle attributes in p tags.
This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig
You should probably also remove any attributes on the <p> tag, since someone bad could do something like:
<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
The easiest way to do this, is to use the regex people suggest here to search for <p> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.