Related
$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";
I need to extract "Alpha-Seeking" and "No Underlying Index ," from the above 2 strings.
Basically, need everything from ('>) to the last character of the string.
Tried two ways,
1) The standard intuitive
($string1=~ /\'>(.*?)/) {print "got $1";}
but this does not seem to work on '>' symbol.
2) Also tried
if ($string1=~ /(?=>)(.*?)/) {print "got $1";}
based on inputs from Greater than and less than symbol in regular expressions, but it is not working.
Any inputs will be useful.
PS: Also, if the answer can include matching the "less than" symbo ("<"), that will be great!
Thanks
Do not parse HTML with a regex. Regexes are very bad at parsing complex, balanced text like HTML.
For example:
<tag>
outer
<tag>
middle
<tag>inner</tag>
middle
</tag>
outer
</tag>
Instead, use an HTML parser and search tools such as XPath.
Here is a demonstration using XML::LibXML.
use strict;
use warnings;
use v5.10;
use XML::LibXML;
my $html = q{
<html>
<body>
<a href='/channels/folder1'>Alpha-Seeking</a>
<a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};
# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);
# Find all links.
for my $node ($dom->findnodes('//a')) {
# Print their text.
say $node->textContent;
}
I must start by reiterating that it's incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.
Having said that, your problem here is pretty simple to fix. What you call the "standard intuitive approach" works fine with a simple tweak.
Here's what you have:
if ($string1=~ /\'>(.*?)/) {print "got $1";}
And your regex is \'>(.*?). That means "find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that". It's "the minimum amount" that's the problem. The simplest thing that .*? can capture is nothing - the empty string.
Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don't want that here. Here, you want their greediness. So just remove that ?.
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings) {
if ($string =~ /'>(.*)/) { # Note: No "?" here
print "got $1\n";
}
}
This displays:
got Alpha-Seeking
got No Underlying Index ,
This works for me
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings)
{
if ($string =~ /'>(.*?)$/)
{
print "got $1\n";
}
}
running it gives
$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,
While exploring various options, I managed to get this working with the following:
Replace the greater than sign with some other generic symbol (like a pipe)
$string=~ s/>/\|/g; #Interestingly, '>' matches here without any issues
After that, split on the pipe char, and print/parse the second part:
($o1,$o2) = split(/\|/, $string);
print "$o2|";
Works perfectly as a work-around.
I'm using the Html::Strip module to remove all html tags from a file. I want to then manipulate the resulting text (stripped of html) and finally return the html tags to their original positions.
The text manipulation I'm doing requires breaking the text into arrays using split(/ /, $text). I then do some natural language processing of the resulting arrays (including adding new html tags to some key words). Once I'm finished processing the text, I'd like to return the original tags to their places while keeping the text manipulations I've done in the meantime intact.
I would be satisfied if I could simply remove all whitespace from within the original tags (since whitespace within tags is ignored by browsers). That way my NLProcessing could simply ignore words that are tags (contain '<' or '>').
I've tried diving into the guts of Html::Strip (in an effort to modify it to my needs), but I can't understand what the following piece of code does:
my $stripped = $self->strip_html( $text );
if( $self->decode_entities && $_html_entities_p ) {
$stripped = HTML::Entities::decode($stripped);
}
Seems like strip_html is a sub, but I can't find that sub anywhere.
Anyway thanks for any and all advice.
... the next day...
After a bit of back and forth with #amon, I have come up with a solution that I believe is sufficient for my purposes. amon pushed me in the right direction even though he recommended I not do what I've done anyway, haha.
It is a brutish method, but gets the job done satisfactorily. Gonna leave it here in the off chance that someone else has the same wishes as me and doesn't mind a quick and dirty solution:
my $input = text.html;
my $stripped = $hs->parse($input);
$hs->eof;
so now I have two string variables. One is the html file I want to manipulate, and the other is the same file stripped of html.
my #marks = split(/\s/, $stripped);
#marks = uniq(#marks);
Now I have a list of all non-HTMLtag-associated words that appear in my file.
$input = HTML::Entities::decode($input);
$input =~ s/\</ \</g;
$input =~ s/\>/\> /g;
$input =~ s/\n/ \n /g;
$input =~ s/\r/ \r /g;
$input =~ s/\t/ \t /g;
Now I've decoded my HTML containing var and have ensured that no word is up against a "<", or ">" or non-space whitespace character.
foreach my $mark(#marks) { $input =~ s/ \Q$mark\E / TAQ\+$mark\TAQ /g; }
$input =~ s/TAQ\+TAQ//g;
Now I've "tagged" each word with a "+" and have separated words from non-words with the TAQ delimiter. I can now split on TAQ and ignore any item that does not contain a "+" when performing my NLP and text manipulation. Once I'm done, I rejoin and strip all of the "+". Follow that with some clever encoding, remove all the extra spaces I inserted, and BAM! I've now got my NLProcessing completed, have manipulated the text, and still have all of my HTML in the right places.
There are a lot of caveats here, and I'm not going to go into all of them. Most problematic is the need to decode and then encode, coupled with the fact that HTML::Strip doesn't always strip all the javascript or invalid HTML. There are ways to work around that, but again I don't have room or time to discuss that here.
Thanks amon for your help, and I welcome any criticism or suggestions. I'm new to this.
The module HTML::Strip uses the XS glue language to connect the Perl code with C code. You can find the XS file e.g. on (meta-)cpan. It includes a file strip_html.c that implements the actual algorithm. Due to the definitions in the XS file, a strip_html sub is available in the Perl code as part of the HTML::Strip package. Therefore, it can be invoked as a method on an appropriate object.
Explanation of that piece of code
my $stripped = $self->strip_html( $text );
This will invoke the C function on the contents of $text to strip all the HTML tags. The stripped data will then be assigned to $stripped.
if( $self->decode_entities && $_html_entities_p ) {
$stripped = HTML::Entities::decode($stripped);
}
Suffixing variable names with -p is a lispish tradition to indicate boolean variables (or predicates, in mathematics). Here, it indicates if HTML::Entities could be loaded: my $_html_entities_p = eval 'require HTML::Entities';. If the configuration option decode_entities was set to a true value, and HTML::Entities could be loaded, then entities will be decoded in the stripped data.
Example: given the input
<code> $x < $y </code>
then stripping would produce
$x < $y
and decoding the entities would make it
$x < $y
I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!
how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.
[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.
The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.
I'm trying to extract the attributes of a anchor tag (<a>). So far I have this expression:
(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+
which works for strings like
<a href="test.html" class="xyz">
and (single quotes)
<a href='test.html' class="xyz">
but not for a string without quotes:
<a href=test.html class=xyz>
How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?
Update: Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.
Update 2021: Radon8472 proposes in the comments the regex https://regex101.com/r/tOF6eA/1 (note regex101.com did not exist when I wrote originally this answer)
<a[^>]*?href=(["\'])?((?:.(?!\1|>))*.?)\1?
Update 2021 bis: Dave proposes in the comments, to take into account an attribute value containing an equal sign, like <img src="test.png?test=val" />, as in this regex101:
(\w+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Update (2020), Gyum Fox proposes https://regex101.com/r/U9Yqqg/2 (again, note regex101.com did not exist when I wrote originally this answer)
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Applied to:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
<img src="test.png">
<img src="a test.png">
<img src=test.png />
<img src=a test.png />
<img src=test.png >
<img src=a test.png >
<img src=test.png alt=crap >
<img src=a test.png alt=crap >
Original answer (2008):
If you have an element like
<name attribute=value attribute="value" attribute='value'>
this regex could be used to find successively each attribute name and value
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Applied on:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
it would yield:
'href' => 'test.html'
'class' => 'xyz'
Note: This does not work with numeric attribute values e.g. <div id="1"> won't work.
Edited: Improved regex for getting attributes with no value and values with " ' " inside.
([^\r\n\t\f\v= '"]+)(?:=(["'])?((?:.(?!\2?\s+(?:\S+)=|\2))+.)\2?)?
Applied on:
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
it would yield:
'type' => 'text/javascript'
'defer' => ''
'async' => ''
'id' => 'something'
'onload' => 'alert(\'hello\');'
Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:
/
\G # start where the last match left off
(?> # begin non-backtracking expression
.*? # *anything* until...
<[Aa]\b # an anchor tag
)?? # but look ahead to see that the rest of the expression
# does not match.
\s+ # at least one space
( \p{Alpha} # Our first capture, starting with one alpha
\p{Alnum}* # followed by any number of alphanumeric characters
) # end capture #1
(?: \s* = \s* # a group starting with a '=', possibly surrounded by spaces.
(?: (['"]) # capture a single quote character
(.*?) # anything else
\2 # which ever quote character we captured before
| ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
) # end group
)? # attribute value was optional
/msx;
"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)
(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.
Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.
there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.
I don't really care which one you use, as long as its recognized, tested, and you use one.
my $foo = Someclass->parse( $xmlstring );
my #links = $foo->getChildrenByTagName("a");
my #srcs = map { $_->getAttribute("src") } #links;
# #srcs now contains an array of src attributes extracted from the page.
You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.
So either don’t use named captures:
(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+
Or don’t use the quantifier on this expression:
(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)
This does also allow attribute values like bar=' baz='quux:
foo="bar=' baz='quux"
Well the drawback will be that you have to strip the leading and trailing quotes afterwards.
Just to agree with everyone else: don't parse HTML using regexp.
It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.
There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.
PHP (PCRE) and Python
Simple attribute extraction (See it working):
((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\\)"|[^"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))
Or with tag opening / closure verification, tag name retrieval and comment escaping. This expression foresees unquoted / quoted, single / double quotes, escaped quotes inside attributes, spaces around equals signs, different number of attributes, check only for attributes inside tags, and manage different quotes within an attribute value. (See it working):
(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)
(Works better with the "gisx" flags.)
Javascript
As Javascript regular expressions don't support look-behinds, it won't support most features of the previous expressions I propose. But in case it might fit someone's needs, you could try this version. (See it working).
(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)
This is my best RegEx to extract properties in HTML Tag:
# Trim the match inside of the quotes (single or double)
(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2
# Without trim
(\S+)\s*=\s*([']|["])([\W\w]*?)\2
Pros:
You are able to trim the content inside of quotes.
Match all the special ASCII characters inside of the quotes.
If you have title="You're mine" the RegEx does not broken
Cons:
It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.: <div title="You're"> the result is Group 1: title, Group 2: ", Group 3: You're.
This is the online RegEx example:
https://regex101.com/r/aVz4uG/13
I normally use this RegEx to extract the HTML Tags:
I recommend this if you don't use a tag type like <div, <span, etc.
<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
For example:
<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">
This is the online RegEx example:
https://regex101.com/r/aVz4uG/15
The bug in this RegEx is:
<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
In this tag:
<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>
Returns <div '> but it should not return any match:
Match: <div '>
To "solve" this remove the [^/]+? pattern:
<div(?:\".*?\"|'.*?'|.*?)*?>
The answer #317081 is good but it not match properly with these cases:
<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)
This is the improvement:
(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?
vs
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Avoid the spaces between equal signal:
(\S+)\s*=\s*((?:...
Change the last + and . for:
|[>"']))?[^"']*)["']?
This is the online RegEx example:
https://regex101.com/r/aVz4uG/8
Tags and attributes in HTML have the form
<tag
attrnovalue
attrnoquote=bli
attrdoublequote="blah 'blah'"
attrsinglequote='bloob "bloob"' >
To match attributes, you need a regex attr that finds one of the four forms. Then you need to make sure that only matches are reported within HTML tags. Assuming you have the correct regex, the total regex would be:
attr(?=(attr)*\s*/?\s*>)
The lookahead ensures that only other attributes and the closing tag follow the attribute. I use the following regular expression for attr:
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?
Unimportant groups are made non capturing. The first matching group $1 gives you the name of the attribute, the value is one of $2or $3 or $4. I use $2$3$4 to extract the value.
The final regex is
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)
Note: I removed all unnecessary groups in the lookahead and made all remaining groups non capturing.
splattne,
#VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted
This one works with mixed attributes
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
to test it out
<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
$code = ' <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/> ';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$ms would then contain keys and values on the 2nd and 3rd element.
$keys = $ms[1];
$values = $ms[2];
I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.
something like this might be helpful
'(\S+)\s*?=\s*([\'"])(.*?|)\2
If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?
I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.
If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.
Then you can use XPath.
I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.
My adaptation to also get the boolean attributes and empty attributes like:
<input autofocus='' disabled />
/(\w+)=["']((?:.(?!["']\s+(?:\S+)=|\s*\/[>"']))+.)["']|(\w+)=["']["']|(\w+)/g
I also needed this and wrote a function for parsing attributes, you can get it from here:
https://gist.github.com/4153580
(Note: It doesn't use regex)
I have created a PHP function that could extract attributes of any HTML tags. It also can handle attributes like disabled that has no value, and also can determine whether the tag is a stand-alone tag (has no closing tag) or not (has a closing tag) by checking the content result:
/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
$matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
$results = array(
'element' => $matches[2],
'attributes' => null,
'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
);
if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
$results['attributes'] = array();
foreach($attrs[1] as $i => $attr) {
$results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
}
}
return $results;
}
Test Code
$test = array(
'<div class="foo" id="bar" data-test="1000">',
'<div>',
'<div class="foo" id="bar" data-test="1000">test content</div>',
'<div>test content</div>',
'<div>test content</span>',
'<div>test content',
'<div></div>',
'<div class="foo" id="bar" data-test="1000"/>',
'<div class="foo" id="bar" data-test="1000" />',
'< div class="foo" id="bar" data-test="1000" />',
'<div class id data-test>',
'<id="foo" data-test="1000">',
'<id data-test>',
'<select name="foo" id="bar" empty-value-test="" selected disabled><option value="1">Option 1</option></select>'
);
foreach($test as $t) {
var_dump($t, extract_html_attributes($t));
echo '<hr>';
}
This works for me. It also take into consideration some end cases I have encountered.
I am using this Regex for XML parser
(?<=\s)[^><:\s]*=*(?=[>,\s])
Extract the element:
var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]
Then use jQuery to parse and extract the bit you want:
$(htmlStr).attr('style')
have a look at this
Regex & PHP - isolate src attribute from img tag
perhaps you can walk through the DOM and get the desired attributes. It works fine for me, getting attributes from the body-tag
I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.
It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.
Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)
I used Xetius regex and it works fine. Except for some flex generated tags which can be : with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags from flex generated html text so i also added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.
Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.
So as I say, I don't really think regexps are the right tool for the job.
Since HTML is not a regular language
HTML isn't but HTML tags are and they can be adequatly described by regular expressions.
Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre> or <param> tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That should cover <p> tags that have attributes, too.
You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.
The original regex can be made to work with very little effort:
<(?>/?)(?!p).+?>
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.
(That said I agree that generally parsing HTML with regexes is not the way to go).
Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).
/EDIT: I've added the ability to handle attributes in p tags.
This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig
You should probably also remove any attributes on the <p> tag, since someone bad could do something like:
<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
The easiest way to do this, is to use the regex people suggest here to search for <p> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.