I'm new to Perl and have a few regex questions - html

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!

how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.

[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.

The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.

Related

perl - matching greater than charater in regex

$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";
I need to extract "Alpha-Seeking" and "No Underlying Index ," from the above 2 strings.
Basically, need everything from ('>) to the last character of the string.
Tried two ways,
1) The standard intuitive
($string1=~ /\'>(.*?)/) {print "got $1";}
but this does not seem to work on '>' symbol.
2) Also tried
if ($string1=~ /(?=>)(.*?)/) {print "got $1";}
based on inputs from Greater than and less than symbol in regular expressions, but it is not working.
Any inputs will be useful.
PS: Also, if the answer can include matching the "less than" symbo ("<"), that will be great!
Thanks
Do not parse HTML with a regex. Regexes are very bad at parsing complex, balanced text like HTML.
For example:
<tag>
outer
<tag>
middle
<tag>inner</tag>
middle
</tag>
outer
</tag>
Instead, use an HTML parser and search tools such as XPath.
Here is a demonstration using XML::LibXML.
use strict;
use warnings;
use v5.10;
use XML::LibXML;
my $html = q{
<html>
<body>
<a href='/channels/folder1'>Alpha-Seeking</a>
<a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};
# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);
# Find all links.
for my $node ($dom->findnodes('//a')) {
# Print their text.
say $node->textContent;
}
I must start by reiterating that it's incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.
Having said that, your problem here is pretty simple to fix. What you call the "standard intuitive approach" works fine with a simple tweak.
Here's what you have:
if ($string1=~ /\'>(.*?)/) {print "got $1";}
And your regex is \'>(.*?). That means "find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that". It's "the minimum amount" that's the problem. The simplest thing that .*? can capture is nothing - the empty string.
Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don't want that here. Here, you want their greediness. So just remove that ?.
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings) {
if ($string =~ /'>(.*)/) { # Note: No "?" here
print "got $1\n";
}
}
This displays:
got Alpha-Seeking
got No Underlying Index ,
This works for me
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings)
{
if ($string =~ /'>(.*?)$/)
{
print "got $1\n";
}
}
running it gives
$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,
While exploring various options, I managed to get this working with the following:
Replace the greater than sign with some other generic symbol (like a pipe)
$string=~ s/>/\|/g; #Interestingly, '>' matches here without any issues
After that, split on the pipe char, and print/parse the second part:
($o1,$o2) = split(/\|/, $string);
print "$o2|";
Works perfectly as a work-around.

Extracting plain text from html files in R [duplicate]

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:
I tried implementing a function to remove the html tags:
cleanFun=function(fullStr)
{
#find location of tags and citations
tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);
#create storage for tag strings
tagStrings=list()
#extract and store tag strings
for(i in 1:dim(tagLoc)[1])
{
tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
}
#remove tag strings from paragraph
newStr=fullStr
for(i in 1:length(tagStrings))
{
newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
}
return(newStr)
};
This works for some tags but not all tags, an example where this fails is following string:
test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
The goal would be to obtain:
cleanFun(test)="junk junk junk junk"
However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.
This can be achieved simply through regular expressions and the grep family:
cleanFun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
This will also work with multiple html tags in the same string!
This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string "". The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.
You can also do this with two functions in the rvest package:
library(rvest)
strip_html <- function(s) {
html_text(read_html(s))
}
Example output:
> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
Note that you should not use regexes to parse HTML.
Another approach, using tm.plugin.webmining, which uses XML internally.
> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
An approach using the qdap package:
library(qdap)
bracketX(test, "angle")
## > bracketX(test, "angle")
## [1] "junk junk junk junk"
It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags
Use a package like XML. Source the html code in parse it using for example htmlParse and use xpaths to find the quantities relevant to you.
UPDATE:
To answer the OP's question
require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)
It may be easier with sub or gsub ?
> test <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"
First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \ to denote escaped characters for literal backslashes. In this case, \" means the double quote mark, not the two literal characters \ and ". You can use cat to see what the string would actually look like if escaped characters were treated literally.
Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_all and str_replace_all.) This is another of the classic blunders; see here for more exposition.
Third, you should have mentioned in your post that you're using the stringr package, but this is only a minor blunder by comparison.

Perl Regular Expression to extract value from nested html tags

$match = q(<h1><b>Google</b></h1>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";
OUTPUT: Google</b></h1>
It Should be : Google
Unable to extract value from link using Regex in Perl, it could have one more or less nesting:
<h1><b><i>Google</i></b></h1>
Please Try this:
1) <td>Unix shell
2) <h1><b>HP</b></h1>
3) generic</td>);
4) <span>[</span>1<span>]</span>
OUTPUT:
Unix shell
HP
generic
[1]
Don't use regexes, as mentioned in the comments. I am especially fond of the Mojo suite, which allows me to use CSS selectors:
use Mojo;
my $dom = Mojo::DOM->new(q(<h1><b>Google</b></h1>));
print $dom->at('a[href="#google"]')->all_text, "\n";
Or with HTML::TreeBuilder::XPath:
use HTML::TreeBuilder::XPath;
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<h1><b>Google</b></h1>));
print $dom->findvalue('//a[#href="#google"]'), "\n";
Try this:
if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)
That should take "everything after the href and between the <b>...</b> tags
Instead, to get "everything after the last > and before the first </, you can use
<a.*?href.*?>([^>]*?)<\/
For this simple case you could use: The requirements are no longer simple, look at #amon's answer for how to use an HTML parser.
/<a.*?>([^<]+)</
Match an opening a tag, followed by anything until you find something between > and <.
Though as others have mentioned, you should generally use a HTML parser.
echo '<td>Unix shell
<h1><b>HP</b></h1>
generic</td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic
I came up with this regex that works for all your sampled inputs under PCRE. This regex is equivalent to a regular grammar with the tail-recursive pattern (?1)*
(?<=>)((?:\w+)(?:\s*))(?1)*
Just take the first element of the returned array, ie array[0]

Regular expression for extracting tag attributes

I'm trying to extract the attributes of a anchor tag (<a>). So far I have this expression:
(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+
which works for strings like
<a href="test.html" class="xyz">
and (single quotes)
<a href='test.html' class="xyz">
but not for a string without quotes:
<a href=test.html class=xyz>
How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?
Update: Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.
Update 2021: Radon8472 proposes in the comments the regex https://regex101.com/r/tOF6eA/1 (note regex101.com did not exist when I wrote originally this answer)
<a[^>]*?href=(["\'])?((?:.(?!\1|>))*.?)\1?
Update 2021 bis: Dave proposes in the comments, to take into account an attribute value containing an equal sign, like <img src="test.png?test=val" />, as in this regex101:
(\w+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Update (2020), Gyum Fox proposes https://regex101.com/r/U9Yqqg/2 (again, note regex101.com did not exist when I wrote originally this answer)
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Applied to:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
<img src="test.png">
<img src="a test.png">
<img src=test.png />
<img src=a test.png />
<img src=test.png >
<img src=a test.png >
<img src=test.png alt=crap >
<img src=a test.png alt=crap >
Original answer (2008):
If you have an element like
<name attribute=value attribute="value" attribute='value'>
this regex could be used to find successively each attribute name and value
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Applied on:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
it would yield:
'href' => 'test.html'
'class' => 'xyz'
Note: This does not work with numeric attribute values e.g. <div id="1"> won't work.
Edited: Improved regex for getting attributes with no value and values with " ' " inside.
([^\r\n\t\f\v= '"]+)(?:=(["'])?((?:.(?!\2?\s+(?:\S+)=|\2))+.)\2?)?
Applied on:
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
it would yield:
'type' => 'text/javascript'
'defer' => ''
'async' => ''
'id' => 'something'
'onload' => 'alert(\'hello\');'
Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:
/
\G # start where the last match left off
(?> # begin non-backtracking expression
.*? # *anything* until...
<[Aa]\b # an anchor tag
)?? # but look ahead to see that the rest of the expression
# does not match.
\s+ # at least one space
( \p{Alpha} # Our first capture, starting with one alpha
\p{Alnum}* # followed by any number of alphanumeric characters
) # end capture #1
(?: \s* = \s* # a group starting with a '=', possibly surrounded by spaces.
(?: (['"]) # capture a single quote character
(.*?) # anything else
\2 # which ever quote character we captured before
| ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
) # end group
)? # attribute value was optional
/msx;
"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)
(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.
Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.
there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.
I don't really care which one you use, as long as its recognized, tested, and you use one.
my $foo = Someclass->parse( $xmlstring );
my #links = $foo->getChildrenByTagName("a");
my #srcs = map { $_->getAttribute("src") } #links;
# #srcs now contains an array of src attributes extracted from the page.
You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.
So either don’t use named captures:
(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+
Or don’t use the quantifier on this expression:
(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)
This does also allow attribute values like bar=' baz='quux:
foo="bar=' baz='quux"
Well the drawback will be that you have to strip the leading and trailing quotes afterwards.
Just to agree with everyone else: don't parse HTML using regexp.
It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.
There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.
PHP (PCRE) and Python
Simple attribute extraction (See it working):
((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\\)"|[^"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))
Or with tag opening / closure verification, tag name retrieval and comment escaping. This expression foresees unquoted / quoted, single / double quotes, escaped quotes inside attributes, spaces around equals signs, different number of attributes, check only for attributes inside tags, and manage different quotes within an attribute value. (See it working):
(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)
(Works better with the "gisx" flags.)
Javascript
As Javascript regular expressions don't support look-behinds, it won't support most features of the previous expressions I propose. But in case it might fit someone's needs, you could try this version. (See it working).
(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)
This is my best RegEx to extract properties in HTML Tag:
# Trim the match inside of the quotes (single or double)
(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2
# Without trim
(\S+)\s*=\s*([']|["])([\W\w]*?)\2
Pros:
You are able to trim the content inside of quotes.
Match all the special ASCII characters inside of the quotes.
If you have title="You're mine" the RegEx does not broken
Cons:
It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.: <div title="You're"> the result is Group 1: title, Group 2: ", Group 3: You're.
This is the online RegEx example:
https://regex101.com/r/aVz4uG/13
I normally use this RegEx to extract the HTML Tags:
I recommend this if you don't use a tag type like <div, <span, etc.
<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
For example:
<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">
This is the online RegEx example:
https://regex101.com/r/aVz4uG/15
The bug in this RegEx is:
<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
In this tag:
<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>
Returns <div '> but it should not return any match:
Match: <div '>
To "solve" this remove the [^/]+? pattern:
<div(?:\".*?\"|'.*?'|.*?)*?>
The answer #317081 is good but it not match properly with these cases:
<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)
This is the improvement:
(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?
vs
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Avoid the spaces between equal signal:
(\S+)\s*=\s*((?:...
Change the last + and . for:
|[>"']))?[^"']*)["']?
This is the online RegEx example:
https://regex101.com/r/aVz4uG/8
Tags and attributes in HTML have the form
<tag
attrnovalue
attrnoquote=bli
attrdoublequote="blah 'blah'"
attrsinglequote='bloob "bloob"' >
To match attributes, you need a regex attr that finds one of the four forms. Then you need to make sure that only matches are reported within HTML tags. Assuming you have the correct regex, the total regex would be:
attr(?=(attr)*\s*/?\s*>)
The lookahead ensures that only other attributes and the closing tag follow the attribute. I use the following regular expression for attr:
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?
Unimportant groups are made non capturing. The first matching group $1 gives you the name of the attribute, the value is one of $2or $3 or $4. I use $2$3$4 to extract the value.
The final regex is
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)
Note: I removed all unnecessary groups in the lookahead and made all remaining groups non capturing.
splattne,
#VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted
This one works with mixed attributes
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
to test it out
<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
$code = ' <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/> ';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$ms would then contain keys and values on the 2nd and 3rd element.
$keys = $ms[1];
$values = $ms[2];
I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.
something like this might be helpful
'(\S+)\s*?=\s*([\'"])(.*?|)\2
If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?
I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.
If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.
Then you can use XPath.
I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.
My adaptation to also get the boolean attributes and empty attributes like:
<input autofocus='' disabled />
/(\w+)=["']((?:.(?!["']\s+(?:\S+)=|\s*\/[>"']))+.)["']|(\w+)=["']["']|(\w+)/g
I also needed this and wrote a function for parsing attributes, you can get it from here:
https://gist.github.com/4153580
(Note: It doesn't use regex)
I have created a PHP function that could extract attributes of any HTML tags. It also can handle attributes like disabled that has no value, and also can determine whether the tag is a stand-alone tag (has no closing tag) or not (has a closing tag) by checking the content result:
/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
$matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
$results = array(
'element' => $matches[2],
'attributes' => null,
'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
);
if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
$results['attributes'] = array();
foreach($attrs[1] as $i => $attr) {
$results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
}
}
return $results;
}
Test Code
$test = array(
'<div class="foo" id="bar" data-test="1000">',
'<div>',
'<div class="foo" id="bar" data-test="1000">test content</div>',
'<div>test content</div>',
'<div>test content</span>',
'<div>test content',
'<div></div>',
'<div class="foo" id="bar" data-test="1000"/>',
'<div class="foo" id="bar" data-test="1000" />',
'< div class="foo" id="bar" data-test="1000" />',
'<div class id data-test>',
'<id="foo" data-test="1000">',
'<id data-test>',
'<select name="foo" id="bar" empty-value-test="" selected disabled><option value="1">Option 1</option></select>'
);
foreach($test as $t) {
var_dump($t, extract_html_attributes($t));
echo '<hr>';
}
This works for me. It also take into consideration some end cases I have encountered.
I am using this Regex for XML parser
(?<=\s)[^><:\s]*=*(?=[>,\s])
Extract the element:
var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]
Then use jQuery to parse and extract the bit you want:
$(htmlStr).attr('style')
have a look at this
Regex & PHP - isolate src attribute from img tag
perhaps you can walk through the DOM and get the desired attributes. It works fine for me, getting attributes from the body-tag

Regex to match all HTML tags except <p> and </p>

I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.
It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.
Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)
I used Xetius regex and it works fine. Except for some flex generated tags which can be : with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags from flex generated html text so i also added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.
Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.
So as I say, I don't really think regexps are the right tool for the job.
Since HTML is not a regular language
HTML isn't but HTML tags are and they can be adequatly described by regular expressions.
Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre> or <param> tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That should cover <p> tags that have attributes, too.
You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.
The original regex can be made to work with very little effort:
<(?>/?)(?!p).+?>
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.
(That said I agree that generally parsing HTML with regexes is not the way to go).
Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).
/EDIT: I've added the ability to handle attributes in p tags.
This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig
You should probably also remove any attributes on the <p> tag, since someone bad could do something like:
<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.