Regex select all text between tags - html

What is the best way to select all the text between 2 tags - ex: the text between all the '<pre>' tags on the page.

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.

Tag can be completed in another line. This is why \n needs to be added.
<PRE>(.|\n)*?<\/PRE>

To exclude the delimiting tags:
(?<=<pre>)(.*?)(?=</pre>)
(?<=<pre>) looks for text after <pre>
(?=</pre>) looks for text before </pre>
Results will text inside pre tag

This is what I would use.
(?<=(<pre>))(\w|\d|\n|[().,\-:;##$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))
Basically what it does is:
(?<=(<pre>)) Selection have to be prepend with <pre> tag
(\w|\d|\n|[().,\-:;##$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".
+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.
(?=(</pre>)) Selection have to be appended by the </pre> tag
Depending on your use case you might need to add some modifiers like (i or m)
i - case-insensitive
m - multi-line search
Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.
Javascript does not support lookbehind
The above example should work fine with languages such as PHP, Perl, Java ...
Javascript however does not support lookbehind so we have to forget about using `(?))` and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here
https://stackoverflow.com/questions/11592033/regex-match-text-between-tags
Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses

use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.
<[tag]>(.+?)</[tag]>
Sometime tags will have attributes, like anchor tag having href, then use the below pattern.
<[tag][^>]*>(.+?)</[tag]>

This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.
(?<=>)([\w\s]+)(?=<\/)
I tested it in https://regex101.com/ using this HTML fragment.
<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>
It's a game of three parts: the look behind, the content, and the look ahead.
(?<=>) # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/) # look ahead (but don't consume/capture) for a '</'
I hope that serves as a started for 10. Luck.

This seems to be the simplest regular expression of all that I found
(?:<TAG>)([\s\S]*)(?:<\/TAG>)
Exclude opening tag (?:<TAG>) from the matches
Include any whitespace or non-whitespace characters ([\s\S]*) in the matches
Exclude closing tag (?:<\/TAG>) from the matches

You shouldn't be trying to parse html with regexes see this question and how it turned out.
In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.
Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:
preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )
A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:
$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();
And since this is a proper parser it will be able to handle nesting tags etc.

Try this....
(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)

var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });
Since accepted answer is without javascript code, so adding that:

preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regex will select everyting between tag. no matter is it in new line(work with multiline.

In Python, setting the DOTALL flag will capture everything, including newlines.
If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org
#example.py using Python 3.7.4
import re
str="""Everything is awesome! <pre>Hello,
World!
</pre>
"""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)
print(matches.group(1))
python example.py
Hello,
World!
Capturing text between all opening and closing tags in a document
To capture text between all opening and closing tags in a document, finditer is useful. In the example below, three opening and closing <pre> tags are present in the string.
#example2.py using Python 3.7.4
import re
# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)
matches = pattern.finditer(str)
for i,match in enumerate(matches):
print(f"tag {i}: ",match.group(1))
python example2.py
tag 0: Desert Survival Problem
tag 1: humor conditions
tag 2: endless opportunity for self-reflection

(?<=>)[^<]+
for Notepad++
>([^<]+)
for AutoIt (option Return array of global matches).
or
(?=>([^<]+))
https://regex101.com/r/VtmEmY/1

To select all text between pre tag I prefer
preg_match('#<pre>([\w\W\s]*)</pre>#',$str,$matches);
$matches[0] will have results including <pre> tag
$matches[1] will have all the content inside <pre>.
DomDocument cannot work in situations where the requirement is to get text with tag details within the searched tag as it strips all tags, nodeValue & textContent will only return text without tags & attributes.

test.match(/<pre>(.*?)<\/pre>/g)?.map((a) => a.replace(/<pre>|<\/pre>/g, ""))
this should be a preferred solution.especially if you have multiple pre tags in the context

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

const content = '<p class="title responsive">ABC</p>';
const blog = {content};
const re = /<([^> ]+)([^>]*)>([^<]+)(<\/\1>)/;
const matches = content.match(re);
console.log(matches[3]);
matches[3] is the content text and this is adapted to any tag name with classes. (not support nested structures)

How about:
<PRE>(\X*?)<\/PRE>

More complex than PyKing's answer but matches any type of tag (except self-closing) and considers cases where the tag has HTML-like string attributes.
/<TAG_NAME(?:STRING|NOT_CLOSING_TAG_NOT_QUOTE)+>INNER_HTML<\/\1 *>/g
Raw: /<([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>(.*?)<\/\1 *>/g
Regex Railroad diagram:
group #1 = tag name
group #2 = string attr
group #3 = inner html
JavaScript code testing it:
let TAG_NAME = '([^\s</>]+)';
let NOT_CLOSING_TAG_NOT_QUOTE = '[^>"]';
let STRING = '("(?:[^"\\\\]|\\\\.)*")';
let NON_SELF_CLOSING_HTML_TAG =
// \1 is a back reference to TAG_NAME
`<${TAG_NAME}(?:${STRING}|${NOT_CLOSING_TAG_NOT_QUOTE})+>(.*?)</\\1 *>`;
let tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG, 'g');
let myStr = `Aenean <abc href="/life<><>\\"<?/abc></abc>"><a>life</a></abc> sed consectetur.
Work Inner HTML quis risus eget about inner html leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>`;
let matches = myStr.match(tagRegex);
// Removing 'g' flag to match each tag part in the for loop
tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG);
for (let i = 0; i < matches.length; i++) {
let tagParts = matches[i].match(tagRegex);
console.log(`Tag #${i} = [${tagParts[0]}]`);
console.log(`Tag #${i} name: [${tagParts[1]}]`);
console.log(`Tag #${i} string attr: [${tagParts[2]}]`);
console.log(`Tag #${i} inner html: [${tagParts[3]}]`);
console.log('');
}
Output:
Tag #0 = [<abc href="/life<><>\"<?/abc></abc>"><a>life</a></abc>]
Tag #0 name: [abc]
Tag #0 string attr: ["/life<><>\"<?/abc></abc>"]
Tag #0 inner html: [<a>life</a>]
Tag #1 = [Work Inner HTML]
Tag #1 name: [a]
Tag #1 string attr: ["/work"]
Tag #1 inner html: [Work Inner HTML]
Tag #2 = [about inner html]
Tag #2 name: [a]
Tag #2 string attr: ["/about"]
Tag #2 inner html: [about inner html]
Tag #3 = [<ve text="<></ve>>">abc</ve>]
Tag #3 name: [ve]
Tag #3 string attr: ["<></ve>>"]
Tag #3 inner html: [abc]
This doesn't work if:
The tag has any descendant tag of the same type
The tag start in one line and ends in another. (In my case I
remove line breaks from HTML)
If you change (.*?)<\/\1 *> to ([\s\S]*?)<\/\1 *> it should match the tag's inner html even if everything is not in the same line. For some reason it didn't work for me on Chrome and Node but worked here with the JavaScript's Regex Engine:
https://www.regextester.com
Regex: <([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>([\s\S]*?)<\/\1 *>
Test String:
Aenean lacinia <abc href="/life<><><?/a></a>">
<a>life</a></abc> sed consectetur.
Work quis risus eget urna mollis ornare about leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>

For multiple lines:
<htmltag>(.+)((\s)+(.+))+</htmltag>

I use this solution:
preg_match_all( '/<((?!<)(.|\n))*?\>/si', $content, $new);
var_dump($new);

In Javascript (among others), this is simple. It covers attributes and multiple lines:
/<pre[^>]*>([\s\S]*?)<\/pre>/

<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>

Related

Regex to find content, then backtrack to initial HTML tag

I'm trying to use regex to match a string that starts with a <p> tag and has some specific content. Then, I want to replace everything from that specific paragraph tag to the end of the page.
I've tried using the expression <p.*?some content.*</html>, but it grabs the first <p> tag it sees, then follows through all the way to the end. I want it to only recognize the paragraph tag immediately preceding the content, allowing for other content and tags between the paragraph tag and the content.
How can I get to some specific content with the regex, then backtrack to the first paragraph tag it sees before the content, and then select everything from there to the end?
If it helps, I'm using EditPad Pro's "Search & Replace" function (although this could apply to anything that uses regex).
For simple input use regex
<p[^<]*some content.*<\/html>
but safer would be to use regex
<p(?:[^<]*|<(?!p\b))*some content.*<\/html>
To start, this is Java code, but it can be easily adapted to other regex engines / programming languages, I suppose.
So from what I understand, you want a situation where a given input has a part that starts with <p> and immediately followed by some target content/phrase. You then want to replace everything following the initial <p> tag with some other content?
If that is correct, you could do something like this:
String input; // holds your input text/html
String targetPhrase = "some specific content"; // some target content/phrase
String replacement; // holds the replacement value
Pattern p = Pattern.compile("<p[^>]*>(" + targetPhrase + ".*)$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
m.replaceFirst(replacement);
Of course, as mentioned in above comments, you really don't want to be using regex for HTML.
Alternatively, if you know that if the <p> tag is just that, with no properties or anything, you could try a substring instead.
So for example, if you're looking for "<p>some specific content", you could try something like:
String input; // your input text/html
String replacement; // the replacement value(s)
int index = input.indexOf("<p>some specific content");
if (index > -1) {
String output = input.substring(0, index);
output += "<p>" + replacement;
// now output holds your modified text/html
}

Matching nested [quote] in using RegExp

I'm trying to get regexp to match some nested tags. (Yes I know I should use a parser, but my input will be correct).
Example:
Text.
More text.
[quote]
First quote
[quote]
Nested second quote.
[/quote]
[/quote]
Let's say I want the regexp to simply change the tags to <blockquote>:
Text.
More text.
<blockquote>
First quote
<blockquote>
Nested second quote.
</blockquote>
</blockquote>
How would I do this, matching both opening and closing tags at the same time?
If you don’t mind correctness, then you could use a simple string replacement and replace each tag separately. Here’s some example using PHP’s str_replace to replace the opening and closing tags:
$str = str_replace('[quote]', '<blockquote>', $str);
$str = str_replace('[/quote]', '</blockquote>', $str);
Or with the help of a regular expression (PHP again):
$str = preg_replace('~\[(/?)quote]~', '<$1blockquote>', $str);
Here the matches of \[(/?)quote] are replaced with <$1blockquote> where $1 is replaced with the match of the first group of the pattern ((/?), either / or empty).
But you should really use a parser that keeps track of the opening and closing tags. Otherwise you can have an opening or closing tag that doesn’t have a counterpart or (if you’re using further tags) that is not nested properly.
You can't match (arbitrarily) nested stuff with regular expressions.
But you can replace every instance of [quote] with <blockquote> and [/quote] with </blockquote>.
It's a lousy idea, but you're apparently trying to match something like: \[\(/?\)quote\] and replace it with: <\1blockquote>
You could use 2 expressions.
s/\[quote\]/\<blockquote\>/
s/\[\/quote\]/\<\/blockquote\>/

Regular expression for extracting tag attributes

I'm trying to extract the attributes of a anchor tag (<a>). So far I have this expression:
(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+
which works for strings like
<a href="test.html" class="xyz">
and (single quotes)
<a href='test.html' class="xyz">
but not for a string without quotes:
<a href=test.html class=xyz>
How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?
Update: Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.
Update 2021: Radon8472 proposes in the comments the regex https://regex101.com/r/tOF6eA/1 (note regex101.com did not exist when I wrote originally this answer)
<a[^>]*?href=(["\'])?((?:.(?!\1|>))*.?)\1?
Update 2021 bis: Dave proposes in the comments, to take into account an attribute value containing an equal sign, like <img src="test.png?test=val" />, as in this regex101:
(\w+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Update (2020), Gyum Fox proposes https://regex101.com/r/U9Yqqg/2 (again, note regex101.com did not exist when I wrote originally this answer)
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Applied to:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
<img src="test.png">
<img src="a test.png">
<img src=test.png />
<img src=a test.png />
<img src=test.png >
<img src=a test.png >
<img src=test.png alt=crap >
<img src=a test.png alt=crap >
Original answer (2008):
If you have an element like
<name attribute=value attribute="value" attribute='value'>
this regex could be used to find successively each attribute name and value
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Applied on:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
it would yield:
'href' => 'test.html'
'class' => 'xyz'
Note: This does not work with numeric attribute values e.g. <div id="1"> won't work.
Edited: Improved regex for getting attributes with no value and values with " ' " inside.
([^\r\n\t\f\v= '"]+)(?:=(["'])?((?:.(?!\2?\s+(?:\S+)=|\2))+.)\2?)?
Applied on:
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
it would yield:
'type' => 'text/javascript'
'defer' => ''
'async' => ''
'id' => 'something'
'onload' => 'alert(\'hello\');'
Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:
/
\G # start where the last match left off
(?> # begin non-backtracking expression
.*? # *anything* until...
<[Aa]\b # an anchor tag
)?? # but look ahead to see that the rest of the expression
# does not match.
\s+ # at least one space
( \p{Alpha} # Our first capture, starting with one alpha
\p{Alnum}* # followed by any number of alphanumeric characters
) # end capture #1
(?: \s* = \s* # a group starting with a '=', possibly surrounded by spaces.
(?: (['"]) # capture a single quote character
(.*?) # anything else
\2 # which ever quote character we captured before
| ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
) # end group
)? # attribute value was optional
/msx;
"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)
(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.
Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.
there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.
I don't really care which one you use, as long as its recognized, tested, and you use one.
my $foo = Someclass->parse( $xmlstring );
my #links = $foo->getChildrenByTagName("a");
my #srcs = map { $_->getAttribute("src") } #links;
# #srcs now contains an array of src attributes extracted from the page.
You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.
So either don’t use named captures:
(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+
Or don’t use the quantifier on this expression:
(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)
This does also allow attribute values like bar=' baz='quux:
foo="bar=' baz='quux"
Well the drawback will be that you have to strip the leading and trailing quotes afterwards.
Just to agree with everyone else: don't parse HTML using regexp.
It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.
There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.
PHP (PCRE) and Python
Simple attribute extraction (See it working):
((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\\)"|[^"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))
Or with tag opening / closure verification, tag name retrieval and comment escaping. This expression foresees unquoted / quoted, single / double quotes, escaped quotes inside attributes, spaces around equals signs, different number of attributes, check only for attributes inside tags, and manage different quotes within an attribute value. (See it working):
(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)
(Works better with the "gisx" flags.)
Javascript
As Javascript regular expressions don't support look-behinds, it won't support most features of the previous expressions I propose. But in case it might fit someone's needs, you could try this version. (See it working).
(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)
This is my best RegEx to extract properties in HTML Tag:
# Trim the match inside of the quotes (single or double)
(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2
# Without trim
(\S+)\s*=\s*([']|["])([\W\w]*?)\2
Pros:
You are able to trim the content inside of quotes.
Match all the special ASCII characters inside of the quotes.
If you have title="You're mine" the RegEx does not broken
Cons:
It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.: <div title="You're"> the result is Group 1: title, Group 2: ", Group 3: You're.
This is the online RegEx example:
https://regex101.com/r/aVz4uG/13
I normally use this RegEx to extract the HTML Tags:
I recommend this if you don't use a tag type like <div, <span, etc.
<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
For example:
<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">
This is the online RegEx example:
https://regex101.com/r/aVz4uG/15
The bug in this RegEx is:
<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
In this tag:
<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>
Returns <div '> but it should not return any match:
Match: <div '>
To "solve" this remove the [^/]+? pattern:
<div(?:\".*?\"|'.*?'|.*?)*?>
The answer #317081 is good but it not match properly with these cases:
<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)
This is the improvement:
(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?
vs
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Avoid the spaces between equal signal:
(\S+)\s*=\s*((?:...
Change the last + and . for:
|[>"']))?[^"']*)["']?
This is the online RegEx example:
https://regex101.com/r/aVz4uG/8
Tags and attributes in HTML have the form
<tag
attrnovalue
attrnoquote=bli
attrdoublequote="blah 'blah'"
attrsinglequote='bloob "bloob"' >
To match attributes, you need a regex attr that finds one of the four forms. Then you need to make sure that only matches are reported within HTML tags. Assuming you have the correct regex, the total regex would be:
attr(?=(attr)*\s*/?\s*>)
The lookahead ensures that only other attributes and the closing tag follow the attribute. I use the following regular expression for attr:
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?
Unimportant groups are made non capturing. The first matching group $1 gives you the name of the attribute, the value is one of $2or $3 or $4. I use $2$3$4 to extract the value.
The final regex is
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)
Note: I removed all unnecessary groups in the lookahead and made all remaining groups non capturing.
splattne,
#VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted
This one works with mixed attributes
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
to test it out
<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
$code = ' <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/> ';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$ms would then contain keys and values on the 2nd and 3rd element.
$keys = $ms[1];
$values = $ms[2];
I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.
something like this might be helpful
'(\S+)\s*?=\s*([\'"])(.*?|)\2
If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?
I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.
If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.
Then you can use XPath.
I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.
My adaptation to also get the boolean attributes and empty attributes like:
<input autofocus='' disabled />
/(\w+)=["']((?:.(?!["']\s+(?:\S+)=|\s*\/[>"']))+.)["']|(\w+)=["']["']|(\w+)/g
I also needed this and wrote a function for parsing attributes, you can get it from here:
https://gist.github.com/4153580
(Note: It doesn't use regex)
I have created a PHP function that could extract attributes of any HTML tags. It also can handle attributes like disabled that has no value, and also can determine whether the tag is a stand-alone tag (has no closing tag) or not (has a closing tag) by checking the content result:
/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
$matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
$results = array(
'element' => $matches[2],
'attributes' => null,
'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
);
if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
$results['attributes'] = array();
foreach($attrs[1] as $i => $attr) {
$results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
}
}
return $results;
}
Test Code
$test = array(
'<div class="foo" id="bar" data-test="1000">',
'<div>',
'<div class="foo" id="bar" data-test="1000">test content</div>',
'<div>test content</div>',
'<div>test content</span>',
'<div>test content',
'<div></div>',
'<div class="foo" id="bar" data-test="1000"/>',
'<div class="foo" id="bar" data-test="1000" />',
'< div class="foo" id="bar" data-test="1000" />',
'<div class id data-test>',
'<id="foo" data-test="1000">',
'<id data-test>',
'<select name="foo" id="bar" empty-value-test="" selected disabled><option value="1">Option 1</option></select>'
);
foreach($test as $t) {
var_dump($t, extract_html_attributes($t));
echo '<hr>';
}
This works for me. It also take into consideration some end cases I have encountered.
I am using this Regex for XML parser
(?<=\s)[^><:\s]*=*(?=[>,\s])
Extract the element:
var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]
Then use jQuery to parse and extract the bit you want:
$(htmlStr).attr('style')
have a look at this
Regex & PHP - isolate src attribute from img tag
perhaps you can walk through the DOM and get the desired attributes. It works fine for me, getting attributes from the body-tag

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.

Regex to match all HTML tags except <p> and </p>

I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.
It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.
Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)
I used Xetius regex and it works fine. Except for some flex generated tags which can be : with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags from flex generated html text so i also added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.
Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.
So as I say, I don't really think regexps are the right tool for the job.
Since HTML is not a regular language
HTML isn't but HTML tags are and they can be adequatly described by regular expressions.
Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre> or <param> tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That should cover <p> tags that have attributes, too.
You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.
The original regex can be made to work with very little effort:
<(?>/?)(?!p).+?>
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.
(That said I agree that generally parsing HTML with regexes is not the way to go).
Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).
/EDIT: I've added the ability to handle attributes in p tags.
This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig
You should probably also remove any attributes on the <p> tag, since someone bad could do something like:
<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.