Acceptable use of Regex in HTML parsing?

Acceptable use of Regex in HTML parsing? - html

There is a lot of argument back and forth over when and if it is ever appropriate to use a regex to parse html.
As a common problem that comes up is parsing links from html my question is, would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML? In this scenario you are not concerned about closing tags and you have a pretty specific structure you are looking for.
It seems like significant overkill to use a full html parser. While I have seen questions and answers indicating the using a regex to parse URLs, while largely safe is not perfect, the extra limitations of structured <a> tags would appear to provide a context where one should be able to achieve 100% accuracy without breaking a sweat.
Thoughts?

Consider this valid html:
<!DOCTYPE html>
<title>Test Case</title>
<p>
<!-- <a href="url1"> -->
<span class="><a href='url2'>"></span>
url<'>click
</p>
What is the list of urls to be extracted? A parser would say just a single url with value my">url<. Would your regular expression?

I'm one of those people who think using regex in this situation is a bad idea.
Even if you just want to match a href attribute from a <a> tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.
Plus, matching href attributes from tags with a XML parser is all but overkill.
I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.
But I had to come back on my code quite a lot, for many reasons :
the source code had changed
one of the source page had broken html and I didn't tested it
I didn't try my code for every pages of the source, only to find out a few of them didn't work.
...
I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.
What I usually from now on is :
using tidy to clean the html source.
Use DOM + Xpath to actually parse the page and extract the parts I want.
Use regexes only on small text-only parts (like the trimed textContent of a node)
The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.
Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.
I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.
Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.
$dom = new DOMDocument();
$dom->loadHTML($html);
// loop on every links
foreach($dom->getElementsByTagName('a') as $link) {
// get href attribute
$href = $link->getAttribute('href');
// do whatever you want with them...
}
I hope this is helping somehow.

I proposed this one :
<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>
On this thread
Eventually it can fail for what can be in name.

Related

Look for Nested XML tag with Regex

This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.
I'm trying to find all the nested tags, here are some examples
I want to catch:
<a><a></a></a>
I don't want to catch
<a></a><a></a>
So in plain english I want to catch all
<a> following other <a> without having </a> in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak
Hoping to have this problem solved.
Thanks all!

I hope you are ready for parsing XML with regex.
First of all, let's define what XML tags would look like!
<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>
To match one of these tags we can then use the following regex:
/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s
Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:
/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s
Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):
/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s
Done - The regex should do.
No seriously, try it out.
I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml from aliteralmind's question and tried it with my regex as well. Works best with heavily nested elements.
(source: gyazo.com)
Cheers.

If you want a 100% correct solution, for example one that works with arbitrary content in comments and CDATA sections and in internal/external entities, and with author-chosen namespace prefixes, then it can't be done with regular expressions.
And since a 100% correct solution is very easy to achieve with XSLT, I think you are using the wrong technology.
No doubt you can achieve an acceptably high hit rate with regular expressions if you're prepared to put enough work in, but the details depend on aspects of the specification that you haven't made clear: for example, what you want to do with the nested elements that you find, and whether you want to locate elements nested 3-deep or 4-deep as well as those nested 2-deep.

Howto remove HTML <a> tags in a CDATA element

I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href> tags, but keep text in the tags.
I'm searching around regex but still not find a good way to do that.
All advices are welcome!

You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing </?a\b[^>]*> with the empty string could get you pretty far.
In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.
If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.

Django templatetag for rendering a subset of html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.

There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.

Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.

You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}

Regex to read out HTML tags

I'm looking for a regex that matches all used HTML tags in a text consisting of several lines. It should read out "b", "p" and "script" in the following lines:
<b>
<p class="normalText">
<script type="text/javascript">
Is there such thing? The start I have is that it should start with a "<" and read until it hits a space or a ">", but at the same time, it should not include the starting "<" since I just want to match the letter/word itself. Thoughts?

There are many similar questions on SO:
Filter out HTML tags and resolve entities in python
Regex to match all HTML tags except <p> and </p>
Strip all HTML tags except links
etc. The general agreement is that it's best not to use regular expressions to parse HTML instead of doing it properly by applying a DOM parser and traversing the DOM tree.

It's virtually impossible to regex HTML once you start considering all the special cases and malformed HTML that browsers sometimes happilly parse anyway. That said however I thought it might be fun to get the names without using capture groups and thus I present too you with the following sollution:
(?<=<)\w+(?=[^<]*?>)
For the record I hold little faith in it being at all useful in any but the most trivial of cases.

I don't know what system you are using, but it can be done to a certain extent. Look at this online flex-based application. Check out the Published > XML regex examples. You will get an idea.

A regular expression to remove a given (x)HTML tag from a string

Let's say I have a string holding a mess of text and (x)HTML tags. I want to remove all instances of a given tag (and any attributes of that tag), leaving all other tags and text along. What's the best Regex to get this done?
Edited to add: Oh, I appreciate that using a Regex for this particular issue is not the best solution. However, for the sake of discussion can we assume that that particular technical decision was made a few levels over my pay grade? ;)

Attempting to parse HTML with regular expressions is generally an extremely bad idea. Use a parser instead, there should be one available for your chosen language.
You might be able to get away with something like this:
</?tag[^>]*?>
But it depends on exactly what you're doing. For example, that won't remove the tag's content, and it may leave your HTML in an invalid state, depending on which tag you're trying to remove. It also copes badly with invalid HTML (and there's a lot of that about).
Use a parser instead :)

I think there is some serious anti-regex bigotry happening here. There are lots of times when you may want to strip a particular tag out of some markup when it doesn't make sense to use a full blown parser.
Of course there are times when a parser might be the best option, but if you are looking for a regex then:
<script[^>]*?>[\s\S]*?<\/script>
That would remove script tags and their contents. Make sure that you use case-insensitive matching.
If you don't want to remove the contents of the tag then you can use:
<\/?script[^>]*?>
An example of usage in javascript would be:
function stripScripts(markup) {
return markup.replace(/<script[^>]*?>[\s\S]*?<\/script>/gi, '');
}
var safeText = stripScripts(textarea.value);

I think it might be Raymond Chen (blogs.msdn.com/oldnewthing) that I'm paraphrasing (badly!) here... But, you want a Regular Expression? "Now you have two problems" ... :=)
If the string is well-formed (X)HTML, could you load it up into a parser (HTML/XML) and use this to remove any nodes of the offending variety? If it's not well-formed, then it becomes a bit more tricky, but, I suspect that a RegEx isn't the best way to go about this...

There are just TOO many ways a single tag can appear, not to mention encodings, variants, etc.
I strongly suggest you rethink this approach.... you really shouldnt have to be handling HTML directly, anyway.

Off the top of my head, I'd say this will get you started in the right direction.
s/<TAG[^>]*>([^<]*)</TAG[^>]*>/\1
Basically find the starting tag, any text in between the tags, and then the ending tag. Replace the whole thing with whatever was in between the tags.

Corrected answer:
</?TAG\b[^>]*?>
Because Dans answer would remove <br />, but you want only <b>

Here's a regex I wrote for this purpose, it works in a few more situations:
</?(?(?=b|img|a|script)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>

While using regexes for parsing HTML is generally frowned upon or looked down on, you almost certainly don't want to write your own parser.
You could however use some inbuilt or library functions to achieve what you need.
JavaScript has getElementsByTagName and getElementById, not to mention jQuery.
PHP has the DOM extension.
Python has the awesome Beautiful Soup
...and many more.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008