A regular expression to remove a given (x)HTML tag from a string - html

Let's say I have a string holding a mess of text and (x)HTML tags. I want to remove all instances of a given tag (and any attributes of that tag), leaving all other tags and text along. What's the best Regex to get this done?
Edited to add: Oh, I appreciate that using a Regex for this particular issue is not the best solution. However, for the sake of discussion can we assume that that particular technical decision was made a few levels over my pay grade? ;)

Attempting to parse HTML with regular expressions is generally an extremely bad idea. Use a parser instead, there should be one available for your chosen language.
You might be able to get away with something like this:
</?tag[^>]*?>
But it depends on exactly what you're doing. For example, that won't remove the tag's content, and it may leave your HTML in an invalid state, depending on which tag you're trying to remove. It also copes badly with invalid HTML (and there's a lot of that about).
Use a parser instead :)

I think there is some serious anti-regex bigotry happening here. There are lots of times when you may want to strip a particular tag out of some markup when it doesn't make sense to use a full blown parser.
Of course there are times when a parser might be the best option, but if you are looking for a regex then:
<script[^>]*?>[\s\S]*?<\/script>
That would remove script tags and their contents. Make sure that you use case-insensitive matching.
If you don't want to remove the contents of the tag then you can use:
<\/?script[^>]*?>
An example of usage in javascript would be:
function stripScripts(markup) {
return markup.replace(/<script[^>]*?>[\s\S]*?<\/script>/gi, '');
}
var safeText = stripScripts(textarea.value);

I think it might be Raymond Chen (blogs.msdn.com/oldnewthing) that I'm paraphrasing (badly!) here... But, you want a Regular Expression? "Now you have two problems" ... :=)
If the string is well-formed (X)HTML, could you load it up into a parser (HTML/XML) and use this to remove any nodes of the offending variety? If it's not well-formed, then it becomes a bit more tricky, but, I suspect that a RegEx isn't the best way to go about this...

There are just TOO many ways a single tag can appear, not to mention encodings, variants, etc.
I strongly suggest you rethink this approach.... you really shouldnt have to be handling HTML directly, anyway.

Off the top of my head, I'd say this will get you started in the right direction.
s/<TAG[^>]*>([^<]*)</TAG[^>]*>/\1
Basically find the starting tag, any text in between the tags, and then the ending tag. Replace the whole thing with whatever was in between the tags.

Corrected answer:
</?TAG\b[^>]*?>
Because Dans answer would remove <br />, but you want only <b>

Here's a regex I wrote for this purpose, it works in a few more situations:
</?(?(?=b|img|a|script)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>

While using regexes for parsing HTML is generally frowned upon or looked down on, you almost certainly don't want to write your own parser.
You could however use some inbuilt or library functions to achieve what you need.
JavaScript has getElementsByTagName and getElementById, not to mention jQuery.
PHP has the DOM extension.
Python has the awesome Beautiful Soup
...and many more.

Related

Look for Nested XML tag with Regex

This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.
I'm trying to find all the nested tags, here are some examples
I want to catch:
<a><a></a></a>
I don't want to catch
<a></a><a></a>
So in plain english I want to catch all
<a> following other <a> without having </a> in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak
Hoping to have this problem solved.
Thanks all!
I hope you are ready for parsing XML with regex.
First of all, let's define what XML tags would look like!
<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>
To match one of these tags we can then use the following regex:
/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s
Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:
/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s
Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):
/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s
Done - The regex should do.
No seriously, try it out.
I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml from aliteralmind's question and tried it with my regex as well. Works best with heavily nested elements.
(source: gyazo.com)
Cheers.
If you want a 100% correct solution, for example one that works with arbitrary content in comments and CDATA sections and in internal/external entities, and with author-chosen namespace prefixes, then it can't be done with regular expressions.
And since a 100% correct solution is very easy to achieve with XSLT, I think you are using the wrong technology.
No doubt you can achieve an acceptably high hit rate with regular expressions if you're prepared to put enough work in, but the details depend on aspects of the specification that you haven't made clear: for example, what you want to do with the nested elements that you find, and whether you want to locate elements nested 3-deep or 4-deep as well as those nested 2-deep.

Regular Expressions vs XPath when parsing HTML text

I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks
It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.
I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
XPath is less likely to break if the web developer does any minor changes. That would be my choice.
Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.

Acceptable use of Regex in HTML parsing?

There is a lot of argument back and forth over when and if it is ever appropriate to use a regex to parse html.
As a common problem that comes up is parsing links from html my question is, would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML? In this scenario you are not concerned about closing tags and you have a pretty specific structure you are looking for.
It seems like significant overkill to use a full html parser. While I have seen questions and answers indicating the using a regex to parse URLs, while largely safe is not perfect, the extra limitations of structured <a> tags would appear to provide a context where one should be able to achieve 100% accuracy without breaking a sweat.
Thoughts?
Consider this valid html:
<!DOCTYPE html>
<title>Test Case</title>
<p>
<!-- <a href="url1"> -->
<span class="><a href='url2'>"></span>
url<'>click
</p>
What is the list of urls to be extracted? A parser would say just a single url with value my">url<. Would your regular expression?
I'm one of those people who think using regex in this situation is a bad idea.
Even if you just want to match a href attribute from a <a> tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.
Plus, matching href attributes from tags with a XML parser is all but overkill.
I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.
But I had to come back on my code quite a lot, for many reasons :
the source code had changed
one of the source page had broken html and I didn't tested it
I didn't try my code for every pages of the source, only to find out a few of them didn't work.
...
I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.
What I usually from now on is :
using tidy to clean the html source.
Use DOM + Xpath to actually parse the page and extract the parts I want.
Use regexes only on small text-only parts (like the trimed textContent of a node)
The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.
Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.
I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.
Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.
$dom = new DOMDocument();
$dom->loadHTML($html);
// loop on every links
foreach($dom->getElementsByTagName('a') as $link) {
// get href attribute
$href = $link->getAttribute('href');
// do whatever you want with them...
}
I hope this is helping somehow.
I proposed this one :
<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>
On this thread
Eventually it can fail for what can be in name.

Howto remove HTML <a> tags in a CDATA element

I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href> tags, but keep text in the tags.
I'm searching around regex but still not find a good way to do that.
All advices are welcome!
You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing </?a\b[^>]*> with the empty string could get you pretty far.
In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.
If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.