How to parse not strict HTML documents indulgently? - html

i've got one more question today
are there any html parsers with not strict syntax analyzers available?
as far as i can see such analyzers are built in web browsers
i mean it should be very nice to get a parser that indulgently process the input document allowing any of the following situations that are invalid in xhtml and xml:
not self-closed single tags. for example: <br> or <hr>...
mismatched casing pairs: <td>...</TD>
attributes with no quotes marks: <span class=hilite>...</SPAN>
so on and so on... etc
suggest any suitable parser, please
thank you

TagSoup is available for various languages, including Java, C++ (Taggle) and XSLT (TSaxon).
...TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

If you're happy with Python, Beautiful Soup is just such a parser.
"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

Hpricot is particularly good at parsing broken markup if you're not afraid of a bit of Ruby. http://github.com/whymirror/hpricot

Related

Parsing an html document using an XML-parser

Can I parse an HTML file using an XML parser?
Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.
The intended use is to make an HTML parser, that is part of a web crawler application
You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.
elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
elements that don’t need end tags; e.g., <p> <dt> <li> (their end tags can be implied)
elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
attributes with unquoted values; for example, <meta charset=utf-8>
attributes that are empty, with no separate value given at all; e.g., <input disabled>
XML parsers will fail to parse any HTML document that uses any of those features.
HTML parsers, on the other hand, will basically never fail no matter what a document contains.
All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.
The intended use is to make an HTML parser, that is part of a web
crawler application
If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.
These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:
parse5 (node.js/JavaScript)
html5lib (python)
html5ever (rust)
validator.nu html5 parser (java)
gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)
syntactically they are almost identical
Computers are picky. "Almost identical" isn't good enough. HTML allows things that XML doesn't, therefore an XML parser will reject (many, though not all) HTML documents.
In addition, there's a different quality culture. With HTML the culture for a parser is "try to do something with the input if you possibly can". With XML the culture is "if it's faulty, send it back for repair or replacement".

Is there a parser for wiki markup language that could create a tree of objects in java

I'm looking for a parser for the wiki markup language used by wikipedia which can convert the input wiki markup text into a parse tree of java objects. I've come across a few parsers but they parse the markup text into HTML like:
java-wikipedia-parser
Mylyn WikiText
WikiText isn't really set up to be parsed in this way.
What you might consider doing is looking at Parsoid – it generates HTML with sufficient annotations that you could convert it into a parse tree.
Otherwise, MediaWiki.org has a page about alternative parsers. It's probably hopelessly out of date, though.

Should I write Polyglot HTML5 documents?

I've been considering converting my current HTML5 documents to polyglot HTML5 ones. I figure that even if they only ever get served as text/html, the extra checks of writing it XML would help to keep my coding habits tidy and valid.
Is there anything particularly thrilling in the HTML5-only space that would make this an unwise choice?
Secondly, the specs are a bit hazy on how to validate a polyglot document. I assume the basics are:
No errors when run through the W3C Validator as HTML5
No errors when run through an XML parser
But are there any other rules I'm missing?
Thirdly, seeing as it is a polyglot, does anyone know any caveats to serving it as application/xhtml+xml to supporting browsers and text/html to non-supporting ones?
Edit: After a small bit of experimenting I found that entities like break in XHTML5 (no DTD). That XML parser is a bit of a double-edged sword, I guess I've answered my third question already.
Work on defining how to create HTML5 polyglot documents is currently on-going, but see http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html for an early draft. It's certainly possible to do, but it does require a good deal of coding discipline, and you will need to decide whether it's worth the effort. Although I create HTML4.01/XHTML1.0 polyglot documents, I create them using an XML tool chain which guarantees XML well-formedness and have specialized code to ensure compatibility with HTML non-void elements and valid XML characters. Direct hand coding would be very difficult.
One known current issue in HTML5 is the srcdoc attribute on the iframe element. Because the value of the attribute contains markup, certain characters need to be escaped. The HTML5 draft spec describes how to do this for the HTML serialization, but not (the last time I looked) how to do it in the XHTML serialization.
I'm late to the party but after 5 years the question is still relevant.
On one hand closing all my tags strongly appeals to me. For people reading it, for easier editing, for Great Justice. OTOH, looking at the gory details of the polyglot spec — http://www.sitepoint.com/have-you-considered-polyglot-markup/ has a convenient summary at the end — it's clear to me I can't get it all right by hand.
https://developer.mozilla.org/en/docs/Writing_JavaScript_for_XHTML also sheds interesting light on why XHTML failed: the very choice to use XML mime type has various side effects at run time. By now it should be routine for good JS code to handle these (e.g. always lowercase tag names before comparing) but I don't want all that. There are enough cross-browser issues to test for as-is, thank you.
So I think there is a useful middle way:
For now serve only as text/html. Stop worrying that it will actually parse as exactly the same DOM with same runtime behavior in both HTML and XML modes.
Only strive that it parses as some well-formed XML. It helps readers, it helps editors, it lets me use XML parser on my own documents.
Unfortunately, polyglot tools are rare to non-existant — it's hard to even serialize back XML in a way that also passes the HTML requirements...
No brainer: always self close void tags (<hr/>) and separately close non-void tags (<script ...></script>).
No brainers: use lowercase tags and attr (except some SVG but foreign content uses XML rules anyway), always quote attribute values, always provide attribute values (selected="selected" is more verbose than stanalone selected but I can live with that).
Inline <script> and <style> are most annoying. I can't use & or < inside without breaking XML parsing. I need:
<script>/*<![CDATA[*/
foo < bar && bar < baz;
/*]]>*/</script>
...and that's about it! Not caring about XML namespaces or matching HTML's implied DOM for tables drops about half the rules :-)
Await some future when I can directly go to authoring XHTML, skipping polyglotness. The benefits are I'll be able to forget the tag-closing limitations, will be able to directly consume and produce it with XML tools. Sure, neglecting xml namespaces and other things now will make the switch harder, but I think I'll create more new documents in this future than convert existing ones.
Actually I'm not entirely sure what's stopping me from living in that future right now. Is it only IE 8? I'm also a tiny bit concerned about the all-or-nothing error handling. I'm slighly hoping a future HTML spec will find a way to shrink the HTML vs XML gaps, e.g. make browsers accept <hr></hr> and <script .../> in HTML— while still retaining HTML error handling.
Also, tools. Having libraries in many languages that can serialize to polyglot markup would make it feasible for programs to generate it. Having tools to validate and convert HTML5 <-> polyglot <-> XHTML5 would help. Otherwise, it's pretty much doomed.
Given that the W3C's documentation on the differences between HTML and XHTML isn't even finished, it's probably not worth your time to try to do polyglot. Not yet anyways.... give it another couple of years.
In any event, only in the extremely narrow circumstances where you are actively planning on parsing your HTML as XML for some specific purpose, should you invest the extra time in XML-compliance. There are no benefits of doing it purely for consumption by web browsers -- only drawbacks.
Should you? Yes. But first some clarification on a couple points.
Sending the Content-Type: application/xhtml+xml header only means it should go through an XML parser, it still has all the benefits of HTML5 as far as I can tell.
About , that isn't defined in XML, the only character entity references XML defines are lt, gt, apos, quot, and amp, you will need to use numeric character references for anything else. The code for nbsp is   or  , I personally prefer hex because unicode code points are represented that way (U+00A0).
Sending the header is useful for testing because you can quickly find problems with your markup such as unclosed tags, stray end tags, text that could be interpreted as a tag, etc, basically stuff that can break the look or even functionality of your site.
Most significantly in my opinion, is if you are allowing user input and it fails to parse, that generally means you didn't escape their data and are leaving yourself open to a vulnerability. Parsed as HTML, you might not ever notice a problem until someone starts injecting scripts to harass your users or steal data.
This page is pretty good about explaining what polyglot markup is: https://blog.whatwg.org/xhtml5-in-a-nutshell
This sounds like a very difficult thing to do. One of the downfalls of XHTML was that it wasn't possible to steer successfully between the competing demands of XML and vintage HTML.
I think if you write HTML5 and validate it successfully, you will have as tidy and valid a document as anyone would need.
This wiki has some information not present in the W3C document: http://wiki.whatwg.org/wiki/HTML_vs._XHTML

Win32.: How to scrape HTML without regular expressions?

A recent blog entry by a Jeff Atwood says that you should never parse HTML using regular expressions - yet doesn't give an alternative.
I want to scrape search search results, extracting values:
<div class="used_result_container">
...
...
<div class="vehicleInfo">
...
...
<div class="makemodeltrim">
...
<a class="carlink" href="[Url]">[MakeAndModel]</a>
...
</div>
<div class="kilometers">[Kilometers]</div>
<div class="price">[Price]</div>
<div class="location">
<span class='locationText'>Location:</span>[Location]
</div>
...
...
</div>
...
...
</div>
...and it repeats
You can see the values I want to extract, [enclosed in brackets]:
Url
MakeAndModel
Kilometers
Price
Location
Assuming we accept the premise that parsing HTML:
generally a bad idea
rapidly devolves into madness
What's the way to do it?
Assumptions:
native Win32
loose html
Assumption clarifications:
Native Win32
.NET/CLR is not native Win32
Java is not native Win32
perl, python, ruby are not native Win32
assume C++, in Visual Studio 2000, compiled into a native Win32 application
Native Win32 applications can call library code:
copied source code
DLLs containing function entry points
DLLs containing COM objects
DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects
Loose HTML
xml is not loose HTML
xhtml is not loose HTML
strict HTML is not loose HTML
Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality I was present the assumption that any HTML parser must be generous in the HTML it accepts.
Clarification#2
Assuming you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would you walk a DOM tree? I need a DIV node that is a class of used_result_container, which has a child DIV of class of vehicleInfo. But the nodes don't necessarily have to be direct children of one another.
It sounds like I'm trading one set of regular expression problems for another. If they change the structure of the HTML, I will have to re-write my code to match - as I would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do I do instead?
And would I not be writing a regular expression parser for DOM nodes? i'm writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way - the way that Jeff alluded to.
I intentionally kept the original question vague, so as not to lead people down the wrong path. I didn't want to imply that the solution, necessarily, had anything to do with:
walking a DOM tree
xpath queries
Clarification#3
The sample HTML I provided I trimmed down to the important elements and attributes. The mechanism I used to trim the HTML down was based on my internal bias that uses regular expressions. I naturally think that I need various "sign-posts in the HTML that I look for.
So don't confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of all the original HTML.
Update 4
The only proposed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: then what?
Now that I have the DOM, what do I do with it? It seems that I still have to walk the tree with some sort of regular DOM expression parser, capable of forward matching and capture.
In this particular case i need all the used_result_container DIV nodes which contain vehicleInfo DIV nodes as children. Any used_result_container DIV nodes that do not contain vehicleInfo has a child are not relevant.
Is there a DOM regular expression parser with capture and forward matching? I don't think XPath can select higher level nodes based on criteria of lower level nodes:
\\div[#class="used_result_container" && .\div[#class="vehicleInfo"]]\*
Note: I use XPath so infrequently that I cannot make up hypothetical xpath syntax very goodly.
Python:
lxml - faster, perhaps better at parsing bad HTML
BeautifulSoup - if lxml fails on your input try this.
Ruby: (heard of the following libraries, but never tried them)
Nokogiri
hpricot
Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.
If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)
Edit: Your post has really grown since it first came out... I'll try to answer what I can.
i don't think XPath can select higher level nodes based on criteria of lower level nodes:
It can. Try //div[#class='vehicleInfo']/parent::div[#class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.
how then do you access repeating structures of data?
It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.
Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...
I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.
Native Win32
You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).
Use Html Agility Pack for .NET
Update
Since you need something native/antique, and the markup is likely bad, I would recommend running the markup through Tidy and then parsing it with Xerces
Use Beautiful Soup.
Beautiful Soup is an HTML/XML parser
for Python that can turn even invalid
markup into a parse tree. It provides
simple, idiomatic ways of navigating,
searching, and modifying the parse
tree. It commonly saves programmers
hours or days of work. There's also a
Ruby port called Rubyful Soup.
If you are really under Win32 you can use a tiny and fast COM object to do it
example code with vbs:
Set dom = CreateObject("htmlfile")
dom.write("<div>Click for <img src='http://www.google.com/images/srpr/logo1w.png'>Google</a></div>")
WScript.Echo(dom.Images.item(0).src)
You can also do this in JScript, or VB/Dephi/C++/C#/Python etc on Windows. It use mshtml.dll dom layout and parser directly.
The alternative is to use an html dom parser. Unfortunately, it seems like most of them have problems with poorly formed html, so in addition you need to run it through html tidy or something similar first.
If a DOM parser is out of the question - for whatever reason,
I'd go for some variant of PHP's explode() or whatever is available in the programming language that you use.
You could for example start out by splitting by <div class="vehicleInfo">, which would give you each result (remember to ignore the first place). After that you could loop the results split each result by <div class="makemodeltrim"> etc.
This is by no means an optimal solution, and it will be quite fragile (almost any change in the layout of the document would break the code).
Another option would be to go after some CSS selector library like phpQuery or similar for your programming language.
Use a DOM parser
e.g. for java check this list
Open Source HTML Parsers in Java (I like to use cobra)
Or if you are sure e.g. that you only want to parse a certain subset of your html which ideally is also xml valid you could use some xml parser to parse only fragment you pass it in and then even use xpath to request the values your are interested in.
Open Source XML Parsers in Java (e.g. dom4j is easy to use)
I think libxml2, despite its name, also does its best to parse tag soup HTML. It is a C library, so it should satisfy your requirements. You can find it here.
BTW, another answer recommended lxml, which is a Python library, but is actually built on libxml2. If lxml worked well for him, chances are libxml2 is going to work well for you.
How about using Internet Explorer as an ActiveX control? It will give you a fully rendered structure as it viewed the page.
The HTML::Parser and HTML::Tree modules in Perl are pretty good at parsing most typical so-called HTML on the web. From there, you can locate elements using XPath-like queries.
What do you think about ihtmldocument2,
I think it should help.

Can you provide some examples of why it is hard to parse XML and HTML with a regex? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:
People want to treat a file as a sequence of lines, but this is valid:
<tag
attr="5"
/>
People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:
<img src="imgtag.gif" alt="<img>" />
People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):
<span id="outer"><span id="inner">foo</span></span>
People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):
<span class="phonenum">(<span class="area code">703</span>)
<span class="prefix">348</span>-<span class="linenum">3020</span></span>
Comments may contain poorly formatted or incomplete tags:
foo
<!-- FIXME:
<a href="
-->
bar
What other gotchas are you aware of?
Here's some fun valid XML for you:
<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
<a b="&y;>" />
<![CDATA[[a>b <a>b <a]]>
<?x <a> <!-- <b> ?> c --> d
</x>
And this little bundle of joy is valid HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
<!ENTITY % e "href='hello'">
<!ENTITY e "<a %e;>">
]>
<title>x</TITLE>
</head>
<p id = a:b center>
<span / hello </span>
&amp<br left>
<!---- >t<!---> < -->
&e link </a>
</body>
Not to mention all the browser-specific parsing for invalid constructs.
Good luck pitting regex against that!
EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<HTML/
<HEAD/
<TITLE/>/
<P/>
Actually
<img src="imgtag.gif" alt="<img>" />
is not valid HTML, and is not valid XML either.
It is not valid XML because the '<' and '>' are not valid characters inside attribute strings. They need to be escaped using the corresponding XML entities < and >
It is not valid HTML either because the short closing form is not allowed in HTML (but is correct in XML and XHTML). The 'img' tag is also an implicitly closed tag as per the HTML 4.01 specification. This means that manually closing it is actually wrong, and is equivalent to closing any other tag twice.
The correct version in HTML is
<img src="imgtag.gif" alt="<img>">
and the correct version in XHTML and XML is
<img src="imgtag.gif" alt="<img>"/>
The following example you gave is also invalid
<
tag
attr="5"
/>
This is not valid HTML or XML either. The name of the tag must be right behind the '<', although the attributes and the closing '>' may be wherever they want. So the valid XML is actually
<tag
attr="5"
/>
And here's another funkier one: you can actually choose to use either " or ' as your attribute quoting character
<img src="image.gif" alt='This is single quoted AND valid!'>
All the other reasons that were posted are correct, but the biggest problem with parsing HTML is that people usually don't understand all the syntax rules correctly. The fact that your browser interprets your tagsoup as HTML doesn't means that you have actually written valid HTML.
Edit: And even stackoverflow.com agrees with me regarding the definition of valid and invalid. Your invalid XML/HTML is not highlighted, while my corrected version is.
Basically, XML is not made to be parsed with regexps. But there is also no reason to do so. There are many, many XML parsers for each and every language. You have the choice between SAX parsers, DOM parsers and Pull parsers. All of these are guaranteed to be much faster than parsing with a regexp and you may then use cool technologies like XPath or XSLT on the resulting DOM tree.
My reply is therefore: not only is parsing XML with regexps hard, but it is also a bad idea. Just use one of the millions of existing XML parsers, and take advantage of all the advanced features of XML.
HTML is just too hard to even try parsing on your own. First the legal syntax has many little subtleties that you may not be aware of, and second, HTML in the wild is just a huge stinking pile of (you get my drift). There are a variety of lax parser libraries that do a good job at handling HTML like tag soup, just use these.
I wrote an entire blog entry on this subject: Regular Expression Limitations
The crux of the issue is that HTML and XML are recursive structures which require counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.
The previous paragraph comes with a slight caveat. Certain regex implementations now support the idea of recursion. However once you start adding recursion into your regex expressions, you are really stretching the boundaries and should consider a parser.
One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.
It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.
Are people actually making a mistake by using a regex, or is it simply good enough for the task they're trying to achieve?
I totally agree that parsing html and xml using a regex is not possible as other people have answered.
However, if your requirement is not to parse html/xml but to just get at one small bit of data in a "known good" bit of html / xml then maybe a regular expression or even an even simpler "substring" is good enough.
I'm tempted to say "don't re-invent the wheel". Except that XML is a really, really complex format. So maybe I should say "don't reinvent the synchrotron."
Perhaps the correct cliche starts "when all you have is a hammer..." You know how to use regular expressions, regular expression are good at parsing, so why bother to learn an XML parsing library?
Because parsing XML is hard. Any effort you save by not having to learn to use an XML parsing library will be more than made up by the amount of creative work and bug-swatting you will have to do. For your own sake, google "XML library" and leverage somebody else's work.
People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.
I think the problems boil down to:
The regex is almost invariably incorrect. There are legitimate inputs which it will fail to match correctly. If you work hard enough you can make it 99% correct, or 99.999%, but making it 100% correct is almost impossible, if only because of the weird things that XML allows by using entities.
If the regex is incorrect, even for 0.00001% of inputs, then you have a security problem, because someone can discover the one input that will break your application.
If the regex is correct enough to cover 99.99% of cases then it is going to be thoroughly unreadable and unmaintainable.
It's very likely that a regex will perform very badly on moderate-sized input files. My very first encounter with XML was to replace a Perl script that (incorrectly) parsed incoming XML documents with a proper XML parser, and we not only replaced 300 lines of unreadable code with 100 lines that anyone could understand, but we improved user response time from 10 seconds to about 0.1 seconds.
I believe this classic has the information you are looking for. You can find the point in one of the comments there:
I think the flaw here is that HTML is a Chomsky Type 2 grammar
(context free grammar) and RegEx is a Chomsky Type 3 grammar (regular
expression). Since a Type 2 grammar is fundamentally more complex than
a Type 3 grammar - you can't possibly hope to make this work. But many
will try, some will claim success and others will find the fault and
totally mess you up.
Some more info from Wikipedia: Chomsky Hierarchy
I gave a simplified answer to this problem here. While it doesn't account for the 100% mark, I explain how it's possible if you're willing to do some pre-processing work.
Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.
I disagree. If you will use recursive in regex, you can easily find open and close tags.
Here I showed example of regex to avoid parsing errors of examples in first message.