Regular Expressions to fix invalid HTML

Regular Expressions to fix invalid HTML - html

I have hundreds of files (ancient ASP and HTML) filled with outdated and often completely invalid HTML code.
Between Visual Studio and ReSharper, this invalid HTML is flagged and easily visible if the editor window is scrolled to where the invalid HTML appears. However, neither tool is providing any method to quickly fix the errors across the whole project.
The first few errors on which ReSharper focuses my attention are tags that are either not closed or closed but not opened. Sometimes this occurs because the opening and closing tags overlap - for instance:
<font face=verdana size=5><b>some text</font></b>
<span><p>start of a paragraph
with multiple lines of <i><b>text/hmtl
</i> with a nice mix of junk</b>
</span></p>
Sometimes opening tags without a corresponding closing tag were allowed in older versions of HTML (or the tools which generated the HTML didn't care about the standards as some browsers usually figured out what the author meant). So the mess I'm attempting to clean up has many unclosed HTML tags that ought to be closed.
<font face = tahoma size=2>some more text<b><sup>*</sup></b>
...
...
</body>
</html>
And just for good measure, the code includes lots of closing HTML tags that have no matching start tag.
</b><p>some text that is actually within closed tags</p>
</td>
</tr>
</table>
So, other than writing a new application to parse, flag, and fix all these errors - does anyone have some .Net regular expressions that could be used to locate and preferably fix this stuff with Visual Studio 2012's Search and Replace feature?
Though a single expression that does it all would be nice, multiple expressions that each handle one of the above cases would still be very helpful.
For the case of overlapped HTML tags, I'm using this expression:
(?n)(?<t1s>(?><(?<t1>\w+)[^>]*>))(?<c1>((?!</\k<t1>>)(\n|.))*?)(?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>))(?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?)(?<t1e></\k<t1>>)(?<c3>(?>(\n|.)*?))(?<t2e></\k<t2>>)
Explanation:
(?n) Ignore unnamed captures.
(?<t1s>(?><(?<t1>\w+)[^>]*>)) Get the first tag, capturing the full tag and attributes
for replacement and the name alone for further matching.
(?<c1>((?!</\k<t1>>)(\n|.))*?) Capture content between the first and second tag.
(?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>)) Get the 2nd tag, capturing the full
tag and attributes for replacement, the name along for further matching, and ensuring
it does not match the 1st tag and that the first tag is still open.
(?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?) Capture content between the second tag
closing of the first tag.
(?<t1e></\k<t1>>) Capture the closing of the first tag, where the second tag is
still open.
(?<c3>(?>(\n|.)*?)) Capture content between the closing of the first tag and the closing
of the second tag.
(?<t2e></\k<t2>>) Capture the closing of the second tag.
With this replacement expression:
${t1s}${c1}${t2s}${c2}${t2e}${c3}${t1e}
The issues with this search expression is that it is painfully slow. Using . instead of (\n|.) for the three content captures is much quicker, but limits the results to just those where the overlapped tags and intervening content are on a single line.
The expression will also match valid, properly closed and properly nested HTML if the first tag appears inside the content of the second tag, like this:
<font color=green><b>hello world</b></font><span class="whatever"><font color=red>*</font></span>
So it is not safe to use the expression in a "Replace All" operation, especially across the hundreds of files in the solution.
For unclosed tags, I've successfully handled the self-closing tags: <img/>, <meta/>, <input/>, <link/>, <br/>, and <hr/>. However, I've still not attempted the generic case for all the other tags - those that may have content or should be closed with a separate closing tag.
Also, I've no idea how to match closing tags without a matching opening tag. The simple solution of </\w+> will match all closing tags regardless of whether or not they have a matched opening tag.

According to their website, Resharper has this feature:
Solution-Wide Analysis
Not only is ReSharper capable of analyzing a specific code file for errors, but it can extend its analysis skills to cover your whole solution.
...
All you have to do is explicitly switch Solution-Wide Analysis on, and then, after it analyzes the code of your solution, view the list of errors in a dedicated window:
[
Even without opening that window, you can still easily navigate through errors in your solution with Go to Next Error in Solution (Shift+Alt+PageDown) and Go to Previous Error in Solution (Shift+Alt+F12) commands.
Your current "solution" is to use regexes on a context-sensitive language (invalid HTML). Please, NO. People flip out already when people suggest parsing context-free languages with regexes.
On second thought, there might be a solution that we can use regexes for.
For this HTML:
<i><b>text/html
</i> with a nice mix of junk</b>
A better transformation would be (it's more valid, right?):
<i><\i><b><i>text/hmtl
</i> with a nice mix of junk</b>
There are many ways this could go wrong, (although it's pretty bad as-is), but I assume you have this all backed up. This regex (where i is an example of a tag you may want to do this with):
<(i(?: [^>]+)?)>([^<]*)<(\/?[^i](?: [^>]+)?)>
Might help you out. I don't know how regex replace works in whatever flavor you're using, but if you replace $0 (everything matched by the regex) with <$1>$2</$1><$3><$1>, you'll get the transformation I'm talking about.

Related

html start tag missing greater-than char

Working my way through this tutorial node-express-mongo primer
At one point is the following tag
<li><a href="superhero/{{superhero.id}}"</a>{{superhero.name}}</li>
There is no "closing" greater-than char for the anchor's start tag.
But it works as intended.
Now I can adjust the above to create an identical looking link, but with a complete start tag:
<li>{{superhero.name}}</li>
However I'm new at web design and feel like I'm missing some rules somewhere. Is this common practice and where would I find this kind of information? To me it feels awkard not seeing the tag completed.
Or maybe the browser is just forgiving and this is not a common practice?
Thanks in advance!

As there is no ending bracket for the tag, the </a is interpreted as garbage inside the tag and the > after it ends the starting tag. You get a starting tag without an ending tag.
The browser then figures out that the anchor tag has to end before closing the list item tag.
You should not rely on the browser implicitly closing your tags, as browsers might react differently to incorrect markup. In this specific case it's likely that browsers react the same, but in other cases they may choose to interpret the code differently, for example placing elements in a different order than you expected.

Looking for a way to trim HTML code using terminal commands

I'm trying to learn awk and sed better, to be able to create cross-compatible terminal tools without needing things like PHP, Perl and so on. I'm now trying to clean up a very long string which is basically a part of an HTML document that I've fetched with curl. I'm wondering about the best way to go about this.
Most solutions that I have found are counting on luxuries like static files or structures, but as I'm trying to clean up fetched HTML code I want to be able to assume that the "periphery" of the string can change a lot, both in size and structure. So what I think I need to be able to do is essentially identify HTML tags, as these likely will not change, and extract the data from those HTML tags, no matter where they are. An example could be something like this:
<span class="unique-class">Payload</span>
I need to be able to look for that entire HTML tag, and when it is found, I need to extract basically everything after the >, until a < is found and another tag starts.
Since my original code is basically useless due to the fact that it just greps lines matching certain words (words that can show up in non-interesting instances on the same page), I'm really open for anything.

You'll very likely need to use Regex to find the string segments you need, sed and awk take Regex as an option, though may require a switch to do so. I recommend looking for the tags as wholes, otherwise you might end up getting code between a closing tag and opening tag (</span>stuff here<p>), which you probably don't want.
So, your regexes, at their most basic, might look something like this (not tested, you will probably have to tweak it):
/\<[a-zA-z]\>/ /* Find the opening tag. */
/\<[/a-zA-z]\>/ /* Find the closing tag, note the presence of the "/" inside the square brackets.
*/
Depending on your needs, you can create a list of tags to look for, specifically, giving you something like:
tags="div|p|article|section" /* Your list of tags, pipe-delimited for OR logic */
/\<$tags[:print:]\>/ /* The regex, looking for something like <div[anything]> */
You may be able to take it farther by Regexing for the opening tag, storing the base tag in a variable, then finding the matching closing tag. This may take a little more work to get working properly, but it does have the advantage of being more robust and naturally avoids the pitfalls of stopping at the wrong closing tag (ie - stopping at an </a> when it should stop at </p>).
A couple of notes - this may get a little hairy with some of the single-character tags. If you don't write it intelligently enough, your program may confuse things like <a> and <article>, so make sure your code is robust enough to account for that.
Also, don't forget that <input>s are used for generating most of the different form inputs, so if you care about what those are, make sure to look for the type attribute whenever you run across an <input>.
Finally, you can't necessarily assume that a tag will have a closing tag. Some tags don't have one (<br/>/<br>, <hr/>/<hr>) and the HTML specs don't always require them (<li> and <p> don't require closing tags as long as the next opening tag is another <li> or <p>, or is followed by the parent's closing tag). You also can't assume that the HTML you get will be valid. So make sure to account for these situations, so your application doesn't crash and burn.

JSoup - Quotations inside attributes

I'm using JSoup in an attempt to built valid XML from a couple of websites. Most of the time it has worked phenomenally well, but recently I've encountered some cases of bad HTML that JSoup can't seem to fix.
<meta name="saploTags" content="Tag1,Tag2,Tag3," Tag4,Tag5,Tag6"/>
Results in
<meta name="saploTags" content="Tag1,Tag2,Tag3," tag4,tag5,tag6"="" />
This causes problems later on when I'm trying to index the resulting XML. Does anyone have any suggestions what to do? Preferably I'd have everything between the leftmost and rightmost quotation marks escaped or removed in some way in order to prevent data loss (like content="Tag1,Tag2,Tag3,Tag4,Tag5,Tag6". Otherwise it would be ok if JSoup cut off after its first "end quote", disregarding the last tags, like content="Tag1,Tag2,Tag3".
(Similar problems that I've found is e.g. <img src=".." alt="This text contains the quote "The quote" and here's some more text"/> which causes similar problems)
Is it possible to get around this with jsoup, or have I reached a dead end?
/Regards, Magnus

That's quite simply not valid XML nor HTML. Those double quotes should be turned into character references if they're to be considered as part of the attribute value. Even if you could set a parser to be very lenient, it's not gonna be able to solve this because it is no longer clear where the attribute content ends.
Trying to automatically fix this seems rather difficult. There's all sorts of corner cases that'll wreak havoc on any sort of solution. How's this supposed to be interpreted, for example:
<element attribute="this isn't "quite" the=correct way="to=" do things"" />
Look at how the SO code formatter struggles with it.
Even making sense of this yourself is difficult, let alone writing a tool that's gonna make sense of what is or isn't attribute content.
Simple approach? Just don't accept invalid HTML. It's lenient enough as it is, with most parsers allowing lower case and upper case element names, closing tags not always being mandatory etc. If people still manage to generate invalid HTML, then too bad for them.

Shortened HTML text and malformed tags

In my web application I intend to shorten a lengthy string of HTML formatted text if it is more than 300 characters long and then display the 300 characters and a Read More link on the page.
The issue I came across is when the 300 character limit is reached inside an HTML tag, example: (look for HERE)
<a hreHERE="somewhere">link</a>
<a hre="somewhere">liHEREnk</a>
When this happens, the entire page could become ill-formatted because everything after the HERE in the previous example is removed and the HTML tag is kept open.
I thinking of using CSS to hide any overflow beyond a certain limit and create the "Read More" link if the text is beyond a certain number, but this would entail me including all the text on the page.
I've also thought about splitting the text at . to ensure that it's split at the end of a sentence, but that would mean I would include more characters than I needed.
Is there a better way to accomplish this?
Note: I have not specified a server side language because this is more of a general question, but I'm using ASP.NET/C# .

Extract the plaintext from the HTML, and display that. There are libraries (like the HTML Agility Pack for .NET) that make this easy, and it's not too hard to do it yourself with an XML parser. Trying to fix a truncated HTML snippet is a losing cause.

One option I can think of is to cut it off at 300 characters and make sure the last index of '<' is less than the last index of '>'. If it is, truncate the string right before the last instance of '>', then use a library like tidy html to fix tags that are orphaned (like the </a> in the example).
There are problems with this though. One thing being if there are 300 chars worth of nothing but HTML - your summary will be displayed as empty.
If you do not need the html to be displayed it's far easier to simply extract the plain text and use that instead.
EDIT: Added using something like tidy html for orphaned tags. Original answer only solved cutting thing mid-tag, rather than within an opening/closing tag.

Regex to read out HTML tags

I'm looking for a regex that matches all used HTML tags in a text consisting of several lines. It should read out "b", "p" and "script" in the following lines:
<b>
<p class="normalText">
<script type="text/javascript">
Is there such thing? The start I have is that it should start with a "<" and read until it hits a space or a ">", but at the same time, it should not include the starting "<" since I just want to match the letter/word itself. Thoughts?

There are many similar questions on SO:
Filter out HTML tags and resolve entities in python
Regex to match all HTML tags except <p> and </p>
Strip all HTML tags except links
etc. The general agreement is that it's best not to use regular expressions to parse HTML instead of doing it properly by applying a DOM parser and traversing the DOM tree.

It's virtually impossible to regex HTML once you start considering all the special cases and malformed HTML that browsers sometimes happilly parse anyway. That said however I thought it might be fun to get the names without using capture groups and thus I present too you with the following sollution:
(?<=<)\w+(?=[^<]*?>)
For the record I hold little faith in it being at all useful in any but the most trivial of cases.

I don't know what system you are using, but it can be done to a certain extent. Look at this online flex-based application. Check out the Published > XML regex examples. You will get an idea.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Regular Expressions to fix invalid HTML - html

Related

html start tag missing greater-than char

Looking for a way to trim HTML code using terminal commands

JSoup - Quotations inside attributes

Shortened HTML text and malformed tags

Regex to read out HTML tags

Categories

Resources