Looking for a way to trim HTML code using terminal commands - html

I'm trying to learn awk and sed better, to be able to create cross-compatible terminal tools without needing things like PHP, Perl and so on. I'm now trying to clean up a very long string which is basically a part of an HTML document that I've fetched with curl. I'm wondering about the best way to go about this.
Most solutions that I have found are counting on luxuries like static files or structures, but as I'm trying to clean up fetched HTML code I want to be able to assume that the "periphery" of the string can change a lot, both in size and structure. So what I think I need to be able to do is essentially identify HTML tags, as these likely will not change, and extract the data from those HTML tags, no matter where they are. An example could be something like this:
<span class="unique-class">Payload</span>
I need to be able to look for that entire HTML tag, and when it is found, I need to extract basically everything after the >, until a < is found and another tag starts.
Since my original code is basically useless due to the fact that it just greps lines matching certain words (words that can show up in non-interesting instances on the same page), I'm really open for anything.

You'll very likely need to use Regex to find the string segments you need, sed and awk take Regex as an option, though may require a switch to do so. I recommend looking for the tags as wholes, otherwise you might end up getting code between a closing tag and opening tag (</span>stuff here<p>), which you probably don't want.
So, your regexes, at their most basic, might look something like this (not tested, you will probably have to tweak it):
/\<[a-zA-z]\>/ /* Find the opening tag. */
/\<[/a-zA-z]\>/ /* Find the closing tag, note the presence of the "/" inside the square brackets.
*/
Depending on your needs, you can create a list of tags to look for, specifically, giving you something like:
tags="div|p|article|section" /* Your list of tags, pipe-delimited for OR logic */
/\<$tags[:print:]\>/ /* The regex, looking for something like <div[anything]> */
You may be able to take it farther by Regexing for the opening tag, storing the base tag in a variable, then finding the matching closing tag. This may take a little more work to get working properly, but it does have the advantage of being more robust and naturally avoids the pitfalls of stopping at the wrong closing tag (ie - stopping at an </a> when it should stop at </p>).
A couple of notes - this may get a little hairy with some of the single-character tags. If you don't write it intelligently enough, your program may confuse things like <a> and <article>, so make sure your code is robust enough to account for that.
Also, don't forget that <input>s are used for generating most of the different form inputs, so if you care about what those are, make sure to look for the type attribute whenever you run across an <input>.
Finally, you can't necessarily assume that a tag will have a closing tag. Some tags don't have one (<br/>/<br>, <hr/>/<hr>) and the HTML specs don't always require them (<li> and <p> don't require closing tags as long as the next opening tag is another <li> or <p>, or is followed by the parent's closing tag). You also can't assume that the HTML you get will be valid. So make sure to account for these situations, so your application doesn't crash and burn.

Related

Regular Expressions to fix invalid HTML

I have hundreds of files (ancient ASP and HTML) filled with outdated and often completely invalid HTML code.
Between Visual Studio and ReSharper, this invalid HTML is flagged and easily visible if the editor window is scrolled to where the invalid HTML appears. However, neither tool is providing any method to quickly fix the errors across the whole project.
The first few errors on which ReSharper focuses my attention are tags that are either not closed or closed but not opened. Sometimes this occurs because the opening and closing tags overlap - for instance:
<font face=verdana size=5><b>some text</font></b>
<span><p>start of a paragraph
with multiple lines of <i><b>text/hmtl
</i> with a nice mix of junk</b>
</span></p>
Sometimes opening tags without a corresponding closing tag were allowed in older versions of HTML (or the tools which generated the HTML didn't care about the standards as some browsers usually figured out what the author meant). So the mess I'm attempting to clean up has many unclosed HTML tags that ought to be closed.
<font face = tahoma size=2>some more text<b><sup>*</sup></b>
...
...
</body>
</html>
And just for good measure, the code includes lots of closing HTML tags that have no matching start tag.
</b><p>some text that is actually within closed tags</p>
</td>
</tr>
</table>
So, other than writing a new application to parse, flag, and fix all these errors - does anyone have some .Net regular expressions that could be used to locate and preferably fix this stuff with Visual Studio 2012's Search and Replace feature?
Though a single expression that does it all would be nice, multiple expressions that each handle one of the above cases would still be very helpful.
For the case of overlapped HTML tags, I'm using this expression:
(?n)(?<t1s>(?><(?<t1>\w+)[^>]*>))(?<c1>((?!</\k<t1>>)(\n|.))*?)(?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>))(?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?)(?<t1e></\k<t1>>)(?<c3>(?>(\n|.)*?))(?<t2e></\k<t2>>)
Explanation:
(?n) Ignore unnamed captures.
(?<t1s>(?><(?<t1>\w+)[^>]*>)) Get the first tag, capturing the full tag and attributes
for replacement and the name alone for further matching.
(?<c1>((?!</\k<t1>>)(\n|.))*?) Capture content between the first and second tag.
(?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>)) Get the 2nd tag, capturing the full
tag and attributes for replacement, the name along for further matching, and ensuring
it does not match the 1st tag and that the first tag is still open.
(?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?) Capture content between the second tag
closing of the first tag.
(?<t1e></\k<t1>>) Capture the closing of the first tag, where the second tag is
still open.
(?<c3>(?>(\n|.)*?)) Capture content between the closing of the first tag and the closing
of the second tag.
(?<t2e></\k<t2>>) Capture the closing of the second tag.
With this replacement expression:
${t1s}${c1}${t2s}${c2}${t2e}${c3}${t1e}
The issues with this search expression is that it is painfully slow. Using . instead of (\n|.) for the three content captures is much quicker, but limits the results to just those where the overlapped tags and intervening content are on a single line.
The expression will also match valid, properly closed and properly nested HTML if the first tag appears inside the content of the second tag, like this:
<font color=green><b>hello world</b></font><span class="whatever"><font color=red>*</font></span>
So it is not safe to use the expression in a "Replace All" operation, especially across the hundreds of files in the solution.
For unclosed tags, I've successfully handled the self-closing tags: <img/>, <meta/>, <input/>, <link/>, <br/>, and <hr/>. However, I've still not attempted the generic case for all the other tags - those that may have content or should be closed with a separate closing tag.
Also, I've no idea how to match closing tags without a matching opening tag. The simple solution of </\w+> will match all closing tags regardless of whether or not they have a matched opening tag.
According to their website, Resharper has this feature:
Solution-Wide Analysis
Not only is ReSharper capable of analyzing a specific code file for errors, but it can extend its analysis skills to cover your whole solution.
...
All you have to do is explicitly switch Solution-Wide Analysis on, and then, after it analyzes the code of your solution, view the list of errors in a dedicated window:
[
Even without opening that window, you can still easily navigate through errors in your solution with Go to Next Error in Solution (Shift+Alt+PageDown) and Go to Previous Error in Solution (Shift+Alt+F12) commands.
Your current "solution" is to use regexes on a context-sensitive language (invalid HTML). Please, NO. People flip out already when people suggest parsing context-free languages with regexes.
On second thought, there might be a solution that we can use regexes for.
For this HTML:
<i><b>text/html
</i> with a nice mix of junk</b>
A better transformation would be (it's more valid, right?):
<i><\i><b><i>text/hmtl
</i> with a nice mix of junk</b>
There are many ways this could go wrong, (although it's pretty bad as-is), but I assume you have this all backed up. This regex (where i is an example of a tag you may want to do this with):
<(i(?: [^>]+)?)>([^<]*)<(\/?[^i](?: [^>]+)?)>
Might help you out. I don't know how regex replace works in whatever flavor you're using, but if you replace $0 (everything matched by the regex) with <$1>$2</$1><$3><$1>, you'll get the transformation I'm talking about.

Acceptable use of Regex in HTML parsing?

There is a lot of argument back and forth over when and if it is ever appropriate to use a regex to parse html.
As a common problem that comes up is parsing links from html my question is, would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML? In this scenario you are not concerned about closing tags and you have a pretty specific structure you are looking for.
It seems like significant overkill to use a full html parser. While I have seen questions and answers indicating the using a regex to parse URLs, while largely safe is not perfect, the extra limitations of structured <a> tags would appear to provide a context where one should be able to achieve 100% accuracy without breaking a sweat.
Thoughts?
Consider this valid html:
<!DOCTYPE html>
<title>Test Case</title>
<p>
<!-- <a href="url1"> -->
<span class="><a href='url2'>"></span>
url<'>click
</p>
What is the list of urls to be extracted? A parser would say just a single url with value my">url<. Would your regular expression?
I'm one of those people who think using regex in this situation is a bad idea.
Even if you just want to match a href attribute from a <a> tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.
Plus, matching href attributes from tags with a XML parser is all but overkill.
I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.
But I had to come back on my code quite a lot, for many reasons :
the source code had changed
one of the source page had broken html and I didn't tested it
I didn't try my code for every pages of the source, only to find out a few of them didn't work.
...
I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.
What I usually from now on is :
using tidy to clean the html source.
Use DOM + Xpath to actually parse the page and extract the parts I want.
Use regexes only on small text-only parts (like the trimed textContent of a node)
The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.
Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.
I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.
Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.
$dom = new DOMDocument();
$dom->loadHTML($html);
// loop on every links
foreach($dom->getElementsByTagName('a') as $link) {
// get href attribute
$href = $link->getAttribute('href');
// do whatever you want with them...
}
I hope this is helping somehow.
I proposed this one :
<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>
On this thread
Eventually it can fail for what can be in name.

Having the HTML of a webpage, how to obtain the visible words of that webpage?

Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.

Stripping HTML but retaining block/inline structure

I would like to convert HTML to plain text but retain the minimum structure.
All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.
The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.
Any binary, library, or source in any programming language is OK.
Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?
The list of HTML 4 Block-level elements is here: http://htmlhelp.com/reference/html40/block.html
The most popular HTML parsing libraries for Perl are HTML::Parser which is a SAX-style parser and HTML::TreeBuilder which is more DOM-like.
Beyond that, you'll have to decide which elements are important and which are not based on what you're trying do to.
You may want to do some research yourself. Then, when you run into a problem, ask a question related to the problem. This sounds more like specification for a project that you want someone to do for you.
For starters, websites use tags for all sorts of things, and the problem is very complex. You would probably want to save information in h# and p tags, but you also may want to save div tag information if they use the id tag. In short, you'd have to write rules for each website you encounter, or employ some sort of fuzzy logic.
Instead of doing it on a tag by tag basis, why not try detecting sentences and grammar, or things likely to be in headings, and choose tags that include those things while stripping out the rest?
Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl
It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.