how to strip html whitespace and comment and javascript comments using nokogiri? - html

I'm looking for a way in nokogiri to strip out html whitespace & comment and javascript comment (/* */, //). I'm doing this not because of the size of the document. I'm playing around with rack middleware to do this job. I know I could do via regular expression, but i think it could be troublesome.
If not possible to do with nokogiri, please give me the best regular expression to strip out for the 2 above cases.
What I tried using regular expression:
response = #app.call(env)
body = response.last.body.gsub(/(\n|\t|\r)/, ' ').gsub(/>\s*</, '><').gsub(/<!--[^>]*-->/, ' ').squeeze(' ')
response.last.body = body
response
I think there should be a cleaner way to do rather than using regular expression.

Loofah is nice but it won't help you strip javascript comments.
This thread deals with stripping js comments but there seems to be much disagreement. I agree with the ones who say you should not do it. However if you wanted to try the accepted answer with loofah you might do:
require 'rubygems'
require "loofah"
scrubber = Loofah::Scrubber.new do |node|
node.content = node.content.strip if node.name == "text"
node.remove if node.name == "comment"
if node.cdata? && node.parent.name == "script"
node.content = node.content.gsub(/\/\*![^*]*\*+(?:[^*\/][^*]*\*+)*\//,'')
end
end
puts Loofah.fragment('<p> trim </p><!-- remove --><p> me </p><script>var x=0;/*! remove! */</script>').scrub!(scrubber)
# <p>trim</p><p>me</p><script>var x=0;</script>

Loofah might be what you are looking for:
https://github.com/flavorjones/loofah

I end up writing a middleware to handle this since there is no exact solution for this.
Here I use very strict regular expression to handle it.
Check the code on my github repo.

Related

Regex to extract text from inside an HTML tag

I know this has been asked at least a thousand times but I can't find a proper regex that will match a name in this string here:
<td><div id="topbarUserName">Donald</div></td>
I want to get the name 'Donald' and the regex that's the closest is >[a-zA-Z0-9]+ but the result is >Donald.
I'm coding in PureBasic (It's syntax is similar to that of Basic) and it uses the PCRE library for regular expressions.
Can anyone help?
Josh's pattern will work if you only make use of the numbered group, not the whole match. If you have to use the whole match, use something like (?<=>)(\w+?)(?=<)
Either way, regex is widely known to not be good for parsing HTML.
Explanation:
(?<=) is used to check if something appears before the current item.
\w+? will match any "word"-character, one or more times, but stop whenever the rest of the pattern matches something, for this situation the ? could have been left out.
(?=) is used to check if something appears after the current item.
Try this
It should capture anything that is a letter / number
>([\w]+)<
Also I'm not exactly sure what your project limitations are, but it would be much easier to do something like this
$('#topbarUserName').text();
in jQuery instead of using a regex.
>([a-zA-Z]+) should do the Trick. Remember to get the grouping right.
Why not doing it with plain old basic string-functions?
a.w = FindString(HTMLstring.s, "topbarUserName") + 16 ; 2 for "> and topbar...
If a > 0
b.w = FindString(HTMLstring, "<", a)
If b > 0
c.w = b - a
Donald.s = Mid(HTMLstring,a, c)
EndIf
EndIf
Debug Donald

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html

Qt Regex matches HTML Tag InnerText

I have a html file with one <pre>...</pre> tag. What regex is necessary to match all content within the pre's?
QString pattern = "<pre>(.*)</pre>";
QRegExp rx(pattern);
rx.setCaseSensitivity(cs);
int pos = 0;
QStringList list;
while ((pos = rx.indexIn(clipBoardData, pos)) != -1) {
list << rx.cap(1);
pos += rx.matchedLength();
}
list.count() is always 0
HTML is not a regular language, you do not use regular expressions to parse it.
Instead, use QXmlSimpleReader to load the XML, then QXmlQuery to find the PRE node and then extract its contents.
DO NOT PARSE HTML USING Regular Expressions!
Instead, use a real HTML parser, such as this one
i did it using substrings:
int begin = clipBoardData.indexOf("<pre");
int end = clipBoardData.indexOf("</body>");
QString result = data.mid(begin, end-begin);
The result includes the <pre's> but i found out thats even better ;)
I have to agree with the others. Drupal 6.x and older are using regex to do a lot of work on the HTML data. It quickly breaks if you create pages of 64Kb or more. So using a DOM or just indexOf() as you've done is a better much faster solution.
Now, for those interested in knowing more about regex, Qt uses the perl implementation. This means you can use the lazy operator. Your regex would become:
(<pre>.*?</pre>)+
to get each one of the <pre> block in your code (although if you have only one, then the question mark and the plus are not required.) Note that no delimiters at the start and end of the regular expression are required here.
QRegExp re("(<pre>.*?</pre>)+", Qt::CaseInsensitive);
re.indexIn(html_input);
QStringList list = re.capturedTexts();
Now list should have one <pre> tag or more.

How can I strip HTML in a string using Perl?

Is there anyway easier than this to strip HTML from a string using Perl?
$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;
I would appreicate both a slimmed down regular expression, e.g. something like this:
$Error_Msg =~ s|</?[b|h1|br]>||ig;
Is there an existing Perl function that strips any/all HTML from a string, even though I only need bolds, h1 headers and br stripped?
Assuming the code is valid HTML (no stray < or > operators)
$htmlCode =~ s|<.+?>||g;
If you need to remove only bolds, h1's and br's
$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g
And you might want to consider the HTML::Strip module
You should definitely have a look at the HTML::Restrict which allows you to strip away or restrict the HTML tags allowed. A minimal example that strips away all HTML tags:
use HTML::Restrict;
my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'
I would recommend to stay away from HTML::Strip because it breaks utf8 encoding.
From perlfaq9: How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.
Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.
Here's one "simple-minded" approach, that works for most files:
#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .
Here are some tricky cases that you should think about when picking a solution:
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solutions would also break on text like this:
<!-- This section commented out.
<B>You can't see me!</B>
-->

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not).
I would like to remove
any HTML tags
Any javascript
Any CSS styles
Is there a regular expression (one or more) that will achieve that?
Remove javascript and CSS:
<(script|style).*?</\1>
Remove tags
<.*?>
You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.
Needed a regex solution (in php) that would return the plain text just as well (or better than) PHPSimpleDOM, only much faster. Here is the solution that I came up with:
function plaintext($html)
{
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('#<!--.*?-->#s', '', $html);
// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('#</li>#', ' </li>', $plaintext);
// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);
// remove all remaining html
$plaintext = strip_tags($plaintext);
return $plaintext;
}
When I tested this on some complicated sites (forums seem to contain some of the tougher html to parse), this method returned the same result as PHPSimpleDOM plaintext, only much, much faster. It also handled the list items (li tags) properly, where PHPSimpleDOM did not.
As for the speed:
SimpleDom: 0.03248 sec.
RegEx: 0.00087 sec.
37 times faster!
Contemplating doing this with regular expressions is daunting. Have you considered XSLT? The XPath expression to extract all of the text nodes in an XHTML document, minus script & style content, would be:
//body//text()[not(ancestor::script)][not(ancestor::style)]
Using perl syntax for defining the regexes, a start might be:
!<body.*?>(.*)</body>!smi
Then applying the following replace to the result of that group:
!<script.*?</script>!!smi
!<[^>]+/[ \t]*>!!smi
!</?([a-z]+).*?>!!smi
/<!--.*?-->//smi
This of course won't format things nicely as a text file, but it strip out all the HTML (mostly, there's a few cases where it might not work quite right). A better idea though is to use an XML parser in whatever language you are using to parse the HTML properly and extract the text out of that.
The simplest way for simple HTML (example in Python):
text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>"
import re
" ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])
Returns this:
'This is my> example HTML, containing tags'
Here's a function to remove even most complex html tags.
function strip_html_tags( $text )
{
$text = preg_replace(
array(
// Remove invisible content
'#<head[^>]*?>.*?</head>#siu',
'#<style[^>]*?>.*?</style>#siu',
'#<script[^>]*?.*?</script>#siu',
'#<object[^>]*?.*?</object>#siu',
'#<embed[^>]*?.*?</embed>#siu',
'#<applet[^>]*?.*?</applet>#siu',
'#<noframes[^>]*?.*?</noframes>#siu',
'#<noscript[^>]*?.*?</noscript>#siu',
'#<noembed[^>]*?.*?</noembed>#siu',
// Add line breaks before & after blocks
'#<((br)|(hr))#iu',
'#</?((address)|(blockquote)|(center)|(del))#iu',
'#</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))#iu',
'#</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))#iu',
'#</?((table)|(th)|(td)|(caption))#iu',
'#</?((form)|(button)|(fieldset)|(legend)|(input))#iu',
'#</?((label)|(select)|(optgroup)|(option)|(textarea))#iu',
'#</?((frameset)|(frame)|(iframe))#iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0",
),
$text );
// Remove all remaining tags and comments and return.
return strip_tags( $text );
}
If you're using PHP, try Simple HTML DOM, available at SourceForge.
Otherwise, Google html2text, and you'll find a variety of implementations for different languages that basically use a series of regular expressions to suck out all the markup. Be careful here, because tags without endings can sometimes be left in, as well as special characters such as & (which is &).
Also, watch out for comments and Javascript, as I've found it's particularly annoying to deal with for regular expressions, and why I generally just prefer to let a free parser do all the work for me.
I believe you can just do
document.body.innerText
Which will return the content of all text nodes in the document, visible or not.
[edit (olliej): sigh nevermind, this only works in Safari and IE, and i can't be bothered downloading a firefox nightly to see if it exists in trunk :-/ ]
Can't you just use the WebBrowser control available with C# ?
System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser();
wc.DocumentText = "<html><body>blah blah<b>foo</b></body></html>";
System.Windows.Forms.HtmlDocument h = wc.Document;
Console.WriteLine(h.Body.InnerText);
string decode = System.Web.HttpUtility.HtmlDecode(your_htmlfile.html);
Regex objRegExp = new Regex("<(.|\n)+?>");
string replace = objRegExp.Replace(g, "");
replace = replace.Replace(k, string.Empty);
replace.Trim("\t\r\n ".ToCharArray());
then take a label and do "label.text=replace;" see on label out put
.