How to Minify HTML code? - html

My idea is to somehow minify HTML code in server-side, so client receive less bytes.
What do I mean with "minify"?
Not zipping. More like, for example, jQuery creators do with .min.js versions. In other words, I need to remove unnecessary white-spaces and new-lines, but I can't remove so much that presentation of HTML changes (for example remove white-space between actual words in paragraph).
Is there any tools that can do it? I know there is HtmlPurifier. Is it able to do it? Any other options?
P.S. Please don't offer regex'ies. I know that only Chuck Norris can parse HTML with them. =]

A bit late but still... By using output_buffering it is as simple as that:
function compress($string)
{
// Remove html comments
$string = preg_replace('/<!--.*-->/', '', $string);
// Merge multiple spaces into one space
$string = preg_replace('/\s+/', ' ', $string);
// Remove space between tags. Skip the following if
// you want as it will also remove the space
// between <span>Hello</span> <span>World</span>.
return preg_replace('/>\s+</', '><', $string);
}
ob_start('compress');
// Here goes your html.
ob_end_flush();

You could parse the HTML code into a DOM tree (which should keep content whitespace in the nodes), then serialise it back into HTML, without any prettifying spaces.

Is there any tools that can do it?
Yes, here's a tool you could include into a build process or work into a web cache layer:
https://code.google.com/archive/p/htmlcompressor/
Or, if you're looking for a tool to minify HTML that you paste in, try:
http://www.willpeavy.com/minifier/

You can use the Pretty Diff tool: http://prettydiff.com/?m=minify&html It will also minify any CSS and JavaScript in the HTML code, and the minification occurs in a regressive manner so to not prevent future beautification of the HTML back to readable form.

Is there any tools that can do it?
You can use CodVerter Online Web Development Editor for compressing mixed html code.
the compressor was tested multiple times for reliability and accuracy.
(Full Disclosure: I am one of the developers).

Related

cost and benefit one line html

Can some one give me reason why (or not) we create one line html?
That I know just reduce file size (benefit) but in server we must add some function to generate one line html (server cost) or we will have trouble when we need to change some code (editing cost).
You shouldn't minify your HTML. It isn't worth the time and the work you will have to do when editing. Take a look at this very good answer.
You should optimize your CSS though. There's a free program: Scout, that checks for changes on your css files and creates a minified version of it automatically. It implements Sass, which will make coding CSS much easier. You should try it.
You don't manually create HTML like that. Personally, my CMS caches it's output as static files to be served to the user. Before it caches, it runs:
$tocache is the contents of the page to be displayed. I do this, then write it to disk. Apache then serves the static content instead avoiding the DB and PHP on subsequent access.
// Remove white space
$tocache = str_replace(array("\n", "\t","\r")," ",$tocache);
// Remove unnecessary closing tags (I know </p> could be here, but it caused problems for me)
$tocache = str_replace(array("</option>","</td>","</tr>","</th>","</dt>","</dd>","</li>","</body>","</html>"),"",$tocache);
// remove ' or " around attributes that don't have spaces
$tocache = preg_replace('/(href|src|id|class|name|type|rel|sizes|lang|title|itemtype|itemprop)=(\"|\')([^\"\'\`=<>\s]+)(\"|\')/i', '$1=$3', $tocache);
// Turn any repeated white space into one space
$tocache = preg_replace('!\s+!', ' ', $tocache);
Now, I run that once per page change, then serve up the smaller HTML to users.
This is pretty much pointless though, as the process of gzipping makes the biggest difference. I do it because I might as well – I already am caching these files, so why not make myself feel clever first!
For CSS and JS I use SASS's compressed option, and uglifyJS to get those as one small file.
That means on a page I have 1 HTML file, 1 CSS and 1 JS, minimising the number of HTTP reqs and the amount of data to be transmitted.
Gzip + ensuring 1 css and 1 js is the biggest savings though.

Can I remove linebreaks and spaces from HTML source?

Besides the fact that it becomes unreadable for humans, are there any downsides when I remove every linebreak and space from the html source code?
Does the browsers render different then? Will the rendering get faster (or maybe slower)?
There are many already answered questions about minifying HTML. Here are some:
Why minify assets and not the markup?
HTML Minification
How to minify HTML code
You will have a smaller file size, so it may download faster (though it'll be probably unnoticeable). There are tools for this, indeed.
If you remove line breaks there is no harm. But according to your questions
...when I remove every linebreak and space from the html source code?
If you do remove every linebreak and line space, your purpose may not be served. You should only remove extra line-breaks and spaces. Also be careful not to alter values attributes form data, or any other attribute for that matter.
Regarding improvements it can offer:
It may render faster as it needs to parse lesser data. But this speedup is highly small. I even discourage it as it reduces readability and the speedup is in order of a few hundred clock cycles for CPU. The same goes for download. It reduces mere bites of data (unless the document has too much white spaces)
Insted of this its better to use GZIP compression for the output at the server side. The following is an line from php which enables it. If you have php in your server, then just rename your *.html file to *.php , Then add the following code before any output:
if (substr_count($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip')) ob_start("ob_gzhandler");
you can also do this using the .htaccess file. Google regarding this more.
A bit late but still... By using output_buffering it is as simple as that:
function compress($string)
{
// Remove html comments
$string = preg_replace('/<!--.*-->/', '', $string);
// Merge multiple spaces into one space
$string = preg_replace('/\s+/', ' ', $string);
// Remove space between tags. Skip the following if
// you want as it will also remove the space
// between <span>Hello</span> <span>World</span>.
return preg_replace('/>\s+</', '><', $string);
}
ob_start('compress');
// Here goes your html.
ob_end_flush();

How to sanitize user generated html code in ruby on rails

I am storing user generated html code in the database, but some of the codes are broken (without end tags), so when this code will mess up the whole render of the page.
How could I prevent this sort of behaviour with ruby on rails.
Thanks
It's not too hard to do this with a proper HTML parser like Nokogiri which can perform clean-up as part of the processing method:
bad_html = '<div><p><strong>bad</p>'
puts Nokogiri.fragment(bad_html).to_s
# <div><p><strong>bad</strong></p></div>
Once parsed properly, you should have fully balanced tags.
My google-fu reveals surprisingly few hits, but here is the top one :)
Valid Well-formed HTML
Try using the h() escape function in your erb templates to sanitize. That should do the trick
Check out Loofah, an HTML sanitization library based on Nokogiri. This will also remove potentially unsafe HTML that could inject malicious script or embed objects on the page. You should also scrub out style blocks, which might mess up the markup on the page.

Perl AJAX stripping html characters out of string?

I have a Perl program that is reading html tags from a text file. (im pretty sure this is working because when i run the perl program on the command line it prints out the HTML like it should be.)
I then pass that "html" to the web page as the return to an ajax request. I then use innerHTML to stick that string into a div.
Heres the problem:
all the text information is getting to where it needs to be. but the "<" ">" and "/" are getting stripped.
any one know the answer to this?
The question is a bit unclear to me without some code and data examples, but if it is what it vaguely sounds like, you may need to HTML-encode your text (e.g. using HTML::Entities).
I'm kind of surprized that's an issue with inserting into innerHTML, but without specific example, that's the first thing which comes to mind
There could be a mod on the server that is removing special characters. Are you running Apache? (I doubt this is what's happening).
If something is being stripped on the client-side, it is most likely in the response handler portion of the AJAX call. Show your code where you stick the string in the div.

Limiting HTML Input into Text Box

How do I limit the types of HTML that a user can input into a textbox? I'm running a small forum using some custom software that I'm beta testing, but I need to know how to limit the HTML input. Any suggestions?
i'd suggest a slightly alternative approach:
don't filter incoming user data (beyond prevention of sql injection). user data should be kept as pure as possible.
filter all outgoing data from the database, this is where things like tag stripping, etc.. should happen
keeping user data clean allows you more flexibility in how it's displayed. filtering all outgoing data is a good habit to get into (along the never trust data meme).
You didn't state what the forum was built with, but if it's PHP, check out:
http://htmlpurifier.org/
Library Features: Whitelist, Removal, Well-formed, Nesting, Attributes, XSS safe, Standards safe
Once the text is submitted, you could strip any/all tags that don't match your predefined set using a regex in PHP.
It would look something like the following:
find open tag (<)
if contents != allowed tag, remove tag (from <..>)
Parse the input provides and strip out all html tags that don't match exactly the list you are allowing. This can either be a complex regex, or you can do a stateful iteration through the char[] of the input string building the allowed input string and stripping unwanted attributes on tags like img.
Use a different code system (BBCode, Markdown)
Find some code online that already does this, to use as a basis for your implementation. For example Slashcode must perform this, so look for its implementation in the Perl and use the regexes (that I assume are there)
Regardless what you use, be sure to be informed of what kind of HTML content can be dangerous.
e.g. a < script > tag is pretty obvious, but a < style > tag is just as bad in IE, because it can invoke JScript commands.
In fact, any style="..." attribute can invoke script in IE.
< object > would be one more tag to be weary of.
PHP comes with a simple function strip_tag to strip HTML tags. It allows for certain tags to not be stripped.
Example #1 strip_tags() example
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text
Personally for a forum, I would use BBCode or Markdown because the amount of support and features provided such as live preview.