How can I minify an html template file without destroying the structure?

How can I minify an html template file without destroying the structure? - html

I have a handlebars template file that I'd like to minify. I found a couple questions that were related to my issue on StackOverflow, but nothing exactly like it that had an answer. My issue is that spaces that are within the templated values are getting removed when I run the code through a minifier.
Example:
I have this line of code in my template file:
<div>{{{displayName}}} - {{cost}}</div>
When I use the un-minified file to render the page, I get entries like:
ProductName - $5.50
which is what I want. After running the template through an html minifier, my template line now looks like this:
<div>{{{displayName}}}-{{cost}}</div>
and the entries on the rendered page look like:
ProductName-$5.50
Not optimal. Now, I understand that I could just run through the template and put in non-breaking spaces into all the places where I'd like spaces to be. Nice. Simple. Easy... relatively.
But.
A secondary, and larger, issue comes into play (and what's the point of going through and putting in all those non-breaking spaces into my template file to avoid this situation with the html minifier if there are more issues) when I'm selectively adding attributes or classes to a given html element.
Example:
I also have lines in my template files that look like:
<div class="paymentMethod{{#if paymentSelected}} active{{/if}}">
On the condition where my template (handlebars) variable "paymentSelected" is true the html shows as:
After minification, however the minified template file contains:
<div class="paymentMethod{{#if amazonAndPaypal}}active{{/if}}">
which makes the html on the page show as:
which, consequently, messes up all of my css and javascript because there is now one unrecognized class on the element instead of two correct classes.
Again, there is a way of getting around this. I could just place all of the class definitions into the template variables. So, my new template would be:
<div class="{{#if amazonAndPaypal}}paymentMethod active{{else}}paymentMethod{{/if}}">
This kind of goes against the idea of removing redundancy though. So I don't like it. And this is a fairly simple case, with only two possible classes.
I'm sure there are more possibilities for hassle with html minification of template files, but I think I've shown my point.
Now, all of that explanation comes to my question:
Is there a tool out there that will minify html but ignore spaces that are between opening and closing template tags? For me, those spaces are similar to the spaces between words. I don't want all the spaces between the words of a sentence removed any more than I want the spaces within my template tags to be removed.
I also went searching for a generic sed solution, but didn't find anything in that direction either.

Could you just use &ampnbsp;?
<div class="paymentMethod{{#if paymentSelected}} active{{/if}}">

Okay, so I figured out a better option, and this may be incredibly obvious to some but I'm pretty new to the whole Handlebars gig.
A better solution to minifying the html templates would be to precompile the templates and to then minify the resulting javascript. This way, I also get the savings of no compilation time on the browser side and (because I'm using Handlebars as my templating language) loading the smaller runtime script.
Granted, this solution doesn't explicitly answer the question I posed, it does solve the ultimate problem I'm trying to solve, which is to minimize the page-load time on a browser by doing everything I can to the necessary assets prior to a browser downloading them.

Related

Including HTML in Markdown

Assuming I am in control of the parsing environment and I'm certain it is only to be converted to HTML (and not any of the many other formats possible); is it ok to embed some HTML within one's Markdown, in order to side-step around a bug?
Could there be any basic sideffects I (as a newbie) couldn't predict but should be aware of?
Non-conventional Markdown example:
_"<strong>This</strong> is an example sentence."_ -**OP**
Which outputs valid HTML:
<em>"<strong>This</strong> is an example sentence."</em> -<strong>OP</strong>
Resulting in successful content:
"This is an example sentence." -OP
Background (don't have to read):
I noticed that if I include HTML in my Markdown, it appears to get skipped during the conversion, resulting in it being seamlessly incorporated in the output HTML.
This appears to be a good thing, at least in my case (Using Hugo to build a website with a template theme) where the Markdown wasn't producing the correct result (leaving a pair of unwanted *s in the HTML: should have been *italic* but asterisks showing).
For those wondering - yes, I confirmed my Markdown was correct using other parsers that handled it fine.
Note: the examples here are simplifications of my specific case

Not only is it okay to do, but it is encouraged. As the rules state:
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.
And later:
If you want, you can even use HTML tags instead of Markdown formatting; e.g. if you’d prefer to use HTML <a> or <img> tags instead of Markdown’s link or image syntax, go right ahead.
Of course, there are a few things to take into consideration. For example block level tags must be at the document root level (cannot be nested inside blockquotes, lists, etc) and content inside them does not get parsed as Markdown. However, inline tags can be placed anywhere and do not restrict Markdown parsing.

For people using Markdown in highly modular or user-flexible environments (probably slightly more advanced readers):
One should note that although Markdown is most commonly converted to HTML, it can also be used with other formats[1].
For this reason I think it's important to confirm that if you (as a publisher of content) are not the one who determines what the Markdown will be parsed with, or how it is converted it may be 'safer' to not embed HTML in it.
[1] as stated in the Markdown Wikipedia page.

Changing specific lines of code from all my HTML pages

Let's assume that
I have thousands of HTML files.
All of them use's the same CSS layout.
All of them are made with exactly 100 lines of code.
Now, I would like to replace lines 40-50 from all of these files with a common set of 10 lines. Is this possible ?

It's not a html related topic. You should use a programming language.
Creat a program.
Loop through each of your html files.
Replace the desired lines of code (or any other string) with your new content.
By the way if your purpose is just to manipulate a bunch of classes/IDs and content, it's doable with javascript/jQuery. but unlike first solution it will run each time one of your html files are loaded in a browser. i don't recommend this.

How do I remove excess whitespace in an HTML file? (And only excess whitespace)

I have a horrible, ugly HTML file that was spat out by a form generator and slightly modified to look nice. This HTML file needs to be translated, so I hooked up some scripts using po4a and csv2po, and that all works fairly well except for one thing: some of the base strings in our translation templates are surrounded by whitespace, and the translators get rather confused.
The other thing is I have this working with a Makefile (because that generated form is updated quite frequently and I'm a nerd). I'd like to keep it that way because it's nice for my workflow. So, I need a command line tool.
I'm really looking for the simplest solution in this case, so I ran the HTML file through HTML Tidy, and that removes the weird whitespace quite competently. However, it does a lot of stuff I don't need. It messes with the doctype (and it doesn't support an html5 doctype), and I've ended up with a really crazy command line just to get it to not mangle things. It is not very pleasant.
All I really want is a command line tool (not an online one) whose single goal in life is to look at my HTML file and format it nicely. Ideally not a "compressor" thing, but if that's the only option, suggestions would be nice :)

Stick it in an ide or text editor like notepad++ or net beans and hit the "format code" button which is available in nearly every ide?

I'm not sure if it is still being developed, but would HTML Tidy do the trick?

Minimize html, doubts and questions

Minimizing html is the only section on Google's Page Speed where there is still room for improvement.
My site is all dynamic and the HTML is already Deflated so there is no reason to put any more pressure on the server (I don't want to minimize pages real time before sending).
What I could do was to minimize the template files. My templates files are a mix of PHP and HTML so I've come up with some code that I think is pretty safe but would like to be community revised.
// this will loop trough all template files
// php is cleaned first so that line-comments will not interfere with the regex
$original = file_get_contents($dir.'/'.$file);
$php_clean = php_strip_whitespace($dir.'/'.$file);
$minimized = preg_replace('/\s+/', ' ', $php_clean);
This will make my template files as a single very long file alternated with some places where DB content is inserted. Google's homepage source looks more or less like what I get so I wonder if they follow a similar approach.
Question 1: Do you antecipate potencial problems?
Question 2: Is there anyway better (more efficient to do this)?
And please remember that I'm not trying to validate HTML as the templates are not valid HTML (header and footer are includes, for example).
Edit: Do take into consideration that the template files will be minimized on deploy. As CSS and Javascript files are minimized and compressed using YUI Compressure and Closure, the template files would be minimized like-wise, on deploy. Not on client-request.
Thank you.

Google's own Closure Templates (Soy) strips whitespace at the end of the line by default, and the template designer explicitly inserts a space using {sp}. This probably isn't a good enough reason to switch away from PHP, but I just wanted to bring it to your attention.
In addition, realize that HTML 4 allows you to exclude some tags, as recommended by the Page Speed documentation on minifying HTML (http://code.google.com/p/page-speed/wiki/MinifyHtml). You can exclude </p>, </td>, </tr>, etc. For a complete list of elements for which you can omit the end tag, search for "- O" in the HTML 4 DTD (http://www.w3.org/TR/REC-html40/sgml/dtd.html). You can even omit the <html>, <head>, <body>, and <tbody> tags entirely, as both start and end tags are optional ("O O" in the DTD).
You can also omit the quotes around attributes (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2) such as id, class (with a single class name), and type that have simple content (i.e., matches /^[-A-Za-z0-9._:]+$/). For attributes that have a single possible value, you can exclude the value (e.g., say simply checked rather than checked=checked).
Some people may find these tips repulsive because we've been conditioned for so many years to prepare for the upcoming world of simple LALR parsers for XHTML. Thus, tools like Dave Raggett's HTML Tidy generate HTML with proper closing tags and quotes around attribute values. But let's face it, all the browsers already have parsers that understand HTML 4, any new browser will use the HTML 5 parser rather than XHTML, and we should get comfortable writing HTML that is optimized for size.
That being said, besides a couple large companies like Google and Facebook, my guess is that page size is a negligible component of latency, so if you're optimizing your own site it's probably because of your own obsessive tendencies rather than performance.

White space can be significant (e.g. in pre elements).
When I had a particularly large page (i.e. large enough that there was a benefit in minifying the HTML) I used HTML Tidy and cached the results.
tidy -c -n -omit -ashtml -utf8 --doctype strict \
--drop-proprietary-attributes yes --output-bom no \
--wrap 0

I think you'll end up running into issues with load time with this approach, as the get contents, strip whitespace, and preg replace calls are going to take a lot longer to do than whatever bandwidth the minified HTML is saving you.

I've been running tests on all my sites for a couple of weeks and I can say that this method is pretty consistent. It will only affect template content, so there is little risk of messing up with unknown <pre> or similar.
It is run before deploy so there is no impact on server - actually there should be a little speed up as the file becomes smaller.
Do remember that all content that comes from the database will not suffer any influence as, like said before, this runs before deploy and on template files only.
The method seams solid enough to pass it into production.
If anything goes wrong I'll post it here.

What should I consider before minifying HTML?

I've googled around but can't find any HTML minification scripts.
It occurred to me that maybe there is nothing more to HTML minification than removing all unneeded whitespace.
Am I missing something or has my Google Fu been lost?

You have to be careful when removing stuff from HTML as it's a fragile language. Depending on how your pages are coded some of that whitespace might be more significant; also if you have CSS styles such as white-space: pre then you may need to keep the whitespace. Plus there are numerous browser bugs, etc, and basically every character in an HTML file might be there to satisfy some requirement or appease some browser.
In my opinion your best bet is to design the pages well with CSS techniques (I was recently able to take an important page on the site I work for and reduce it's size by 50% just by recoding it using CSS instead of tables and nested style="..." attributes). Then, use GZip to reduce the size of your pages for browsers that understand gzip. This will save bandwidth while preserving the structure of the html.

Sometimes, depending on the enclosing tags and/or on the CSS, whitespace may be significant.

Outside of HTML Tidy/removing white space as the other answers mentioned, there isn't much.
This is more of a manual task pulling out style attributes into CSS (hopefully you're not using FONT tags, etc.), using fewer tags and attributes where possible (like not embedding <strong> tags in an element but using CSS to make the whole element font-weight: bold, unless of course it makes semantic sense to use >strong<), etc.

Yes I guess it's pretty much removing whitespace and comments. You cannot replace identifiers with shorter ones like in javascript, since chances are that CSS classes or javascript will depend on those identifiers.
Also, you should be careful when removing whitespace and make sure that there is always at least whitespace character left, otherwise allyourtextwilllooklikethis.

There's a pretty lengthy discussion on this Wordpress blog about this topic. You can find a very lengthy proposed solution using PHP and HTML Tidy there.

You can find some good references here to things like HTML tidy and others.
If you don't want to use one of those options, Prototype has a means to clean the whitespace in the DOM. You could do that on your own and copy it via 'View Generated Source' in the Firefox extension Web Developer Toolbar. Then you can replace the original html with prototype's fix. Sorry for not making that apparent nickf.
(I recommend the first link)

I haven’t tried it yet, but htmlcompressor is an HTML minifier, if you fancy giving one a try.

If you have installed node.js and you are a windows user you can create this .bat
It will minify all html in your folder in the min subfolder.
The output will be in min folder
open the console. run--> npm install html-minifier -g
create the .bat. don't forget to change the route in cd command. It's easier to change the folder in the bat file than copy and paste.
go in console into the .bat folder and run it.
cd the_destination_folder
dir /b *.HTML > list1.txt
for /f "tokens=*" %%A in (list1.txt) do html-minifier --collapse-whitespace --remove-comments --remove-optional-tags %%~nxA -o min\%%~nxA
pause

Couldn't JavaScript be used as a decompresser for a compressed HTML string, for instance have a DEV build for the uncompressed format, run a 'publish' script to compress the DEV build to production and attach a JavaScript to the HTML source (with the whitespace and such removed as before)?
The bandwidth would be reduced on the server, but the downside is there is a lot more client strain for decompressing the string to HTML. Also JavaScript would need to be enabled and be able to parse the decompressed string to HTML.
I am not saying its a definite solution, but something that might work - it all depends on if your looking in regards to bandwidth without the users JavaScript permissions/systems spec, or such.
Otherwise look for obfuscation scripts, a simple google search produced http://tinyurl.com/phpob - dependent on what your looking for there should be a software package available.
If I am on the wrong lines, please shout and I will see what else I can do.
Good Luck!

I recently found a PHP based script that minify your sites HTML - Inline css - Inline javascript on the fly it is called as
Dynamic website compressor

I've used this regexp for years, without any problems: s/>\s*</></g
In Python re.sub(r'>\s*<', '><', html)
Or in PHP preg_replace('/>\s*</', '><', $html);
This removed all whitespace between tags, but not anywhere, this is fairly safe (but not perfect, there are situations where this will break, but they're rare).
My main reason for doing this isn't speed/file size, but because the whitespace often introduces a, well, space. This would be okay, but when you start mucking about in your DOM with Javascript, spaces are often lost, creating (minor) layout differences.
Consider:
<div>
<a>link1</a>
<a>link2</a>
</div>
There's a space between the links, but now I do something like:
$('div').append('<a>link3</a>')
And there's no space ... I need to manually add the space in my JS, which is fairly ugly & error-prone IMHO.

Here is a minifier for HTML5 written in PHP.
<?PHP
$in=file_get_contents('path/to/source.html');
//Strips spaces if there are more than one.
$in=preg_replace('/\s{2,}/m',' ',$in);
//trim
$in=preg_replace('/^\s+|\s+$/m','',$in);
/*Strips spaces between tags.
Use ( or  or better) padding or margin if necessary, otherwise the html
parser appends a one space textnode.*/
$in=preg_replace('/ ?> < ?/','><',$in);
//Removes tag end slash.
$in=preg_replace('# ?/>#','>',$in);
//Removes HTML comments except conditional IE comments.
$in=preg_replace('/<!--[^\[]*?-->/','',$in);
//Removes quotes where possible.
$in=preg_replace('/="([^ \'"\=><]+)"/','=$1',$in);
$in=preg_replace("/='([^ '\"\=><]+)'/",'=$1',$in);
file_put_contents('path/to/min.html',$in);
?>
After that you have a one line, shorter html code.
Better you make an array from the regular expressions, but aware to escape the back slashes.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008