How can I minimize HTML for performance using PHP - html

I'm somewhat new to the concept of minimization, but it seems simple enough. I understand how libraries like jQuery can benefit from minimization, but my question is whether or not it should be extended to HTML and CSS (I've seen CSS resets minimize code, so perhaps the increase in performance has been measured).
First, I have no idea how to measure the increased performance, but I was wondering if any of you had any knowledge or experience about the magnitude of performance increase one could expect when minimizing HTML and CSS code (substantial, negligible, etc).
Finally, I know HTML and especially XHTML should be as readable as possible to whoever maintains the code, which why I was thinking it would be best to minimize this code at the time of rendering only, with PHP. If this is viable, what is the best way of doing this? Just trimming the HTML and CSS ($html = trim($html);)?

Yahoo has some high performance website rules. I have quoted some of the rules. Read it carefully. These rules answers your question.
The time it takes to transfer an HTTP request and response across the network can be significantly reduced by decisions made by front-end engineers. It's true that the end-user's bandwidth speed, Internet service provider, proximity to peering exchange points, etc. are beyond the control of the development team. But there are other variables that affect response times. Compression reduces response times by reducing the size of the HTTP response.
Minification is the practice of removing unnecessary characters from code to reduce its size thereby improving load times. When code is minified all comments are removed, as well as unneeded white space characters (space, newline, and tab). In the case of JavaScript, this improves response time performance because the size of the downloaded file is reduced. Two popular tools for minifying JavaScript code are JSMin and YUI Compressor. The YUI compressor can also minify CSS.
So minimizing any content thats going to be transferred over HTTP reduces transfer time. Hence gives a head start on rendering. So your website performs faster. If you enable compression, site performance improvement will be noticeable. Also if you compress and minify Javascript, HTML and CSS all things will be faster.

To measure performance in this case you can use some tool like YSlow (in firefox > firebug) or the Profile tab in Chrome inspector.
There are many things to do to speed up a webpage. If you have many small images (icons) maybe it is a good idea join all in a big image and pick the correct using css. And put css and script tags in order.

<?php header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 3600));
header('Content-Type: text/html; charset=utf-8');
header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");
header('X-UA-Compatible: IE=Edge,chrome=1');
function ob_html_compress($buf){
return preg_replace(array('/<!--(?>(?!\[).)(.*)(?>(?!\]).)-->/Uis','/[[:blank:]]+/'),array('',' '),str_replace(array("\n","\r","\t"),'',$buf));
}
ob_start("ob_html_compress"); ?>
<?php // Your Code ?>
<?php ob_end_flush(); ?>
I'm using my simple php script for optimize HTML in one line
See example: http://cs.lviv.pro/

Related

Does formatted HTML takes up more space

If one views the source code of http://www.google.com, it's highly minified. Even the html part. I am just wondering if formatted html takes up more space than minified HTML.
All I can think of is, that in formatted html, the characters : spaces, tabs and newline take space. And that is the only scope where html minification can save memory.
Yes, your thinking is correct. Removing whitespace and compressing the HTML will result in smaller download sizes.
If you'd like to see test cases for HTML minification, check out this blog post on Perfection Kills.
Excerpt:
Original size: 217KB (35.8KB gzipped)
Minified size: 206.6KB (34.3KB gzipped)
Savings: 10.4KB (1.5KB gzipped)
Minifying home page of amazon.com saves about 10KB with uncompressed
document, and only 1.5KB with compressed one.
Yes, there’s a difference. But for many (most?) websites this difference is not worth thinking about, because (1) the server will probably serve the HTML gzipped anyway, and (2) you don’t have enough pageviews to make the difference substantial. (Google does.)
Yes, minifying HTML, CSS, and JavaScript by removing spaces, tabs, newlines, and comments saves on bandwidth cost.
In addition to minifying the HTML, you should also be certain your HTML, CSS, and JavaScript is being GZIP'ed when being sent over the wire for even better performance. For more information about GZIP, read: http://developer.yahoo.com/performance/rules.html#gzip
I would also like to add that it is very important to think about bandwidth cost and page speed to any degree this day in age. Mobile web users are on a large upward swing. Even if you are not expecting a large mobile draw from your site, you are doing a disserve to those trying to access your site on their mobile 3G devices by not taking the proper considerations into bandwidth cost and speed.

If you have CSS that's specific to that page, is it better to include it in the <head> or a separate file?

I'm working on a web app; one of the screens requires some CSS that's very specific to that page (i.e., it isn't used anywhere else in the app/site).
So - I have three options:
Include it in the global CSS file
Include it in a page-specific CSS file
Include it in the <head> of the page
The downside of option 1 is that the CSS will be loaded when when the user visits any screen of the app, even if she never visits this specific screen (which is quite likely).
The downside of option 2 is that it's a separate HTTP request; since the CSS itself is trivially small (<1kb) - it seems like the overhead of the http request itself is worse than the actual bandwidth to download the data.
The downside of option 3 is that the user will download the CSS every time she visits the page (i.e., the CSS won't get cached). But since this is an infrequently viewed page (and seldomly revisited page), this seems minor.
To me - it seems like option 3 might be the best. But everything I read seems to discourage that approach.
Given how hard experts push CSS sprites to minimize http requests, doesn't the same logic apply to a tiny CSS file? So, why isn't #3 a good option? Are there other considerations I've missed?
For what it's worth - it seems like this same question applies to any page-specific JavaScript; I could include that in a <script> tag at the end of the page, or in a separate .js file.
Thanks in advance.
Put it in the head and move on to other problems. :)
"Programmers waste enormous amounts of time thinking about, or
worrying about, the speed of noncritical parts of their programs, and
these attempts at efficiency actually have a strong negative impact
when debugging and maintenance are considered. We should forget about
small efficiencies, say about 97% of the time: premature optimization
is the root of all evil. Yet we should not pass up our opportunities
in that critical 3%." --Donald Knuth
If you're using it in a single page, you'd best include it in the <head> directly. It results in fewer HTTP requests going through, less bandwidth usage, and marginally faster loading.
You can also consider using a combination of an internal and external stylesheet: for stuff that you might use site-wide, like the styles for h1, h2, h3, and so on, link to an external stylesheet. For stuff specific to the one page, like the background-image, put style in the <head>.

Overhead of HTML whitespace indentation

I started wondering what is the overall impact of using whitespaces to indent html documents.
Why not simply use tabs to indent? Wouldn't this be more cost-effective: 1 char (\t) vs. example 4 chars (spaces)?
I did little experimenting by converting an asp.net-page to use tabs and compared sizes of rendered markups.
By replacing only one partial view's white space caused a page of 22kb size to be reduced to 19,4kb -> that's 12% reduction. Changing all indentation, page ended up allocating 16,7kb - 24% reduction! (used chrome dev tools and Fiddler for verifying)
Is my reasoning sound? Should tabs be used primary for indentation of HTML? Is there any reason to use spaces(such as compatibility with exotic browsers)?
ps. Stackoverflow seems to use spaces too. Converting SO main page to use tabs gave 9% reduction. Is this valid observation? If so, why haven’t they used tabs?
StackOverflow uses HTTP Compression - when this is turned on, the differences between using spaces versus tabs goes down - a lot.
You need to run your tests against the compressed versions for reliable results.
You do have a point though for the cases when a browser does not support the compression schemes the server supports.
First thing : html doesn't have a rule of doing indentation. It's done by programmers for code readability and program's structure. More ever We can reduce size taken by indents and white spaces by compression.
Minify/compact/compressing HTML : Compacting HTML code, can save many bytes of data and speed up downloading, parsing, and execution time.
StackOverflow uses HTTP Compression
Minifying HTML has the same benefits as those for minifying CSS and JS: reducing network latency, enhancing compression, and faster browser loading and execution. Moreover, HTML frequently contains inline JS code (in tags) and inline CSS (in tags), so it is useful to minify these as well.
Note: This rule is experimental and is currently focused on size reduction rather than strict HTML well-formedness. Future versions of the rule will also take into account correctness. For details on the current behavior, see the Page Speed wiki.
Tip: When you run Page Speed against a page referencing HTML files, it automatically runs the Page Speed HTML compactor (which will in turn apply JSMin and cssmin.js to any inline JavaScript and CSS) on the files and saves the minified output to a configurable directory.
Refer : http://code.google.com/speed/page-speed/docs/payload.html#MinifyHTML
Why not simply use tabs to indent? Wouldn't this be more cost-effective: 1 char (\t) vs. example 4 chars (spaces)?
If you're worried about downloaded HTML size, you won't fuss over tabs-vs-spaces — you'll compress your HTML as it goes over the wire and minify your markup, CSS, and Javascript, which provide real savings and don't interfere with your own coding guidelines.

HTML compression

Most web pages are filled with significant amounts of whitespace and other useless characters which result in wasted bandwidth for both the client and server. This is especially true with large pages containing complex table structures and CSS styles defined at the level. It seems like good practice to preprocess all your HTML files before publishing, as this will save a lot of bandwidth, and where I live, bandwidth aint cheap.
It goes without saying that the optimisation should not affect the appearance of the page in any way (According to the HTML standard), or break any embedded Javascript or backend ASP code, etc.
Some of the functions I'd like to perform are:
Removal of all whitespace and carriage returns. The parser needs to be smart enough to not strip whitespace from inside string literals. Removal of space between HTML elements or attributes is mostly safe, but iirc browsers will render the single space between div or span tags, so these shouldn't be stripped.
Remove all comments from HTML and client side scripts
Remove redundant attribute values. e.g. <option selected="selected"> can be replaced with <option selected>
As if this wasn't enough, I'd like to take it even farther and compress the CSS styles too. Pages with large tables often contain huge amounts of code like the following: <td style="TdInnerStyleBlaBlaBla">. The page would be smaller if the style label was small. e.g. <td style="x">. To this end, it would be great to have a tool that could rename all your styles to identifiers comprised of the least number of characters possible. If there are too many styles to represent with the set of allowable single digit identifiers, then it would be necessary to move to larger identifiers, prioritising the smaller identifiers for the styles which are used the most.
In theory it should be quite easy to build a piece of software to do all this, as there are many XML parsers available to do the heavy lifting. Surely someone's already created a tool which can do all these things and is reliable enough to use on real life projects. Does anyone here have experience with doing this?
The term you're probably after is 'minify' or 'minification'.
This is very similar to an existing conversation which you may find helpfull:
https://stackoverflow.com/questions/728260/html-minification
Also, depending on the web server you use and the browser used to look at your site, it is likely that your server is already compressing data without you having to do anything:
http://en.wikipedia.org/wiki/HTTP_compression
your 3 points are actually called "Minimizing HTML/JS/CSS"
Can have a look these:
HTML online minimizer/compressor?
http://tidy.sourceforge.net/
I have done some compression HTML/JS/CSS too, in my personal distributed crawler. which use gzip, bzip2, or 7zip
gzip = fastest, ~12-25% original filesize
bzip2 = normal, ~10-20% original filesize
7zip = slow, ~7-15% original filesize

Minimize html, doubts and questions

Minimizing html is the only section on Google's Page Speed where there is still room for improvement.
My site is all dynamic and the HTML is already Deflated so there is no reason to put any more pressure on the server (I don't want to minimize pages real time before sending).
What I could do was to minimize the template files. My templates files are a mix of PHP and HTML so I've come up with some code that I think is pretty safe but would like to be community revised.
// this will loop trough all template files
// php is cleaned first so that line-comments will not interfere with the regex
$original = file_get_contents($dir.'/'.$file);
$php_clean = php_strip_whitespace($dir.'/'.$file);
$minimized = preg_replace('/\s+/', ' ', $php_clean);
This will make my template files as a single very long file alternated with some places where DB content is inserted. Google's homepage source looks more or less like what I get so I wonder if they follow a similar approach.
Question 1: Do you antecipate potencial problems?
Question 2: Is there anyway better (more efficient to do this)?
And please remember that I'm not trying to validate HTML as the templates are not valid HTML (header and footer are includes, for example).
Edit: Do take into consideration that the template files will be minimized on deploy. As CSS and Javascript files are minimized and compressed using YUI Compressure and Closure, the template files would be minimized like-wise, on deploy. Not on client-request.
Thank you.
Google's own Closure Templates (Soy) strips whitespace at the end of the line by default, and the template designer explicitly inserts a space using {sp}. This probably isn't a good enough reason to switch away from PHP, but I just wanted to bring it to your attention.
In addition, realize that HTML 4 allows you to exclude some tags, as recommended by the Page Speed documentation on minifying HTML (http://code.google.com/p/page-speed/wiki/MinifyHtml). You can exclude </p>, </td>, </tr>, etc. For a complete list of elements for which you can omit the end tag, search for "- O" in the HTML 4 DTD (http://www.w3.org/TR/REC-html40/sgml/dtd.html). You can even omit the <html>, <head>, <body>, and <tbody> tags entirely, as both start and end tags are optional ("O O" in the DTD).
You can also omit the quotes around attributes (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2) such as id, class (with a single class name), and type that have simple content (i.e., matches /^[-A-Za-z0-9._:]+$/). For attributes that have a single possible value, you can exclude the value (e.g., say simply checked rather than checked=checked).
Some people may find these tips repulsive because we've been conditioned for so many years to prepare for the upcoming world of simple LALR parsers for XHTML. Thus, tools like Dave Raggett's HTML Tidy generate HTML with proper closing tags and quotes around attribute values. But let's face it, all the browsers already have parsers that understand HTML 4, any new browser will use the HTML 5 parser rather than XHTML, and we should get comfortable writing HTML that is optimized for size.
That being said, besides a couple large companies like Google and Facebook, my guess is that page size is a negligible component of latency, so if you're optimizing your own site it's probably because of your own obsessive tendencies rather than performance.
White space can be significant (e.g. in pre elements).
When I had a particularly large page (i.e. large enough that there was a benefit in minifying the HTML) I used HTML Tidy and cached the results.
tidy -c -n -omit -ashtml -utf8 --doctype strict \
--drop-proprietary-attributes yes --output-bom no \
--wrap 0
I think you'll end up running into issues with load time with this approach, as the get contents, strip whitespace, and preg replace calls are going to take a lot longer to do than whatever bandwidth the minified HTML is saving you.
I've been running tests on all my sites for a couple of weeks and I can say that this method is pretty consistent. It will only affect template content, so there is little risk of messing up with unknown <pre> or similar.
It is run before deploy so there is no impact on server - actually there should be a little speed up as the file becomes smaller.
Do remember that all content that comes from the database will not suffer any influence as, like said before, this runs before deploy and on template files only.
The method seams solid enough to pass it into production.
If anything goes wrong I'll post it here.