Most web pages are filled with significant amounts of whitespace and other useless characters which result in wasted bandwidth for both the client and server. This is especially true with large pages containing complex table structures and CSS styles defined at the level. It seems like good practice to preprocess all your HTML files before publishing, as this will save a lot of bandwidth, and where I live, bandwidth aint cheap.
It goes without saying that the optimisation should not affect the appearance of the page in any way (According to the HTML standard), or break any embedded Javascript or backend ASP code, etc.
Some of the functions I'd like to perform are:
Removal of all whitespace and carriage returns. The parser needs to be smart enough to not strip whitespace from inside string literals. Removal of space between HTML elements or attributes is mostly safe, but iirc browsers will render the single space between div or span tags, so these shouldn't be stripped.
Remove all comments from HTML and client side scripts
Remove redundant attribute values. e.g. <option selected="selected"> can be replaced with <option selected>
As if this wasn't enough, I'd like to take it even farther and compress the CSS styles too. Pages with large tables often contain huge amounts of code like the following: <td style="TdInnerStyleBlaBlaBla">. The page would be smaller if the style label was small. e.g. <td style="x">. To this end, it would be great to have a tool that could rename all your styles to identifiers comprised of the least number of characters possible. If there are too many styles to represent with the set of allowable single digit identifiers, then it would be necessary to move to larger identifiers, prioritising the smaller identifiers for the styles which are used the most.
In theory it should be quite easy to build a piece of software to do all this, as there are many XML parsers available to do the heavy lifting. Surely someone's already created a tool which can do all these things and is reliable enough to use on real life projects. Does anyone here have experience with doing this?
The term you're probably after is 'minify' or 'minification'.
This is very similar to an existing conversation which you may find helpfull:
https://stackoverflow.com/questions/728260/html-minification
Also, depending on the web server you use and the browser used to look at your site, it is likely that your server is already compressing data without you having to do anything:
http://en.wikipedia.org/wiki/HTTP_compression
your 3 points are actually called "Minimizing HTML/JS/CSS"
Can have a look these:
HTML online minimizer/compressor?
http://tidy.sourceforge.net/
I have done some compression HTML/JS/CSS too, in my personal distributed crawler. which use gzip, bzip2, or 7zip
gzip = fastest, ~12-25% original filesize
bzip2 = normal, ~10-20% original filesize
7zip = slow, ~7-15% original filesize
Related
I am storing the HTML from the body of emails in a SQL Server nvarchar(max) column.
Is there any benefit in minimizing the HTML on the way in?
By minimizing I mean removing redundant white space and carriage returns/linefeeds in the HTML text stream. My terminology might not be quite right: I'm not looking at removing any HTML tags/comments or anything like that.
By benefit I mean in terms of efficiency of storage space, speed of insert/retrieval, so benefits are focused on the database side.
If it is worthwhile to do, what should I look out for (e.g. if I replace linefeeds with a single space, might it render the HTML incorrectly at a later time)?
You'd still have to have a full HTML parser to understand what's HTML and whats not. Most browsers do a bit of 'fixing up' to make otherwise unpresentable HTML graphically renderable -- in such a way that without fully parsing the tree would be impossible.
Someone could stick some bad HTML in that'd goof up your 'simple' parser pretty easily more often by mistake than malice. Don't get in the business of fixing HTML, handle it verbatim and let the bad content hang itself.
HTML will be just be stored as a BLOB in the database. You won't be able to parse it, search it etc (well, you technically can but that's silly). In that case, you can (un)compress it in the client and send it+store it as varbinary(max) in the database.
The trade off is CPU time to manage compression vs increased storage+network traffic.
I wouldn't sanitise the HTML because you'll lose readability and possibly original content.
I've been writing a source-to-display converter for a small project. Basically, it takes an input and transforms the input into an output that is displayable by the browser (think Wikipedia-like).
The idea is there, but it isn't like the MediaWiki style, nor is like the MarkDown style. It has a few innovations by itself. For example, when the user types in a chain of spaces, I would presume he wants the spaces preserved. Since html ignores spaces by default, I was thinking of converting these chain of spaces into respective s (for example 3 spaces in a row converted to 1 )
So what happens is that I can foresee a possibility of a ton of tags per post (and a single page may have multiple posts).
I've been hearing alot of anti- s in the web, but most of it boils down to readability headaches (in this case, the input is supplied by the user. if he decides to make his post unreadable he can do so with any of the other formatting actions supplied) or maintenance headaches (which in this case is not, since it's a converted output).
I'm wondering what are the disadvantages of having tons of tags on a webpage?
You are rendering every space as ?
Besides wasting so much bandwidth, this will not allow dynamic line breaking as "nbsp" means "*n*on *b*reaking *sp*ace". This will most probably cause much trouble.
If it's just being dumped to a client, it's just a matter of size, and if it's gzipped, it barely matters in terms of network traffic.
It'll slow down rendering, I'm sure, and take up DOM space, but whether or not that matters depends on stuff I don't know about your use case(s). You might be able to achieve the same result in other ways, too; not sure.
s aren't tags, but are character entities like ©, <, >, etc.
I'd say that the disadvantages would be readability. When I see a word, I expect the spacing to be constant (unless it is in a block of justified text).
Can you show me a case where you'd need s?
Have you considered trying to figure out what the user, by inserting those spaces, is really trying to achieve? Rather than the how (they want to insert the spaces), the what (if the spaces are at the beginning of a line, they want to indent the text in question).
An example of this is many programming sites convert 4 spaces at the start of a line to a pre+code block.
For your purposes, maybe it should be a <block> block.
The end goal being that of converting the spaces not to what the user (with their limited resources) intended to show up there but, rather, what they meant to convey with it.
Minimizing html is the only section on Google's Page Speed where there is still room for improvement.
My site is all dynamic and the HTML is already Deflated so there is no reason to put any more pressure on the server (I don't want to minimize pages real time before sending).
What I could do was to minimize the template files. My templates files are a mix of PHP and HTML so I've come up with some code that I think is pretty safe but would like to be community revised.
// this will loop trough all template files
// php is cleaned first so that line-comments will not interfere with the regex
$original = file_get_contents($dir.'/'.$file);
$php_clean = php_strip_whitespace($dir.'/'.$file);
$minimized = preg_replace('/\s+/', ' ', $php_clean);
This will make my template files as a single very long file alternated with some places where DB content is inserted. Google's homepage source looks more or less like what I get so I wonder if they follow a similar approach.
Question 1: Do you antecipate potencial problems?
Question 2: Is there anyway better (more efficient to do this)?
And please remember that I'm not trying to validate HTML as the templates are not valid HTML (header and footer are includes, for example).
Edit: Do take into consideration that the template files will be minimized on deploy. As CSS and Javascript files are minimized and compressed using YUI Compressure and Closure, the template files would be minimized like-wise, on deploy. Not on client-request.
Thank you.
Google's own Closure Templates (Soy) strips whitespace at the end of the line by default, and the template designer explicitly inserts a space using {sp}. This probably isn't a good enough reason to switch away from PHP, but I just wanted to bring it to your attention.
In addition, realize that HTML 4 allows you to exclude some tags, as recommended by the Page Speed documentation on minifying HTML (http://code.google.com/p/page-speed/wiki/MinifyHtml). You can exclude </p>, </td>, </tr>, etc. For a complete list of elements for which you can omit the end tag, search for "- O" in the HTML 4 DTD (http://www.w3.org/TR/REC-html40/sgml/dtd.html). You can even omit the <html>, <head>, <body>, and <tbody> tags entirely, as both start and end tags are optional ("O O" in the DTD).
You can also omit the quotes around attributes (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2) such as id, class (with a single class name), and type that have simple content (i.e., matches /^[-A-Za-z0-9._:]+$/). For attributes that have a single possible value, you can exclude the value (e.g., say simply checked rather than checked=checked).
Some people may find these tips repulsive because we've been conditioned for so many years to prepare for the upcoming world of simple LALR parsers for XHTML. Thus, tools like Dave Raggett's HTML Tidy generate HTML with proper closing tags and quotes around attribute values. But let's face it, all the browsers already have parsers that understand HTML 4, any new browser will use the HTML 5 parser rather than XHTML, and we should get comfortable writing HTML that is optimized for size.
That being said, besides a couple large companies like Google and Facebook, my guess is that page size is a negligible component of latency, so if you're optimizing your own site it's probably because of your own obsessive tendencies rather than performance.
White space can be significant (e.g. in pre elements).
When I had a particularly large page (i.e. large enough that there was a benefit in minifying the HTML) I used HTML Tidy and cached the results.
tidy -c -n -omit -ashtml -utf8 --doctype strict \
--drop-proprietary-attributes yes --output-bom no \
--wrap 0
I think you'll end up running into issues with load time with this approach, as the get contents, strip whitespace, and preg replace calls are going to take a lot longer to do than whatever bandwidth the minified HTML is saving you.
I've been running tests on all my sites for a couple of weeks and I can say that this method is pretty consistent. It will only affect template content, so there is little risk of messing up with unknown <pre> or similar.
It is run before deploy so there is no impact on server - actually there should be a little speed up as the file becomes smaller.
Do remember that all content that comes from the database will not suffer any influence as, like said before, this runs before deploy and on template files only.
The method seams solid enough to pass it into production.
If anything goes wrong I'll post it here.
I have a few hand-crafted web pages. When deploying them I would like to run them through a tool so that new smaller HTML files are created, with extraneous whitespace taken out, etc.
We already use YUICompressor for our Javascript and our CSS, and we tend to follow all of the techniques described by the Yahoo performance team.
Is there a good, free tool that does this? I prefer tools that would fit into our deployment process similarly to YUICompressor.
HTML Tidy does the job.
I use the following on one document that I generate (a rather large one). This saved me about 10% on the post-gzip size.
tidy -c -omit -ashtml -utf8 --doctype strict \
--drop-proprietary-attributes yes --output-bom no \
--wrap 0 source.html > target.html
-c — Replace surplus presentational tags and attributes
-omit — Drop optional end tags
-ashtml — use HTML rather than XHTML (HTML is leaner and XHTML provides no benefits for most use cases)
-utf8 — So we don't have to use entities for characters outside the character set (entities are more bytes)
--doctype strict — use Strict (again, leaner)
--drop-proprietary-attributes yes — get rid of proprietary junk
--output-bom no — BOMs cause issues in some clients
--wrap 0 — Have very long lines
Plain old minify will also attack your HTML for you, if you want.
But HTML minification isn't, generally, hugely effective:
Taking runs of whitespace down to one won't do that much. If you're already using gzip/deflate, that'll be compressing the whitespace quite efficiently. You can't remove all whitespace as single whitespaces can often have an effect on rendering that it is desirable to keep.
Taking comments out may have an effect, depending on how much comment content you actually have. But you'd have to be careful not to hit conditional comments.
Apart from that, there is not much in an HTML document that can be ‘minified’. Obviously the JS idea of packing variable names down to the shortest possible string is inapplicable.
Doing all this with regex, as most minifiers do, is a bit dodgy. You have to stick to a limited ‘normal’ range of markup that won't trip it up.
With HTML minification you're typically getting less gain (and less post-gzip gain) than JS/CSS minification, and for dynamically-generated pages you have more overhead (as you can't pre-minify them like with static scripts/styles). Some templating languages may already have built-in features for trimming whitespace at generation time; if available in your environment, use that.
Simple question hopefully.
We have a style sheet that is over 3000 lines long and there is a noticeable lag when the page is rendering as a result.
Here's the question: Is it better to have one massive style sheet that covers everything, or lots of little style sheets that cover different parts of the page? (eg one for layout, one for maybe the drop down menu, one for colours etc?)
This is for performance only, not really 'which is easier'
3,000 lines? You may want to first go in and look for redundancy, unnecessarily-verbose selectors, and other formatting/content issues. You can opt to create a text stylesheet, a colors stylesheet, and a layout stylesheet, but it's likely not going to improve performance. That is generally done to give you more organization. Once you've tightened up your rules, you could also minify it by removing all formatting, which might shave off a little bit more, but likely not much.
Well, if you split those 3k lines into multiple files the overall rendering time won't decrease because
All 3000 lines will still need to be parsed
Multiple requests are needed to get the CSS files which slows down the whole issue on another level
According to this probably reliable source one file, to satisfy rule 1 (minimize http requests). And don't forget rule 10, minify js and css, especially with a 3000 line monster.
It'll be worse if you split them up because of the overhead of the extra HTTP requests andnew connections for each one (I believe it is Apache's default behaviour to have keep-alive off)
Either way, it all needs to be downloaded and parsed before anything can happen.
Separating that monster file into smaller once (layout, format and so on) would make development more efficient. Before deployment you should merge and minify them to avoid multiple http requests. Giving the file a new number (style-x.css) for each new deployment will allow you to configure your http server to set an expire date far into the future and by that saving some additional http requests.
It sounds like you are using CSS in a very inefficient way. I usually have a style sheet with between 400 and 700 lines and some of the sites that I have designed are very intricate. I don't think you should ever need more than 1500 lines, ever.
3000 lines of code is far to many to maintain properly. My advice would be to find things that share the same properties and make them sub-categories. For example, if you want to have one font throughout the page, define it once in the body and forget about it. If you need multiple fonts or multiple backgrounds you can put a div font1 and wrap anything that needs that font style with that div.
Always keep CSS in one file, unless you have completely different styles on each page.
the effort of loading multiple css files stands against the complexity (and hence speed) of parsing as well as maintenance aspects
If certain subsets of the monster file can be related to certain html pages (and only those certain pages) then a separation into smaller units would make sense.
example:
you have a family homepage and your all.css contains all the formats for your own range of pages, your spouse's, your kids' and your pet's pages - all together 3000 lines.
./my/*.html call ./css/all.css
./spouse/*.html call ./css/all.css
./kid/*.html call ./css/all.css
./pet/*.html call ./css/all.css
in this case it's rather easy to migrate to
./my/*.html call ./css/my.css
./spouse/*.html call ./css/spouse.css
./kid/*.html call ./css/kid.css
./pet/*.html call ./css/pet.css
better to maintain, easier to transfer responsibilities, better to protect yourself from lousy code crunchers :-)
If all (or most) of your pages are sooo complex that they absolutely need the majority of the 3000 lines, then don't split. You may consider to check for "overcoding"
Good luck
MikeD
The only gain in dividing your CSS would be to download each part in parallel. If you host each CSS on different server, it could in some case gain a bit of speed.
But in most case, having a single 3000 lines of code CSS should be (a bit) faster.
Check out the Yahoo performance rules. They are backed by lots of empirical research.
Rule #1 is minimize HTTP requests (don't split the file--you could for maintenance purposes but for performance you should concat them back together as part of a build process). #5 is place CSS references at the top (in < head>). You can also use the YUI compressor to reduce the file size of CSS by stripping whitespace etc.
More stuff (CDNs, gzipping, cache-control, etc.) in the rules.
I have a site css file that controls the styles for the overall site (layout mainly).
Then I have smaller css files for page specific stuff.
I even sometimes have more than one if I am planning on ripping out an entire section at a later date.
Better to download and parse 2 files at 20kb than 1 file at 200kb.
Update: Besides, isn't this a moot point? It only has to be downloaded once. If the pause is that big a deal, have a 'loading' screen like what GMail has.
3000 lines is not a big deal. If you split them into multiple chunks, it still needs to be downloaded for rendering. the main concern is the file size. i have over 11000 lines in one of our master css file and the size is about 150 kb.
And we gzipped the static contents and the size is drastically reduced to about 20 kb.. and we didnt face any performance issues.