Compression algorithms specifically optimized for HTML content? - html

Are there any compression algorithms -- lossy or lossless -- that have been specifically adapted to deal with real-world (messy and invalid) HTML content?
If not, what characteristics of HTML could we take advantage of to create such an algorithm? What are the potential performance gains?
Also, I'm not asking the question to serve such content (via Apache or any other server), though that's certainly interesting, but to store and analyze it.
Update: I don't mean GZIP -- that's obvious -- but rather an algorithm specifically designed to take advantage of characteristics of HTML content. For example, the predictible tag and tree structure.

Brotli is a specialized HTML/english compression algorithm.
Source: https://en.wikipedia.org/wiki/Brotli
Unlike most general purpose compression algorithms, Brotli uses a
pre-defined 120 kilobyte dictionary. The dictionary contains over
13000 common words, phrases and other substrings derived from a large
corpus of text and HTML documents.[6][7] A pre-defined dictionary can
give a compression density boost for short data files.

I do not know of "off-the-shelf" compression library explicitly optimized for HTML content.
Yet, HTML text should compress quite nicely with generic algorithms (do read the bottom of this answer for better algorithms). Typically all variations on Lempel–Ziv perform well on HTML-like languages, owing to the highly repeatitive of specific language idioms; GZip, often cited uses such a LZ-based algoritm (LZ77, I think).
An idea to maybe improve upon these generic algorithms would be to prime a LZ-type circular buffer with the most common html tags and patterns at large. In this fashion, we'd reduce the compressed size by using citations from the very first instance of such a pattern. This gain would be particularly sensitive on smaller html documents.
A complementary, similar, idea, is to have the compression and decompression methods imply (i.e. not send) the info for other compression's algorithm of an LZ-x algorithm (say the Huffman tree in the case of LZH etc.), with statistics specific to typical HTML being careful to exclude from characters count the [statistically weighted] instances of character encoded by citation. Such a filtered character distribution would probably become closer to that of plain English (or targeted web sites' national languge) than the complete HTML text.
Unrelated to the above [educated, I hope] guesses, I started searching the web for information on this topic.
' found this 2008 scholarly paper (pdf format) by Przemysław Skibiński of University of Wrocław. The paper's abstract indicates a 15% improvement over GZIP, with comparable compression speed.
I may be otherwise looking in the wrong places. There doesn't seem to be much interest for this. It could just be that the additional gain, relative to a plain or moderately tuned generic algorithm wasn't deemed sufficient enough to warrant such interest, even in the early days of Web-enabled cell phones (when bandwidth was at quite a premium...).

About the only "lossy" I am willing to deal with in HTML content, messy or not, is whitespace flattening. This is a typical post-publish step that high volume sites perform on their content, also called flattening.
You can also flatten large Javascript libs using the YUI compressor, which renames all Javascript vars to short names, removes whitespace, etc. It is very important for large apps using kits like ExtJS, Dojo, etc.

Is gzip compression not sufficient for your needs? It gives you about 10:1 compression ratio, not only with HTML contents but also with JavaScript, CSS etc. files, and is readily available on most servers or reverse proxies (e.g. Apache's mod_deflate, Nginx's NginxHttpGzipModule etc.) and all modern browsers (you can instruct both Apache and Nginx to skip compression for specific browsers based on User-Agent.)
You'll be surprised how close gzip compression comes to optimal. Some people have suggested minifying your files; however, unless your files contain lots of comments (which the minifier can discard completely, i.e. what you probably referred to as "lossy" -- but something you probably don't want to do with HTML anyway, not unless you're sure that none of your <script> or <style> tags are inside HTML comments <!-- --> to accommodate antediluvian browsers), remember that minifying achieves most of its gains from a technique similar to (yet more limited than) DEFLATE -- so expect a minified file to be larger or much larger than a gzipped original (particularly true with HTML, in which you are stuck with W3C's tags and attributes, and only gzip can help you there), and that gzipping a minified file will give you minimal gain over gziping the original file (again, unless the original file contained lots of comments which can be safely discarded by a minifier.)

Use S-expressions instead, saves you a number of characters per tag :)

if i understand your question correctly what you need is gz compression, which is available pretty easily with Apache.

Run your code thru some HTML minificator/obfuscator that removes as much markup as possible, then let your web server compress it with gzip.

No, there are not any HTML-specific compression algorithms, because the general-purpose ones have proved adequate.
The potential gains would come from knowing ahead of time the likely elements of an HTML page - you could start with a predefined dictionary that would not have to be part of the compressed stream. But this would not give a noticeable gain, as compression algorithms are extraordinarily good at picking out common sub-expressions on the fly.

You would usually use a common algorithm like gzip which is supported by most browsers through the HTTP protocol. The Apache documentation shows how to enable mod_deflate without breaking the browser support of your website.
Additionally you can minimize static HTML files (or do that dynamically).

You could consider each unique grouping (i.e. tags & attribs) as a symbol, determine a minimum symbol count and re-encode using Shannon's entropy; this would generate one large blob of bytes with maximal compression. I will say that this may not be much better than gzip.

Related

Pro's and Con's of using HTML Codes vs Special Characters

When building websites for non-english speaking countries
you have tons of characters that are out of the scope.
For the database I usally encode it on either utf-8 or latin-1.
I would like to know if there is any issue with performance, speed resolution, space optimization, etc.
For the fixed texts that are on the html between using for example
á or á
which looks exactly the same: á or á
The things that I have so far for using it with utf-8:
Pros:
Easy to read for the developers and the web administrator
Only one space ocupied on the code instead of 4-5
Easier to extract an excerpt from a text
1 byte against 8 bytes (according to my testings)
Cons:
When sending files to other developers depending on the ide, softwares, etc that they use to read the code they will break the accent in things like: é
When an auto minification of code occurs it sometimes break it too
Usually breaks when is inside an encoding
The two cons that I have a bigger weight than the pros by my perspective because the reflect on the visitor.
Just use the actual character á.
This is for many reasons.
First: a separation of concerns, the database shouldn't know about HTML. Just imagine if at a later date you want to create an API to use it in another service or a Mobile App.
Second: just use UTF-8 for your database not latin. Again, think ahead what if your app suddently needs to support Japanese then how you store あ?
You always have the change to convert it to HTML codes if you really have to... in a view. HTML is an implementation detail, not core to your app.
If your concern is the user, all major browsers in this time and age support UTF-8. Just use the right meta tag. Easy.
If your problem are developers and their tools take a look at http://editorconfig.org/ to enforce and automatize line endings and the usage of UTF-8 in your files.
Maybe add some git attributes to the mix and why not go the extra mile and have a git precommit hook running some checker so make super sure everyone commits UTF-8 files.
Computer time is cheap, developer time is expensive: á is easier to change and understand, just use it.

Highly efficient filesystem APIs for certain kinds of operations

I occasionally find myself needing certain filesystem APIs which could be implemented very efficiently if supported by the filesystem, but I've never heard of them. For example:
Truncate file from the beginning, on an allocation unit boundary
Split file into two on an allocation unit boundary
Insert or remove a chunk from the middle of the file, again, on an allocation unit boundary
The only way that I know of to do things like these is to rewrite the data into a new file. This has the benefit that the allocation unit is no longer relevant, but is extremely slow in comparison to some low-level filesystem magic.
I understand that the alignment requirements mean that the methods aren't always applicable, but I think they can still be useful. For example, a file archiver may be able to trim down the archive very efficiently after the user deletes a file from the archive, even if that leaves a small amount of garbage either side for alignment reasons.
Is it really the case that such APIs don't exist, or am I simply not aware of them? I am mostly interested in NTFS, but hearing about other filesystems will be interesting too.
For NTFS and FAT there are no such APIs. You can obvoiusly truncate the end a file but not the beginning.
Implementing this is unadvisable due to file system caching.
Most of the time people implement a layer "on top" of NTFS to support this.
Raymond Chen has essentially answered this question.
His answer is that no, such APIs don't exist, because there is too little demand for them. Raymond also suggests the use of sparse files and decomitting blocks by zeroing them.

What are the practical benefits of using microformats for every possible thing?

What practical benefits can my client get if I use microformats on his site for every possible thing?
How can I explain these benefits to a non-technical client?
Sometimes it seems like the practical benefits are hard to quantify.
Search engines already pick up and parse microformats (see e.g. https://support.google.com/webmasters/answer/99170). I believe hCard and hCalendar are fairly well supported--and if not, plenty of sites are using it, including places like MySpace.
It's the idea that adding CSS classes and specified IDs make your existing content easier to parse in a machine-readable manner.
hReview is starting to make some inroads, and hResume looks like it take off too.
I heavily use rel="nofollow" on uncontrolled links (3rd party sources) which is actually a microformat.
Check the microformats wiki for a decent starting point.
It just means your viewers can share a few generic "formats". You can generalize stylesheets, and parsing mechanisms. Rather than having a webpage consist of one "html document," you have a webpage that consist of "10 formatted micro-documents".
If you need a real world analog: think of it like attaching a formatted invoice, to a receipt, and a business card, rather than writing it all down on notebook paper with your left hand.
Overall the site becomes easier to digest for the rest of the internet. The data can be reused, combined, cross-referenced, and saved.
A simple example would be to have anywhere on the site a latitude and a longitude (geo). With Microformats, anybody that searches for that latitude and longitude can be easily referenced to their website, increasing traffic, awareness of that person / company, and allow users to easily save that information. (Although I've encountered little of this personally, this is more of 'the future' of things than it is current. But always good to stay up to date).
A second example would be a business card (hCard) where a browser can easily save and transfer it to an address book, so that just one visit to the site and the visitor has the information saved locally. Especially useful if they're getting hits from a cell phone.
I wouldn't recommend using microformats for "every possible thing". Use them for things where you get some benefit, in exchange for the effort of using them.
The main practical benefit I'm aware of is customised search engine results:
https://support.google.com/webmasters/answer/99170
Technically, Google now prefers this to be implemented using microdata (i.e. itemprop attributes) rather than microformats, but it's the same idea.
Having a micro-format can be better than no format since it lets you save every possible thing in the application.
A micro-format for every possible thing can be better than a standard format only because: it's quicker to create so it costs less and it take less space than some standard formats, like XML.
But all this depends on the context of the application and so you must explain it to the client in that context.
microformatting your content extends its reach in every, which way possible. using your sites structure as its "api" the possibilities are what you set your limits too

Can i gzip-compress all my html content(pages)

I am trying to find out if there are any principles in defining which pages should be gzip-compressed and to draw a line when to send plain html content.
It would be helpful if you guys can share the decisions you took in gzip-compressing a part of your project.
A good idea is to benchmark, how fast is the data coming down v.s. how well compressed is it. If it takes 5 seconds to send something that went from 200K to 160K it's probably not worth it. There is a cost of compression on the server side and if the server gets busy it might not be worth it.
For the most part, if your server load is below 0.8 regularly, I'd just gzip anything that isn't binary like jpegs, pngs and zip files.
There's a good write up here:
http://developer.yahoo.com/performance/rules.html#gzip
Unless your server's CPU is heavily utilized, I would always use compression. It's a trade-off between bandwidth and CPU utilization, and webservers usually have plenty of spare CPU cycles to spare.
I don't think there's a good reason not to gzip HTML content.
It takes very little CPU power for big gains in loading speed.
There's one notable exception: There's a bug in Internet Explorer 6, that makes all compressed content turn up blank.
Considering there is a huge gain on the size of the HTML data to download when it's gzipped, I don't see why you shouldn't gzip it.
Maybe it uses a little bit of CPU... But not that much ; and it's really interesting for the client, who has less to download. And it's only a couple of lines in the webserver configuration to activate it.
(But let your webserver do that : there are modules like mod_deflate for the most used servers)
As a semi-sidenote : you are talking about compressing HTML content pages... But stop at HTML pages : you can compress JS and CSS too (they are text files, and, so, are generally compressed really well), and it doesn't cost much CPU either.
Considering the big JS/CSS Frameworks in use nowadays, the gain is probably even more consequent by compressing those than by compressing HTML pages.
We made the decision to gzip all content since spending time determining what to gzip or what not to gzip did not seem worth the effort. The overhead of gzipping everything is not significantly higher than gzipping nothing.
This webpage suggests:
"Servers choose what to gzip based on
file type, but are typically too
limited in what they decide to
compress. Most web sites gzip their
HTML documents. It's also worthwhile
to gzip your scripts and stylesheets,
but many web sites miss this
opportunity. In fact, it's worthwhile
to compress any text response
including XML and JSON. Image and PDF
files should not be gzipped because
they are already compressed. Trying to
gzip them not only wastes CPU but can
potentially increase file sizes."
If you care about cpu time, I would suggest not gzipping already compressed content. Remember when adding complexity to a system that Programmers/sys admins are expensive, servers are cheap.

Tools to help reverse engineer binary file formats

What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.