Besides the fact that it becomes unreadable for humans, are there any downsides when I remove every linebreak and space from the html source code?
Does the browsers render different then? Will the rendering get faster (or maybe slower)?
There are many already answered questions about minifying HTML. Here are some:
Why minify assets and not the markup?
HTML Minification
How to minify HTML code
You will have a smaller file size, so it may download faster (though it'll be probably unnoticeable). There are tools for this, indeed.
If you remove line breaks there is no harm. But according to your questions
...when I remove every linebreak and space from the html source code?
If you do remove every linebreak and line space, your purpose may not be served. You should only remove extra line-breaks and spaces. Also be careful not to alter values attributes form data, or any other attribute for that matter.
Regarding improvements it can offer:
It may render faster as it needs to parse lesser data. But this speedup is highly small. I even discourage it as it reduces readability and the speedup is in order of a few hundred clock cycles for CPU. The same goes for download. It reduces mere bites of data (unless the document has too much white spaces)
Insted of this its better to use GZIP compression for the output at the server side. The following is an line from php which enables it. If you have php in your server, then just rename your *.html file to *.php , Then add the following code before any output:
if (substr_count($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip')) ob_start("ob_gzhandler");
you can also do this using the .htaccess file. Google regarding this more.
A bit late but still... By using output_buffering it is as simple as that:
function compress($string)
{
// Remove html comments
$string = preg_replace('/<!--.*-->/', '', $string);
// Merge multiple spaces into one space
$string = preg_replace('/\s+/', ' ', $string);
// Remove space between tags. Skip the following if
// you want as it will also remove the space
// between <span>Hello</span> <span>World</span>.
return preg_replace('/>\s+</', '><', $string);
}
ob_start('compress');
// Here goes your html.
ob_end_flush();
Related
My html file contains in many places the code
It is too short and it doesn't really make sense to replace it with a code like
<span class="three-spaces"></span>
I would like to replace it with something like
##TS##
or
%%TS%%
and the file should start with something like:
SET TS = " "
Is there any way to write the HTML this way? I am not looking for compiling a source file into a HTML. I am looking for a solution that allows directly writing macros into HTML files.
Later edit: I'm coming with another example:
I also need to transform
lnk(http://www.example.com)
into
<a target="_blank" href="http://www.example.com">http://www.example.com</a>
Instead of telling him WHY he should not do something, how about telling him HOW he could do it? Maybe his example is not an appropriate need for it, but there's other situations where being able to create a macro would be nice.
For example... I have an HTML page that I'm working on that deals with unit conversions and quite often, I'm having to type things like "cm/in" as "cm/in" or for volumes "cu-cm/cu-in" as "cm3/in3". It would be really nice from a typing and readability standpoint if I could create macros that were just typed as "%%cm-per-in%%, %%cc-per-cu-in%% or something like that.
So, the line in the 'sed' file might look like this:
s/%%cc-per-cu-in%%/<sup>cm<sup>3<\/sup><\/sup>\/<sub>in<sup>3<\/sup><\/sub>/g
Since the "/" is a field separator for the substitute command, you need to explicitly quote it with the backslash character ("\") within the replacement portion of the substitute command.
The way that I have handled things like this in the past was to either write my own preprocessor to make the changes or if the "sed" utility was available, I would use it. So for this sort of thing, I would basically have a "pre-HTML" file that I edited and after running it through "sed" or the preprocessor, it would generate an HTML file that I could copy to the web server.
Now, you could create a javascript function that would do the text substitution for you, but in my opinion, it is not as nice looking as an actual preprocessor macro substitution. For example, to do what I was doing in the sed script, I would need to create a function that would take as a parameter the short form "nickname" for the longer HTML that would be generated. For example:
function S( x )
{
if (x == "cc-per-cu-in") {
document.write("<sup>cm<sup>3</sup></sup>/<sub>in<sup>3</sup></sub>");
} else if (x == "cm-per-in") {
document.write("<sup>cm</sup>/<sub>in</sub>");
} else {
document.write("<B>***MACRO-ERROR***</B>");
}
}
And then use it like this:
This is a test of cc-per-cu-in <SCRIPT>S("cc-per-cu-in");</SCRIPT> and
cm-per-in <SCRIPT>S("cm-per-in");</SCRIPT> as an alternative to sed.
This is a test of an error <SCRIPT>S("cc-per-in");</SCRIPT> for a
missing macro substitution.
This generates the following:
This is a test of cc-per-cu-in cm3/in3
and cm-per-in cm/in as an alternative to sed. This is a test of an error MACRO-ERROR for a missing macro substitution.
Yeah, it works, but it is not as readable as if you used a 'sed' substitution.
So, decide for yourself... Which is more readable...
This...
This is a test of cc-per-cu-in <SCRIPT>S("cc-per-cu-in");</SCRIPT> and
cm-per-in <SCRIPT>S("cm-per-in");</SCRIPT> as an alternative to sed.
Or this...
This is a test of cc-per-cu-in %%cc-per-cu-in%% and
cm-per-in %%cm-per-in% as an alternative to sed.
Personally, I think the second example is more readable and worth the extra trouble to have pre-HTML files that get run through sed to generate the actual HTML files... But, as the saying goes, "Your mileage may vary"...
EDITED: One more thing that I forgot about in the initial post that I find useful when using a pre-processor for the HTML files -- Timestamping the file... Often I'll have a small timestamp placed on a page that says the last time it was modified. Instead of manually editing the timestamp each time, I can have a macro (such as "%%DATE%%", "%%TIME%%", "%%DATETIME%%") that gets converted to my preferred date/time format and put in the file.
Since my background is in 'C' and UNIX, if I can't find a way to do something in HTML, I'll often just use one of the command line tools under UNIX or write a small 'C' program to do it. My HTML editing is always in 'vi' (or 'vim' on the PC) and I find that I am often creating tables for alignment of various portions of the HTML page. I got tired of typing all the TABLE, TR, and TD tags, so I created a simple 'C' program called 'table' that I can execute via the '!}' command in 'vi', similar to how you execute the 'fmt' command in 'vi'. It takes as parameters the number of rows & columns to create, whether the column cells are to be split across two lines, how many spaces to indent the tags, and the column widths and generates an appropriately indented TABLE tag structure. Just a simple utility, but saves on the typing.
Instead of typing this:
<TABLE>
<TR>
<TD width=200>
</TD>
<TD width=300>
</TD>
</TR>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
</TABLE>
I can type this:
!}table -r 3 -c 2 -split -w 200 300
Now, with respect to the portion of the original question about being able to create a macro to do HTML links, that is also possible using 'sed' as a pre-processor for the HTML files. Let's say that you wanted to change:
%%lnk(www.stackoverflow.com)
to:
www.stackoverflow.com
you could create this line in the sed script file:
s/%%lnk(\(.*\))/<a href="\1">\1<\/a>/g
'sed' uses regular expressions and they are not what you might call 'pretty', but they are powerful if you know what you are doing.
One slight problem with this example is that it requires the macro to be on a single line (i.e. you cannot split the macro across lines) and if you call the macro multiple times in a single line, you get a result that you might not be expecting. Instead of doing the macro substitution multiple times, it assumes the argument to the macro starts with the first '(' of the first macro invocation and ends with the last ')' of the last macro invocation. I'm not a sed regular expression expert, so I haven't figured out how to fix this yet. For the multiple line portion though, a possible fix would be to replace all the LF characters in the file with some other special character that would not normally be used, run sed on that result, and then convert the special characters back to LF characters. Of course, the problem there is that the entire file would be a single line and if you are invoking the macro, it is going to have the results that I described above. I suspect awk would not have that problem, but I have never had a need to learn awk.
Upon further reflection, I think there might be an easier solution to both the multi-line and multiple invocation of a macro on a single line -- the 'm4' macro preprocessor that comes with the 'C' compiler (e.g. gcc). I haven't tested it much to see what the downside might be, but it seems to work well enough for the tests that I have performed. You would define a macro as such in your pre-HTML file:
define(`LNK', `$1')
And yeah, it does use the backwards single quote character to start the text string and the normal single quote character to end the text string.
The only problem that I've found so far is that is that for the macro names, it only allows the characters 'A'-'Z', 'a'-'z', '0'-'9', and '' (underscore). Since I prefer to type '-' instead of '', that is a definite disadvantage to me.
Technically inline JavaScript with a <script> tag could do what you are asking. You could even look into the many templating solutions available via JavaScript libraries.
That would not actually provide any benefit, though. JavaScript changes what is ultimately displayed, not the file itself. Since your use case does not change the display it wouldn't actually be useful.
It would be more efficient to consider why is appearing in the first place and fix that.
This …
My html file contains in many places the code
… is actually what is wrong in your file!
is not meant to use for layout purpose, you should fix that and use CSS instead to layout it correctly.
is meant to stop breaking words at the end of a line that are seperated by a space. For example numbers and their unit: 5 liters can end up with 5 at the end of the line and liters in the next line (Example).
To keep that together you would use 5 liters. That's what you use for and nothing else, especially not for layout purpose.
To still answer your question:
HTML is a markup language not a programming language. That means it is descriptive/static and not functional/dynamic. If you try to generate HTML dynamically you would need to use something like PHP or JavaScript.
Just an observation from a novice. If everyone did as purists suggest (i.e.-the right way), then the web would still be using the same coding conventions it was using 30 years ago. People do things, innovate, and create new ways, then new standards, and deprecate others all the time. Just because someone says "spaces are only for separating words...and nothing else" is silly. For many, many years, when people typed letters, they used one space between words, and two spaces between end punctuation and the next sentence. That changed...yeah, things change. There is absolutely nothing wrong with using spaces and non-breaking spaces in ways which assist layout. It is neither useful nor elegant for someone to use a long span with style over and over and over, rather than simple spaces. You can think it is, and your club of do it right folks might even agree. But...although "right", they are also being rather silly about it. Question: Will a page with 3 non-breaking spaces validate? Interesting.
I have this strange issue, where I get random linebreaks in my HTML when I copy & paste links from mails I get.
The problem is, linebreaks look exactly like any other whitespace and on long lines I have problems seeing if there is any linebreaks.
Normally this wouldn't be a problem, but we are also using emailing system that doesn't like breaklines in middle of an element.
Is there a way to see these without manually scanning all the lines, which is impossible due to amount of mails we are sending.
Regex maybe?
I'm using Notepad++ as an editor.
In Notepad++, you can use "Extended" mode in the FIND Option. Use "\r\n" to scan all the new lines in the file. Use "\r" to find all carriage returns in the file.
I noticed on my website, http://www.cscc.org.sg/, there's this odd symbol that shows up.
It says L SEP. In the HTML Code, it display the same thing.
Can someone shows me how to remove them?
That character is U+2028 or HTML entity code
which is a kind of newline character. It's not actually supposed to be displayed. I'm guessing that either your server side scripts failed to translate it into a new line or you are using a font that displays it.
But, since we know the HTML and UNICODE vales for the character, we can add a few lines of jQuery that should get rid of the character. Right now, I'm just replacing it with an empty space in the code below. Just add this:
$(document).ready(function() {
$("body").children().each(function() {
$(this).html($(this).html().replace(/
/g," "));
});
});
This should work, though please note that I have not tested this and may not work as none of my browsers will display the character.
But if it doesn't, you can always try pasting your text block onto http://www.nousphere.net/cleanspecial.php which will remove any special characters.
Some fonts render LS as L SEP. Such a glyph is designed for unformatted presentations of the character, such as when viewing the raw characters of a file in a binary editor. In a formatted presentation, actual line spacing should be displayed instead of the glyph.
The problem is that neither the web server nor web browser are interpreting the LS as a newline. The web server could detect the LS and replace it with <br>. Such a feature would fit well with a web server that dynamically generates HTML anyway, but would add overhead and complexity to a web server that serves file contents without modification.
If a LS makes its way to the web browser, the web browser doesn't interpret it as formatting. Page formatting is based only on HTML tags. For example, LF and CR just affect formatting of the HTML source code, not the web page's formatting (except in <pre> sections). The browser could in principle interpret LS and PS (paragraph separator) as <br> and <p>, but the HTML standard doesn't tell browsers to do that. (It seems to me like it would be a good addition.)
To replace the raw LS character with the line separation that the content creator likely intended, you'll need to replace the LS characters with HTML markup such as <br>.
This is the solution for the 'strange symbol' issue.
$(document).ready(function () {
$("body").children().each(function() {
document.body.innerHTML = document.body.innerHTML.replace(/\u2028/g, ' ');
});
})
The jquery/js solutions here work to remove the character, but it broke my Revolution Slider. I ended up doing a search replace for the character on the wp_posts tabel with Better Search Replace plugin: https://wordpress.org/plugins/better-search-replace/
When you copy paste the character from a page to the plugin box, it is invisible, but it does work. Before doing DB replaces, always have a database (or full) backup ready! And be sure to uncheck the bottom checkbox to not do a dry run with the plugin.
Can some one give me reason why (or not) we create one line html?
That I know just reduce file size (benefit) but in server we must add some function to generate one line html (server cost) or we will have trouble when we need to change some code (editing cost).
You shouldn't minify your HTML. It isn't worth the time and the work you will have to do when editing. Take a look at this very good answer.
You should optimize your CSS though. There's a free program: Scout, that checks for changes on your css files and creates a minified version of it automatically. It implements Sass, which will make coding CSS much easier. You should try it.
You don't manually create HTML like that. Personally, my CMS caches it's output as static files to be served to the user. Before it caches, it runs:
$tocache is the contents of the page to be displayed. I do this, then write it to disk. Apache then serves the static content instead avoiding the DB and PHP on subsequent access.
// Remove white space
$tocache = str_replace(array("\n", "\t","\r")," ",$tocache);
// Remove unnecessary closing tags (I know </p> could be here, but it caused problems for me)
$tocache = str_replace(array("</option>","</td>","</tr>","</th>","</dt>","</dd>","</li>","</body>","</html>"),"",$tocache);
// remove ' or " around attributes that don't have spaces
$tocache = preg_replace('/(href|src|id|class|name|type|rel|sizes|lang|title|itemtype|itemprop)=(\"|\')([^\"\'\`=<>\s]+)(\"|\')/i', '$1=$3', $tocache);
// Turn any repeated white space into one space
$tocache = preg_replace('!\s+!', ' ', $tocache);
Now, I run that once per page change, then serve up the smaller HTML to users.
This is pretty much pointless though, as the process of gzipping makes the biggest difference. I do it because I might as well – I already am caching these files, so why not make myself feel clever first!
For CSS and JS I use SASS's compressed option, and uglifyJS to get those as one small file.
That means on a page I have 1 HTML file, 1 CSS and 1 JS, minimising the number of HTTP reqs and the amount of data to be transmitted.
Gzip + ensuring 1 css and 1 js is the biggest savings though.
My idea is to somehow minify HTML code in server-side, so client receive less bytes.
What do I mean with "minify"?
Not zipping. More like, for example, jQuery creators do with .min.js versions. In other words, I need to remove unnecessary white-spaces and new-lines, but I can't remove so much that presentation of HTML changes (for example remove white-space between actual words in paragraph).
Is there any tools that can do it? I know there is HtmlPurifier. Is it able to do it? Any other options?
P.S. Please don't offer regex'ies. I know that only Chuck Norris can parse HTML with them. =]
A bit late but still... By using output_buffering it is as simple as that:
function compress($string)
{
// Remove html comments
$string = preg_replace('/<!--.*-->/', '', $string);
// Merge multiple spaces into one space
$string = preg_replace('/\s+/', ' ', $string);
// Remove space between tags. Skip the following if
// you want as it will also remove the space
// between <span>Hello</span> <span>World</span>.
return preg_replace('/>\s+</', '><', $string);
}
ob_start('compress');
// Here goes your html.
ob_end_flush();
You could parse the HTML code into a DOM tree (which should keep content whitespace in the nodes), then serialise it back into HTML, without any prettifying spaces.
Is there any tools that can do it?
Yes, here's a tool you could include into a build process or work into a web cache layer:
https://code.google.com/archive/p/htmlcompressor/
Or, if you're looking for a tool to minify HTML that you paste in, try:
http://www.willpeavy.com/minifier/
You can use the Pretty Diff tool: http://prettydiff.com/?m=minify&html It will also minify any CSS and JavaScript in the HTML code, and the minification occurs in a regressive manner so to not prevent future beautification of the HTML back to readable form.
Is there any tools that can do it?
You can use CodVerter Online Web Development Editor for compressing mixed html code.
the compressor was tested multiple times for reliability and accuracy.
(Full Disclosure: I am one of the developers).