Can some one give me reason why (or not) we create one line html?
That I know just reduce file size (benefit) but in server we must add some function to generate one line html (server cost) or we will have trouble when we need to change some code (editing cost).
You shouldn't minify your HTML. It isn't worth the time and the work you will have to do when editing. Take a look at this very good answer.
You should optimize your CSS though. There's a free program: Scout, that checks for changes on your css files and creates a minified version of it automatically. It implements Sass, which will make coding CSS much easier. You should try it.
You don't manually create HTML like that. Personally, my CMS caches it's output as static files to be served to the user. Before it caches, it runs:
$tocache is the contents of the page to be displayed. I do this, then write it to disk. Apache then serves the static content instead avoiding the DB and PHP on subsequent access.
// Remove white space
$tocache = str_replace(array("\n", "\t","\r")," ",$tocache);
// Remove unnecessary closing tags (I know </p> could be here, but it caused problems for me)
$tocache = str_replace(array("</option>","</td>","</tr>","</th>","</dt>","</dd>","</li>","</body>","</html>"),"",$tocache);
// remove ' or " around attributes that don't have spaces
$tocache = preg_replace('/(href|src|id|class|name|type|rel|sizes|lang|title|itemtype|itemprop)=(\"|\')([^\"\'\`=<>\s]+)(\"|\')/i', '$1=$3', $tocache);
// Turn any repeated white space into one space
$tocache = preg_replace('!\s+!', ' ', $tocache);
Now, I run that once per page change, then serve up the smaller HTML to users.
This is pretty much pointless though, as the process of gzipping makes the biggest difference. I do it because I might as well – I already am caching these files, so why not make myself feel clever first!
For CSS and JS I use SASS's compressed option, and uglifyJS to get those as one small file.
That means on a page I have 1 HTML file, 1 CSS and 1 JS, minimising the number of HTTP reqs and the amount of data to be transmitted.
Gzip + ensuring 1 css and 1 js is the biggest savings though.
Related
In Objective C to build a Mac OSX (Cocoa) application, I'm using the native Webkit widget to display local files with the file:// URL, pulling from this folder:
MyApp.app/Contents/Resources/lang/en/html
This is all well and good until I start to need a German version. That means I have to copy en/html as de/html, then have someone replace the wording in the HTML (and some in the Javascript (like with modal dialogs)) with German phrasing. That's quite a lot of work!
Okay, that might seem doable until this creates a headache where I have to constantly maintain multiple versions of the html folder for each of the languages I need to support.
Then the thought came to me...
Why not just replace the phrasing with template tags like %CONTINUE%
and then, before the page is rendered, intercept it and swap it out
with strings pulled from a language plist file?
Through some API with this widget, is it possible to intercept HTML before it is rendered and replace text?
If it is possible, would it be noticeably slow such that it wouldn't be worth it?
Or, do you recommend I do a strategy where I build a generator that I keep on my workstation which builds each of the HTML folders for me from a main template, and then I deploy those already completed with my setup application once I determine the user's language from the setup application?
Through a lot of experimentation, I found an ugly way to do templating. Like I said, it's not desirable and has some side effects:
You'll see a flash on the first window load. On first load of the application window that has the WebKit widget, you'll want to hide the window until the second time the page content is displayed. I guess you'll have to use a property for that.
When you navigate, each page loads twice. It's almost not noticeable, but not good enough for good development.
I found an odd quirk with Bootstrap CSS where it made my table grid rows very large and didn't apply CSS properly for some strange reason. I might be able to tweak the CSS to fix that.
Unfortunately, I found no other event I could intercept on this except didFinishLoadForFrame. However, by then, the page has already downloaded and rendered at least once for a microsecond. It would be great to intercept some event before then, where I have the full HTML, and do the swap there before display. I didn't find such an event. However, if someone finds such an event -- that would probably make this a great templating solution.
- (void)webView:(WebView *)sender didFinishLoadForFrame:(WebFrame *)frame
{
DOMHTMLElement * htmlNode =
(DOMHTMLElement *) [[[frame DOMDocument] getElementsByTagName: #"html"] item: 0];
NSString *s = [htmlNode outerHTML];
if ([s containsString:#"<!-- processed -->"]) {
return;
}
NSURL *oBaseURL = [[[frame dataSource] request] URL];
s = [s stringByReplacingOccurrencesOfString:#"%EXAMPLE%" withString:#"ZZZ"];
s = [s stringByReplacingOccurrencesOfString:#"</head>" withString:#"<!-- processed -->\n</head>"];
[frame loadHTMLString:s baseURL:oBaseURL];
}
The above will look at HTML that contains %EXAMPLE% and replace it with ZZZ.
In the end, I realized that this is inefficient because of page flash, and, on long bits of text that need a lot of replacing, may have some quite noticeable delay. The better way is to create a compile time generator. This would be to make one HTML folder with %PARAMETERIZED_TAGS% inside instead of English text. Then, create a "Run Script" in your "Build Phase" that runs some program/script you create in whatever language you want that generates each HTML folder from all the available lang-XX.plist files you have in a directory, where XX is a language code like 'en', 'de', etc. It reads the HTML file, finds the parameterized tag match in the lang-XX.plist file, and replaces that text with the text for that language. That way, after compilation, you have several HTML folders for each language, already using your translated strings. This is efficient because then it allows you to have one single HTML folder where you handle your code, and don't have to do the extremely tedious process of creating each HTML folder in each language, nor have to maintain that mess. The compile time generator would do that for you. However -- you'll have to build that compile time generator.
I have a pdf file and I need to get get small pieces of data from it.
It is structured like this :
Page1:
Question 1
......................................
......................................
Question 2
......................................
......................................
Page End
I want to get Question 1 and Question 2 as separate html files, which contain text and image.
I've tried
pdftohtml -c pdffile.pdf output.html
And I got files with png images, but how to do I cut the Image into smaller chunks to fit the size of each Question (I want to separate each question into individual files)?
P.S. I have alot of pdf files, so a command-line tool would be nice.
I'll try to give you an approach on how I would go about it. You mention, that every page in your PDF document might have multiple questions and you basically want have one HTML file for every question.
It's great if pdftohtml works for you, but I also found another decent command line utility that you might want to try out.
Ok, so assuming you have an HTML file converted from the PDF you initially had, you might want to use csplit or awk to split your file into multiple files based on the delimiter 'Question' in your case. (Side note- csplit and awk are linux specific utilites, but I'm sure there are alternatives if you are on Windows or a MAC. I haven't specifically tried the following code)
From a relevant SO Post :
csplit input.txt'/^Question$/' '{*}'
awk '/Question/{filename=NR".txt"}; {print >filename}' input.txt
So, assuming this works, you will have a couple of broken html files. Broken because they'll be unsanitized due to dangling < or > or some other stray HTML elements after the splitting.
So you could start by saving the initial .html as .txt, removing the html, head and body elements specifically and going through the general structure of how the program converts the pdf into html. I'm sure you'll see a pattern around how the string 'Quetion' is wrapped in an element and is something you can take care of. That is why I mention .txt files in the code snippets.
You will basically have a bunch of text files with just the content html and not the usual starting tags for an html file because we removed that initially. Then it's only a matter of reading each file, just taking care of the element that surrounds the string 'Question' and adding the html, head and body elements around the content and saving them as .html files. You could do this in any programming language of your choice that supports file reading and writing (would be a fun exercise)
I hope this gets you started in the right direction.
I use a lot (20 or so) of what FrontPage 2003 (don't laugh!) calls "Site Parameters", which are essentially variables. I use them for the Number of Products, Phone Hours, etc.
I'm upgrading to Expression Web, which does not support changing or adding those Parameters. Also, I'd like to create Variables for our Breadcrumbs trail. (So a product page might have breadcrumbs: Products> This Product> Screenshot.
So if we ever decide to rename Products I want to easily be able to replace it.
What I do not want to do:
Replace the value when the page is served up. That slows things down a bit and forces all pages to be .aspx, etc. I want to stick with plain html.
Replace the value using Javascript (same reason, and a tiny % of brrowsers don't have .js enabled).
I was thinking:
we have <variable.products_count>20</variable.products_count>
But..... it's easy to get another tag and text in there, as happens in this example:
<variable.products_count> we have <strong>20</strong> </variable.products_count>
Now when I replace, I'm replacing the tags and "we have" as well.
What I do not want to do: Replace the value when the page is served up. That slows things down a bit and forces all pages to be .aspx, etc. I want to stick with plain html.
Technically you are correct; there will probably be a performance hit by using an executed versus static page. But the overhead of inserting a few dynamic values is so trivial that it shouldn't even factor into the decision making process.
At it's simplest:
HTML
<body>
<div>Good old HTML</div>
<div>A dynamic value: <%= SiteParameters.Foo %></div>
</body>
c# (or VB.Net, or whatever you prefer)
public static class SiteParameters
{
// This value could be pulled from a config file, a database, etc.
public static readonly string Foo = "Hello World";
}
Your example with the we have problem can be easily re-written to fix the problem:
<variable.products_count> we have <strong>20</strong> </variable.products_count>
to:
we have <strong><variable.products_count></variable.products_count></strong>
Of course, I think the easier approach is to step slightly outside of HTML syntax and pick variable names that are unlikely to occur in your documentation:
we have <strong>##PRODUCTS_COUNT##</strong> ...
When you have several of these you can replace them all with one pass of a simple text-replacement tool such as sed(1):
sed -e 's/##PRODUCTS_COUNT##/20/g'
-e 's/##PRODUCTS##/Products/g'
-e 's/##SCREENSHOTS##/Screen Shots/g' < inputfile > output.html
To automate over many files, you'd probably want to write a little script:
s/##PRODUCTS_COUNT##/20/g;
s/##PRODUCTS##/Products/g;
s/##SCREENSHOTS##/Screen Shots/g;
and run it on all your files:
for f in pre/*.html ; do sed -f /path/to/script < "$f" > "${f/pre/post}" ; done
(The ${f/pre/post} replaces the pre with post; your post-processed files would be in the post/ directory.)
Besides the fact that it becomes unreadable for humans, are there any downsides when I remove every linebreak and space from the html source code?
Does the browsers render different then? Will the rendering get faster (or maybe slower)?
There are many already answered questions about minifying HTML. Here are some:
Why minify assets and not the markup?
HTML Minification
How to minify HTML code
You will have a smaller file size, so it may download faster (though it'll be probably unnoticeable). There are tools for this, indeed.
If you remove line breaks there is no harm. But according to your questions
...when I remove every linebreak and space from the html source code?
If you do remove every linebreak and line space, your purpose may not be served. You should only remove extra line-breaks and spaces. Also be careful not to alter values attributes form data, or any other attribute for that matter.
Regarding improvements it can offer:
It may render faster as it needs to parse lesser data. But this speedup is highly small. I even discourage it as it reduces readability and the speedup is in order of a few hundred clock cycles for CPU. The same goes for download. It reduces mere bites of data (unless the document has too much white spaces)
Insted of this its better to use GZIP compression for the output at the server side. The following is an line from php which enables it. If you have php in your server, then just rename your *.html file to *.php , Then add the following code before any output:
if (substr_count($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip')) ob_start("ob_gzhandler");
you can also do this using the .htaccess file. Google regarding this more.
A bit late but still... By using output_buffering it is as simple as that:
function compress($string)
{
// Remove html comments
$string = preg_replace('/<!--.*-->/', '', $string);
// Merge multiple spaces into one space
$string = preg_replace('/\s+/', ' ', $string);
// Remove space between tags. Skip the following if
// you want as it will also remove the space
// between <span>Hello</span> <span>World</span>.
return preg_replace('/>\s+</', '><', $string);
}
ob_start('compress');
// Here goes your html.
ob_end_flush();
My idea is to somehow minify HTML code in server-side, so client receive less bytes.
What do I mean with "minify"?
Not zipping. More like, for example, jQuery creators do with .min.js versions. In other words, I need to remove unnecessary white-spaces and new-lines, but I can't remove so much that presentation of HTML changes (for example remove white-space between actual words in paragraph).
Is there any tools that can do it? I know there is HtmlPurifier. Is it able to do it? Any other options?
P.S. Please don't offer regex'ies. I know that only Chuck Norris can parse HTML with them. =]
A bit late but still... By using output_buffering it is as simple as that:
function compress($string)
{
// Remove html comments
$string = preg_replace('/<!--.*-->/', '', $string);
// Merge multiple spaces into one space
$string = preg_replace('/\s+/', ' ', $string);
// Remove space between tags. Skip the following if
// you want as it will also remove the space
// between <span>Hello</span> <span>World</span>.
return preg_replace('/>\s+</', '><', $string);
}
ob_start('compress');
// Here goes your html.
ob_end_flush();
You could parse the HTML code into a DOM tree (which should keep content whitespace in the nodes), then serialise it back into HTML, without any prettifying spaces.
Is there any tools that can do it?
Yes, here's a tool you could include into a build process or work into a web cache layer:
https://code.google.com/archive/p/htmlcompressor/
Or, if you're looking for a tool to minify HTML that you paste in, try:
http://www.willpeavy.com/minifier/
You can use the Pretty Diff tool: http://prettydiff.com/?m=minify&html It will also minify any CSS and JavaScript in the HTML code, and the minification occurs in a regressive manner so to not prevent future beautification of the HTML back to readable form.
Is there any tools that can do it?
You can use CodVerter Online Web Development Editor for compressing mixed html code.
the compressor was tested multiple times for reliability and accuracy.
(Full Disclosure: I am one of the developers).