Performance of wkhtmltopdf - html

We are intending to use wkhtmltopdf to convert html to pdf but we are concerned about the scalability of wkhtmltopdf. Does anyone have any idea how it scales? Our web app potentially could attempt to convert hundreds of thousands of (reletively complex)html so it's important for us to have some idea. Has anyone got any information on this?

First of all, your question is quite general; there are many variables to consider when asking about scalability of any project. Obviously there is a difference between converting "hundreds of thousands" of HTML files over a week and expecting to do that in a day, or an hour. On top of that "relatively complex" HTML can mean different things to other people.
That being said, I figured since I have done something similar to this, converting approximately 450,000 html files, utilizing wkhtmltopdf; I'd share my experience.
Here was my scenario:
450,000 HTML files
95% of the files were one page in length
generally containing 2 images (relative path, local system)
tabular data (sometimes contained nested tables)
simple markup elsewhere (strong, italic, underline, etc)
A spare desktop PC
8GB RAM
2.4GHz Dual Core Processor
7200RPM HD
I used a simple single threaded script written in PHP, to iterate over the folders and pass the html file path to wkhtmltopdf. The process took about 2.5 days to convert all the files, with very minimal errors.
I hope this gives you insight to what you can expect from utilizing wkhtmltopdf in your web application. Some obvious improvements would come from running this on better hardware but mainly from utilizing a multi-threaded application to process files simultaneously.

In my experience performance depends a lot on your pictures. It there are lots of large pictures it can slow down significantly. If at all possible I would try to stage a test with an estimate of what the load would be for your servers. Some people do use it for intensive operations, but I have never heard of hundrerds of thousands. I guess like everything, it depends on your content and resources.
The following quote is straight off the wkhtmltopdf mailing list:
I'm using wkHtmlToPDF to convert about 6000 E-mails a day to PDF. It's all
done on a quadcore server with 4GB memory... it's even more then enough for
that.
There are a few performance tips, but I would suggest trying out what is your bottlenecks before optimizing for performance. For instance I remember some person saying that if possible, loading images directly from disk instead of having a web server inbetween can speed it up conciderably.
Edit:
Adding to this I just had some fun playing with wkhtmltopdf. Currently on an Intel Centrino 2 with 4Gb memory I generate PDF with 57 pages of content (mixed p,ul,table), ~100 images and a toc takes consistently < 7 seconds. I'm also running visual studio, browser, http server and various other software that might slow it down. I use stdin and stdout directly instead of files.
Edit:
I have not tried this, but if you have linked CSS, try embedding it in the HTML file (remember to do a before and after test to see the effects properly!). The improvement here most likely depends on things like caching and where the CSS is served - if it's read from disk every time or god forbid regenerated from scss, it could be pretty slow, but if the result is cached by the webserver (I dont think wkhtmltopdf caches anything between instances) it might not have a big effects. YMMV.

We try to use wkhtmltopdf in any implementations. My objects are huge tables for generated coordinate points. Typically volume of my pdf = 500 pages
We try to use port of wkhtmltopdf to .net. Results are
- Pechkin - Pro: don't need other app. Contra: slow. 500 pages generated about 5 minutes
- PdfCodaxy - only contra: slow. Slower than pure wkhtmltopdf. Required installed wkhtmltopdf. Problems with non unicode text
- Nreco - only contra: slow. Slower than pure wkhtmltopdf. Required installed wkhtmltopdf. Incorrect unlock libs after use (for me)
We try to use binary wkhtmltopdf invoked from C# code.
Pro: easy to use, faster that libs
Contra: need temporary files (cannot use Stream objects). Break with very huge (100MB+)html files as like as other libs

wkhtmltopdf --print-media-type is blazing fast. But you loose normal CSS styling with that.
This may NOT be an ideal solution for complex html pages export. But it worked for me because my html contents are pretty simple and in tabular form.
Tested on version wkhtmltopdf 0.12.2.1

You can create own pool of the wkhtmltopdf engines. I did it for a simple use case by invoking API directly instead of start process wkhtmltopdf.exe every time. The wkhtmltopdf API is not thread-safe, so it's not easy to do. Also, you should not forget about sharing a native code between AppDomains.

Related

Best way to minify html in an asp.net mvc 5 application

I came across several articles about this topic but most of them are outdated. So what is the best way to minify/get rid of the whitespaces when outputing my views html?
I built a very trivial minifier called RazorHtmlMinifier.Mvc5.
It's operating in compile-time when the cshtml Razor files are converted to C# classes, so it won't have any performance overhead at runtime.
The minification is very trivial, basically just replacing multiple spaces with one (because sometimes a space is still significant, e.g. <span>Hello</span> <span>World</span> is different to <span>Hello</span><span>World</span>).
The source code is very recent and very simple (just one file with less than 100 lines of code) and installation involves just a NuGet package and changing one line in Web.config file.
And all of this is built for the latest version of ASP.NET MVC 5.
Usually, it's recommended to use gzip encoding to minify HTTP responses, but I found out that if you minify the HTML before gzipping, you can still get around 11% smaller responses on average. In my opinion, it's still worth it.
Use WebMarkupMin: ASP.NET 4.X MVC. Install the NuGet package and then use the MinifyHtmlAttribute on your action method, controller, or register it in RegisterGlobalFilters in FilterConfig. You can also try the CompressContentAttribute. Here's the wiki: https://github.com/Taritsyn/WebMarkupMin/wiki/WebMarkupMin:-ASP.NET-4.X-MVC
If you use the CompressContentAttribute you'll see the Content-Encoding:deflate header rather than Content-Encoding:gzip header if you were using gzip before applying this attribute.
Some test numbers:
No minification or compression:
Content-Length:21594
Minification only: Content-Length:19869
Minification and compression: Content-Length:15539
You'll have to test to see if you're getting speed improvements overall from your changes.
EDIT:
After exhaustive testing locally and on the live site, I've concluded that minifying and compressing HTML with WebMarkupMin in my case slowed the page load time by about 10%. Just compressing (using CompressContentAttribute) or just minifying also slowed it down. So I've decided not to compress (using CompressContentAttribute) or minify my HTML at all.

HTML5: accessing large structured local data

Summary:
Are there good HTML5/javascript options for selectively reading chunks of data (let's say to be eventually converted to JSON) from a large local file?
Problem I am trying to solve:
Some existing program locally and outputs a ton of data. I want to provide a browser-based interactive viewer that will allow folks to browse through these results. I have control over how the data is written out. I can write it all out in one big file, but since it's quite large, I can't just read the whole thing in memory. Hence, I am looking for some kind of indexed or db-like access to this from my webapp.
Thoughts on solutions:
1. Brute-force: HTML5 FileReader API has a nice slice() method for random access. So I could write out some kind of an index in the beginning of the file, use it to look up positions of other stored objects, and read them whenever they're needed. I figured I'd ask if there are already javascript libraries that do something like this (or better) before trying to implement this ugly thing.
2. HTML5 local database. Essentially, I am looking for an analog of HTML5 openDatabase() call that would open (a read-only) connection to a database based on a user-specified local file. From what I understand, there's no way to specify a file with a pre-loaded database. Furthermore, even if there was such a hack, it's not clear whether the local file format would be the same across browsers. I've seen the phonegap solution that populates the browser local database from SQL statements. I can do that too, but the data I am talking about is quite large (5-10GB): it will take a while to load, and such duplication seems rather pointless.
HTML5 does not sound like the appropriate answer for your needs. HTML5's focus is on the client side, and based on your description you're asking a lot out of the browsers, most likely more than they can handle.
I would instead recommend you look at a server-based solution to deliver the desired goal/results to the client view, something like Splunk would be a good product to consider.

How can I create a well-formatted PDF?

I'm working on automating our company invoicing system. Currently all data is stored in our local MySQL database and someone manually updates an excel spreadsheet and then merges this data into a MS Word template. The goal is to automate this process so that the invoice can be generated from our intranet website as a PDF.
My original plan was to create a template in HTML/CSS and use wkhtmltopdf to generate the PDF but I ran into problems with getting a repeatable header and footer on each page. thead and tfoot aren't supported by Webkit and the fix suggested in this other question does not seem to work either.
So I then stumbled on using XML and XSL-FO, the latter I know nothing about. Is this the best path to take? Are there any libraries or utilities out there that will make converting my HTML+CSS into XML+XSL-FO easier? Are there any other alternatives I'm overlooking?
EDIT
Currently the server is CentOS Linux with a MySQL database. All other code is currently in PHP currently but that may change as the whole system is being revamped. Linux and MySQL will almost certainly remain, though.
For your requirement, XSL-FO might just do the trick. It is much cleaner to produce the pdf's directly from the data, then going the cumbersome html path, unless you need to display the html as well, then you might consider converting from html to pdf, but it will always be messy.
You can get xml results from mysql quite easily (mysql --xml) and then you write one (or several) xsl-fo stylesheet for the data. then, you cannot only produce pdfs, but also postscript files or rtf's with some processors.
XSL-FO has its limitations tho, but for your situation, it should suffice.
I admit, the learning curve can be steep, and maintaining xslt-stylesheets can get very tiring, but as you start knowing more about it, you end up writing less code.
another possibility is to do the whole thing in e.g. java or c# - send select statements and loop the results and iteratively build the pdf using a library like iText.
You could try JODReports or Docmosis as less-code intensive options. You supply Word or OpenOffice Writer documents to act as templates and use these engines to manipulate/populate the templates then spit out the documents in the format(s) you require. This may mean your existing Word-templates can be used directly which should save you some effort/time.
iText is another library that will let you build and pump out PDFs from code. It's pretty good.
If you cloud use ASP.NET for web you can use free ReportViewer library and designer for automated of publishing PDF-s.
Here is some references:
http://gotreportviewer.com
http://weblogs.asp.net/srkirkland/archive/2007/10/29/exporting-a-sql-server-reporting-services-2005-report-directly-to-pdf-or-excel.aspx
If you're OK using .NET and C#, you could use DotPdf from Atalasoft (obligatory disclaimer: I work for Atalasoft and wrote most of DotPdf). The Generating namespace is geared for exactly what you're trying to do: automate report generation. From the very basics, you could just create docs directly with the toolkit or you can create template documents that have unpopulated text fields that you can reload and fill later (see here and here for examples).

any FAST tex to html program?

(im using debian squeeze)
i tried catdvi (but its unacceptable - just a lot of '?'s)
now i am using tex4ht but its awfully sloow..
for example generating html for this :
takes ~2 seconds (thats 4+ times slower than generating the image !!!)
is there something wrong with my config or is tex4ht really that slow?
(i doubt theres something wrong with my config) are there any other(FAST) reliable tex2html converters?
As already suggested, if you want equations in a web page, MathJax will process TeX math code into proper math display.
What about latex2html? It seems the only hit on Google that provides this kind of functionality. Keep in mind that latex is inherently slow, and it may be better to rely on something MathML or MathJax related. I have not tested the above for performance.
On Debian squeeze, just do
apt-get install latex2html

Google App Engine - Caching generated HTML

I have written a Google App Engine application that programatically generates a bunch of HTML code that is really the same output for each user who logs into my system, and I know that this is going to be in-efficient when the code goes into production. So, I am trying to figure out the best way to cache the generated pages.
The most probable option is to generate the pages and write them into the database, and then check the time of the database put operation for a given page against the time that the code was last updated. Then, if the code is newer than the last put to the database (for a particular HTML request), new HTML will be generated and served, and cached to the database. If the code is older than the last put to the database, then I will just get the HTML direct from the database and serve it (therefore avoiding all the CPU wastage of generating the HTML). I am not only looking to minimize load times, but to minimize CPU usage.
However, one issue that I am having is that I can't figure out how to programatically check when the version of code uploaded to the app engine was updated.
I am open to any suggestions on this approach, or other approaches for caching generated html.
Note that while memcache could help in this situation, I believe that it is not the final solution since I really only need to re-generate html when the code is updated (as opposed to every time the memcache expires).
In order of speed:
memcache
cached HTML in data store
full page generation
Your caching solution should take this into account. Essentially, I would probably recommend using memcache anyways. It will be faster than accessing the data store in most cases and when you're generating a large block of HTML, one of the main benefits of caching is that you potentially didn't have to incur the I/O penalty of accessing the data store. If you cache using the data store, you still have the I/O penalty. The difference between regenerating everything and pulling from cached html in the data store is likely to be fairly small unless you have a very complex page. It's probably better to get a bunch of very fast cache hits off memcache and do a full regenerate every once in a while than to make a call out to the data store every time. There's nothing stopping you from invalidating the cached HTML in memcache when you update, and if your traffic is high enough to warrant it, you can always do a multi-level caching system.
However, my main concern is that this is premature optimization. If you don't have the traffic yet, keep caching to a minimum. App Engine provides a set of really convenient performance analysis tools, and you should be using those to identify bottlenecks after you've got at least a few QPS of traffic.
Anytime you're doing performance optimization, measure first! A lot of performance "optimizations" turn out to either be slower than the original, exactly the same, or they have negative user experience characteristics (like stale data). Don't optimize until you're certain you have to.
A while ago I wrote a series of blog posts about writing a blogging system on App Engine. You may find the post on static generation of HTML pages of particular interest.
This is not a complete solution, but might offer some interesting option for caching.
Google Appengine Frontend Caching allows you a way of caching without using memcache.
Just serve a static version of your site
It's actually a lot easier than you think.
If you already have a file that contains all of the urls for your site (ex urls.py), half the work is already done.
Here's the structure:
+-/website
+--/static
+---/html
+--/app/urls.py
+--/app/routes.py
+-/deploy.py
/html is where the static files will be served from. urls.py contains a list of all the urls for your site. routes.py (if you moved the routes out of main.py) will need to be modified so you can see the dynamically generated version locally but serve the static version in production. deploy.py is your one-stop static site generator.
How you layout your urls module depends. I personally use it as a one-stop-shop to fetch all the metadata for a page but YMMV.
Example:
main = [
{ 'uri':'about-us', 'url':'/', 'template':'about-us.html', 'title':'About Us' }
]
With all of the urls for the site in a structured format it makes crawling your own site easy as pie.
The route configuration is a little more complicated. I won't go into detail because there are just too many different ways this could be accomplished. The important piece is the code required to detect whether you're running on a development or production server.
Here it is:
# Detect whether this the 'Development' server
DEV = os.environ['SERVER_SOFTWARE'].startswith('Dev')
I prefer to put this in main.py and expose it globally because I use it to turn on/off other things like logging but, once again, YMMV.
Last, you need the crawler/compiler:
import os
import sys
import urllib2
from app.urls import main
port = '8080'
local_folder = os.getcwd() + os.sep + 'static' + os.sep + 'html' + os.sep
print 'Outputting to: ' + local_folder
print '\nCompiling:'
for page in main:
http = urllib2.urlopen('http://localhost:' + port + page['url'])
file_name = page['template']
path = local_folder + file_name
local_file = open(path, 'w')
local_file.write(http.read())
local_file.close()
print ' - ' + file_name + ' compiled successfully...'
This is really rudimentary stuff. I was actually stunned with how easy it was when I created it. This is literally the equivalent of opening your site page-by-page in the browser, saving as html, and copying that file into the /static/html folder.
The best part is, the /html folder works like any other static folder so it will automatically be cached and the cache expiration will be the same as all the rest of your static files.
Note: This handles a site where the pages are all served from the root folder level. If you need deeper nesting of folders it'll need a slight modification to handle that.
Old thread, but i'll comment anyways as technology has progressed a little...
Another idea that may or may not be approproate for you is to generate the HTML and store it on Google Cloud Storage.
Then access the HTML via a CDN link that the cloud storage provides for you.
No need to check memcache or wait for datastore to wake up on new requests.
Ive started storing all my JavaScript, CSS, and other static content (images, downloads etc) like this for my appengine apps and its working well for me.