any FAST tex to html program? - html

(im using debian squeeze)
i tried catdvi (but its unacceptable - just a lot of '?'s)
now i am using tex4ht but its awfully sloow..
for example generating html for this :
takes ~2 seconds (thats 4+ times slower than generating the image !!!)
is there something wrong with my config or is tex4ht really that slow?
(i doubt theres something wrong with my config) are there any other(FAST) reliable tex2html converters?

As already suggested, if you want equations in a web page, MathJax will process TeX math code into proper math display.

What about latex2html? It seems the only hit on Google that provides this kind of functionality. Keep in mind that latex is inherently slow, and it may be better to rely on something MathML or MathJax related. I have not tested the above for performance.
On Debian squeeze, just do
apt-get install latex2html

Related

Is there any safe way to convert tabs to spaces in multiple files?

Is there any safe way to automate this process for multiple files? By safe I want that this will not break the code or introduce some kind of weird side effects that will manifest exactly when you don't want it in production.
I know about http://man.cx/expand. Is this method truly safe?
expand is pretty good, but I seem to recall it can get tricked in some conditions / for some languages, so for safety I'd have to assume "not truly".
Hopefully, however, your source code has plenty of tests before it goes to Production to demonstrate its full functionality and correctness.
Alternatively / additionally, if you're compiling or producing bytecode (e.g. Java), you could probably do a binary comparison of the artefacts to prove equivalence between the original and that produced from the de-tabbed source code.

Performance of wkhtmltopdf

We are intending to use wkhtmltopdf to convert html to pdf but we are concerned about the scalability of wkhtmltopdf. Does anyone have any idea how it scales? Our web app potentially could attempt to convert hundreds of thousands of (reletively complex)html so it's important for us to have some idea. Has anyone got any information on this?
First of all, your question is quite general; there are many variables to consider when asking about scalability of any project. Obviously there is a difference between converting "hundreds of thousands" of HTML files over a week and expecting to do that in a day, or an hour. On top of that "relatively complex" HTML can mean different things to other people.
That being said, I figured since I have done something similar to this, converting approximately 450,000 html files, utilizing wkhtmltopdf; I'd share my experience.
Here was my scenario:
450,000 HTML files
95% of the files were one page in length
generally containing 2 images (relative path, local system)
tabular data (sometimes contained nested tables)
simple markup elsewhere (strong, italic, underline, etc)
A spare desktop PC
8GB RAM
2.4GHz Dual Core Processor
7200RPM HD
I used a simple single threaded script written in PHP, to iterate over the folders and pass the html file path to wkhtmltopdf. The process took about 2.5 days to convert all the files, with very minimal errors.
I hope this gives you insight to what you can expect from utilizing wkhtmltopdf in your web application. Some obvious improvements would come from running this on better hardware but mainly from utilizing a multi-threaded application to process files simultaneously.
In my experience performance depends a lot on your pictures. It there are lots of large pictures it can slow down significantly. If at all possible I would try to stage a test with an estimate of what the load would be for your servers. Some people do use it for intensive operations, but I have never heard of hundrerds of thousands. I guess like everything, it depends on your content and resources.
The following quote is straight off the wkhtmltopdf mailing list:
I'm using wkHtmlToPDF to convert about 6000 E-mails a day to PDF. It's all
done on a quadcore server with 4GB memory... it's even more then enough for
that.
There are a few performance tips, but I would suggest trying out what is your bottlenecks before optimizing for performance. For instance I remember some person saying that if possible, loading images directly from disk instead of having a web server inbetween can speed it up conciderably.
Edit:
Adding to this I just had some fun playing with wkhtmltopdf. Currently on an Intel Centrino 2 with 4Gb memory I generate PDF with 57 pages of content (mixed p,ul,table), ~100 images and a toc takes consistently < 7 seconds. I'm also running visual studio, browser, http server and various other software that might slow it down. I use stdin and stdout directly instead of files.
Edit:
I have not tried this, but if you have linked CSS, try embedding it in the HTML file (remember to do a before and after test to see the effects properly!). The improvement here most likely depends on things like caching and where the CSS is served - if it's read from disk every time or god forbid regenerated from scss, it could be pretty slow, but if the result is cached by the webserver (I dont think wkhtmltopdf caches anything between instances) it might not have a big effects. YMMV.
We try to use wkhtmltopdf in any implementations. My objects are huge tables for generated coordinate points. Typically volume of my pdf = 500 pages
We try to use port of wkhtmltopdf to .net. Results are
- Pechkin - Pro: don't need other app. Contra: slow. 500 pages generated about 5 minutes
- PdfCodaxy - only contra: slow. Slower than pure wkhtmltopdf. Required installed wkhtmltopdf. Problems with non unicode text
- Nreco - only contra: slow. Slower than pure wkhtmltopdf. Required installed wkhtmltopdf. Incorrect unlock libs after use (for me)
We try to use binary wkhtmltopdf invoked from C# code.
Pro: easy to use, faster that libs
Contra: need temporary files (cannot use Stream objects). Break with very huge (100MB+)html files as like as other libs
wkhtmltopdf --print-media-type is blazing fast. But you loose normal CSS styling with that.
This may NOT be an ideal solution for complex html pages export. But it worked for me because my html contents are pretty simple and in tabular form.
Tested on version wkhtmltopdf 0.12.2.1
You can create own pool of the wkhtmltopdf engines. I did it for a simple use case by invoking API directly instead of start process wkhtmltopdf.exe every time. The wkhtmltopdf API is not thread-safe, so it's not easy to do. Also, you should not forget about sharing a native code between AppDomains.

Parse HTML using ruby core libraries? (ie, no gems required)

Some friends and I have been working on a set of scripts that make it easier to do work on the machines at uni. One of these tools currently uses Nokogiri, but in order for these tools to run on all machines with as little setup as possible we've been trying to find a 'native' html parser, instead of requiring users to install RVM and custom gems (due to disk space limitations for most users).
Are we pretty much restricted to Nokogiri/Hpricot/? Should we look at just writing our own custom parser that fits our needs?
Cheers.
EDIT: If there's posts on here that I've missed in my searches, let me know! S.O. is sometimes just too large to find things effectively...
There is no html parser in ruby stdlib
html parsers have to be more forgiving of bad markup than xml parsers
You could run the html though tidy (http://tidy.sourceforge.net)
to tidy up the html and produce valid markup
This can now be read via rexml :-) which is in stdlib
rexml is much slower than nokogiri, last checked in 2009
Sam Ruby had been working on making rexml faster though
A better way would be to have a better deployment
Take a look at http://gembundler.com/bundle_package.html and using capistrano (or some such) to provision servers

Which technology should I use to transform my latex documents into html documents

I want to write a little program that transforms my TeX files into HTML. I want to parse the documents and turn the macros (the build-in and of course my own) into HTML pieces. Here are my requirements:
predefined rules (e.g. begin{itemize} \item text \end{itemize} => <br> <p>text </p> <br/>)
defining own CSS style
ability to convert formulars (extract the formulars, load them in an imagecreator and then save the jpg/png)
easy to maintain and concise
I know there are several technologies out there, but I don't exactly know which is the best for me. Here are the technologies which flow into my mind
Ruby (I/O is easy, formular loading via webrat),
XML XSLT (I don't think that I need just overhead)
perl (there are many libs out there but I'm not quite familiar with it)
bash (I worked with sed and was surprised how easy it was to work with regular expressions)
latex2html ... (these converters won't work for me and they don't give me freedom in parsing)
Any suggestions, hints and comments are welcome.
Thanks for your time, folks.
have a look at pandoc here. it can also be installed on linux or os x. Though it won't do your custom macros. The only thing I've seen that can do a decent job with custom macros is tex4ht, but to really work well you need to be producing .DVI files. If you have a ton of custom macros, writing your own converter is going to take an ass load of time. Even if you only have a few custom macros, it's still going to be a pain. good luck!
Six: TeX
Seven: Haskell
(I gave up trying to persuade SO to start numbering my list from 6).

Using diff to find the portions of many files that are the same? (bizzaro-diff, or inverse-diff)

Bizzaro-Diff!!!
Is there a away to do a bizzaro/inverse-diff that only displays the portions of a group of files that are the same? (I.E. way more than three files)
Odd question, I know...but I'm converting someone's ancient static pages to something a little more manageable.
You want a clone detector. It detects similar code chunks across
large source systems.
See our ClonedR tool: http://www.semdesigns.com/Products/Clone/index.html
You could try the comm command (for common). It'll only compare 2 files at a time, but you should be able to do 3+ with some clever scripting.
You could try sim. Been a few years since I've used it, but I recall it being very useful when looking for similarities within a file or in many different files.
This is a classic problem.
If I had to quick-and-dirty it, I'd probably do something like a diff -U 1000000 (assuming a version of diff that supports it), piped through sed to just get the lines in common (and strip the leading spaces). You'd have to loop through all the files, though.
Edit: I forgot there is also Tcl implementation that would be slightly more versatile, but would require more coding. You may be able to find an implementation for your language of choice.