How do I create "accessible" PDFs from HTML? - html

Does anyone have any suggestions on how to generate accessible PDFs (including images) from HTML?
The PDFs need to look like the original HTML, including positions of images etc.
Any special HTML structure required to help make the final PDF accessible?
I've seen questions about creating PDFS none of them specifically address the important issue of accessibility.
My poison of choice is Perl but references to any program, language or library will help.
I have a more in-depth question at TypeDoc if anyone has more general information to offer.
http://doctype.com/TiB
Also,
I, and others, would find it useful if users with accessibility problems could comment if they find the "usability experience" of using PDFs better or worse than reading from Plain Old Semantic HTML (POSH).
Thanks
Mike

Look into PrinceXML. Through CSS you can control margins, page breaking and orientation. While not open source, you can try it for free, but it places a small water mark in the upper right corner.

The Adobe ColdFusion server product does a really fine job of this, not surprisingly. But it's not free, and the open source implementations of the language (Smith and BlueDragon) don't support the pdf stuff.
Developer licenses to Adobe ColdFusion are free, and you can download it.

I've done this thing on a small scale but scripting Safari to print to PDFs. I don't recommend it for large-scale projects though.

By far the most capable PDF publishing tool I've ever come across is reportlab. There is an open source library written with Python and a proprietary system that allows you to construct a document using RML, a custom xml spec. The latter is easier for more complex docs. They tend to be very flexible (and reasonable) with pricing.
Not strictly an answer to your question as it doesn't handle html-to-pdf conversions, but perhaps of use to you.

Related

Make a html to pdf converter

I am pretty new to developing softwares and am intrigued by the huge world out there!! I have working knowledge of C/C++ and Java.. I was thinking of making an application that would convert a webpage to a pdf document.. I know there are many solutions available -- both online and offline..But I want to develop my own.. I googled but couldn't find anything that would help me get started..
I want to know how do we go about a conversion process?? How to get started?? What languages and technologies are pre-requisites for making a converter like this??
Thank You
So at least you need to get to the bottom to following specifications:
HTML specification
CSS specification
JavaScript specification
PDF specification
Moreover here are a lot of minor stuff such as Fonts, Decription/Encription algorithms and many many other minor but still necessary things.
I think you can imagine that this is quite a long way to get all this working. In fact, the complexity of such software is the reason why so many companies make money in this field.
Anyway, I'd suggest you to start from the simple things and grow your software gradually. Start with converting HTML to Image, because it is a bit simpler. Take and parse HTML, its CSS, its JavaScript. Clean HTML. Build DOM of the HTML document. Apply styles. Go thru the DOM and draw elements to the image.
Good luck!

Best way of creating graphics-heavy, cross-platform, eBooks?

I have been working on converting graphics and layout heavy books (created in Adobe InDesign) into cross-platform eBooks with non-standard functionality (embedded video, audio, interactivity, etc).
Exporting from InDesign to EPUB or other similar formats don't seem to work very well with the types of books I work with. There are all kinds of problems with layout being lost, images getting split apart, etc.
Out of all the different formats available, it looks like a combination of HTML5 & JavaScript is the best way to go. Though, the only way I can figure to do this is to export each page as a JPEG (without the text), then put the page images into an HTML shell, and add in the text using DIV tags (formatting with CSS).
This seems like a lot of work, and the idea of exporting images of pages to make an eBook seems embarrassingly amateurish to me.
I realize that HTML5 isn't quite cross-platform at the moment, but that's not an issue for me right now. I know that Adobe has other options that would be better, but I also need these books to work on the iPad. Unless/until Apple changes their terms, I can't use the technology Adobe developed for the purpose of converting books made in InDesign into apps.
Further, I'd like to add custom features that formats like epub and PDF don't support, but are relatively simple using HTML+JS.
Is there a better way of going from InDesign to some kind of cross-platform eBook format? Or getting everything to HTML, rather than just creating images of the pages and manually adding in all the text?
There are as many eBook formats as there are target markets, so there is no easy answer here.
Since you are keen on maintaining the design, there are some cross-platform eBook formats that are web-based that preserve layouts and allow to add real value (Xplana) and others that simply preserve the layout (CourseSmart, VitalBook). Inkling, a new startup for the iPad, offers some promise. Adobe makes it easy to transform content, but other than PDF they have nothing of note right now except for a couple of public experiments with high-end magazines, and Flash is definitely not an answer for eBooks even if Apple allowed it.
In pretty much all cases, you don't worry about the conversion, you just simply hand over web-optimized PDFs, and a publishing services company (or the distributers themselves) handle the conversion for you to the format that you need for your target market.
In the end, you won't find a single cross-platform format that is going to meet all of your needs. That's a bit of a Holy Grail of book publishing right now. I'd recommend writing up an RFP for your needs and contacting some publishing services companies or high-end compositors to see what they have to offer.

Math equations on the web

How can I render Math equations on the web? I am already familiar with LaTeX's Math mode.
The other answers are out-of-date. As of 2012, beautiful math is easy to write and render. The technology is called MathJax. You can see it in quiet action on MathOverflow and hundreds of math blogs.
MathJax is an open source JavaScript display engine for mathematics that works in all modern browsers. No more setup for readers. No more browser plugins. No more font installations… It just works.
Mathjax is reliable and unobtrusive, so you just need to write the math. You do so in Tex (Latex), a concise syntax with which most scientists and mathematicians are familiar (and have shared decades of good tutorials). For Mathjax, you simply write Tex code in-line in your HTML between double dollar signs, eg.
When $$a \ne 0$$, there are two solutions to $$ax^2 + bx + c = 0$$ and they are $$x = {-b \pm \sqrt{b^2-4ac} \over 2a}.$$
To use Mathjax to render your math, put a Javascript line in your HTML header:
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
If you publish on a platform such as Wordpress, Tumblr or Blogger there are plug-ins in their galleries to do this (Wordpress).
How does Mathjax render math? With Javascript it renders your math to beautiful HTML and CSS (remarkably resembling Latex) in a fraction of a second. If a browser supports MathML, it can render math through that too, but that's not important. It's a popular success because the end-user workflow is easy, not because of the technology behind it.
You can choose to use Mathjax (over png images) on Wikipedia if you have an account. Look for Special:Preferences / Appearance.
MathML is ridiculous. It's neither human-readable nor human-writable (the quadratic equation takes 800 characters - it's 50 in Tex). It's just another pointless XML language . Thankfully, it's obsolete before most browsers support it. It doesn't even look as good as Tex or Mathjax's HTML-CSS!
It turns out this is a bit of a pain.
You can use MathML, but browser support is still iffy. If you are starting with latex you've got a few options for converting to html, but they'll all typically end up rendering the actual equations to images and inlining those.
Nothings all that pretty (unless you resort to pdf or something). What's best will depend a bit on what sort of content, how many equations, and how complicated the equations are.
Here is a decent summary.
My two favorite approaches:
Client-side: MathJax. See some examples here. It is very easy to use and install and its development is backed by the AMS and SIAM among other scientific institutions. I expect this to become the defacto standard for displaying math on the Web.
Server-side: LaTeXML. This is used for producing the NIST Digital Library of Mathematical Functions. It tends to hiccup if you have custom macros in your TeX sources but in general it does give very good results.
The jsMath package is another option that uses LaTeX markup and native fonts. Quoting from their webpage http://www.math.union.edu/~dpvc/jsMath/:
The jsMath package provides a method
of including mathematics in HTML pages
that works across multiple browsers
under Windows, Macintosh OS X, Linux
and other flavors of unix. It
overcomes a number of the shortcomings
of the traditional method of using
images to represent mathematics:
jsMath uses native fonts, so they
resize when you change the size of the
text in your browser, they print at
the full resolution of your printer,
and you don't have to wait for dozens
of images to be downloaded in order to
see the mathematics in a web page.
There are also advantages for web-page
authors, as there is no need to
preprocess your web pages to generate
any images, and the mathematics is
entered in TeX form, so it is easy to
create and maintain your web pages.
See for example this page or that one.
Katex
A couple of developers from the Khan Academy released a blazing quick library based off of Tex called Katex:
Fast
High-quality
Self-contained; and,
Can be rendered on the server
Looks like a great modern option.
You can do more math directly in HTML than most people realize. See these notes.
The only safe way to render LaTeX is to save the output as an image. Some sites try to use tools to do this on the fly, and they never work reliably. For example, on some blogs, this works if you visit the web page directly but not if you go through Feedburner/Google Reader.
I've had terrible experience with MathML browser support, both in Firefox and IE. Don't even try it. Not yet. Maybe in a few years.
Here's the site I use to compile LaTeX to gifs.
If you're willing to use PDF instead of HTML, things get much easier. Just create your LaTeX document and use pdflatex to compile it to PDF. If you do go the PDF route, you may be interested in how to include PDF properties such as author, keywords, etc. in your LaTeX file. Also, this page explains how to mark up the LaTeX to make links in your PDF.
texvc can convert LaTeX math equations to png or HTML.
LaTeX and MathML are the only "right" ways to do this. However each has severe limitations. The other options are images (not really optimal if you need to edit the equations later) or complex HTML(requires some training but can be done).
I do render LaTeX formulas "on demand" in my wiki. Basically, I extract the latex code from each wiki section and put it into a .tex file (whose filename is an md5sum of the latex, so if the same code is used again, the same tex and therefore the same image will be used).
The tex file is then latex compiled by a cron task every minute, to produce first a .ps, then with the convert program a .png (named again with the original md5). The wiki entry replaces the latex text with an img tag referring to this png (with the original latex code as an alt, for text readers).
If you want to go this way, be very careful to sanitize your latex as much as you can. there are commands in latex, like \input, that you definitely do not want to let go through, as anybody able to use them would be able to include any readable file in your server disk and include it in the resulting latex output.
To solve this issue, Mediawiki (of wikipedia fame) has a special plugin which sanitizes the latex input, but I didn't want to use it for two reasons: first I did not use mediawiki, second it's written in OCaml and I didn't want to mess with a language I don't know.
I've used ASCIIMathML for this in the past. It's essentially a JavaScript library and can use a plugin in IE to optimize performance, but also works without it in IE & Firefox/Mozilla (although a bit slower). The syntax supports a subset of LaTeX, but the differences cause some confusion, so it may confuse your users, depending on where they are coming from.
Here are some links so you can check it out yourself:
ASCIIMathML
ASCIIMath Tutorial
Not perfect and doesn't work in all browsers (Safari, etc) but it's something that works today at least, albeit in a somewhat selective subset of the web.
I've written an open source javascript module to do this, named jqmath. See http://mathscribe.com/author/jqmath.html. You type equations in a simplified TeX-like syntax, and jqmath converts them to MathML or simple HTML and CSS, depending on the browser. This is more efficient and accessible than using images.
By the way, some of the summaries and notes mentioned in the other answers here are pretty outdated now. Also, Firefox supports MathML now, and webkit (Chrome and Safari) have it in their nightly builds, though they haven't released it yet. Internet Explorer renders MathML if you have the MathPlayer plugin. Opera fakes MathML with a stylesheet. MathML is part of the HTML 5 standard, so presumably all these browsers will natively support it sooner rather than later. It's true that until then, jqmath's output will not look as good as TeX's, but it's certainly readable, and is definitely a better solution for web pages going forward.
If you do use images, will a reader for a blind user be able to read the equation? Some may want to.
There is a little Mac App called LatexIt that makes it very easy to convert LaTeX equations to PDF, PNGs etc.
(I use it to create equations for my slides in Keynote or PowerPoint. It's very nice, with drag 'n drop support, so you can just 'drag' the equations anywhere to insert them.)
I have the impression that MediaWiki will allow you to enter LaTeX markup (or something similar) and dynamically decide the best way to display it. Currently I think that uses HTML where possible for small expressions and images for more complicated expressions that cannot be represented otherwise; I suspect that one day it may take advantage of whatever other methods become state of the art, i.e., MathML if browsers start supporting it. So I think you might find that if you use MediaWiki as if it were your website engine you'll be forward-compatible with whatever comes in the future.
You can generate equation image on-the-fly via a LaTeX server.
http://www.forkosh.com/mimetex.html
If you are using WordPress, you can use LaTeX for WordPress (http://wordpress.org/extend/plugins/latex/) plugin.
Currently the state of client side MathML rendering isn't ready for broad adoption. The means you really need to render the MathML as an image. How you do this will depend on your environment.
Do you have root access to your own server? Are you comfortable installing software on it? In this case, you can render your own images. If your running blogging software or a wiki, generally you can find a plugin which will take advantage of your platforms capabilities. This is usually the idea scenario if you plan to write a lot of math expressions.
If you host your own images, you can either pre-render them, or use an extension like mimetex.cgi. If you allow arbitrary MathML expressions to be rendered, you run the risk of other websites hot linking to your image renderer. If you put a filter in on your web server to prevent hot linking, then people viewing your site through a feed reader will also be blocked.
If you can't render your own images, or if you only have a few expressions you want to render, then you can usually have another service generate the image, and you hot link the image on your site. The downside of course is your dependent on another site, who gets nothing in return for serving up images for you.
Examples of other services (as mentioned in other comments) include:
* http://www.artofproblemsolving.com/LaTeX/AoPS_L_TeXer.php : alt text http://alt2.artofproblemsolving.com/Forum/latexrender/pictures/a/f/c/afc183343d84d030898f589bac12a8d9cf04558a.gif
* http://www.forkosh.com/mimetex.html : mimetex.cgi http://www.forkosh.dreamhost.com/mimetex.cgi?c=%5Csqrt%7Ba%5E2+b%5E2%7D
The advantage of using mimetex is one can easily change the formula and have it re-rendered.

How best to write documentation targeting both HTML and PDF? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Latex-to-html converters I've seen in the past have been pretty awful. Editing raw html is no fun and doesn't seem to translate well to the printed page. How do others solve this problem? Links to examples (both pdf and html) would be great.
Added: Another similar question was just asked:
What formatting language should I use for project documentation
For documenting code, I also recommend Sphinx. ReStructured Text is nice because it is readable and somewhat marked up in plaintext, and can do a nice job converting to html and to pdf. I still like LaTeX for certain things. My wife and I use LaTeX to write our christmas letter, which we mail out via snail mail. The pdf version is pretty fancy, with two columns, and headers and footers. The html version is simpler. I convert with plastex. Examples here:
http://fedibblety.com/annualReports
I don't think any binary format is a good choice (Word) for any sort of document that you might like to read 10 years from now. That is one of the nice things about LaTeX.
Yes, LaTeX-to-HTML converters used to suck (you've probably tried LaTeX2HTML), but of late they've got better. Tex4ht is highly configurable, and produces nice XHTML+CSS. See also other converters.
You can also use Docbook, if you can bear to write in it. There are converters from DocBook to both HTML and LaTeX (or to PDF directly); an example of the latter is dblatex.
See this post: LaTeX vs Docbook.
After many years of anguish and several false starts, I'm about to revisit this, and I'm going to give Sphinx a try. It can generate HTML or LaTeX from ReStructured Text.
I'm hoping it will be a much "lighter" option than full DocBook, but with many of the advantages.
You could take a step back and use something like DocBook and render to PDF via LaTeX and HTML straight from the DocBook files. Alternatively, Adobe Technical Communication Suite (Framemaker) will let you single-source a document to PDF and HTML. See this posting for a rundown on various technical documentation systems.
This is a personal choice but Latex in theory is perfect however in practice it's pain-in-the-arse. I'm using VS.NET HTML editor + raw HTML edit when I need it.
So I think using an WSIWYG HTML editor is best choice. You can always use a simple tool to convert it to PDF, and you can always edit HTML when you need something advanced. Also it's easier to put online when you need.
That's how I'm managing my software documentations and works fine for me.
PlasTeX looks like a nice latex-to-html converter, though I haven't tried it myself.
My friend Rob Felty wrote a blog post extolling its virtues:
http://blog.robfelty.com/2008/03/19/finally-a-better-latex-to-html-converter/
AsciiDoc looks like an interesting possibility.
Read about EPUB format. Its e-book format. http://en.wikipedia.org/wiki/EPUB
Since the answer mentioning Asciidoc was somewhat short on examples, here are some of the things your are looking for:
A pdf generated with Asciidoc
A cheatsheet with a side by side of the Asciidoc markup and the html result.
A list of publications done using Asciidoc, including O'Reilly books and the git documentation (to see both ends of the user scale).
I'm not sure that latex is really the best tool for this. The trouble you're having with the usual latex to html converter is indicative of the problem: html is simple not as expressive as latex.
If you insist on latex to html, take care to use a limited subset that can convert reasonably.
I've used TeXinfo in the past and it does a good job. Here's an example: http://yootles.com/api. I'd prefer to stick with LaTeX though instead of use another language.
If everything else fails you could grab an LaTeX to XML converter and write a simple XSLT stylesheet to convert it to HTML, or create a CSS style sheet and attach it to the XML file directly.
We've been using WebWorks ePublisher (www.webworks.com) which offers both multiple single-source formats (we are using Word) and the ability to output to many output formats (we output to Adobe PDF and Online Help (.CHM).
We were facing this problem in an academic project that involved Eclipse software, and we used plastex to convert Latex to HTML and Eclipse Help. Getting it to work was quite difficult, but the end result looks really nice. You can see all three versions here:
http://handbook.event-b.org/
Further, as this is an open project, the code (build scripts) are available. We have a continuous build system (Jenkins) that rebuilds everything when new Latex is checked in. This is particularly nice, as contributors don't need to install the toolchain on their systems. They just check in the new Latex and check on the server whether the HTML was produced correctly. Sources:
http://sourceforge.net/p/rodin-b-sharp/svn/HEAD/tree/trunk/Handbook/org.rodinp.handbook.feature/
Best, Michael
I don't have enough points to comment, but to bolster the plastex answer, here is the updated plastex example link:
http://robfelty.com/2008/03/19/finally-a-better-latex-to-html-converter
LaTeX? Seriously? I wasn't aware anyone outside academia still used it. I'd go with HTML, which you can save as PDF from the web browser. If you really must have some advanced typographic stuff, go with Word instead - it has a way to save to HTML (probably not as clean as one would like), and you can save as PDF with a free plug-in (downloadable separately).
Oh, and I wouldn't bother using things like InDesign - they are overkill. Also, don't bother paying for Acrobat Professional - there is a zillion free solutions available.

REALLY Simple Website--How Basic Can You Go?

Although I've done programming, I'm not a programmer. I've recently agreed to coordinate getting a Website up for a club. The resources are--me, who has done Web content maintenance (putting content into HTML and ColdFusion templates via a gatekeeper to the site itself; doing simple HTML and XML coding); a serious Web developer who does database programming, ColdFusion, etc., and talks way over the heads of the rest of us; two designers who use Dreamweaver; the guy who created the original (and now badly broken) site in Front Page and wants to use Expression Web; and assorted other club members who are even less technically inclined.
What we need up first is some text and graphics (a gorgeous design has been created in Dreamweaver), some links (including to existing PDF newsletters for download), and maybe hooking up an existing Blogspot blog. Later (or earlier if it's not hard), we may add mouseover menus to the links, a gallery, a calendar, a few Mapquest hotlinks, and so on.
My question--First, is there any real problem with sticking with HTML and jpegs for the initial site? Second, for the "later" part of the site development, what's the simplest we can go with? Third, are there costs in doing this the simple way that will make us regret it down the road? Also, is there a good site/resource where I can learn more about this from a newbie perspective?
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
If you don't require any dynamic content, heck, if you don't plan on editing the content more than once a week, I'd say stick to basic HTML.
Later, you'd probably want a basic, no-fuss and easily installable CMS. The brand really depends on the platform (most likely PHP/Rails/ASP), but most of them can be found by typing " CMS" into Google. Try prefixing it with "free" or "open source" if you want.
I'm pretty sure you can do all this for absolutely free. Most PHP and Ruby CMS's are free and web hosting is free/extremely cheap if you're not demanding.
And last/best tip: Find someone who has done this before, preferably more than once. He'll probably set you up so you never have to look at anything more complicated than a WYSIWYG editor.
Plain old HTML is fine, just as long as you don't use tags like blink and marquee.
I personally love tools like CityDesk.
And I'm not just plugging Joel. (There are others out there in this class I'm sure.) The point is they make making a static website very easy:
The structure is just a filesystem structure
pages have templates to consolidate formatting
all resources are contained in one file
easy and fast Preview and Publish functions
For a dynamic collaborative site, I would just install one of many open source CMSs available on shared hosting sites.
If you're familiar with html/javascript basics I'd look into a CMS - wordpress, drupal, joomla, nuke, etc. All of these are free. Very often your web hosting company will install one of these by default which takes all of the hard part out of your hands. Next is just learning to customize the system and there's tons of docs out there for any of those systems.
All that being said there is noting wrong with good old fashioned html.
In addition to some of the great content management systems already mentioned, consider cms made simple.
It makes it very easy to turn a static site into a content managed site (which sounds like exactly what you might need to do in the future), and the admin area is very easy to use. Our clients have found it much simpler to use than the likes of Joomla.
It's also free and open source.
Good luck!
There's no reason to not go with plain old HTML and JPGs if you don't know any server side scripting languages. Also, once you want to get more advanced, most cheap hosting services have tools that can be installed with one click, and provide things like blogs, photo galleries, bulletin boards (PHPBB), and even content management tools like Joomla.
I had the same problem myself, I was just looking for something really easy to smash together a website quickly. First I went with just plain old HTML, but then I realised a simple CMS would be better.
I went for Wordpress. Wordpress is mostly known as a blogging platform, but in my opinion it is really great as a deadly simple CMS as well.
why not simply use Google pages?
Here is an example of a website I did, takes about 2 hours, easy to maintain (not that I do (-: ) and FREE.
I think that suggesting you mess with HTML for what you need is crazy!
Plain HTML is great, gives you the most control. If you want to make updating a bit easier though, you could use SSI. Most servers have this enabled. It basically let's you attach one file to many pages.
For example, you could have your menu in navigation.html and every page would include this file. That way you wouldn't have to update this one file on every page each time you need to update.
<!--#include virtual="navigation.html" -->
I agree with the other commenters that a CMS might be useful to you, however as I see it, probably a solution like Webby might do it for you. It generates plain HTML pages based on Templates. Think about it as a "webpage preprocessor" which outputs plain HTML files. It has most of the advantages of using a server-based CMS, but without a lot of load on the server, and making it easy for you to change stuff on any of the templates you might use.
It's fine
Rails (or purchase / use a CMS)
Not unless you start becoming crazy-popular
It really depends on what you go with for 2. Rails has a plethora of tutorials on the net and any product you go with will have its own community etc.
To be perfectly honest though, if the dynamic part is someone elses blog and you move the gallery out into flikr you may find that you can actually live with large parts of it being static HTML for a very long time.
If a to Implement a website With User Profiles/Logins, Extensions, Gallery's etc s a Newbi then a CMS like Joomla, Etc are good , but Else if you presently have only Static Content then Its good to go with Good Old HTML, About JPEG , I though Presently Its better to use PNG or GIF as its Less Bulky.
Also About you Query About Shifting to Server Scripts , When you have Database Driven Material or When you have Other Things that Require Advanced Prog Languages , Just use PHP Scripts inside PHP , and Rename teh File as a PHP, Thats IT, No Loss to you HTML Data.....
Do Go Ahead and Launch you Site ......
Dude, you're talking about HTML, obviously you'll be styling your content with CSS. Wait till you run into IE issues and god forbid your client wants ie6 compatibility.
Go with the HTML for now, I'm sure you guys will hack it through. Our prayers are with you.
Personally, I'd never use JPEG images on a website, mainly because of three reasons:
JPEGs often contains artifacts.
Quality is often proportional
with filesize.
Does not support
alpha transparency.
That said, I'd recommend you to use PNGs for images since it's lossless and a 24-bit palette (meaning full colors + alpha transparency). The only quirk is that IE6 and below does not support native alpha for PNGs, however this could be resolved by running a javascript which would fix this issue.
As for designing a website, there's both pros and cons for this. I suggest you read through:
37 Signal's Why We Skip Photoshop
Jeff Croft's Why We Don't Skip Photoshop
As for newbie resources, I'd recommend you flip through the pages at W3 Schools.