Convert doc/docx to semantic HTML - html

I would like to convert doc/docx documents to semantic HTML.
Some wishes/requirements:
Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.
Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.
• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.
• Should work programatically, and with large number of documents.
The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.

" headers in the document are "
I think this is impossible.
Because MS Word only write down the result, with different styles of <p>
just like printed text on paper, the original info are not recorded.
Your other wishes could be approached.
There're two commercial tools can do this
(don't believe those free tools or online tools, they don't do the real work.)
1 Word Cleaner by Zapadoo
www.zapadoo.com
2 HTML Cleaner for Word by wonder Studio
www.htmlcleaner.com
I prefer the second one which released just last year. You can try them both.

There's a tool called upCast which is able to convert Word documents into XML.

docx4j (for docx only, not doc) writes clean HTML output. You'd need to change things a bit if you wanted <h1> instead of <p class="h1">, but its open source so you can do that.

I wrote a utility which implements the requirements you listed, excluding images, graphs and maths formulas. It's beta quality (i.e., it works on my machine). I published it at http://www.modeltext.com/word

Just more ideas.
Use Gmail to convert word docs
http://www.oreillynet.com/mac/blog/2006/05/use_gmail_to_convert_word_docs.html

Related

Is XML really more semantic that HTML with classes/ids?

I'm coming from a HTML / JavaScript / PHP background and have recently started learning XML.
I was reading this excerpt from "No Nonsense XML Web Development with PHP" which includes this comparison:
<div>
<div>
<h2>Product One</h2>
<p>Product One is an exciting new widget that will simplify your life.</p>
<p><b>Cost: $19.95</b></p>
<p><b>Shipping: $2.95</b></p>
</div>
</div>
Take a good look at this – admittedly simple – code sample from a computer’s perspective. A human can certainly read this document and make the necessary semantic leaps to understand it, but a computer couldn’t. ....
A computer program (and even some humans) that tried to decipher this document wouldn’t be able to make the kinds of semantic leaps required to make sense of it. The computer would be able only to render the document to a browser with the styles associated with each tag. HTML is chiefly a set of instructions for rendering documents inside a Web browser; it’s not a method of structuring documents to bring out their meaning.
The author then compares this to XML with this:
If the above document were created in XML, it might look a little like this:
<productListing title="ABC Products">
<product>
<name>Product One</name>
<description>Product One is an exciting new widget that will simplify your life.</description>
<cost>$19.95</cost>
<shipping>$2.95</shipping>
</product>
</productListing>
In theory, we should be able to look at any XML document and understand instantly what’s going on. In the example above, we know that a product listing contains products, and that each product has a name, a description, a price, and a shipping cost. You could say, rightly, that each XML document is self-describing, and is readable by both humans and software.
I get the author's point to a degree. Of course a computer would not be able to discern meaning from this HTML, there's no context.
However, I would never expect the HTML to be written in this way. Rather I would expect the HTML to use classes and/or ids to provide the necessary context more like:
<div class="productListing">
<div class="product">
<h2 class="name">Product One</h2>
<p class="description">Product One is an exciting new widget that will simplify your life.</p>
<p class="cost"><b>Cost: $19.95</b></p>
<p class="shipping"><b>Shipping: $2.95</b></p>
</div>
</div>
Given this example, my question is:
Is XML really more semantic than HTML that utilizes classes/ids to provide context to the data it contains?
(Note that I simplified the code examples to avoid TL;DR)
This is an interesting question.I'll give you my two cents.
I jumped onto XML a few years ago when I had to built a dynamic website and my client didn't have access to the database(just FTP access).What I essentially coded was an XML backend and PHP which fetched this through SimpleXML parsing.
In retrospect, I do think XML is more semantically richer than HTML. As a comment pointed out above, the html class has been a styling construct. I don't remember personally using/ hearing anyone using classes or ids for purposes other than CSS/JS based styles or animations.
The key in using XML over HTML with classes was the flexibility to throw it around. For another project, updating values of XML elements from one system, and then having them read and displayed by an other system made a lot of things smoother.Additionally, the XML parsing libraries allow a number of functions for parsing through the nodes.
Also it's important to note that XML allows you to define attributes.This could be viewed as something similar to classes and ids to HTML.
Also, let's not forget that RSS feeds are essentially XML and not HTML with more tags.
Therefore, answering your question specifically with respect to semantic, I definitely think XML has the advantage there.
TLDR:XML is more semantic according to me
You are correct that in terms of just looking at markup, there is little do none difference between XML's "meaningful" element names, and HTML class/id. However, keep in mind that for XML, there is a set of technologies and tools that allow you to easily work with element names. You can write schemas and validate against them. You can compose schemas by using namespaces. You can extract structures by using simple XPath expressions. All of this is much harder with the HTML approach.
So if you have requirements to capture and process "meaningful" structures, then XML is your friend. If all you want is to have snapshot of something where you can say "this is a product", then maybe there really might be not such a big difference.
My advice would be: If you store and process data using multiple publishing pipelines, XML very likely is a much better starting point. If all you want is capture snapshots that will get delivered to HTML-based consumers, then "semantically enriched" HTML may be the easier way to go.

Format suitable to export to both html & pdf?

I need to maintain many documents which need to be able to be viewed as 2 different types of format: PDF & HTML. The document will be mostly text, but may contain some images or mathematical formulas.
My current approach is to maintain 2 files for each document. However, this approach is tiresome, as if the content needs to change, I need to modify BOTH versions of the file.
I want to find a way to easily keep both versions of the file in sync. Preferably (but not necessary), the approach should allow me to use tools like git, or svn.
A solution that comes to my mind is to use latex. Represent the document in latex, then export it to HTML/PDF. This way, whenever there is a change, I need only to modify one file (the latex file).
But I have zero experience working with latex. I'm not sure whether latex is suitable for this, I need advice. What do you guys think? Is latex suitable for this task? If not, what alternatives do I have?
First of all,
yes, LaTeX is suitable for this (and it works particularly well with formulæ).
The main processing paths are:
Use pdflatex to create a pdf directly from LaTeX
Use latex2html or tex4ht to convert your LaTeX source to HTML
I am biased (having authored a text book for LaTeX in German language), but I think LaTeX is definitely worth learning.
restructuredText (Python docutils) is good for this. There are a couple of paths from text to PDF; one of them goes through LaTeX and the other one is a pure Python rst2pdf.
If you have a lot of formulas, it might be worth doing it in LaTeX, but restructuredText source is a lot more readable than LaTeX source.
Sounds like a good candidate scenario for working in markdown and using pandoc to convert to both LaTeX and HTML. Formulas can be essentially written in LaTeX (thus making the maintenance of that output painless) and the markdown-to-HTML conversion can be expressed with the --mathjax option to yield proper display in HTML.

author html for ms word

my objective is to generate HTML markup to target ms word. So far my findings are, if you have all the styles inline to an element, the document, when opened in word renders properly. However it is lengthy task.
<h1 style="font-family:Arial">Inventory</h1>
This is how I try to achieve formatting. If i want to maintain a constant font across the document, in my HTML, I'd have to add font-family to all the elements like I've done above.
Later, I came across a codeproject article. http://www.codeproject.com/KB/office/Wordyna.aspx Now I am sort of convinced that you can declare the styles globally, but the styling language used and the formatting is not like CSS, and, I think its proprietary to ms word document formatting. I am looking for any tutorials/articles for this styling being used.
ps: I am aware about OpenXML etc, etc. I feel its too complex for me to implement at this point.
Word --should-- open valid (read: not Microsoft's proprietary html-ish mess) without fail as it's the rendering engine for Outlook when you open an HTML email. You could go to the effort to build a document entirely in-line (read: only best practice for Microsoft) as we do for HTML emails, but I suspect there are several different ways to skin this cat.
Personally, if I was trying to get a rich text formatted document from html to Word I'd use a tool such as PHPDocX to build a proper word document natively, then if I really wanted Word HTML I could simply hit save on Word. I've had to do similarly with Excel, where it will accept CSV, but the outcome is always better with XLSX, and there's a similar plugin to easily author a proper XLSX document.
If that's too difficult a route (and it's not that bad, trust me) then I'd stick to formatting following HTML Email rules. Simple guides are all over the web, such as here. And, since Outlook 07-current uses Word's html rendering engine, one could deduce that it has the same limitations listed here

What language should I use for editing documents?

Document editors are nice but they have their limitations.
What is a good alternative to them?
I already know HTML and CSS and while they can do the job, they are ill-suited for printed documents.
I was thinking in learning LaTeX, because many scholars use it. But I wonder if someone would recommend another language such as postscript.
LaTeX is fine. You don't want to write postscript by hand.
I’m using LaTeX almost exclusively nowadays, at least for text documents (everything from CV over letters to manuals).
For quick one-off notes, I’m actually using Markdown (without a renderer. I just think that Markdown preserves document structure quite nicely even when used in text-only mode).
For presentations and spreadsheets, I use appropriate applications, though. In particular, I don’t think LaTeX is that well-suited to do the former (depending on your style of presentations, obviously. Mine have next to no text though …).
I finally got a chance to write an entire paper in LaTeX for my final semester of College and found it to be easier than I thought it would be. A couple of the nice things I found about it were
A fairly lightweight syntax for most things (tables being the only real offender, but no one can get text tables right).
An extremely wide array of syntax for doing anything from automatically marking up a chemical formula to writing inline lists.
Beautiful output automatically.
Extremely easy to write modular documents where I might store a chapter in a file and then simply \include{} it in another. One particularly nice use I found for this was to include code that I had written in the document simply by referencing the files.
Wonderful support for footnotes and bibliographic references.
Libraries for just about anything you can imagine.
The major drawbacks are, IMHO:
A lack of any real direction or life in the language. It feels dead, and not because it's done.
A frustrating build process, although there are tools to help with that, from a simple bash script to a full fledged make file.
If you're interested in learning LaTeX, I would recommend starting out by reading the Not So Short Introduction to LaTeX 2e PDF.
However, I decided against using LaTeX for most things that I write these days specifically because it feels dead and has a frustrating build process. I instead switched over to MultiMarkdown, as it is well supported and can be transformed into a large array of other formats, including LaTeX which can then be hand massaged if you really need to in order to get it the format expected by some publication. If you haven't played with MultiMarkdown or Markdown before, then I highly recommend checking them out. The syntax is extremely lightweight and natural, even compared to LaTeX. I find that except for some of the higher level typographical constructs, MultiMarkdown supports everything I need on a regular basis.
My 2 cents.
It depends on what you want to do. If you are planning to write a formal document, maybe for printing too, just go for LaTex.
Not difficolt as it may appear at the very beginning but professional and fulfilling.
If Web is your goal, go for HTML / CSS.
OpenOffice or Word would do the trick in most cases; do not underestimate them, if you are going to use them (example for job) take time to learn them.
To expand on zzzzBov's commmment, LaTeX is SUPPOSED to allow the writer to concentrate on the content and allow the compiler/documentclass to handle formatting (and that usually is true). If you use HTML/CSS to format you will probably be spending more time (rather than less) doing formatting. Imagine that the LaTeX documentclass is the CSS, only it is already written for you, and your LaTeX source is the content, only the tags are more functional (such as italics or equations) than for patching between the HTML and the CSS (<div ...>). I recommend the LaTeX wikibook as an easy way to start, and the short-math-guide, it if you need mathematics. Enjoy!

Why do I need Markdown?

Why do I need a Markdown with a front edit editor like WMD? What does the markdown do to the content that’s sent from the WMD editor?
How does Markdown store the content in the backend? Is it the same way like *bold* or in some other format? Why can’t I just do an html encode?
Sorry if I sounded very naïve.

			
				
It's probably helpful to take a step back and ask some of the larger questions. The issue Markdown is trying to solve is that of rich editing in the browser. Consider this: At some point, for any piece of software to enable rich text it has to describe the richness in a some manner, however that may be.
We could call that description of richness (by description of richness I mean like "this bit of text is bold" or "this bit of text is a hyperlink), we could call that description of richness "markup" -- it marks up the text with meta "richness".
Implementations of rich text can take on two approaches, either a.) hide the markup from the user or b.) let them have access to the markup.
For those who choose to hide it, the end result is very often WYSIWYG. The user is oblivious to what is happening behind the scenes. The editor takes care of the details. Think MS Word as an example. No one manipulates the Word markup format as a regular end user.
For implementations which choose to expose the markup, a markup language is then in order to allow users to interacat with it. Such markup languages would be things like HTML doing <tag> or BB code for example, doing things like [tag].
Markdown is one such of these languages.
As opposed to the former types I mentioned, Markdown has tried to design itself so that the markup renders common ASCII people already use. For example, it's common for people to asterisk their text to set it off, *important*, and this notation in Markdown is an indicator of italic.
In regards to storage, as Stephan pointed out, the system will most likely store the raw markdown, because the user will most likely need to have the possibility of editing, and the original markdown can be recalled for that purpose.
In most of the systems I've built, I store the markdown, and then normalize it to a 2nd field which caches the HTML rendering of the markdown. This way I don't have to do markdown->HTML rendering for every markdown field. It takes a little more space, but I'd rather the user have a faster response than use less DB storage space.
Care should also be taken when accepting Markdown from the browser, as it can easily contain <script> tags which need to be filtered out. Most markdown implementations will also recognize HTML intermingled with Markdown formatting, as so to be safe, you need to make sure your inputs and caches are sanitized properly.
The reason for using an alternate encoding system other than HTML is for security
Markdown and other such wiki style encoding systems do not usually support scripting languages
HTML supports scripting languages in many ways (
The two main security issues are:
Malware criminals use scripts in user generated content to attempt malware actions on the content readers computer by scripting to access known security holes
Free loaders using scripts to subvert the rest of the site by changing the content frame or styles i.e. ads, menu's, logos etc. This can also be criminal behaviour if not just annoying
By using an intermediate language such as Markdown you have total control on the rendered output
Filtering HTML is possible, but is also complex and risky
The other significant reason for an alternate encoding system is enforcement of style. Normal HTML has too many options. By limiting the available options, users can only use certain styles. The usually makes for cleaner looking and more readable content (compare SO to Ebay)
The main reason for using Markdown is the readability of a marked text. For instance, you can send it in a plain-text email and the reader will still understand the emphiasis, bullets, the text will be divided in paragraphs et cetera.
When you ask about storing data, it depends. If you enable Markdown in the WordPress blog engine, it stores data as the user has input it - in Markdown. In Stack Overflow, however, it seems like the data is stored as HTML. At least, the "Stack Overflow data dumps" contain HTML, not Markdown (I've seen people complaining) that they have to convert it back).
If you use the WMD editor, you can show the user how the outputs will look like after being converted to HTML. Even though Markdown syntax is really simple, it is not hard to make mistakes. Hence, it is best to show users the output.
Another reason for using Markdown instead of a WYSIWIG control - a WYSIWIG control allows the user to use HTML in data you are displaying on your web page. So, you have to be the one who decides when there is simply incorrect HTML and when it is an evil XSS/CSRF/whatever injection. In Markdown, you simply convert *something* to <b>something</b>, remove any unknow HTML elements and you're done.