Starting in Office 2007, it is possible to save documents in PDF or XPS format. This is done (programmatically) by calling the method ExportAsFixedFormat. Googling has not turned up any definition of "fixed format" that seems compatible with what I know about the PDF file format. Is there a widely-accepted definition of this term?
Fixed format means that the document isn't editable.
When you save it as PDF or XPS it's an export to a format that is only intended for viewing.
(They are of course still editable if you happen to have the right software. For PDF for example that would be Adobe Acrobat Professional, i.e. not the free Adobe Reader.)
It probably means that these formats do not have formulas or layout engines.
In other words, your values, data, and layout are fixed.
CSV would also fit this definition, but it predated the feature.
Related
I am trying to find a solution for a website that I made so that the word documents will not download automatically anymore, but that they display directly in a new page in the browser when I click them.
Can someone help me?? How can I do that in HTML?
Thanks a lot in advance
There are two approaches you can take for this:
Ensure that all visitors to your site have a browser extension which can display Word documents. (This isn't something you can control for a typical website but might be an option for a company Intranet)
Convert the Word documents to a format that the browser can display (i.e. HTML). You could do this manually or with code (which could be client-side or server-side). The Word document formats are notoriously complex so you would need to find a third party library that could do this for you.
I have hundreds of .doc files with text that I need put on web pages.
I realize I could convert every .doc file to .txt, then use a server side include to embed the contents of each page into a webpage. This would save a lot of time because I could simply have one .php?txt=... page which will display a different .txt include depending on the link the user pressed to get there. This works perfectly content-wise.
However, all formatting is lost when it is converted to .txt (titles should be in bold)
When I convert these .doc files to .html using Microsoft Word, the ~20 line documents become bloated >300 line .htm files (probably because each paragraph is put into textboxes)
Dreamweaver's "Clean up Word HTML" helped a bit but the code was still extremely bloated.
How would you suggest going about this?
edit: I may have solved my own question, trying to embed Google docs into my page.
There is a program suite called wv (former mswordview). It has a program wvWare. This software can transform Word documents to HTML.
Furthermore you can use the output from Word and send it through tidy. This corrects markup and usually can handle the mistakes made by Word.
You can try converting the Word documents to a DocBook intermediate format, then you can easily transform the DocBook with existing tools to (X)HTML.
MS Word is bloatware. Its own markup is bloated, and therefore any attempt to automatically convert it to HTML will inherit these problems. You end up with garbage like: <strong><strong></strong></strong> for no good reason.
Dreamweaver can clean it up a lot, but nothing short of strip/remarkup is going to get you clean results.
That's why most people use PDFs for this type of issue.
My immediate reaction would be to convert the docs to PDFs. That will normally preserve formatting quite well, and users typically have their browsers set up to view PDFs one way or another (and the few who don't are undoubtedly accustomed to being unable to view a lot of documents on a lot of sites).
Alright thanks everyone for your suggestions, but I wanted to make this page accessible to everyone without pdf viewers as well.
Google docs allows you to bulk upload your text files (and converts them for you too)
You can then export them into an iframe to embed in any html document.
Are there any classes, COM objects, command line utilities, or anything else that I can make an API for that can convert a PDF to an HTML document? Obviously the conversion might be a little rough since PDFs can contain a lot more than HTML can describe. I found a utility called pdftohtml on Source Forge, but quite honestly it does a horrible job with the conversion. I don't care if the software is free or commercial, but is there anything out there at all that I can incorporate with my own software to do this sort of conversion at least decently? I know Google's developed their own method of doing this, since you can click "View as HTML" on a PDF attached to an email through Gmail, but I was hoping there was something out available to the public.
Remember, PDF to HTML. I'm NOT worried about HTML to PDF.
well one solution i can think of is to write little program that reads pdf text using library called iText and then generate html files.
well for java based PDF solutions...we dont have a clean way i guess-still.. all solutions are primitive and kind of workarounds... No easy solution for
1. Designing a template of a PDF
2. Then at runtime using java, populate data into this template...either using xml or other datasources...
such a simple requirement and NONE has a good "open-source and free" solution yet !
Eclipse BIRT comes close.. but does not handle Barcode elements ..OOB.
You were looking for pdf2htmlEX (C++), which converts PDF to HTML without losing text or format.
To convert further to semantic HTML, you can process pdf2htmlEX output using my project Transcript (Python). It is however not lossless anymore and works best on documents not deviating too much from conventional visual layout.
I'm looking to export a page that looks good in print media, to word.
Can this be done automatically, or mostly automatically with office apis?
The alternative is to create a program that reads all our style meta data and font meta data and convert to word and force a download.
The issue is our style metadata is already built for css, its a web app after all. And writing my own css parser, doesn't sound like a good use of time.
I know this sounds too simple to be true, but I belive you can simply rename a ".html" file to ".doc" to force it to open in word, and let office's html rendering take care of the rest.
If it's for reporting purposes, and you think you might have use for more of the same in the future, you could look at something like reporting services as a way of creating a report that can be downloaded in various formats. I'm not 100% sure if the newest version allows the creation of .doc files, but you can purchase plugins to permit this.
I would like to take a pdf of a scanned graph paper notebook (with handwriting) and turn it into a text file.
How can I do this?
Thanks
Check out an OCR library, like OCRopus. I don't think it takes PDF, so you may have to convert it to a TIFF or JPEG first.
There are OCR libraries that convert typing (OCRopus, tesseract, etc.)
There are also Java based handwriting libraries. I am not sure if OCRopus has that ability, one library I was looking into to do handwriting recognition was:
Online Video
Java Neural Networks
Conceivably you could take the pdf, convert it into a tiff if need be (according to the software), and it would give you something..
Good luck!
If it is the notebook as a PDF file you could e-mail it to a gmail account and then gmail allows you to "view" the PDF from within your browser as an HTML file. Still the pages remain images.
If you would like the text out of it OCR might work but it may also be uncapable of getting the text out of it.