PDF to Structured Format - html

I have tons PDFs that I need to convert to some structured format that I can interpret (HTML/XML/etc)
PDFs are in this format:
http://img840.imageshack.us/img840/5407/pdfv.png
I have tried so far a lot of softwares that convert to HTML but all of them have no capabilities to separate the images, they just take like a printscreen of the page without the text and then use this image as a background in the html, using css to position the text
Like this: http://img37.imageshack.us/img37/5015/examplelp.jpg
I have a bunch of PDFs so process each ones images manually is not an option. Does anyone knows any solution for this (even paid softwares)?

I had a similar problem a while back and ended up writing my own solution. It's called PDFX and it's free to use. It converts PDF to a structured-format XML and also renders any bitmap images (not vector graphics) found in the PDF separately.
Example input/output can be found here. You might want to give it a try.

Related

PDF to HTML converter - Stuck

I need to have just one pdf on my website and HTML file. I dont need to be making them on my website I just need to add one pdf to a page and put text over it. Does anyone know of the best way to convert the pdf to HTML. I have found places like cloud converter but it adds so much other stuff on the page with the text that it is impossible to filter through all the css and javascript to find ways to put text over it without it covering it up or weird characteristics arising. I just need the text to be formatted relative to itself and on the page plainly in html. Is this even the right approach. Thanks!

How to automate filling of several pdfs?

Does anyone know a good way to generate an adaptive pdf? I am currently working on a project where I need to create a report where text and numbers are inserted into the text in a template. I, therefore, need the pdf to adapt to the length of these inputs. My initial thought was converting HTML to pdf, which works, it is just very time-consuming to write HTML for 14 pages. I have tried different ways to generate HTML, from canvas to adobe and different applications found online. Although this often looks beautiful in HTML, the HTML generated in this way does not convert to pdf without compromising the appearance of the report. At least not with my method of using wkhtmltopdf. Does anyone have any ideas on how to efficiently make a beautiful pdf report that adapts to the length of text input?
Thanks for any answers:)

How to parse a PDF using AS3? (air)

is there a way to parse a pdf using AS3 via Air in mobile?.
I don't need the full content of the PDF, only some data, is that possible?.
Edit for clarification:
I got a PDF file that was originally created based on a XML, what I'd need is to be able to retrieve that XML. Or at least, to find a string inside that PDF so I can make a call to a web service.
Original:
There's nothing native in AS3 for this kind of stuff but AlivePDF. It won't let you traverse things like XML so much, as it seems like you're trying to do by taking a small bit of a PDF, but it will let you create pdf's, add pages and change fonts etc.
You weren't entirely clear on what you're attempting to achieve, if you update your question a with a bit more detail we may be able to help a bit more.
Edit:
From the refined question, AlivePDF is not what you're after as it's really only for PDF generation. I'm assuming you're after a method to traverse the document like you would XML, by looking for a tag and extracting the information. I've not found a way to do this other than iterating through the document and searching manually which probably isn't what you're after.
After some searching I found an as3-pdfreader which doesn't seem to be complete at the moment. However on the Project Home the roadmap says parsing pdf files is complete, I've not been able to try it out yet though.

Good PDF to HTML Converter for Mobiles

We are having Multiple PDF which have account tables and balance sheet within it. We have tried many Converters but the result is not satisfactory. Can anybody please suggest any good converter that would replicated the contents of PDF to Exact structure in HTML. IF any paid Converter is there please suggest me .
This is the PDF we want to convert and Show in html "http://www.marico.com/html/investor/pdf/Quarterly_Updates/Consolidated%20Financial%20Results%20-%20Q3FY11.pdf"
Have you looked into this? http://pdftohtml.sourceforge.net/
It's open source as well, so it's free and can be modified if necessary.
There's even a demo showing the before PDF and the after HTML version. Not bad if you ask me.
If you're having issues specifically with tables in PDFs, perhaps the issue are the table themselves and whatever program is being used to generate them. Not all PDFs are created equal.
ALSO: Be aware that all PDFs that I've created and come across over the years have had lots of issues when it comes to copy/pasting blocks/lines of text that have other blocks/lines of text at equal or higher height on any given page. I think Acrobat lacks the ability to define a "sequence order" of what block is selected after what (or most programs don't use it properly), so the system sorta moves from a top-down, left-to-right way of selecting content.....even if that means jumping over large blank areas or grabbing lines from multiple columns at once when you wouldn't expect it. This may be part of your tabular data issue. Your weak link here is the PDF format itself and I think perhaps you may be expecting too much from it. Turning anything into a PDF is pretty much a one-way street, especially when you start putting lots of editable text into it.
Have you tried http://www.jpedal.org/html_index.php - there is also a free online version

Putting several hundred .doc pages into webpage

I have hundreds of .doc files with text that I need put on web pages.
I realize I could convert every .doc file to .txt, then use a server side include to embed the contents of each page into a webpage. This would save a lot of time because I could simply have one .php?txt=... page which will display a different .txt include depending on the link the user pressed to get there. This works perfectly content-wise.
However, all formatting is lost when it is converted to .txt (titles should be in bold)
When I convert these .doc files to .html using Microsoft Word, the ~20 line documents become bloated >300 line .htm files (probably because each paragraph is put into textboxes)
Dreamweaver's "Clean up Word HTML" helped a bit but the code was still extremely bloated.
How would you suggest going about this?
edit: I may have solved my own question, trying to embed Google docs into my page.
There is a program suite called wv (former mswordview). It has a program wvWare. This software can transform Word documents to HTML.
Furthermore you can use the output from Word and send it through tidy. This corrects markup and usually can handle the mistakes made by Word.
You can try converting the Word documents to a DocBook intermediate format, then you can easily transform the DocBook with existing tools to (X)HTML.
MS Word is bloatware. Its own markup is bloated, and therefore any attempt to automatically convert it to HTML will inherit these problems. You end up with garbage like: <strong><strong></strong></strong> for no good reason.
Dreamweaver can clean it up a lot, but nothing short of strip/remarkup is going to get you clean results.
That's why most people use PDFs for this type of issue.
My immediate reaction would be to convert the docs to PDFs. That will normally preserve formatting quite well, and users typically have their browsers set up to view PDFs one way or another (and the few who don't are undoubtedly accustomed to being unable to view a lot of documents on a lot of sites).
Alright thanks everyone for your suggestions, but I wanted to make this page accessible to everyone without pdf viewers as well.
Google docs allows you to bulk upload your text files (and converts them for you too)
You can then export them into an iframe to embed in any html document.