OCR graph paper - ocr

I would like to take a pdf of a scanned graph paper notebook (with handwriting) and turn it into a text file.
How can I do this?
Thanks

Check out an OCR library, like OCRopus. I don't think it takes PDF, so you may have to convert it to a TIFF or JPEG first.

There are OCR libraries that convert typing (OCRopus, tesseract, etc.)
There are also Java based handwriting libraries. I am not sure if OCRopus has that ability, one library I was looking into to do handwriting recognition was:
Online Video
Java Neural Networks
Conceivably you could take the pdf, convert it into a tiff if need be (according to the software), and it would give you something..
Good luck!

If it is the notebook as a PDF file you could e-mail it to a gmail account and then gmail allows you to "view" the PDF from within your browser as an HTML file. Still the pages remain images.
If you would like the text out of it OCR might work but it may also be uncapable of getting the text out of it.

Related

AS3 - Extract text from image

Is it possible to extract the text from an image like this?
(I'd like to display it in an textfield afterwards)
Thanks.
Uli
What you're looking for is Optical Character Recognition. Here is a similar question:
OCR Actionscript
Though sadly it has no clear-cut answer. There is no native class/framework for doing it in AS3, though I'm sure it's possible.
This is a task where you'd employ web service. I know Google Docs can OCR an image for you. ABBYY, whose FineReader is one of the best in the business, also provides an OCR web service. Google has open-sourced their OCR software. You can conceivable set it up on your own server.

PDF to Structured Format

I have tons PDFs that I need to convert to some structured format that I can interpret (HTML/XML/etc)
PDFs are in this format:
http://img840.imageshack.us/img840/5407/pdfv.png
I have tried so far a lot of softwares that convert to HTML but all of them have no capabilities to separate the images, they just take like a printscreen of the page without the text and then use this image as a background in the html, using css to position the text
Like this: http://img37.imageshack.us/img37/5015/examplelp.jpg
I have a bunch of PDFs so process each ones images manually is not an option. Does anyone knows any solution for this (even paid softwares)?
I had a similar problem a while back and ended up writing my own solution. It's called PDFX and it's free to use. It converts PDF to a structured-format XML and also renders any bitmap images (not vector graphics) found in the PDF separately.
Example input/output can be found here. You might want to give it a try.

Putting several hundred .doc pages into webpage

I have hundreds of .doc files with text that I need put on web pages.
I realize I could convert every .doc file to .txt, then use a server side include to embed the contents of each page into a webpage. This would save a lot of time because I could simply have one .php?txt=... page which will display a different .txt include depending on the link the user pressed to get there. This works perfectly content-wise.
However, all formatting is lost when it is converted to .txt (titles should be in bold)
When I convert these .doc files to .html using Microsoft Word, the ~20 line documents become bloated >300 line .htm files (probably because each paragraph is put into textboxes)
Dreamweaver's "Clean up Word HTML" helped a bit but the code was still extremely bloated.
How would you suggest going about this?
edit: I may have solved my own question, trying to embed Google docs into my page.
There is a program suite called wv (former mswordview). It has a program wvWare. This software can transform Word documents to HTML.
Furthermore you can use the output from Word and send it through tidy. This corrects markup and usually can handle the mistakes made by Word.
You can try converting the Word documents to a DocBook intermediate format, then you can easily transform the DocBook with existing tools to (X)HTML.
MS Word is bloatware. Its own markup is bloated, and therefore any attempt to automatically convert it to HTML will inherit these problems. You end up with garbage like: <strong><strong></strong></strong> for no good reason.
Dreamweaver can clean it up a lot, but nothing short of strip/remarkup is going to get you clean results.
That's why most people use PDFs for this type of issue.
My immediate reaction would be to convert the docs to PDFs. That will normally preserve formatting quite well, and users typically have their browsers set up to view PDFs one way or another (and the few who don't are undoubtedly accustomed to being unable to view a lot of documents on a lot of sites).
Alright thanks everyone for your suggestions, but I wanted to make this page accessible to everyone without pdf viewers as well.
Google docs allows you to bulk upload your text files (and converts them for you too)
You can then export them into an iframe to embed in any html document.

How do I convert PDF to HTML programmatically?

Are there any classes, COM objects, command line utilities, or anything else that I can make an API for that can convert a PDF to an HTML document? Obviously the conversion might be a little rough since PDFs can contain a lot more than HTML can describe. I found a utility called pdftohtml on Source Forge, but quite honestly it does a horrible job with the conversion. I don't care if the software is free or commercial, but is there anything out there at all that I can incorporate with my own software to do this sort of conversion at least decently? I know Google's developed their own method of doing this, since you can click "View as HTML" on a PDF attached to an email through Gmail, but I was hoping there was something out available to the public.
Remember, PDF to HTML. I'm NOT worried about HTML to PDF.
well one solution i can think of is to write little program that reads pdf text using library called iText and then generate html files.
well for java based PDF solutions...we dont have a clean way i guess-still.. all solutions are primitive and kind of workarounds... No easy solution for
1. Designing a template of a PDF
2. Then at runtime using java, populate data into this template...either using xml or other datasources...
such a simple requirement and NONE has a good "open-source and free" solution yet !
Eclipse BIRT comes close.. but does not handle Barcode elements ..OOB.
You were looking for pdf2htmlEX (C++), which converts PDF to HTML without losing text or format.
To convert further to semantic HTML, you can process pdf2htmlEX output using my project Transcript (Python). It is however not lossless anymore and works best on documents not deviating too much from conventional visual layout.

best practices to import text into html

What is the best practice for importing text into html from a multipage InDesign document, from designer to non-designer. Document designed on a mac going to CMS on PC - hand off the InDesign File or strip text into word file? Supplying all images and pdf as go-by?
More people are likely to be able to open a PDF than InDesign, especially with font considerations. I prefer to get work in PDF format. I can easily extract the text and I can pull the document into PhotoShop to slice it up. You just have to make sure the quality/compression settings are right so it doesn't muck up the JPEGs too much.