Save and reload OCR result of Abbyy FineReader - ocr

Assume I have OCR-ed multiple PDFs in Abbyy FindeReader. Is it possible to reload the OCR results in Abbyy FineReader at a later time in order to correct OCR errors?
My idea is that I want to split executing the OCR and (at a later time) correcting the OCR results

Yes. You can use "File" - "Save" command to save the recognition results together will all the intermediate data.

Related

POC Help - OCR exploit to run code gathered from JPEG

I have been thinking about security concerns in regards to OCR programs such as Tesseract.
My theory is that malicious code printed out in plain text can be photographed and saved an image file. ( This leaves the hex and headers free from a year change )
Then using OCR the JPEG could be converted to greyscale and the characters then read and executed. Perhaps via an exploit within the OCR application.
Looking back at the way certain worms could self execute in windows via preview perhaps something similar can be done using the abike method.
I imagine it's one of the key security concerns for a company developing an OCR application so this may be very hard to provide a proof of concept.
If anyone would like to explore this concept or perhaps explain why it's is, or indeed is not possible I would appreciate it.
This is my first post so sorry if any forum rules have been missed.

Kofax Capture Recognition - I vs 1

Using Kofax Capture 10 (SP1, FP2), I have recognition zones set up on some fields on a document. These fields are consistently recognizing I's as 1's. I have tried every combination of settings I can think of that don't obliterate all the characters in the field, to no avail. I have tried Advanced OCR and High Performance OCR, different filters for characters. All kinds of things.
What options can I try to automatically recognize this character? Should I tell the people producing the forms (they're generated by a computer) they need to try using a different font? Convince them that now is the time to consider using Validation?
My current field setup:
Kofax Advanced OCR with no custom settings except Maximize Accuracy in the advanced dialog. This has worked as well as anything else I have tried so far.
The font being used is 8 - 12 pt arial, btw.
Validation is a MUST if OCR is involved, no matter if e-docs or paper docs are processed. For paper docs it is an even bigger must.
Use at least 11pt Arial and render the document as 300 dpi image. This will give you I'd say 99.9% accuracy (that is 1 character in every 1000 missed). Accuracy can drop if you have data where digits and letters are mixed within one word especially 1-I, 0-O, 6-G.
Recognition scripts can be used if you know that you have no such mixed data and OCR still returns mixed digits and letters. You can use the PostRecognition script event to catch the recognition result from the OCR engine and modify it with SBL or VB.NET scripts. But it greatly depends on the documents and data you process.
Image cleanup will not do any good for e-docs.
I'd say your best would be to use validation. At least that will push responsibility to the validation operator.

Tesseract OCR - Handwritten font

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?
I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers
Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)

Reversing an old file format Inbox X

I’m trying to reverse engineer an old medical imaging format called Stentor for interoperability. It was designed by a company of the same name who was subsequently bought by Phillips. But Phillips has forgotten how to read Stentor files. I have a windows program which exports JPEG from Stentor files but it’s closed source. I’d like to automate this process in order to tackle hundreds of files in this format.
The program is late-1990s Win32 or MFC executeable. It runs next to an ActiveX (.ocx) file which I’ve been able to interop with, but that file doesn’t contain the export method. I'm looking for suggestions on how to dissemble the binary in order to unearth the algorithm used to convert Stentor to JPEG. I looked through the Stentor files in hex editor and didn’t find any evidence of JPEG (although hints on finding that would be appreciated too), so I think that the program has a couple of tricks up its sleeve.
Thanks in advance.
Kyle
Few programmers implement complex routines such as image recoding themselves. Instead they tend to license libraries that do that. A very smart way to start would be searching for text strings and see if you can discover the libraries they use. This will subsequently give you a lot of insight into how the data is encoded.
Another good strategy would be to build a program that simply runs the GUI of your export program by sending mouse and keyboard events directly to it. Let this run a few days to complete your export. Reverse engineering the file format is going to be slow and expensive so for a 1 time gig it's probably not worthwhile.

Reliably extracting identity fields from scanned documents / images?

I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".
I've tried the following software:
Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)
I've tried the following settings:
Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.
In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".
Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?
Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.
Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.
If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.
http://en.wikipedia.org/wiki/QR_Code
I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:
CVISION Maestro Recognition Server 4.0
ABBYY Recognition Server 2.0
In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.
I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.