training tesseract and multi page tiff - ocr

I use tesseract 3.0.1 on windows 7 64 bit.
The documentation on training says:
Each font should be put in a single multi-page tiff (only if you are
using libtiff!)
I'm not familiar with libtiff. I use ImageMagick to create multi-page tiff. So far this is working well, or at least seems to be. Am I expected to get some road blocks later on? If so what to do with libtiff - is it enough to run its setup or do I need to configure something?

Tesseract doesn't care how you produced your multi-page tiff as long as it can read it with leptonica (which internally depends on libtiff). If tesseract can handle your tiff now, it can do the same for the rest of training process as well as run for OCR, so you are good to go.
I've produced my multi-page tiff with .Net standard library and tesseract had no problem with it.

Related

Tesseract OCR to PAGE

The tool Tesseract OCR to PAGE located here is a Windows tool to run tesseract and output a file in page format (an xml file that contains structural information about the document). Do you know of any mac version of this kind of tool?
This question is linked to my previous: How do I segment a document then output bounding boxes and labels using tesseract
You can run this tool using wine an example is included in my answer to the question linked in the question.

Any free OCR engine or API to identify hand written scan document?

I am using google Tesseract engine python binder https://code.google.com/p/python-tesseract/ to extract the text in a image(http://ceoarunachal.nic.in/eci/affidavits/s02/ge/1/KIREN%20RIJIJU/KirenRijiju_SC1.jpg). I am trying to make it digitized for thousands of images similar to it. But Tesseract is not able to extract the handwritten text in it correctly, as its mainly designed for machine text.
Any way to optimize current image which will help in improving the recognition by training the data or is there any other better tools to do it?

tesseract 2.x - using multiple fonts at the same time

I have succesfully trained tesseract 2.x to recognize a few specific fonts. However, it seems that I can't make tesseract to recognize all of those fonts at the same time - i.e. source image contains all of them. Currently, only one set of tesseract data can be put into tessdata folder (i.e. one set with one trained font).
I know that tesseract 3.x handles correctly multiple fonts - however, I can't upgrade, since there's no decent binding to .NET, that has same features as .NET binding of version 2.x.
Also, I would like to avoid doing all the preprocessing and OCR itself several times, for each font.
For Tesseract 2.0x, a language data pack can recognize multiple fonts. Did you cluster your training files?
There are a couple excellent .NET wrapper for Tesseract 3.01. Check its AddOn page for more info.

Is there any OCR SDK for c++ builder?

I'd like to add character recognition functionality to my application that's why asking you what's the best available and affordable OCR SDK . I looked at ABBY FineReader Engine 10.0 but haven't got trial version yet as I requested from the official site!
I've downloaded Asprise OCR SDK but it's doesn't recognize Cyrillic symbols..
How to implement character recognition on my application ? By using what kind of libs, SDKs, APIs and so on..
There's Cunieform and Google's Tesseract OCR, both of which are free. Personally I've used Tesseract, the SDK was giving a lot of trouble so finally decided to simply call the command line interface of Tesseract with arguments from within my C program using the system() function.
Lots of people face difficulties with the Tesseract installation, so here's a short summary (version 2 works for me, insert appropriate version if necessary):
Download the following from the svn: tesseract-2.00.tar.gz, tesseract-2.00.exe6.tar.gz, tesseract-2.00.eng.tar.gz
Unzip tesseract-2.00.tar.gz to a folder
Unzip tesseract-2.00.exe6.tar.gz and move to where tesseract-2.00.tar.gz was unzipped. A few files will be replaced this way
Similarly unzip tesseract-2.00.eng.tar.gz and move to tesseract-2.00.tar.gz where tessdata folder will be replaced.
After all this is done, open the tesseract.dsw workspace, select All Files and do "Rebuild All." This'll take a while with loads of warnings but hopefully no errors.
The command using DOS shell is tesseract picture.tif textfile -l eng. So basically save your image as a TIFF file, run the command from within your program and then read in the OCR output strings from the text file.
I can recommend you Crystal OCR if you don't need to recognize a very complex documents, they sent me C++ Builder sample by request. IMHO, Tesseract is still buggy, though it's the best free OCR of course.
You can try KSAI-Toolkits. It has a completely ocr application, which include C++ API, OCR model, benchmark and test data. And it supports different platforms.

Air - Unzipping file

Currently I am using nochump library for unzipping files. But its very slow(around 30 seconds for 2 mb file). Is there any other libraries available which are fast. Or is thaere any better way to unzip by communicating with os?
I have used FZip, but it wont work in mac. So cant use it.
Not that I'm aware of... AS3 is quite slow in these areas...
A possible workaround, if you are using zips for loading images, could be using a big JPEG with all of your images inside it (eventually using an aditional XML to determine dimentions, or maybe even custom metadata). Uncompressing images in Flash is quite fast (and asynchronous).
It might be possible using Alchemy (there are very fast Alchemy librairies to encode JPEG and PNG), but I can't find any existing one for unzipping.
Otherwise, you can use the AIR 2.0 beta (not great for production code... depends on your project) to call a native application which will do the job for you.
Anyway, it might get tricky to retrieve progression information if you need it.