I would like to train tesseract v3.04.
I currently have generated box files from some images, and I am not sure where to go next.
The images are "png" format.
This is to be used with pytesseract
Try using jTessBoxEditor.
jTessBoxEditor is a box editor and trainer for Tesseract OCR (Both 2.0x and 3.0x formats) and provides full automation of Tesseract training.
Hope this helps!
Related
The tool Tesseract OCR to PAGE located here is a Windows tool to run tesseract and output a file in page format (an xml file that contains structural information about the document). Do you know of any mac version of this kind of tool?
This question is linked to my previous: How do I segment a document then output bounding boxes and labels using tesseract
You can run this tool using wine an example is included in my answer to the question linked in the question.
I am using google Tesseract engine python binder https://code.google.com/p/python-tesseract/ to extract the text in a image(http://ceoarunachal.nic.in/eci/affidavits/s02/ge/1/KIREN%20RIJIJU/KirenRijiju_SC1.jpg). I am trying to make it digitized for thousands of images similar to it. But Tesseract is not able to extract the handwritten text in it correctly, as its mainly designed for machine text.
Any way to optimize current image which will help in improving the recognition by training the data or is there any other better tools to do it?
I have succesfully trained tesseract 2.x to recognize a few specific fonts. However, it seems that I can't make tesseract to recognize all of those fonts at the same time - i.e. source image contains all of them. Currently, only one set of tesseract data can be put into tessdata folder (i.e. one set with one trained font).
I know that tesseract 3.x handles correctly multiple fonts - however, I can't upgrade, since there's no decent binding to .NET, that has same features as .NET binding of version 2.x.
Also, I would like to avoid doing all the preprocessing and OCR itself several times, for each font.
For Tesseract 2.0x, a language data pack can recognize multiple fonts. Did you cluster your training files?
There are a couple excellent .NET wrapper for Tesseract 3.01. Check its AddOn page for more info.
I use tesseract 3.0.1 on windows 7 64 bit.
The documentation on training says:
Each font should be put in a single multi-page tiff (only if you are
using libtiff!)
I'm not familiar with libtiff. I use ImageMagick to create multi-page tiff. So far this is working well, or at least seems to be. Am I expected to get some road blocks later on? If so what to do with libtiff - is it enough to run its setup or do I need to configure something?
Tesseract doesn't care how you produced your multi-page tiff as long as it can read it with leptonica (which internally depends on libtiff). If tesseract can handle your tiff now, it can do the same for the rest of training process as well as run for OCR, so you are good to go.
I've produced my multi-page tiff with .Net standard library and tesseract had no problem with it.
I'd like to add character recognition functionality to my application that's why asking you what's the best available and affordable OCR SDK . I looked at ABBY FineReader Engine 10.0 but haven't got trial version yet as I requested from the official site!
I've downloaded Asprise OCR SDK but it's doesn't recognize Cyrillic symbols..
How to implement character recognition on my application ? By using what kind of libs, SDKs, APIs and so on..
There's Cunieform and Google's Tesseract OCR, both of which are free. Personally I've used Tesseract, the SDK was giving a lot of trouble so finally decided to simply call the command line interface of Tesseract with arguments from within my C program using the system() function.
Lots of people face difficulties with the Tesseract installation, so here's a short summary (version 2 works for me, insert appropriate version if necessary):
Download the following from the svn: tesseract-2.00.tar.gz, tesseract-2.00.exe6.tar.gz, tesseract-2.00.eng.tar.gz
Unzip tesseract-2.00.tar.gz to a folder
Unzip tesseract-2.00.exe6.tar.gz and move to where tesseract-2.00.tar.gz was unzipped. A few files will be replaced this way
Similarly unzip tesseract-2.00.eng.tar.gz and move to tesseract-2.00.tar.gz where tessdata folder will be replaced.
After all this is done, open the tesseract.dsw workspace, select All Files and do "Rebuild All." This'll take a while with loads of warnings but hopefully no errors.
The command using DOS shell is tesseract picture.tif textfile -l eng. So basically save your image as a TIFF file, run the command from within your program and then read in the OCR output strings from the text file.
I can recommend you Crystal OCR if you don't need to recognize a very complex documents, they sent me C++ Builder sample by request. IMHO, Tesseract is still buggy, though it's the best free OCR of course.
You can try KSAI-Toolkits. It has a completely ocr application, which include C++ API, OCR model, benchmark and test data. And it supports different platforms.