Tesseract OCR to PAGE - ocr

The tool Tesseract OCR to PAGE located here is a Windows tool to run tesseract and output a file in page format (an xml file that contains structural information about the document). Do you know of any mac version of this kind of tool?
This question is linked to my previous: How do I segment a document then output bounding boxes and labels using tesseract

You can run this tool using wine an example is included in my answer to the question linked in the question.

Related

Any free OCR engine or API to identify hand written scan document?

I am using google Tesseract engine python binder https://code.google.com/p/python-tesseract/ to extract the text in a image(http://ceoarunachal.nic.in/eci/affidavits/s02/ge/1/KIREN%20RIJIJU/KirenRijiju_SC1.jpg). I am trying to make it digitized for thousands of images similar to it. But Tesseract is not able to extract the handwritten text in it correctly, as its mainly designed for machine text.
Any way to optimize current image which will help in improving the recognition by training the data or is there any other better tools to do it?

tesseract 2.x - using multiple fonts at the same time

I have succesfully trained tesseract 2.x to recognize a few specific fonts. However, it seems that I can't make tesseract to recognize all of those fonts at the same time - i.e. source image contains all of them. Currently, only one set of tesseract data can be put into tessdata folder (i.e. one set with one trained font).
I know that tesseract 3.x handles correctly multiple fonts - however, I can't upgrade, since there's no decent binding to .NET, that has same features as .NET binding of version 2.x.
Also, I would like to avoid doing all the preprocessing and OCR itself several times, for each font.
For Tesseract 2.0x, a language data pack can recognize multiple fonts. Did you cluster your training files?
There are a couple excellent .NET wrapper for Tesseract 3.01. Check its AddOn page for more info.

training tesseract and multi page tiff

I use tesseract 3.0.1 on windows 7 64 bit.
The documentation on training says:
Each font should be put in a single multi-page tiff (only if you are
using libtiff!)
I'm not familiar with libtiff. I use ImageMagick to create multi-page tiff. So far this is working well, or at least seems to be. Am I expected to get some road blocks later on? If so what to do with libtiff - is it enough to run its setup or do I need to configure something?
Tesseract doesn't care how you produced your multi-page tiff as long as it can read it with leptonica (which internally depends on libtiff). If tesseract can handle your tiff now, it can do the same for the rest of training process as well as run for OCR, so you are good to go.
I've produced my multi-page tiff with .Net standard library and tesseract had no problem with it.

Is there any OCR SDK for c++ builder?

I'd like to add character recognition functionality to my application that's why asking you what's the best available and affordable OCR SDK . I looked at ABBY FineReader Engine 10.0 but haven't got trial version yet as I requested from the official site!
I've downloaded Asprise OCR SDK but it's doesn't recognize Cyrillic symbols..
How to implement character recognition on my application ? By using what kind of libs, SDKs, APIs and so on..
There's Cunieform and Google's Tesseract OCR, both of which are free. Personally I've used Tesseract, the SDK was giving a lot of trouble so finally decided to simply call the command line interface of Tesseract with arguments from within my C program using the system() function.
Lots of people face difficulties with the Tesseract installation, so here's a short summary (version 2 works for me, insert appropriate version if necessary):
Download the following from the svn: tesseract-2.00.tar.gz, tesseract-2.00.exe6.tar.gz, tesseract-2.00.eng.tar.gz
Unzip tesseract-2.00.tar.gz to a folder
Unzip tesseract-2.00.exe6.tar.gz and move to where tesseract-2.00.tar.gz was unzipped. A few files will be replaced this way
Similarly unzip tesseract-2.00.eng.tar.gz and move to tesseract-2.00.tar.gz where tessdata folder will be replaced.
After all this is done, open the tesseract.dsw workspace, select All Files and do "Rebuild All." This'll take a while with loads of warnings but hopefully no errors.
The command using DOS shell is tesseract picture.tif textfile -l eng. So basically save your image as a TIFF file, run the command from within your program and then read in the OCR output strings from the text file.
I can recommend you Crystal OCR if you don't need to recognize a very complex documents, they sent me C++ Builder sample by request. IMHO, Tesseract is still buggy, though it's the best free OCR of course.
You can try KSAI-Toolkits. It has a completely ocr application, which include C++ API, OCR model, benchmark and test data. And it supports different platforms.

Seeking a compact format for HTML ebooks for offline reading under Linux

I have a netbook running Linux and a large collection of computer books and reference material as HTML. I'd like some compact way of storing these books which can be browed without unpacking them first. This would save space and reduce wear on my small SSD.
If there was some way to convince Firefox to browse files contained in ZIP file, this would be ideal. (I know iCab (Mac) had a web archive format that worked this way.) Perhaps a Firefox plugin? A small web server that can serve directly from ZIP files? Some magic FUSE module? Does anyone have any ideas?
On my PDA (which the netbook is largely replacing) I used iSilo for this, but it's not available for Linux, its conversions are lossy and it costs money.
There is the FUSE zip thing here :
http://code.google.com/p/fuse-zip/
Gvfs should also support zip files.
Calibre might help (convert to a compressed format, manage, view e-books).
You can use OpenOffice.org to open the html pages, and then save them as OO documents. OO documents are essentially a zip files.
Another option is to use OO to save as pdf.
You can even do this from a command line using this OO macro.
Same with AbiWord - you can use it on commandline to convert.
In the AbiWord example there is shown how to convert all files in a directory to a desired format (pdf). Then you can use pdftools to merge all pages in one document.
Also, I do not know what windows manager your laptop has, but if it is KDE, konqueror (the file and web browser for KDE) opens web pages from inside a zip file w/o any problem.
Most probably Gnome's Nautilus can do this as well (I have no Gnome here to test).
Have you ever tried to open a zip file with whatever file manager you have, and then click on a web page inside it?