tesseract 2.x - using multiple fonts at the same time - ocr

I have succesfully trained tesseract 2.x to recognize a few specific fonts. However, it seems that I can't make tesseract to recognize all of those fonts at the same time - i.e. source image contains all of them. Currently, only one set of tesseract data can be put into tessdata folder (i.e. one set with one trained font).
I know that tesseract 3.x handles correctly multiple fonts - however, I can't upgrade, since there's no decent binding to .NET, that has same features as .NET binding of version 2.x.
Also, I would like to avoid doing all the preprocessing and OCR itself several times, for each font.

For Tesseract 2.0x, a language data pack can recognize multiple fonts. Did you cluster your training files?
There are a couple excellent .NET wrapper for Tesseract 3.01. Check its AddOn page for more info.

Related

Forge Viewer local package

I was wondering if there is any (approved) way of having an offline copy of the up to date Forge Viewer js package (v2.14 at the time I'm writing this).
All documentations I've seen about the viewer use the CDN (or rather viewingservice) version and emphasize on specifying a version tag (e.g. https://developer.api.autodesk.com/viewingservice/v1/viewers/viewer3D.js?v=v2.14).
In some case it comes handy to have the full package (js, css, locales, textures, dds, etc...) locally.
npm package view-and-data ships a 2.5 version inside a zip (but github repo no longer exists), forge-rcdb.nodejs github repo used to embed it, and some old forks who do are still online (with an outdated version obviously).
In the same vein, https://autodeskviewer.com/viewers-dev/ seems to list all files (so one could script and retrieve them) but again is outdated.
No, there is no downloadable package for the Autodesk Viewer. The Viewer is composed of multiple files and the list may change in future (js, css, png, rendering config, ...) - One easy way to get it locally is to go to https://extract.autodesk.io/ which is embedding the viewer files in the extraction process. The Autodesk copyright still applies, but there is no issue to get a local copy to run your app.

Tesseract OCR to PAGE

The tool Tesseract OCR to PAGE located here is a Windows tool to run tesseract and output a file in page format (an xml file that contains structural information about the document). Do you know of any mac version of this kind of tool?
This question is linked to my previous: How do I segment a document then output bounding boxes and labels using tesseract
You can run this tool using wine an example is included in my answer to the question linked in the question.

What does generating an app bundle in Windows 8.1 do?

Windows 8.1 introduced a new feature in the packaging section of the manifest called "Generate app bundle". It says that "Consider generating an app bundle if your app contains language-specific resources, a variety of image scales, or resources that apply to specific versions of DirectX. If you don't generate one, your app will run just fine, but users will have to download a larger app. For more information about app bundles, see App Packaging."
But users can change their language or run the app on a variety of different monitors at any time without reinstalling the app. So how does this feature work, what is it doing?
Basically, the App Package is split up into modular chunks. Each library that you use is split up into its component dll's. The language resources are also split up into a different chunk for each language.
This does a few things. For instance, let's say you have two games, BlackJack and Spades. Both of them use the same base engine, with the same images and base game logic. All of these are included in your 'BaseCardGame' library. In the bundle, it will keep a log of the BaseCardGame library and include it in the bundle. Now, let's say you have a user who downloads both of these apps (as you hope they would). The bundle says "I need the BaseCardGame library with XXXXX signature." Your system says "I already have that, so bundle me up the rest of the stuff that I don't have." So your users only have to download that package once.
The same thing is true for the language resources. If they have only added to their system French and Italian, then it's unlikely they're going to need the Ukrainian language information. So, they don't have to download that. Note: It does not have to be the language they have currently set, only the languages they have added to their system. If they then add a new language, the system will go and get the language packages for the apps that have them.
This is all at a high level, but describes the basics of the bundling system. Channel 9 has quite a few good videos on it.

training tesseract and multi page tiff

I use tesseract 3.0.1 on windows 7 64 bit.
The documentation on training says:
Each font should be put in a single multi-page tiff (only if you are
using libtiff!)
I'm not familiar with libtiff. I use ImageMagick to create multi-page tiff. So far this is working well, or at least seems to be. Am I expected to get some road blocks later on? If so what to do with libtiff - is it enough to run its setup or do I need to configure something?
Tesseract doesn't care how you produced your multi-page tiff as long as it can read it with leptonica (which internally depends on libtiff). If tesseract can handle your tiff now, it can do the same for the rest of training process as well as run for OCR, so you are good to go.
I've produced my multi-page tiff with .Net standard library and tesseract had no problem with it.

Is there any OCR SDK for c++ builder?

I'd like to add character recognition functionality to my application that's why asking you what's the best available and affordable OCR SDK . I looked at ABBY FineReader Engine 10.0 but haven't got trial version yet as I requested from the official site!
I've downloaded Asprise OCR SDK but it's doesn't recognize Cyrillic symbols..
How to implement character recognition on my application ? By using what kind of libs, SDKs, APIs and so on..
There's Cunieform and Google's Tesseract OCR, both of which are free. Personally I've used Tesseract, the SDK was giving a lot of trouble so finally decided to simply call the command line interface of Tesseract with arguments from within my C program using the system() function.
Lots of people face difficulties with the Tesseract installation, so here's a short summary (version 2 works for me, insert appropriate version if necessary):
Download the following from the svn: tesseract-2.00.tar.gz, tesseract-2.00.exe6.tar.gz, tesseract-2.00.eng.tar.gz
Unzip tesseract-2.00.tar.gz to a folder
Unzip tesseract-2.00.exe6.tar.gz and move to where tesseract-2.00.tar.gz was unzipped. A few files will be replaced this way
Similarly unzip tesseract-2.00.eng.tar.gz and move to tesseract-2.00.tar.gz where tessdata folder will be replaced.
After all this is done, open the tesseract.dsw workspace, select All Files and do "Rebuild All." This'll take a while with loads of warnings but hopefully no errors.
The command using DOS shell is tesseract picture.tif textfile -l eng. So basically save your image as a TIFF file, run the command from within your program and then read in the OCR output strings from the text file.
I can recommend you Crystal OCR if you don't need to recognize a very complex documents, they sent me C++ Builder sample by request. IMHO, Tesseract is still buggy, though it's the best free OCR of course.
You can try KSAI-Toolkits. It has a completely ocr application, which include C++ API, OCR model, benchmark and test data. And it supports different platforms.