I am writing a program that should be able to detect a single character from the image of it.
I think it should be pretty easy given how powerful OCR software have become these days but I have no real idea how to do it.
Here are the specifics:
The language is Persian
The character is not hand written.
There are no words or sentences, the image is of a single character generated from a PDF file. It will look like this:
Now ideally I should be able to perform OCR on this image and determine the character.
But I was using another approach so far. The fonts used in the PDF files are from a finite set of fonts (100 something) and from those only 2-3 fonts are usually used. So I can actually "cheat", and compare this character to all the characters of these 100 fonts and determine what it is.
As an example these are some of the characters in the font "Roya". I intended to compare my character image with all of these and determine the letter. Repeat for every other font until a match is found.
I was doing a bitmap compare with imagemagick but I realized that even if the fonts are the same there are still small differences between the character images generated from the same font.
As an example, these two are both the character "beh" from the font "Zar". But as you can see there won't be an exact match when doing a bitmap compare between them:
So given all this how should I go about doing the OCR?
Other notes:
The program is written in Java, but a standalone application or a C/C++ library is also acceptable.
I tried using Tesseract but I just couldn't get it to detect characters. Persian was very badly documented and it looked like it would need a ton of calibration and training. It also looked like it is optimized for detecting words and gave very bad results when detecting single characters.
Related
When building websites for non-english speaking countries
you have tons of characters that are out of the scope.
For the database I usally encode it on either utf-8 or latin-1.
I would like to know if there is any issue with performance, speed resolution, space optimization, etc.
For the fixed texts that are on the html between using for example
á or á
which looks exactly the same: á or á
The things that I have so far for using it with utf-8:
Pros:
Easy to read for the developers and the web administrator
Only one space ocupied on the code instead of 4-5
Easier to extract an excerpt from a text
1 byte against 8 bytes (according to my testings)
Cons:
When sending files to other developers depending on the ide, softwares, etc that they use to read the code they will break the accent in things like: é
When an auto minification of code occurs it sometimes break it too
Usually breaks when is inside an encoding
The two cons that I have a bigger weight than the pros by my perspective because the reflect on the visitor.
Just use the actual character á.
This is for many reasons.
First: a separation of concerns, the database shouldn't know about HTML. Just imagine if at a later date you want to create an API to use it in another service or a Mobile App.
Second: just use UTF-8 for your database not latin. Again, think ahead what if your app suddently needs to support Japanese then how you store あ?
You always have the change to convert it to HTML codes if you really have to... in a view. HTML is an implementation detail, not core to your app.
If your concern is the user, all major browsers in this time and age support UTF-8. Just use the right meta tag. Easy.
If your problem are developers and their tools take a look at http://editorconfig.org/ to enforce and automatize line endings and the usage of UTF-8 in your files.
Maybe add some git attributes to the mix and why not go the extra mile and have a git precommit hook running some checker so make super sure everyone commits UTF-8 files.
Computer time is cheap, developer time is expensive: á is easier to change and understand, just use it.
I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:
unpaper
Fred's ImageMagick Scripts: TEXTCLEANER
manuall using GIMP
But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file)
http://9gag.com/gag/aBrG8w2/employee-handbook
This online service works surprisingly good with this test document:
http://www.onlineocr.net/
I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...
Tesseract's precision in recognition is a little bit lower than the precision of the best commercial one (Abbyy FineReader), but it's more flexible because of its nature.
This flexibility entail sometimes some preprocessing, because it's not possible for Tesseract to manage each situation.
Actually is used by google because is Google its main sponsor!
The first thing you could do is to try to expand the text in order to have at least 20 pixel wide characters or more. Since Tesseract works using as features the main segments of the characters' borders, it needs to have a bigger characters' size comparing with other algorithms.
Another thing that you could try, always referring to the test document you mentioned, is to binarize your image with an adaptive thresholding method (here you can find some infos about that https://dsp.stackexchange.com/a/2504), because some changes in the illumination are present. Tesseract binarizes the image internally, but this could be the case when it fails to do that (it's similar to the example here Improving the quality of the output with Tesseract, where you can also find some other useful informations)
I have a pdf of 100+ handwritten pages that I need to convert to machine readable text. So far I have tried tesseract and a free online tool with no success. The output seems to be jibberish.
tesseract myscan.png out -l eng
I've attached one example page. It contains both text, mathematical symbols (eg. integral sign) and occasionally pictures.
Maybe I'm using tesseract wrong? Could anyone try and get a decent output off this?
I use http://www.techsupportalert.com/best-free-ocr-software.htm
Watch out for the installer trying to load you up with other stuff
When it works, it just gives you bits to copy and paste.
But don't rush to download this one, try your's again first.
The problem likely isn't with the software, it's probably your input.
Scan at 600 dpi.
Try to increase the contrast and sharpen the image. The thinner and more defined from the background that the letters are, and the clearer the interspacing of the loops are, the better your chance of OCR capture.
These adjustments are best made in your original scanning software. 8MP or better camera can also make the scan.
Use GIMP to tweak after the scan.
Using Kofax Capture 10 (SP1, FP2), I have recognition zones set up on some fields on a document. These fields are consistently recognizing I's as 1's. I have tried every combination of settings I can think of that don't obliterate all the characters in the field, to no avail. I have tried Advanced OCR and High Performance OCR, different filters for characters. All kinds of things.
What options can I try to automatically recognize this character? Should I tell the people producing the forms (they're generated by a computer) they need to try using a different font? Convince them that now is the time to consider using Validation?
My current field setup:
Kofax Advanced OCR with no custom settings except Maximize Accuracy in the advanced dialog. This has worked as well as anything else I have tried so far.
The font being used is 8 - 12 pt arial, btw.
Validation is a MUST if OCR is involved, no matter if e-docs or paper docs are processed. For paper docs it is an even bigger must.
Use at least 11pt Arial and render the document as 300 dpi image. This will give you I'd say 99.9% accuracy (that is 1 character in every 1000 missed). Accuracy can drop if you have data where digits and letters are mixed within one word especially 1-I, 0-O, 6-G.
Recognition scripts can be used if you know that you have no such mixed data and OCR still returns mixed digits and letters. You can use the PostRecognition script event to catch the recognition result from the OCR engine and modify it with SBL or VB.NET scripts. But it greatly depends on the documents and data you process.
Image cleanup will not do any good for e-docs.
I'd say your best would be to use validation. At least that will push responsibility to the validation operator.
I'm working on getting the Lincoln font to work in Tesseract, and I'm getting abysmal results, even after going through the wildly complicated training process.
This is what the font looks like, so yeah, it's a bit tricky:
I've carefully made a training image, and then used that to make a box file. The training image is here (25MB!). The image is 300 DPI, and has representative characters nicely spaced out vertically and horizontally.
I made a box file for the training image, and it worked properly. I've verified that it's correct using a box file editor.
I took this box file/tif file, and used it to create training data. I did likewise with the 30 or so other sample images/fonts provided by Tesseract.
I created the unicharset file.
I created a font_properties file. There's no guidance on the site about when fraktur should be used. So I've tried it both this way (fraktur on for Lincoln):
eng.lincoln.box 0 0 0 0 1
And this way (fraktur off):
eng.lincoln.box 0 0 0 0 0
And finally, I've tried this with and without dictionary files. When I used dictionary files, they were the wordmap from my search engine, Sphinx, and they have about 15K common words and about 20K uncommon ones.
In all cases, when I try to OCR the first couple lines of this file (3MB), the quality is abysmal. Rather than getting:
United States Court of Appeals
for the Federal Circuit
I get:
OniteiJ %tates C0urt of QppeaIs
for the jfeI1eraICircuit
Why?
I think you'll need a lot more samples (letters) and better training images (clean background, grayscale, 300 DPI, etc.). And try to train with only one font (for instance, Lincoln) first. You can use jTessBoxEditor tool to generate your training images and edit the box files.
Once you master the training process, you can add other fonts to your training. You can test the success of the resultant language data by using it in performing OCR on the training image itself -- the recognition rates should be high.
The font names in font_properties should be like:
lincoln 0 0 0 0 1
I am not a Tesseract expert but I have evaluated nearly every OCR engine available and my comments are based on my experience over the years of analysing OCR errors.
Just wondering why your image has speckles in the background and not a pure white background. I don't know how Tesseract or the training tool works but the background could be causing some problems.
Just reading the sample page is difficult and requires a large amount of concentration. Characters such as F and I are very similar as are U and N. Tesseract like many OCR engines would be using many different techniques to recognise a character and there is not a whole lot difference between many of these characters in terms of the strokes and curves used in the font.
These characters, especially the uppercase characters would confuse many different matching algorithms just because they are so different to standard Latin / Roman type characters. This shows through in your results ie. All capital letters have an OCR error.