OCR (Optical character Recognition) - ocr

I just got a doubt it's not clear with the search engine results.
Can OCR (Optical character Recognition) read captcha, QR-code and barcodes?
Captcha.
QR-code.
Barcodes.
Licence codes

It depends on captcha. Standard OCR isn't meant for CAPTCHA breaking. Anyway simple captcha can be preprocessed and then fed to an OCR engine, sometimes it works... In general CAPTCHA breaking is much more complex than downloading the Tesseract binaries. If it were that easy, all of the paid services would be out of business overnight.
QR Codes and barcodes are both optical machine-readable data systems capable of conveying large amounts of data. Both are extremely useful in their own right. They have important differences but not regarding your question... so see point 3
The error correction capabilities of bar code recognition engines are way beyond that of OCR engines. A damaged bar code can easily be read. Also, most barcodes either work or they don't. OCR can confidently misread letters, while barcodes are "fail-safe".

Related

BarCode recognition using OCR

I am trying to recognize a barcode using simple CNN treating it like a multi-digit recognition problem.
The results are not very good. So I was looking was some better deep learning models for the same. During my search, I did not find any OCR model being tried on barcodes. So my question is - Can OCR models be trained to recognize barcodes. I find the task of barcode detection and recognition very similar to text recognition. Is there something I am missing?
While CNNs can be used to read the contents of the barcode, especially in the scenario where massive datasets of images are available for training, it is tough to match the performance of a classical barcode reading algorithm with standard AI approaches.
The difference between reading the text and reading the barcode is structural. Text is fundamentally unstructured, while barcodes are designed to be structured for readability using specifically engineered decoding algorithms.
All these algorithms for reading have rules which are, in many cases, not so hard to implement. On the other hand, CNNs would have a hard time and need vast amounts of data to learn those rules.
Also, many barcode symbologies (EAN included) use error detection or correction algorithms (like check-digits), which can be integrated into the error-recovery loop to increase the performance of the scanning further.
So, in theory, OCR and Barcode scanning are similar problems, while in practice, there are substantial differences.
Note: I'm working at Microblink, where we do R&D in the area of barcode scanning and text recognition. When it comes to barcode scanning, we've tried basically everything in the AI repertoire to get the most out of it, and ended up using both CNNs and classical algorithms working tightly together.

POC Help - OCR exploit to run code gathered from JPEG

I have been thinking about security concerns in regards to OCR programs such as Tesseract.
My theory is that malicious code printed out in plain text can be photographed and saved an image file. ( This leaves the hex and headers free from a year change )
Then using OCR the JPEG could be converted to greyscale and the characters then read and executed. Perhaps via an exploit within the OCR application.
Looking back at the way certain worms could self execute in windows via preview perhaps something similar can be done using the abike method.
I imagine it's one of the key security concerns for a company developing an OCR application so this may be very hard to provide a proof of concept.
If anyone would like to explore this concept or perhaps explain why it's is, or indeed is not possible I would appreciate it.
This is my first post so sorry if any forum rules have been missed.

Howto improve OCR results

I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:
unpaper
Fred's ImageMagick Scripts: TEXTCLEANER
manuall using GIMP
But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file)
http://9gag.com/gag/aBrG8w2/employee-handbook
This online service works surprisingly good with this test document:
http://www.onlineocr.net/
I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...
Tesseract's precision in recognition is a little bit lower than the precision of the best commercial one (Abbyy FineReader), but it's more flexible because of its nature.
This flexibility entail sometimes some preprocessing, because it's not possible for Tesseract to manage each situation.
Actually is used by google because is Google its main sponsor!
The first thing you could do is to try to expand the text in order to have at least 20 pixel wide characters or more. Since Tesseract works using as features the main segments of the characters' borders, it needs to have a bigger characters' size comparing with other algorithms.
Another thing that you could try, always referring to the test document you mentioned, is to binarize your image with an adaptive thresholding method (here you can find some infos about that https://dsp.stackexchange.com/a/2504), because some changes in the illumination are present. Tesseract binarizes the image internally, but this could be the case when it fails to do that (it's similar to the example here Improving the quality of the output with Tesseract, where you can also find some other useful informations)

Tesseract OCR - Handwritten font

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?
I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers
Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)

handwriting recognition with simple training

I've been reading (and trying) OCR programs suggested in previous answers but I'm still without a clear answer to my problem.
I need to recognize handwritten English text. The text would be multiple lines but each line is only one or two words length. The text is from a different person at time. I could ask that person to provide a training file (e.g. with the alphabet and 0-9 numbers) but I cannot really ask for a much more complicated training than this.
I need to integrate the recognition as part of another (Java) application but the solution doesn't need to be Java. I can just execute it from Java and get the results from a text file.
Any recommendations?
I've already tested Tesseract (bad results without training and training looks quite complex). Java OCR looked like the perfect solution (simple training, open source and Java) but it doesn't work well even with their own examples (anybody has had a better experiencie?). GOCR does not seem very active.
Of course I prefer free solutions but this is not a MUST (though the problem I see with a commercial option is that I must be able to integrate it in my own app which will be offered as SaaS)
From my experience ABBYY is one of the best for handwriting recognition, even without training. (It's possibly one of the most expensive too, though...) They have an SDK for Java.
http://www.abbyy.com
With a free trial, it's definately worth a look!
I am on the lookout for a handwritten text recognition software. So far the only one giving better results than even abby 11 has been SimpleOCR using the same text for both, which is a freeware for ocr but a 14 day trial for HCR!
I know I am answering after nearly 6 years. But if anyone's still looking, try using tensorflow. Their website has a simple example for handwritten digit recognition(MNIST). You can use this example and implement it for handwritten alphabet recognition (you need training data for this, I used NIST special Database 19 to get this data).