can Tesseract OCR be extended or trainned?

can Tesseract OCR be extended or trainned? - ocr

I'm looking for a OCR library that allows me to read text in an image, but only text that is circled. I want to get some feedback on Tesseract OCR for this task. It looks powerful but complex. HOw would it be used here, can I be trained for something like this? or should have to be extended?

Yes, Tesseract is fully trainable. And it just happens that it supports text in a circle also (pagesegmode 9). Give it a try.

Related

How to properly OCR typewriter fonts using tesseract and python

I am using Tesseract-OCR version 3.05 dev in python to OCR some documents. The main issue I have is with number 4 in the typewriter font. It almost always misses it and outputs either empty instead of 4 or some incorrect text.
I have uploaded a sample image.
I dont have to use tesseract as well, if you have suggestions on other (better) engines out there please let me know.

If you are looking for digits only you could add a whitelist which contains only digits. Example in c++:
tesseract::TessBaseAPI api;
api.SetVariable("tessedit_char_whitelist", "0123456789");
If that doesn't work I suggest you train tesseract-ocr for this specific font. A good and clear guide can be found here: https://medium.com/apegroup-texts/training-tesseract-for-labels-receipts-and-such-690f452e8f79#.mpllnzu57
Hope this helps solving your problem. :)

AS3 - Extract text from image

Is it possible to extract the text from an image like this?
(I'd like to display it in an textfield afterwards)
Thanks.
Uli

What you're looking for is Optical Character Recognition. Here is a similar question:
OCR Actionscript
Though sadly it has no clear-cut answer. There is no native class/framework for doing it in AS3, though I'm sure it's possible.

This is a task where you'd employ web service. I know Google Docs can OCR an image for you. ABBYY, whose FineReader is one of the best in the business, also provides an OCR web service. Google has open-sourced their OCR software. You can conceivable set it up on your own server.

OCR graph paper

I would like to take a pdf of a scanned graph paper notebook (with handwriting) and turn it into a text file.
How can I do this?
Thanks

Check out an OCR library, like OCRopus. I don't think it takes PDF, so you may have to convert it to a TIFF or JPEG first.

There are OCR libraries that convert typing (OCRopus, tesseract, etc.)
There are also Java based handwriting libraries. I am not sure if OCRopus has that ability, one library I was looking into to do handwriting recognition was:
Online Video
Java Neural Networks
Conceivably you could take the pdf, convert it into a tiff if need be (according to the software), and it would give you something..
Good luck!

If it is the notebook as a PDF file you could e-mail it to a gmail account and then gmail allows you to "view" the PDF from within your browser as an HTML file. Still the pages remain images.
If you would like the text out of it OCR might work but it may also be uncapable of getting the text out of it.

Image processing/enhancement algorithms for document OCR / readability?

I'm looking for algorithms, papers, or software to enhance faxes, images from cell phone cameras, and other similar source for readability and OCR.
I'm mainly interested in simple enhancements (eg. things you could do using ImageMagick), but I'm also interested in more sophisticated techniques. I'm already talking to vendors, so for this question I'm mostly looking for algorithms or open source software.
To further clarify: I'm not looking for OCR software or algorithms; I'm looking for algorithms to clean up the image so it looks more readable to the human eye, and can possibly be used for OCR.

I had a similar problem when I was writing some software to do book scanning; floating around on the internet is a program called pagetools that does straightening of scanned-in pages using a fairly clever mathematical trick called the Radon transform.
I also wrote a small routine that would white out the blank space on the page; OCR algorithms tend to do a lot better when they don't have to contend with background noise. What I did, was look for light-colored pixels that were more than a small radius away from dark-colored ones, and then boost those up to being pure white.
It's been a few years, though, so I don't have the exact implementation details handy.

One simple image filter to look into is the "Median Filter" which is a very straightforward, easy to implement yourself, filter to help clean up scanned/photographed text. http://en.wikipedia.org/wiki/Median_filter

As requested, link to Wikipedia: Optical character recognition
Microsoft Research: Optical character recognition papers
CiteSeerX : Papers on optical character recognition

How does Google Books find text regions?

One challenging topic in computer vision is processing document scans. Typically this involves a number of steps, like noise removal, color analysis, binarization, text block identification, OCR, and then maybe some context analysis and correction.
I'm curious if anyone understands, knows or can point me to literature on how Google identifies text blocks prior to the OCR stage. Any insights?

I believe Google uses the Tesseract OCR engine in conjunction with another tool called Ocropus, both of which are open-source. I don't know anything about how they work but you may be interested in checking out the code, available at the above links.

This is second-hand information from the digitization specialist in my library, but it seems that Google's approach is to just throw everything through the automated process, ocr anything that looks like text and to not fuss too much about cropping individual images or doing much semantic analasys to look for image captions, etc. They may be doing subtle things that aren't obvious but on the surface they are definitely gunning for quantity over quality, which is smart for them to do for their purposes, IMO.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

can Tesseract OCR be extended or trainned? - ocr

I'm looking for a OCR library that allows me to read text in an image, but only text that is circled. I want to get some feedback on Tesseract OCR for this task. It looks powerful but complex. HOw would it be used here, can I be trained for something like this? or should have to be extended?

Yes, Tesseract is fully trainable. And it just happens that it supports text in a circle also (pagesegmode 9). Give it a try.

Related

How to properly OCR typewriter fonts using tesseract and python

AS3 - Extract text from image

OCR graph paper

Image processing/enhancement algorithms for document OCR / readability?

How does Google Books find text regions?

Categories

Resources