How can a scanned page be divided into words like the reCaptcha project?

How can a scanned page be divided into words like the reCaptcha project? - ocr

I would like to digitize a book in a similar way to the reCaptcha project. Is there already a system for inputing an image and then outputting little images cropped around words? Any ideas on how to do this?

You should look into the Tesseract OCR project on which reCaptcha was probably based. It has the capability to output the coordinates of recognized words. Then you crop the page to those coords and you are done.

If you just want to split the image in multiple images one word each you could try to find the word bounding boxes and then take those co-ordinates for the splitting. This can be done by taking histograms/projections of the document in horizontal direction and then for each line in vertical direction. An example algorithm with some pictures describing the idea can be found in this paper: "Document Page Decomposition by the Bounding-Box Projection Technique" (http://haralick.org/conferences/71281119.pdf). You could implement this in OpenCV.
Alternativly, you can use Tessaract as mentioned by beppe9000. Perhaps this helps: Getting the bounding box of the recognized words using python-tesseract
But then you get the whole complexity of training OCR even though you only want the bounding boxes.

Related

Polygonal Search

I have read several of the posts concerning Polygonal Search, but they are all about fixing or updating the programs. I am just wondering how it works. If there is a way I can get something like pseudo code of it or an explanation of how a shape captures the data points.
To further specify my goal, I am trying to make a constant square that will be held over a map (such as google maps), but the map can move around behind the square, however, the square will continue to report whatever cities lie within its bounds. [I will eventually proceed to building it, I just need some guidance]
Thank you.

There is an open-source library which has a function to check if two shapes overlap. You can check source code:
http://turfjs.org/static/docs/module-turf_inside.html
If you look for theory behind it check Hyperplane separation theorem

OCR diagonally written text

After a broad search of keywords in google scholar, images, and web - I cannot find anything related to OCR of diagonal text. There are a few close examples:
The page related to open CV preprocessing a document for skew it is close, but relates to the entire page
This document has an example of no skew, with a mix of horizontal and diagonal text, but the question there does not relate to the diagonal text, though this is a good example
So, presumably, diagonal fields functions do not exist in openCV. Is this true. And how are diagonal text fields handled?

It seems you want to perform OCR on a page with both horizontal and diagonal text. There is no straightforward solution in terms of OpenCV, but you could take a divide-and-conquer approach such as:
Partition the image according to prior knowledge about the document (common with forms), or the distribution of white regions (column spacing etc.)
Identify regions where there is a possibility of diagonal text (diagonal, fat lines after blurring and thresholding is one method)
Rotate the partition and perform OCR
Merge results for different partitions
You can also try a brute force approach like rotating the image by a range of angles and performing OCR on all of them. The results will have to be merged.

How to remove graphic from scanned document before passing it to tesserract for OCRing?

I'm working on OCR project but I don't know how to remove graphics from the scanned document image before passing it to tesserract.
Some scanned documents which I want to remove graphics are below:
http://www.mediafire.com/view/hvmpty2z3cw3vao/IMG_0087.JPG
http://www.mediafire.com/view/1sgy5s2aaj2o8y3/IMG_0086.JPG
Any advice is very appreciate. Many thanks.

As the text area is usually sparse and does not connect each other, you may consider to have a sobel edge detection on the original image and detect the biggest connection area with some threshold to detect the image area.
Meanwhile, as the image is a rectangle area, another way is to have a Hough translation to detect straight line to consist a rectangle with 4 lines. If you go this way, it’s recommended that you zoom the image first to reduce the calculate complexity.

You can start by detecting text areas using an algorithm available in AForge.Net. See HorizontalRunLengthSmoothing and VerticalRunLengthSmoothing. The algorithm is not very complicated and you can implement easily it using your favorite image processing library. The only constraint is to know approximately the size of the characters in your images.

How to make tesseract to give relevant results in the presence of noise?

I am using tesseract 3.0.0 and I bumped into the following problem:
When there is something too small for tesseract to recognize it seems it's merged with
other fragments. As a result nothing relevant is returned.
The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).
The last case is the problem one. Is there someway to improve tesseract in situations like this?

As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.
Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.
I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.
There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.
Here's what you actually can do to address your problem:
If your images have very typical structure, you can try to do some dumb segmentation and cut text from the image yourself before passing it to Tesseract. However, if you expect to have wide variety of images to be supported, just forget it.
You can ckeck OCRopus and see if their segmentation work for your images. If yes, then you can spend some time to make OCRopus + Tesseract work together.
Well, if what you do is not just for fun and you value your time, I would recommend thinking about real OCR engine like ABBYY. You will get much higher accuracy of both segmentaiton and OCR out of the box, and professional customer support of course.
Disclaimer: I work for ABBYY

How to find pixel co-ordinates of corners of a square pattern?

This may not be a programming related but possibly programmers would be in the best position to answer it.
For camera calibration I have a 8 x 8 square pattern printed on sheet of paper. I have to manually enter these co-ordinates into a text file. The software would then pick it up from there and compute the calibration parameters.
Is there a script or some software that I can run on these images and get the pixel co-ordinates of the 4 corners of each of the 64 squares?

You can do this with a traditional chessboard pattern (i.e. black and white squares with no gaps) using cvFindChessboardCorners(). You can read more about the function in the OpenCV API Reference and see some sample code in O'Reilly's OpenCV Book or elsewhere online. As an added bonus, OpenCV has built-in functions that calculate the intrinsic parameters of the camera and an array of extrinsic parameters for the multiple views of a planar calibration object.

I would:
apply threshold and get binarized image.
apply SobelX filter to image. You get an image with the vertical lines. This belong to the sides of the squares that are almost vertical. Keep this as image1.
apply SobelY filter to image. You get an image with the horizontal lines. This belong to the sides of the squares that are almost horizontal. Keep this as image2.
make (image1 xor image2). You get a black image with white pixels indicating the corner positions.
Hope it helps.

I'm sure there are many computer vision libraries with varying capabilities and licenses out there, but one that I can remember off the top of my head is ARToolKit, which should be able to recognize this pattern. And if that's not possible, it comes with a set of very good patterns that are tailored so that they can be recognized even if they're partially obscured.

I don't know ARToolKit (although i've heard a lot about it) but with OpenCV this processing is trivial.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008