OCR diagonally written text - ocr

After a broad search of keywords in google scholar, images, and web - I cannot find anything related to OCR of diagonal text. There are a few close examples:
The page related to open CV preprocessing a document for skew it is close, but relates to the entire page
This document has an example of no skew, with a mix of horizontal and diagonal text, but the question there does not relate to the diagonal text, though this is a good example
So, presumably, diagonal fields functions do not exist in openCV. Is this true. And how are diagonal text fields handled?

It seems you want to perform OCR on a page with both horizontal and diagonal text. There is no straightforward solution in terms of OpenCV, but you could take a divide-and-conquer approach such as:
Partition the image according to prior knowledge about the document (common with forms), or the distribution of white regions (column spacing etc.)
Identify regions where there is a possibility of diagonal text (diagonal, fat lines after blurring and thresholding is one method)
Rotate the partition and perform OCR
Merge results for different partitions
You can also try a brute force approach like rotating the image by a range of angles and performing OCR on all of them. The results will have to be merged.

Related

Can I denote a glyph as being two chars (NA) in a box file in Tesseract 3.05

I am using tesseract 3.05 for reasons beyond my control. I am using source files to train the engine to detect this unique font. As I have a vast amount of samples, I am simply using the samples themselves as the training images rather than segment them into a font training image as this should give it more variation and training with the specific spacing issues this font has.
My question when generating the box files, as some letters are touching at corners (i.e . no clear break between glyphs), it will detect them as one glyph instead of two separate glyphs. An example it sometimes struggles with NA as the front serif of the A has bled into serif of the N. The image pre-processing I have applied has improved it by leaps and bounds but there are still some that I cannot correct on the image enough.
My question is this: can I simply denote the glyph as being NA in the box file?
If I cannot what would be the simplest solution? Introducing another glyph box seems like it wouldn't be a good idea but the only other solution I can see is to manually edit the image to make the separation of glyphs more obvious. This is itself anthi-thetical however as this is the kind of problem the font will have in the future that I am trying to OCR.
Thank you in advance but the documentation isn't specific on if I can correct a box glyph to being two characters instead of just one (or I just haven't found a relevant section where they explain this).
After scouring the documentation, I managed to find a lone paragraph that wasn't appearing in my website scraping:
"If you didn't successfully space out the characters on the training image, some may have been joined into a single box. In this case, you can either remake the images with better spacing and start again, or if the pair is common, put both characters at the start of the line, leaving the bounding box to represent them both. (As of 3.00, there is a limit of 24 bytes for the description of a "character". This will allow you between 6 and 24 unicodes to describe the character, depending on where your codes sit in the unicode set. If anyone hits this limit, please file an issue describing your situation.)"
Thus you can do what I ask: represent a glyph with two or more characters in a box file for Tesseract.

How can a scanned page be divided into words like the reCaptcha project?

I would like to digitize a book in a similar way to the reCaptcha project. Is there already a system for inputing an image and then outputting little images cropped around words? Any ideas on how to do this?
You should look into the Tesseract OCR project on which reCaptcha was probably based. It has the capability to output the coordinates of recognized words. Then you crop the page to those coords and you are done.
If you just want to split the image in multiple images one word each you could try to find the word bounding boxes and then take those co-ordinates for the splitting. This can be done by taking histograms/projections of the document in horizontal direction and then for each line in vertical direction. An example algorithm with some pictures describing the idea can be found in this paper: "Document Page Decomposition by the Bounding-Box Projection Technique" (http://haralick.org/conferences/71281119.pdf). You could implement this in OpenCV.
Alternativly, you can use Tessaract as mentioned by beppe9000. Perhaps this helps: Getting the bounding box of the recognized words using python-tesseract
But then you get the whole complexity of training OCR even though you only want the bounding boxes.

OCR match frame´s position to field in credit card

I am developing an OCR to detect credit card.
After scanning the image I get a list of words with it´s positions.
Any tips/suggestions about the best approach to detect which words correspond to each field of credit card (number, date, name)?
For example:
position = 96.00 491.00
text = CARDHOLDER
Thanks in advance
Your first problem is that most OCRs are not optimised for small amounts of text that take up most of the "page" (or card image, in your case) in spatially separated chunks. They expect lines, or pages of text from a scanned book or a newspaper. So straight away they're not likely to do that well at analysing the image.
Because the font is fairly uniform they'll likely recognise the characters well, but the layout will confuse the page segmentation algorithm and so the text you get out might not be in the right order. For example, the "1234" of the card number and the smaller "1234" below it constitute a single column of text, likewise the second two sets of four numbers and the expiration date.
For specialized cases where you know the layout in advance you really want to develop your own page segmentation algorithm to break up the image into zones, e.g. card number, card holder name, start and expiration dates. This shouldn't be too hard because I think the location of these components are standardised on credit cards. Assuming good preprocessing and binarization you could basically do a horizontal histogram and split the image at the troughs.
Then extract each zone as a separate image containing just one line of text and feed it to the OCR.
Alternately (the quick and dirty approach)
Instruct the OCR that what you want to recognise consists of a single column (i.e. prevent it from trying to figure out the page layout itself). You can do this with Tesseract using the -psm (page segmentation mode) parameter set to, probably, 6 (but try and see what gives you the best results)
Make Tesseract output hOCR format, which you can set in the configfile. hOCR format includes the bounding boxes of the lines that get output relative to the whole image.
write an algorithm that compares the bounding boxes in the hOCR to where you know each card component should be (looking for some percentage of overlap, it won't match exactly for obvious reasons.)
In addition to the good tips provided by Mikesname, you can greatly improve the recognition result regardless of which OCR engine you use if you use image processing to convert the image to bitonal (pure black and white), such as the attached copy of your image.

Pictures with patterns sometimes display strangely

My site is a clothing store, and my partner has complained about the following issue.
The pictures of clothing with more complex patterns (checkerboard for example) displays like this: instead of this:
I assume the other pictures are also displaying weirdly, but it's just less noticeable. As far as I can tell, it happens most often on Macs.
If anyone has any information about this phenomenon it would be much appreciated.
It's called a Moire Pattern: http://en.wikipedia.org/wiki/Moir%C3%A9_pattern
The best solution is to not resize images, to ensure they're displayed at 1:1 scaling. If not, make differently-sized images using a tool like Photoshop that has better image-resize algorithms that avoid this problem and then use HTML5's srcset attribute so the right image is loaded for the right DPI, see here: http://www.w3.org/html/wg/drafts/srcset/w3c-srcset/
This is called the Moiré effect. From a Wikipedia article:
In physics, mathematics, and art, a moiré pattern (/mwɑrˈeɪ/; French:
[mwaˈʁe]) is a secondary and visually evident superimposed pattern
created, for example, when two identical (usually transparent)
patterns on a flat or curved surface (such as closely spaced straight
lines drawn radiating from a point or taking the form of a grid) are
overlaid while displaced or rotated a small amount from one another.
In context of images the overlaying comes from anti-aliased (in case of upsampling) or averaged pixels (for downsampling).
To resize them properly use high-quality resizing such as bi-cubic interpolation based resampling. Most browser has built-in support for this but certain conditions are affecting which stratgey is selected (bi-cubic or bi-linear), for example for performance reason. The latter is more prone to this effect.
It can be reduced using a canvas to scale down the image. I have an article here on this topic and an SO answer here showing a concrete example on how to.

How to remove graphic from scanned document before passing it to tesserract for OCRing?

I'm working on OCR project but I don't know how to remove graphics from the scanned document image before passing it to tesserract.
Some scanned documents which I want to remove graphics are below:
http://www.mediafire.com/view/hvmpty2z3cw3vao/IMG_0087.JPG
http://www.mediafire.com/view/1sgy5s2aaj2o8y3/IMG_0086.JPG
Any advice is very appreciate. Many thanks.
As the text area is usually sparse and does not connect each other, you may consider to have a sobel edge detection on the original image and detect the biggest connection area with some threshold to detect the image area.
Meanwhile, as the image is a rectangle area, another way is to have a Hough translation to detect straight line to consist a rectangle with 4 lines. If you go this way, it’s recommended that you zoom the image first to reduce the calculate complexity.
You can start by detecting text areas using an algorithm available in AForge.Net. See HorizontalRunLengthSmoothing and VerticalRunLengthSmoothing. The algorithm is not very complicated and you can implement easily it using your favorite image processing library. The only constraint is to know approximately the size of the characters in your images.