Image Comparisson using CBIR and OCR - ocr

Working on a project for Retrieving content from a given image and compare with other images in the repository and list out the matching images.
what should be right approach to do it so that the search wont slowdown eventually.
What I was planning to do as a first level of filtering was to use any Image Querying (CBIR technique) to retrieve images matching the pattern of given image.
Then do OCR to get the image content and do a match check.
Please let me know if there is any better approach for this.

Steps done
Softwares
1. Tesseract OCR
2. Image Magick - For image cleaning
3. Textcleaner script
Found out the image orientation using Image Magick software
Convert package has a feature to find the image orientation using the EXIF data which is not that useful.
For this image was rotated 90 degree thrice and the ocr data for each was compared with the other to find the correct orientation. ( image with maximum number of words wins)
OCRed the image to get the text and applied filtering to get the bill no, date and amount.
on success stores the details on DB for future search
on failure
Created 10 different images with different filters (gray scale mode and sharpment applied)
OCRed all images and found out the required data form all the data got.
Saved data is used for future search feature to eradicate duplication

Related

FirebaseVisionImage / ML Toolkit cropRect() support

I am posting this question by request of a Firebase engineer.
I am using the Camera2 API in conjunction with Firebase-mlkit vision. I am using both barcode and on-platform OCR. The things I am trying to decode are mostly labels on equipment. In testing the application I have found that trying to scan the entire camera image produces mixed results. The main problem is that the field of view is too wide.
If there are multiple bar codes in view, firebase returns multiple results. You can sort of work around this by looking at the coordinates and picking the one closest to the center.
When scanning text, it's more or less the same, except that you get multiple Blocks, many times incomplete (you'll get a couple of letters here and there).
You can't just narrow the camera mode, though - for this type of scanning, the user benefits from the "wide" camera view for alignment. The ideal situation would be if you have a camera image (let's say for the sake of argument it's 1920x1080) but only a subset of the image is given to firebase-ml. You can imagine a camera view that has a guide box on the screen, and you orient and zoom the item you want to scan within that box.
You can select what kind of image comes from the Camera2 API but firebase-ml spits out warnings if you choose anything other than YUV_420_488. The problem is that there's not a great way in the Android API to deal with YUV images unless you do it yourself. That's what I ultimately ended up doing - I solved my problem by writing a Renderscript that takes an input YUV, converts it to RGBA, crops it, then applies any rotation if necessary. The result of this is a Bitmap, which I then feed into either the FirebaseVisionBarcodeDetectorOptions or FirebaseVisionTextRecognizer.
Note that the bitmap itself cases mlkit runtime warnings, urging me to use the YUV format instead. This is possible, but difficult. You would have to read the byte array and stride information from the original camera2 yuv image and create your own. The object that comes from camear2 is unfortunately a package-protected class, so you can't subclass it or create your own instance - you'd essentially have to start from scratch. (I'm sure there's a reason Google made this class package protected but it's extremely annoying that they did).
The steps I outlined above all work, but with format warnings from mlkit. What makes it even better is the performance gain - the barcode scanner operating on an 800x300 image takes a tiny fraction as long as it does on the full size image!
It occurs to me that none of this would be necessary if firebase paid attention to cropRect. According to the Image API, cropRect defines what portion of the image is valid. That property seems to be mutable, meaning you can get an Image and change its cropRect after the fact. That sounds perfect. I thought that I could get an Image off of the ImageReader, set cropRect to a subset of that image, and pass it to Firebase and that Firebase would ignore anything outside of cropRect.
This does not seem to be the case. Firebase seems to ignore cropRect. In my opinion, firebase should either support cropRect, or the documentation should explicitly state that it ignores it.
My request to the firebase-mlkit team is:
Define the behavior I should expect with regard to cropRect, and document it more explicitly
Explain at least a little about how images are processed by these recognizers. Why is it so insistent that YUV_420_488 be used? Maybe only the Y channel is used in decoding? Doesn't the recognizer have to convert to RGBA internally? If so, why does it get angry at me when I feed in Bitmaps?
Make these recognizers either pay attention to cropRect, or state that they don't and provide another way to tell these recognizers to work on a subset of the image, so that I can get the performance (reliability and speed) that one would expect out of having to ML correlate/transform/whatever a smaller image.
--Chris

OSM direct tile referencing.

I wanted to display a map in HTML5 notification. Since notifications do not allow a full fledged HTML+JS app to run inside them, I wanted to show just an image.
The question is how do I get the link to the appropriate map tile image if I know the lat/lon co-coordinates. I can set some fixed values for zoom level etc. I don't wan't to run my own tile server or depend on third party servers (other than openstreetmaps.org) that may go down any time. I am okay with the lack of ability to customize image size, or centering image around the co-ordinate etc.
There are a bunch of example conversions from lat/long to tile number in various langauges on this OSM wiki page - hopefully you'll find something there that you can use.

OCR match frame´s position to field in credit card

I am developing an OCR to detect credit card.
After scanning the image I get a list of words with it´s positions.
Any tips/suggestions about the best approach to detect which words correspond to each field of credit card (number, date, name)?
For example:
position = 96.00 491.00
text = CARDHOLDER
Thanks in advance
Your first problem is that most OCRs are not optimised for small amounts of text that take up most of the "page" (or card image, in your case) in spatially separated chunks. They expect lines, or pages of text from a scanned book or a newspaper. So straight away they're not likely to do that well at analysing the image.
Because the font is fairly uniform they'll likely recognise the characters well, but the layout will confuse the page segmentation algorithm and so the text you get out might not be in the right order. For example, the "1234" of the card number and the smaller "1234" below it constitute a single column of text, likewise the second two sets of four numbers and the expiration date.
For specialized cases where you know the layout in advance you really want to develop your own page segmentation algorithm to break up the image into zones, e.g. card number, card holder name, start and expiration dates. This shouldn't be too hard because I think the location of these components are standardised on credit cards. Assuming good preprocessing and binarization you could basically do a horizontal histogram and split the image at the troughs.
Then extract each zone as a separate image containing just one line of text and feed it to the OCR.
Alternately (the quick and dirty approach)
Instruct the OCR that what you want to recognise consists of a single column (i.e. prevent it from trying to figure out the page layout itself). You can do this with Tesseract using the -psm (page segmentation mode) parameter set to, probably, 6 (but try and see what gives you the best results)
Make Tesseract output hOCR format, which you can set in the configfile. hOCR format includes the bounding boxes of the lines that get output relative to the whole image.
write an algorithm that compares the bounding boxes in the hOCR to where you know each card component should be (looking for some percentage of overlap, it won't match exactly for obvious reasons.)
In addition to the good tips provided by Mikesname, you can greatly improve the recognition result regardless of which OCR engine you use if you use image processing to convert the image to bitonal (pure black and white), such as the attached copy of your image.

How to remove graphic from scanned document before passing it to tesserract for OCRing?

I'm working on OCR project but I don't know how to remove graphics from the scanned document image before passing it to tesserract.
Some scanned documents which I want to remove graphics are below:
http://www.mediafire.com/view/hvmpty2z3cw3vao/IMG_0087.JPG
http://www.mediafire.com/view/1sgy5s2aaj2o8y3/IMG_0086.JPG
Any advice is very appreciate. Many thanks.
As the text area is usually sparse and does not connect each other, you may consider to have a sobel edge detection on the original image and detect the biggest connection area with some threshold to detect the image area.
Meanwhile, as the image is a rectangle area, another way is to have a Hough translation to detect straight line to consist a rectangle with 4 lines. If you go this way, it’s recommended that you zoom the image first to reduce the calculate complexity.
You can start by detecting text areas using an algorithm available in AForge.Net. See HorizontalRunLengthSmoothing and VerticalRunLengthSmoothing. The algorithm is not very complicated and you can implement easily it using your favorite image processing library. The only constraint is to know approximately the size of the characters in your images.

document image processing

I working on an application for processing document images (mainly invoices) and basically, I'd like to convert certain regions of interest into an XML-structure and then classify the document based on that data. Currently I am using ImageJ for analyzing the document image and Asprise/tesseract for OCR.
Now I am looking for something to make developing easier. Specifically, I am looking for something to automatically deskew a document image and analyze the document structure (e.g. converting an image into a quadtree structure for easier processing). Although I prefer Java and ImageJ I am interested in any libraries/code/papers regardless of the programming language it's written in.
While the system I am working on should as far as possible process data automatically, the user should oversee the results and, if necessary, correct the classification suggested by the system. Therefore I am interested in using machine learning techniques to achieve more reliable results. When similar documents are processed, e.g. invoices of a specific company, its structure is usually the same. When the user has previously corrected data of documents from a company, these corrections should be considered in the future. I have only limited knowledge of machine learning techniques and would like to know how I could realize my idea.
The following prototype in Mathematica finds the coordinates of blocks of text and performs OCR within each block. You may need to adapt the parameters values to fit the dimensions of your actual images. I do not address the machine learning part of the question; perhaps you would not even need it for this application.
Import the picture, create a binary mask for the printed parts, and enlarge these parts using an horizontal closing (dilation and erosion).
Query for each blob's orientation, cluster the orientations, and determine the overall rotation by averaging the orientations of the largest cluster.
Use the previous angle to straighten the image. At this time OCR is possible, but you would lose the spatial information for the blocks of text, which will make the post-processing much more difficult than it needs to be. Instead, find blobs of text by horizontal closing.
For each connected component, query for the bounding box position and the centroid position. Use the bounding box positions to extract the corresponding image patch and perform OCR on the patch.
At this point, you have a list of strings and their spatial positions. That's not XML yet, but it sounds like a good starting point to be tailored straightforwardly to your needs.
This is the code. Again, the parameters (structuring elements) of the morphological functions may need to change, based on the scale of your actual images; also, if the invoice is too tilted, you may need to "rotate" roughly the structuring elements in order to still achieve good "un-skewing."
img = ColorConvert[Import#"http://www.team-bhp.com/forum/attachments/test-drives-initial-ownership-reports/490952d1296308008-laura-tsi-initial-ownership-experience-img023.jpg", "Grayscale"];
b = ColorNegate#Binarize[img];
mask = Closing[b, BoxMatrix[{2, 20}]]
orientations = ComponentMeasurements[mask, "Orientation"];
angles = FindClusters#orientations[[All, 2]]
\[Theta] = Mean[angles[[1]]]
straight = ColorNegate#Binarize[ImageRotate[img, \[Pi] - \[Theta], Background -> 1]]
TextRecognize[straight]
boxes = Closing[straight, BoxMatrix[{1, 20}]]
comp = MorphologicalComponents[boxes];
measurements = ComponentMeasurements[{comp, straight}, {"BoundingBox", "Centroid"}];
texts = TextRecognize#ImageTrim[straight, #] & /# measurements[[All, 2, 1]];
Cases[Thread[measurements[[All, 2, 2]] -> texts], (_ -> t_) /; StringLength[t] > 0] // TableForm
The paper we use for skew angle detection is: Skew detection and text line position determination in digitized documents by Gatos et. al. The only limitation with this paper is that it can detect skew upto -5 and +5 degrees. After that, we need something to slap the user with a message! :)
In your case, where there are primarily invoice scans, you may beautifully use: Multiresolution Analysis in Extraction of Reference Lines from Documents with Gray Level Background by Tag et. al.
We wrote the code in MATLAB, if you need help let me know!
I worked on a similar project once, and for being a long time user of OpenCV I ended up using it once again. OpenCV is a popular-cross-platform-computer-vision-library that offers programming interfaces for C and C++.
I found an interesting blog that had a post on how to detect the skew angle of a text using OpenCV, and then another on how to deskew.
To retrieve the text of the document and be able to pass a smaller image to tesseract, I suggest taking a look at the bounding box technique.
I don't know if the image acquisition procedure is your responsibility, but if it is you might want to take a look at how to do camera calibration with OpenCV to fix the distortion in the image caused by some camera lenses.