document image processing - language-agnostic

I working on an application for processing document images (mainly invoices) and basically, I'd like to convert certain regions of interest into an XML-structure and then classify the document based on that data. Currently I am using ImageJ for analyzing the document image and Asprise/tesseract for OCR.
Now I am looking for something to make developing easier. Specifically, I am looking for something to automatically deskew a document image and analyze the document structure (e.g. converting an image into a quadtree structure for easier processing). Although I prefer Java and ImageJ I am interested in any libraries/code/papers regardless of the programming language it's written in.
While the system I am working on should as far as possible process data automatically, the user should oversee the results and, if necessary, correct the classification suggested by the system. Therefore I am interested in using machine learning techniques to achieve more reliable results. When similar documents are processed, e.g. invoices of a specific company, its structure is usually the same. When the user has previously corrected data of documents from a company, these corrections should be considered in the future. I have only limited knowledge of machine learning techniques and would like to know how I could realize my idea.

The following prototype in Mathematica finds the coordinates of blocks of text and performs OCR within each block. You may need to adapt the parameters values to fit the dimensions of your actual images. I do not address the machine learning part of the question; perhaps you would not even need it for this application.
Import the picture, create a binary mask for the printed parts, and enlarge these parts using an horizontal closing (dilation and erosion).
Query for each blob's orientation, cluster the orientations, and determine the overall rotation by averaging the orientations of the largest cluster.
Use the previous angle to straighten the image. At this time OCR is possible, but you would lose the spatial information for the blocks of text, which will make the post-processing much more difficult than it needs to be. Instead, find blobs of text by horizontal closing.
For each connected component, query for the bounding box position and the centroid position. Use the bounding box positions to extract the corresponding image patch and perform OCR on the patch.
At this point, you have a list of strings and their spatial positions. That's not XML yet, but it sounds like a good starting point to be tailored straightforwardly to your needs.
This is the code. Again, the parameters (structuring elements) of the morphological functions may need to change, based on the scale of your actual images; also, if the invoice is too tilted, you may need to "rotate" roughly the structuring elements in order to still achieve good "un-skewing."
img = ColorConvert[Import#"http://www.team-bhp.com/forum/attachments/test-drives-initial-ownership-reports/490952d1296308008-laura-tsi-initial-ownership-experience-img023.jpg", "Grayscale"];
b = ColorNegate#Binarize[img];
mask = Closing[b, BoxMatrix[{2, 20}]]
orientations = ComponentMeasurements[mask, "Orientation"];
angles = FindClusters#orientations[[All, 2]]
\[Theta] = Mean[angles[[1]]]
straight = ColorNegate#Binarize[ImageRotate[img, \[Pi] - \[Theta], Background -> 1]]
TextRecognize[straight]
boxes = Closing[straight, BoxMatrix[{1, 20}]]
comp = MorphologicalComponents[boxes];
measurements = ComponentMeasurements[{comp, straight}, {"BoundingBox", "Centroid"}];
texts = TextRecognize#ImageTrim[straight, #] & /# measurements[[All, 2, 1]];
Cases[Thread[measurements[[All, 2, 2]] -> texts], (_ -> t_) /; StringLength[t] > 0] // TableForm

The paper we use for skew angle detection is: Skew detection and text line position determination in digitized documents by Gatos et. al. The only limitation with this paper is that it can detect skew upto -5 and +5 degrees. After that, we need something to slap the user with a message! :)
In your case, where there are primarily invoice scans, you may beautifully use: Multiresolution Analysis in Extraction of Reference Lines from Documents with Gray Level Background by Tag et. al.
We wrote the code in MATLAB, if you need help let me know!

I worked on a similar project once, and for being a long time user of OpenCV I ended up using it once again. OpenCV is a popular-cross-platform-computer-vision-library that offers programming interfaces for C and C++.
I found an interesting blog that had a post on how to detect the skew angle of a text using OpenCV, and then another on how to deskew.
To retrieve the text of the document and be able to pass a smaller image to tesseract, I suggest taking a look at the bounding box technique.
I don't know if the image acquisition procedure is your responsibility, but if it is you might want to take a look at how to do camera calibration with OpenCV to fix the distortion in the image caused by some camera lenses.

Related

FirebaseVisionImage / ML Toolkit cropRect() support

I am posting this question by request of a Firebase engineer.
I am using the Camera2 API in conjunction with Firebase-mlkit vision. I am using both barcode and on-platform OCR. The things I am trying to decode are mostly labels on equipment. In testing the application I have found that trying to scan the entire camera image produces mixed results. The main problem is that the field of view is too wide.
If there are multiple bar codes in view, firebase returns multiple results. You can sort of work around this by looking at the coordinates and picking the one closest to the center.
When scanning text, it's more or less the same, except that you get multiple Blocks, many times incomplete (you'll get a couple of letters here and there).
You can't just narrow the camera mode, though - for this type of scanning, the user benefits from the "wide" camera view for alignment. The ideal situation would be if you have a camera image (let's say for the sake of argument it's 1920x1080) but only a subset of the image is given to firebase-ml. You can imagine a camera view that has a guide box on the screen, and you orient and zoom the item you want to scan within that box.
You can select what kind of image comes from the Camera2 API but firebase-ml spits out warnings if you choose anything other than YUV_420_488. The problem is that there's not a great way in the Android API to deal with YUV images unless you do it yourself. That's what I ultimately ended up doing - I solved my problem by writing a Renderscript that takes an input YUV, converts it to RGBA, crops it, then applies any rotation if necessary. The result of this is a Bitmap, which I then feed into either the FirebaseVisionBarcodeDetectorOptions or FirebaseVisionTextRecognizer.
Note that the bitmap itself cases mlkit runtime warnings, urging me to use the YUV format instead. This is possible, but difficult. You would have to read the byte array and stride information from the original camera2 yuv image and create your own. The object that comes from camear2 is unfortunately a package-protected class, so you can't subclass it or create your own instance - you'd essentially have to start from scratch. (I'm sure there's a reason Google made this class package protected but it's extremely annoying that they did).
The steps I outlined above all work, but with format warnings from mlkit. What makes it even better is the performance gain - the barcode scanner operating on an 800x300 image takes a tiny fraction as long as it does on the full size image!
It occurs to me that none of this would be necessary if firebase paid attention to cropRect. According to the Image API, cropRect defines what portion of the image is valid. That property seems to be mutable, meaning you can get an Image and change its cropRect after the fact. That sounds perfect. I thought that I could get an Image off of the ImageReader, set cropRect to a subset of that image, and pass it to Firebase and that Firebase would ignore anything outside of cropRect.
This does not seem to be the case. Firebase seems to ignore cropRect. In my opinion, firebase should either support cropRect, or the documentation should explicitly state that it ignores it.
My request to the firebase-mlkit team is:
Define the behavior I should expect with regard to cropRect, and document it more explicitly
Explain at least a little about how images are processed by these recognizers. Why is it so insistent that YUV_420_488 be used? Maybe only the Y channel is used in decoding? Doesn't the recognizer have to convert to RGBA internally? If so, why does it get angry at me when I feed in Bitmaps?
Make these recognizers either pay attention to cropRect, or state that they don't and provide another way to tell these recognizers to work on a subset of the image, so that I can get the performance (reliability and speed) that one would expect out of having to ML correlate/transform/whatever a smaller image.
--Chris

Image from concept of a convolutional neural network

Suppose I have a CNN which is trained for classifying images of different animals, the output of such model will be a point (output point) in a n spatial dimension, where n is the number of animal classes the model is trained on; then that output is transformed in a way to convert it into a one-hot vector of n parameters, giving then the correct label for the image from the point of view of the CNN, but let's stick with the n dimensional point, which is the concept of an input image.
Suppose then that I want to take that point and transform it in a way so that the final output is an image with constraint width and height (the dimensions should be the same with different input images) which outputs the same point as the input image's, how do I do that?
I'm basically asking for the methods used (training mostly) for this kind of task, where an image must be reconstructed based on the output point of the CNN -I know the image will never be identical, but I'm looking for images that generate the same (or at least not so different) output point as a input image when that point is inputted to the CNN-. Take in mind that the input of the model I'm asking for is n and the output is a two (or three if it's not in grayscale) dimensional tensor. I noticed that deepdream does exactly this kind of thing (I think), but every time I put "deepdream" and "generate" in Google, an online generator is almost always shown, not the actual techniques; so if there are some answers to this I'd love to hear about them.
The output label does not contain enough information to reconstruct an entire image.
Quoting from the DeepDream example ipython notebook:
Making the "dream" images is very simple. Essentially it is just a gradient ascent process that tries to maximize the L2 norm of activations of a particular DNN layer.
So the algorithm modifies an existing image such that outputs of certain nodes in the network (can be in an intermediate layer, not necessarily the output nodes) become large. In order to do that, it has to calculate the gradient of the node output with respect to the input pixels.

how to format the image data for training/prediction when images are different in size?

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

Randomly Generate Directed Graph on a grid

I am trying to randomly generate a directed graph for the purpose of making a puzzle game similar to the ice sliding puzzles from pokemon.
This is essentially what I want to be able to randomly generate: http://bulbanews.bulbagarden.net/wiki/Crunching_the_numbers:_Graph_theory
I need to be able to limit the size of the graph in an x and y dimension. In the example in the link, it would be restricted to an 8x4 grid.
The problem I am running in to is not randomly generating the graph, but randomly generating a graph which I can properly map out in a 2d space, since I need something (like a rock) on the opposite side of a node, to make it visually make sense when you stop sliding. The problem with this is sometimes the rock ends up in the path between two other nodes or possibly on another node itself, which causes the entire graph to become broken.
After discussing the problem with a few people I know, we came to a couple of conclusions that may lead to a solution. Including the obstacles in the grid as part of the graph when constructing it. Start out with a fully filled grid and just draw a random path and delete out blocks that will make that path work, though the problem then becomes figuring out which ones to delete so that you don't accidentally introduce an additional, shorter path. We were also thinking a dynamic programming algorithm may be beneficial, though none of us are too skilled with creating dynamic programming algorithms from nothing. Any ideas or references about what this problem is officially called (if it's an official graph problem) would be most helpful.
I wouldn't look at it as a graph problem, since as you say the representation is incomplete. To generate a puzzle I would work directly on a grid, and work backwards; first fix the destination spot, then place rocks in some way to reach it from one or more spots, and iteratively add stones to reach those other spots, with the constraint that you never add a stone which breaks all the paths to the destination.
You might want to generate a planar graph, which means that the edges of the graph will not overlap each other in a two dimensional space. Another definition of planar graphs ist that each planar graph does not have any subgraphs of the type K_3,3 (complete bi-partite with six nodes) or K_5 (complete graph with five nodes).
There's a paper on the fast generation of planar graphs.

Vector graphics flood fill algorithms?

I am working on a simple drawing application, and i need an algorithm to make flood fills.
The user workflow will look like this (similar to Flash CS, just more simpler):
the user draws straight lines on the workspace. These are treated as vectors, and can be selected and moved after they are drawn.
user selects the fill tool, and clicks on the drawing area. If the area is surrounded by lines in every direction a fill is applied to the area.
if the lines are moved after the fill is applied, the area of fill is changed accordingly.
Anyone has a nice idea, how to implement such algorithm? The main task is basically to determine the line segments surrounding a point. (and storing this information somehow, incase the lines are moved)
EDIT: an explanation image: (there can be other lines of course in the canvas, that do not matter for the fill algorithm)
EDIT2: a more difficult situation:
EDIT3: I have found a way to fill polygons with holes http://alienryderflex.com/polygon_fill/ , now the main question is, how do i find my polygons?
You're looking for a point location algorithm. It's not overly complex, but it's not simple enough to explain here. There's a good chapter on it in this book: http://www.cs.uu.nl/geobook/
When I get home I'll get my copy of the book and see if I can try anyway. There's just a lot of details you need to know about. It all boils down to building a DCEL of the input and maintain a datastructure as lines are added or removed. Any query with a mouse coord will simply return an inner halfedge of the component, and those in particular contain pointers to all of the inner components, which is exactly what you're asking for.
One thing though, is that you need to know the intersections in the input (because you cannot build the trapezoidal map if you have intersecting lines) , and if you can get away with it (i.e. input is few enough segments) I strongly suggest that you just use the naive O(n²) algorithm (simple, codeable and testable in less than 1 hour). The O(n log n) algorithm takes a few days to code and use a clever and very non-trivial data structure for the status. It is however also mentioned in the book, so if you feel up to the task you have 2 reasons to buy it. It is a really good book on geometric problems in general, so for that reason alone any programmer with interest in algorithms and datastructures should have a copy.
Try this:
http://keith-hair.net/blog/2008/08/04/find-intersection-point-of-two-lines-in-as3/
The function returns the intersection (if any) between two lines in ActionScript. You'll need to loop through all your lines against each other to get all of them.
Of course the order of the points will be significant if you're planning on filling them - that could be harder!
With ActionScript you can use beginFill and endFill, e.g.
pen_mc.beginFill(0x000000,100);
pen_mc.lineTo(400,100);
pen_mc.lineTo(400,200);
pen_mc.lineTo(300,200);
pen_mc.lineTo(300,100);
pen_mc.endFill();
http://www.actionscript.org/resources/articles/212/1/Dynamic-Drawing-Using-ActionScript/Page1.html
Flash CS4 also introduces support for paths:
http://www.flashandmath.com/basic/drawpathCS4/index.html
If you want to get crazy and code your own flood fill then Wikipedia has a decent primer, but I think that would be reinventing the atom for these purposes.