Bounding box creater - deep-learning

Friends, I have images(more 100) with their ground truths (Their black white masks). How can I get them automatically bound in the pascal voc format, bounding box values, ie xml files.
I mean that creating xmin,xmax,ymin,ymax values from masks and saved them as xml files. I used LabelImg, but there was automatic way, I did not find. I will use them for deep learning pascal voc.
Is there a code, tool or link how to do?

If you want to get bounding box from masks, then you just need to use numpy.where() to get indexes of each max, then you simply get the max and min values of those indexes, and those are exactly the coordonates of the bounding box

Related

document image processing

I working on an application for processing document images (mainly invoices) and basically, I'd like to convert certain regions of interest into an XML-structure and then classify the document based on that data. Currently I am using ImageJ for analyzing the document image and Asprise/tesseract for OCR.
Now I am looking for something to make developing easier. Specifically, I am looking for something to automatically deskew a document image and analyze the document structure (e.g. converting an image into a quadtree structure for easier processing). Although I prefer Java and ImageJ I am interested in any libraries/code/papers regardless of the programming language it's written in.
While the system I am working on should as far as possible process data automatically, the user should oversee the results and, if necessary, correct the classification suggested by the system. Therefore I am interested in using machine learning techniques to achieve more reliable results. When similar documents are processed, e.g. invoices of a specific company, its structure is usually the same. When the user has previously corrected data of documents from a company, these corrections should be considered in the future. I have only limited knowledge of machine learning techniques and would like to know how I could realize my idea.
The following prototype in Mathematica finds the coordinates of blocks of text and performs OCR within each block. You may need to adapt the parameters values to fit the dimensions of your actual images. I do not address the machine learning part of the question; perhaps you would not even need it for this application.
Import the picture, create a binary mask for the printed parts, and enlarge these parts using an horizontal closing (dilation and erosion).
Query for each blob's orientation, cluster the orientations, and determine the overall rotation by averaging the orientations of the largest cluster.
Use the previous angle to straighten the image. At this time OCR is possible, but you would lose the spatial information for the blocks of text, which will make the post-processing much more difficult than it needs to be. Instead, find blobs of text by horizontal closing.
For each connected component, query for the bounding box position and the centroid position. Use the bounding box positions to extract the corresponding image patch and perform OCR on the patch.
At this point, you have a list of strings and their spatial positions. That's not XML yet, but it sounds like a good starting point to be tailored straightforwardly to your needs.
This is the code. Again, the parameters (structuring elements) of the morphological functions may need to change, based on the scale of your actual images; also, if the invoice is too tilted, you may need to "rotate" roughly the structuring elements in order to still achieve good "un-skewing."
img = ColorConvert[Import#"http://www.team-bhp.com/forum/attachments/test-drives-initial-ownership-reports/490952d1296308008-laura-tsi-initial-ownership-experience-img023.jpg", "Grayscale"];
b = ColorNegate#Binarize[img];
mask = Closing[b, BoxMatrix[{2, 20}]]
orientations = ComponentMeasurements[mask, "Orientation"];
angles = FindClusters#orientations[[All, 2]]
\[Theta] = Mean[angles[[1]]]
straight = ColorNegate#Binarize[ImageRotate[img, \[Pi] - \[Theta], Background -> 1]]
TextRecognize[straight]
boxes = Closing[straight, BoxMatrix[{1, 20}]]
comp = MorphologicalComponents[boxes];
measurements = ComponentMeasurements[{comp, straight}, {"BoundingBox", "Centroid"}];
texts = TextRecognize#ImageTrim[straight, #] & /# measurements[[All, 2, 1]];
Cases[Thread[measurements[[All, 2, 2]] -> texts], (_ -> t_) /; StringLength[t] > 0] // TableForm
The paper we use for skew angle detection is: Skew detection and text line position determination in digitized documents by Gatos et. al. The only limitation with this paper is that it can detect skew upto -5 and +5 degrees. After that, we need something to slap the user with a message! :)
In your case, where there are primarily invoice scans, you may beautifully use: Multiresolution Analysis in Extraction of Reference Lines from Documents with Gray Level Background by Tag et. al.
We wrote the code in MATLAB, if you need help let me know!
I worked on a similar project once, and for being a long time user of OpenCV I ended up using it once again. OpenCV is a popular-cross-platform-computer-vision-library that offers programming interfaces for C and C++.
I found an interesting blog that had a post on how to detect the skew angle of a text using OpenCV, and then another on how to deskew.
To retrieve the text of the document and be able to pass a smaller image to tesseract, I suggest taking a look at the bounding box technique.
I don't know if the image acquisition procedure is your responsibility, but if it is you might want to take a look at how to do camera calibration with OpenCV to fix the distortion in the image caused by some camera lenses.

algorithm to draw filled symmetric polygon?

I'm looking for the series of steps necessary to draw a filled polygon. I will create a function that renders it to a bitmap. I'm writing in a language similar to visual basic, but without most of the object oriented stuff like classes and inheritance, and the drawing capabilities are drawline() and drawrect() and that is it, but it can scale and rotate a completed bitmap object, so, when I fill the polygon, it will be one dot at a time in a for loop or a while loop, however, I can convert the bitmap to a byte array if that makes any difference (might be faster?) so if you have a method that would treat a completed polygon line as a byte array and fill it that way, might be faster than 100,000 plot(x,y) commands? I don't know, either way would be interesting to look at.
I'm not trying to draw irregular polygons, just symmetrical (radial symmetry) with an arbitrary number of sides, minimum 3, centered in the bitmap area.
Drawing method is cartesian with 0,0 being uppper left of the bitmap. I guess the inputs would look something like:
drawpolygon(bitmapobj,width,height,sides,radius)
Perhaps radius is not necessary since the size of the bitmap will be the limit of the polygon?
Looking for steps in English instead of code, if possible, but code could be useful if it doesn't have too many language specific aspects (for instance, c++ has a bunch of declarations, type casting pointers, stuff I don't have to deal with and am not 100% sure how to convert to the language I'm using).
There is an equation given here (the last one).
By looping over all the x and y coordinates, and checking that the output of this equation is less than zero, you can determine which points are 'inside' and colour them appropraitely.

How to find pixel co-ordinates of corners of a square pattern?

This may not be a programming related but possibly programmers would be in the best position to answer it.
For camera calibration I have a 8 x 8 square pattern printed on sheet of paper. I have to manually enter these co-ordinates into a text file. The software would then pick it up from there and compute the calibration parameters.
Is there a script or some software that I can run on these images and get the pixel co-ordinates of the 4 corners of each of the 64 squares?
You can do this with a traditional chessboard pattern (i.e. black and white squares with no gaps) using cvFindChessboardCorners(). You can read more about the function in the OpenCV API Reference and see some sample code in O'Reilly's OpenCV Book or elsewhere online. As an added bonus, OpenCV has built-in functions that calculate the intrinsic parameters of the camera and an array of extrinsic parameters for the multiple views of a planar calibration object.
I would:
apply threshold and get binarized image.
apply SobelX filter to image. You get an image with the vertical lines. This belong to the sides of the squares that are almost vertical. Keep this as image1.
apply SobelY filter to image. You get an image with the horizontal lines. This belong to the sides of the squares that are almost horizontal. Keep this as image2.
make (image1 xor image2). You get a black image with white pixels indicating the corner positions.
Hope it helps.
I'm sure there are many computer vision libraries with varying capabilities and licenses out there, but one that I can remember off the top of my head is ARToolKit, which should be able to recognize this pattern. And if that's not possible, it comes with a set of very good patterns that are tailored so that they can be recognized even if they're partially obscured.
I don't know ARToolKit (although i've heard a lot about it) but with OpenCV this processing is trivial.

Effective data structure for overlapping spatial areas

I'm writing a game where a large number of objects will have "area effects" over a region of a tiled 2D map.
Required features:
Several of these area effects may overlap and affect the same tile
It must be possible to very efficiently access the list of effects for any given tile
The area effects can have arbitrary shapes but will usually be of the form "up to X tiles distance from the object causing the effect" where X is a small integer, typically 1-10
The area effects will change frequently, e.g. as objects are moved to different locations on the map
Maps could be potentially large (e.g. 1000*1000 tiles)
What data structure would work best for this?
Providing you really do have a lot of area effects happening simultaneously, and that they will have arbitrary shapes, I'd do it this way:
when a new effect is created, it is
stored in a global list of effects
(not necessarily a global variable,
just something that applies to the
whole game or the current game-map)
it calculates which tiles
it affects, and stores a list of those tiles against the effect
each of those tiles is
notified of the new effect, and
stores a reference back to it in a
per-tile list (in C++ I'd use a
std::vector for this, something with
contiguous storage, not a linked
list)
ending an effect is handled by iterating through
the interested tiles and removing references to it, before destroying it
moving it, or changing its shape, is handled by removing
the references as above, performing the change calculations,
then re-attaching references in the tiles now affected
you should also have a debug-only invariant check that iterates through
your entire map and verifies that the list of tiles in the effect
exactly matches the tiles in the map that reference it.
Usually it depends on density of your map.
If you know that every tile (or major part of tiles) contains at least one effect you should use regular grid – simple 2D array of tiles.
If your map is feebly filled and there are a lot of empty tiles it make sense to use some spatial indexes like quad-tree or R-tree or BSP-trees.
Usually BSP-Trees (or quadtrees or octrees).
Some brute force solutions that don't rely on fancy computer science:
1000 x 1000 isn't too large - just a meg. Computers have Gigs. You could have an 2d array. Each bit in the bytes could be a 'type of area'. The 'effected area' that's bigger could be another bit. If you have a reasonable amount of different types of areas you can still use a multi-byte bit mask. If that gets ridiculous you can make the array elements pointers to lists of overlapping area type objects. But then you lose efficiency.
You could also implement a sparse array - using a hashtable key'd off of the coords (e.g., key = 1000*x+y) - but this is many times slower.
If course if you don't mind coding the fancy computer science ways, they usually work much better!
If you have a known maximum range of each area effect, you could use a data structure of your choosing and store the actual sources, only, that's optimized for normal 2D Collision Testing.
Then, when checking for effects on a tile, simply check (collision detection style, optimized for your data structure) for all effect sources within the maximum range and then applying a defined test function (for example, if the area is a circle, check if the distance is less than a constant; if it's a square, check if the x and y distances are each within a constant).
If you have a small (<10) amount of effect "field" shapes, you can even do a unique collision detection for each effect field type, within their pre-computed maximum range.

Element point map for html5 canvas element, need algorithm

I'm currently working on a pure html 5 canvas implementation of the "flying tag cloud sphere", which many of you have undoubtedly seen as a flash object in some pages.
The tags are drawn fine, and the performance is satisfactory, but there's one thing in the canvas element that's kind of breaking this idea: you can't identify the objects that you've drawn on a canvas, as it's just a simple flat "image"..
What I have to do in this case is catch the click event, and try to "guess" which element was clicked. So I have to have some kind of matrix, which stores a link to a tag object for each pixel on the canvas, AND I have to update this matrix on every redraw. Now this sounds incredibly inefficient, and before I even start trying to implement this, I want to ask the community - is there some "well known" algorithm that would help me in this case? Or maybe I'm just missing something, and the answer is right behind the corner? :)
This is called the point location problem, and it's one of the basic topics in computational geometry. There are a lot of methods you could use that would be much faster than the approach you're thinking of, but the details depend on what exactly you want to accomplish.
For example, each text string is contained in a bounding box. Do you just want to test whether the user clicked somewhere in that box? Then simply store the minimum and maximum coordinates of each rendered string, and test the point against each bounding box to see if it's contained in that range. If you have a large number of points to test, you can build any number of data structures to speed this up (e.g. R-trees), but for a single point the overhead of constructing such a structure probably isn't worthwhile.
If you care about whether the point actually falls within the opaque area of the stroked characters, the problem is slightly trickier. One solution would be to use the bounding box approach to first eliminate most of the possibilities, and then render the remaining strings one at a time to an offscreen buffer, checking each time to see if the target point has been touched.