Clustering the characters (alphabets/digits) based on bounding boxes into appropriate words - regression

Following is the image of medicine.
I have trained an object detection (Yolov4) model to identify the characters (numbers/alphabets) in the image. In the above image the green dots represent the center coordinates of the bounding boxes of individual digits (I haven't drawn the bounding boxes as it looks very cluttered but I am getting proper boxes). My end goal is to detect text from the images, ie to group the digits into words. Can someone think of an algorithm to cluster the individual digits into words based on the bounding box coordinates of the digits. The correct word representation of above image would be:
Word1 : 'BNOE2IAT016'
WORD2 : 'MFG112020'
WORD3 : 'EXP10'
I was able to correctly get the words by incorporating a text detection model like CRAFT. After getting the bounding box of words I simply checked whether the center coordinate of digit is inside the box or not. But CRAFT model adds huge overhead in terms of time complexity (7 secs). How to do it without any text detection model?

Related

Is there a way to determine whether there's a sufficient amount of space in a Pie chart segment for a label?

In the attached Google charts Pie chart the labels fit well inside the segments. Determining the length of a bit of text in HTML5 canvas is easy enough - but how do you determine whether the label will fit into a particular segment (using trigonometry) ? As you can see on the image, two of the segments don't have labels inside the segment.
EDIT: Here's an example of what I have at the moment: https://www.rgraph.net/tests/canvas.pie/in-pie-labels.html
As you see the labels for the small segments overlap. What I'm after is a way to calculate whether there's enough space for the labels at the point where they're going to be rendered. If not, I can just not draw the label like in the example image above.
Could chord size be useful to do this?
Here's the forumulae for the chord size that I found via Google:
"Chord length using trigonometry = 2 × r × sin(θ/2); where 'r' is the radius of the circle and 'θ' is the angle subtended at the center by the chord."
I sorted it (in about one hour) after 3 days of trying to calculate it with trig by using the built-in context.isPointInPath() function...
Draw the text (transparent color) to get the coordinates (x/y/w/h) of it. You might be able to get away with measuring it to get the width and height.
Draw the segment in a transparent color and do not stroke or fill it. Also, do not close the path.
Test each corner of the text rectangle (formed the x/y/w/h that you got above) using the context.isPointInPath() function. If the function returns true for each corner of the rectangle formed by the coordinates of the text, then the text will fit into the segment.

How Can I Convert Dataset Annaotations To Fixed(YoloV5) Format Without Hand Encoding

So I Am Working On This Awesome Project On Object Detection,Where The Prior Task Is To Identify Brand Logos, So after Doing some research i found this dateset available for the
brand logo For More About Dataset:here
DATASET:
This dateset has 2 versions
FlickrLogos32
FlickrLogos47(recommended for brand detection)
as the name 32 and 47 are the no. of classes offered by this dataset. From the Documentation itself mentioned 47 version is correctly annotated and recommended for object detection & recognization also in my project i have used 47 version
Model:
I Am Using YoloV5 For object detection the
reason behind using YoloV5 and not previous versions is, it it well documented with couple of tutorials with jupyter notebooks available
Problem:
As For The YoloV5:Object Detection Model,The Object Label Should Be Annotated As
<x_center> <y_center> <width> <height> corresponds to bounding box(below image),
whereas the dataset annotations are given in the form of
<x1> <y1> <x2> <y2> where <x1>,<y1>:upper left corner of the bounding box
<x2>,<y2>:lower right corner of the bounding box.
How can i transform <x1>,<y1>,<x2>,<y2>: corner points of bounding box to naive yolo
annotations format i.e.<center_x>,<center_y>,<height>,<width>
without manually going one by one over image and drawing rectangle box with roboflow
Also the Labels are annotated by pixel so we have to normalize it in (0,1)
Datset Insights:
For Any Dataset Example Its Having An Image(.png) and as a Label A Ground truth(.txt)(see below image)
the '.mask' file its just binary mask of object present in image
So A Data Example look likes:
Image:
gt_data.txt:
Mask:
In general to calculate the center it should be xmin + (width/2) and ymin + (height/2). So I think you have you /2 in wrong part of the equation.
Also note that an yolo annotation will look like this.
0.642859 0.079219 0.148063 0.148062
The coordinates are relative to the size of the photo from 0-1. To normalize the coordinates you need to normalize the x dimensions by dividing by the photo width and normalize the y dimensions by dividing by the photo height.

Anchor Boxes in YOLO : How are they decided

I have gone through a couple of YOLO tutorials but I am finding it some what hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).
How can the small cell predict anchor boxes for an object bigger than it. Also it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighbouring cells if only a small part of the object falls within the cell
E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighbouring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.
PS: What I currently think of how the YOLO works is basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class is then selected but I am sure it doesn't add up somewhere.
UPDATE: Made a mistake with this question, it should have been about how regular bounding boxes were decided rather than anchor/prior boxes. So I am marking #craq's answer as correct because that's how anchor boxes are decided according to the YOLO v2 paper
I think there are two questions here. Firstly, the one in the title, asking where the anchors come from. Secondly, how anchors are assigned to objects. I'll try to answer both.
Anchors are determined by a k-means procedure, looking at all the bounding boxes in your dataset. If you're looking at vehicles, the ones you see from the side will have an aspect ratio of about 2:1 (width = 2*height). The ones viewed from in front will be roughly square, 1:1. If your dataset includes people, the aspect ratio might be 1:3. Foreground objects will be large, background objects will be small. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.
It's useful to have anchors that represent your dataset, because YOLO learns how to make small adjustments to the anchor boxes in order to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.
The assignment problem is trickier. As I understand it, part of the training process is for YOLO to learn which anchors to use for which object. So the "assignment" isn't deterministic like it might be for the Hungarian algorithm. Because of this, in general, multiple anchors will detect each object, and you need to do non-max-suppression afterwards in order to pick the "best" one (i.e. highest confidence).
There are a couple of points that I needed to understand before I came to grips with anchors:
Anchors can be any size, so they can extend beyond the boundaries of
the 13x13 grid cells. They have to be, in order to detect large
objects.
Anchors only enter in the final layers of YOLO. YOLO's neural network makes 13x13x5=845 predictions (assuming a 13x13 grid and 5 anchors). The predictions are interpreted as offsets to anchors from which to calculate a bounding box. (The predictions also include a confidence/objectness score and a class label.)
YOLO's loss function compares each object in the ground truth with one anchor. It picks the anchor (before any offsets) with highest IoU compared to the ground truth. Then the predictions are added as offsets to the anchor. All other anchors are designated as background.
If anchors which have been assigned to objects have high IoU, their loss is small. Anchors which have not been assigned to objects should predict background by setting confidence close to zero. The final loss function is a combination from all anchors. Since YOLO tries to minimise its overall loss function, the anchor closest to ground truth gets trained to recognise the object, and the other anchors get trained to ignore it.
The following pages helped my understanding of YOLO's anchors:
https://medium.com/#vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807
https://github.com/pjreddie/darknet/issues/568
I think that your statement about the number of predictions of the network could be misleading. Assuming a 13 x 13 grid and 5 anchor boxes the output of the network has, as I understand it, the following shape: 13 x 13 x 5 x (2+2+nbOfClasses)
13 x 13: the grid
x 5: the anchors
x (2+2+nbOfClasses): (x, y)-coordinates of the center of the bounding box (in the coordinate system of each cell), (h, w)-deviation of the bounding box (deviation to the prior anchor boxes) and a softmax activated class vector indicating a probability for each class.
If you want to have more information about the determination of the anchor priors you can take a look at the original paper in the arxiv: https://arxiv.org/pdf/1612.08242.pdf.

Make a turn based system like final fantasy tactics AS3

i wanted to make a turn based system like final fantasy tactics. I already created the map, which is 5x5 tiles grid and the characters which is each character places in the end of the tiles. I have 2 teams, which are named Red and Yellow.
------Red-------:
First character is at 0,0. Second character is at 0,1. Third character is at0.2, fourth character is at0.3, and the last one is at0.4`.
-----Yellow------:
First character is at 5.0. Second character is at 5.1. Third character is at 5.2, fourth character is at 5.3, and the last one is at 5.4.
I wanted Red team are moving first and make a decision (whether it is attack or wait), and after 5 characters of the Red team is already made a decision, the Yellow team is the one that make a decision (Yellow team is an AI)
But, i don't know how to move my characters into the next grid (e.g: from 0,0 to 0,1) by clicking the left mouse button and also how do i display a grid (when select a move selection) that shows how many tiles that the character able to move.
Anyone know about this? or how should i know more about this? is there any recommendations books or webs?
You have your basic data structures set up, but now you need to get some higher level code to manipulate that data.
First of all, I think you should work on selecting locations on the grid with the mouse. Once you can click and get that grid coordinate saved to a variable, you need to set up a function to move your characters. After the first click (on a character), you need to check the valid moves, and for each valid move, you need to render an image on the grid square (or highlight the square's texture).
Secondly, you need a function which iterates through all the characters in each team, according to who moves next. When you have gone through Red.length (red is an array consisting of each player), then you switch to counting through Yellow.length, and running the AI for each character. If you are trying to make a two player game, you instead ask for user input a second time for the yellow team.
I recommend that you learn about how to display your grid and set up a simple way to highlight squares on the grid. After that, you need to convert mouse coordinates into grid coordinates. Your teams should each be an array of characters. I'm not familiar with actionscript, but in the languages I know, they would look like this:
team[6] = {Character1, Character2, Character3... }
Character1.position = {x, y}
running a turn would be something like this:
while battle == not finished {
for (i = 0; i < red.length; i++) {
getInput();
move(red[i], newX, newY); //red[i].position = {newX, newY}
}
for (i = 0; i < yellow.length; i++) {
runAI();
move(yellow[i], newX, newY);
}
}
The hardest part will be the mouse selection and drawing the grid/characters. Graphics are always a nuisance. The data itself just takes a bit of thinking. Your question in particular seems to be about game programming. My advice is to make the grid, then figure out how to display the grid. Then get mouse input. Finally, worry about moving the characters and highlighting squares.

How to expand to a normal vessel with ITK when I have a skeleton line and every radius for pixels?

I did an thinning operation on vessels, and now I'm trying to reconstruct it.
How to expand them to normal vessels in ITK when I have a skeleton line and radius values for each pixel?
DISCLAIMER: This could be slow, but since no other answer has been suggested, here you go.
Since your question does not indicate this, I'm assuming that you're talking about a 2D image, but the following approach can be extended for 3D too. This is how I'd go about it:
Create a blank image with zero filled pixel values
Create multiple instances of disk/sphere ShapedNeighborhoodIterator each having a different radius on the blank image (choose the most common radii from the vessel width histogram).
Visit each pixel in the binary skeleton image. When you come upon a white (vessel skeleton) pixel, recollect the vessel radius at that pixel.
If you already have a ShapedNeighborhoodIterator for that radius value, take the iterator to the pixel location in the blank image and fill up a disk/sphere of white pixels centered about that pixel. If you don't have a ShapedNeighborhoodIterator for that radius value, create one and do the same operation.
Once you finish iterating over the skeletonized image, you will have a reconstructed tree in the other image. Note that step 2 is optional, but will help you achieve faster computation.