How to get the nth element of list using neural network? - deep-learning

I would like to add a neural network layer that takes as input from a output of another layer in the neural network and another separate number k and outputs the kth element of the list. This layer is supposed to be part a bigger deep network that supplies only the k element to succeeding layer.
The way i think is to dynamically change weights dynamically to a one hot array with only kth element = 1 rest all zeros.
Second way would be to freeze weights and mutliply the previous layer out with the one hot output and input it to next layer. But I am not sure how to do this.

You can just compose top-k modules from any library:
just_kth_element(x, k) := -topk(-topk(x, k=k), k=1)
Since kth element is nothing but smallest element in topK elements.
Or equivalent
just_kth_element(x, k) := min(topk(x, k=k))

Related

Anchor Boxes in YOLO : How are they decided

I have gone through a couple of YOLO tutorials but I am finding it some what hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).
How can the small cell predict anchor boxes for an object bigger than it. Also it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighbouring cells if only a small part of the object falls within the cell
E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighbouring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.
PS: What I currently think of how the YOLO works is basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class is then selected but I am sure it doesn't add up somewhere.
UPDATE: Made a mistake with this question, it should have been about how regular bounding boxes were decided rather than anchor/prior boxes. So I am marking #craq's answer as correct because that's how anchor boxes are decided according to the YOLO v2 paper
I think there are two questions here. Firstly, the one in the title, asking where the anchors come from. Secondly, how anchors are assigned to objects. I'll try to answer both.
Anchors are determined by a k-means procedure, looking at all the bounding boxes in your dataset. If you're looking at vehicles, the ones you see from the side will have an aspect ratio of about 2:1 (width = 2*height). The ones viewed from in front will be roughly square, 1:1. If your dataset includes people, the aspect ratio might be 1:3. Foreground objects will be large, background objects will be small. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.
It's useful to have anchors that represent your dataset, because YOLO learns how to make small adjustments to the anchor boxes in order to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.
The assignment problem is trickier. As I understand it, part of the training process is for YOLO to learn which anchors to use for which object. So the "assignment" isn't deterministic like it might be for the Hungarian algorithm. Because of this, in general, multiple anchors will detect each object, and you need to do non-max-suppression afterwards in order to pick the "best" one (i.e. highest confidence).
There are a couple of points that I needed to understand before I came to grips with anchors:
Anchors can be any size, so they can extend beyond the boundaries of
the 13x13 grid cells. They have to be, in order to detect large
objects.
Anchors only enter in the final layers of YOLO. YOLO's neural network makes 13x13x5=845 predictions (assuming a 13x13 grid and 5 anchors). The predictions are interpreted as offsets to anchors from which to calculate a bounding box. (The predictions also include a confidence/objectness score and a class label.)
YOLO's loss function compares each object in the ground truth with one anchor. It picks the anchor (before any offsets) with highest IoU compared to the ground truth. Then the predictions are added as offsets to the anchor. All other anchors are designated as background.
If anchors which have been assigned to objects have high IoU, their loss is small. Anchors which have not been assigned to objects should predict background by setting confidence close to zero. The final loss function is a combination from all anchors. Since YOLO tries to minimise its overall loss function, the anchor closest to ground truth gets trained to recognise the object, and the other anchors get trained to ignore it.
The following pages helped my understanding of YOLO's anchors:
https://medium.com/#vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807
https://github.com/pjreddie/darknet/issues/568
I think that your statement about the number of predictions of the network could be misleading. Assuming a 13 x 13 grid and 5 anchor boxes the output of the network has, as I understand it, the following shape: 13 x 13 x 5 x (2+2+nbOfClasses)
13 x 13: the grid
x 5: the anchors
x (2+2+nbOfClasses): (x, y)-coordinates of the center of the bounding box (in the coordinate system of each cell), (h, w)-deviation of the bounding box (deviation to the prior anchor boxes) and a softmax activated class vector indicating a probability for each class.
If you want to have more information about the determination of the anchor priors you can take a look at the original paper in the arxiv: https://arxiv.org/pdf/1612.08242.pdf.

Why do filters and feature layers have the same number of channels?

Some object detection framework such as SSD (Single Shot MultiBox Detector) and Faster-RCNN have “convolutional filters” for classification and regression. The following is from SSD:
For a feature layer of size m × n with p channels, the basic element for predicting parameters of a potential detection is a 3 × 3 × p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m × n locations where the kernel is applied, it produces an output value.
My question is: does the numbers of “small kernels” have to be p? How about set a arbitrary number k (which is not same with feature channels)?
In the figure, the part extra Feature layers shows how the small kernel extracts p vector from each of the output location, that predict detections for different aspect ratios and class categories.
For example, from the first convolutional feature map, p is (3x(classes+4)), and for the second one it is (6x(classes+4)). The numbers 3 and 6 indicate the number of anchor boxes defined for those feature maps, and for each of those anchor boxes there are classes + 4 box coordinates output.
So you need to fix p based on the number of anchor boxes you decide for each feature map, the number of classes you want to detect.
My question is: does the numbers of “small kernels” have to be p? How
about set a arbitrary number k (which is not same with feature
channels)?
The feature channel is the result of convolution of the 3x3xp channel so it will always takes size p which is the output channel size of the kernel. And note 3x3xp is actually 3 x 3 x in_channels x p, for example the first features layer is obtained by convolving 38x38x512 from the VGG with the kernel 3x3x512xp to get 38x38xp

Quadtrees: a common intersect method failing to handle a simple case

I am writing a simple GUI library and am using quadtrees to determine which, if any, objects are interacted with during a mouse event. I was looking through a number of quadtree libraries on github and they all contained a method for adding a rectangular object to a quadtree.
The method, in all cases, simply checked to see if the rectangle intersected with the given quadtree:
return quadtree.x2 >= rect.x1
and quadtree.x1 <= rect.x2
and quadtree.y2 >= rect.y1
and quadtree.y1 <= rect.y2
However, this gives an unwanted result in one of the simplest cases: Imagine a 100x100 square area. I place four 50x50 square objects into the area with coordinates (0,0), (0,50), (50,0), and (50,50). If these objects had been placed into a 100x100 quadtree with a maximum capacity of one object, I would (visually) expect that the first layer of the quadtree would split and that the four resulting trees would each exactly contain one of the squares.
If I use the above method to determine which tree the squares are placed into, though, I find that each object intersects with all four trees. This would cause each of the trees to rapidly split until the maximum depth is reached.
The only way I see to avoid this is to use two checks:
return (quadtree.x2 > rect.x1
and quadtree.x1 < rect.x2
and quadtree.y2 > rect.y1
and quadtree.y1 < rect.y2)
or (quadtree.x2 == rect.x1
and quadtree.x1 == rect.x2
and quadtree.y2 == rect.y1
and quadtree.y1 == rect.y2)
(in the simplest case. Larger objects would have to be viewed within a bounding box since, for example, an object with coordinates (0,0), w=100, h=100 would belong in the upper-left quadtree as well.)
I could also calculate the overlap between the rectangles and the quadtrees to see if it's non-zero.
Am I missing something? It seems like this should be an ideal situation for a quadtree, yet, in most implementations, it's a huge mess.
I wouldn't call this an ideal situation, because the four rectangles overlap by a fractional amount. For example, if we assume a (fictional) floating precision of 10^(-10), every 'point' is actually a small rectangle with 10^(-10) length, and thus the rectangles overlap by 10^(-10). This is why you get the deep tree.
But I also think the tree could be improved with a slightly modified overlap checking. With your code, the sub-nodes all overlap by a tiny amount. It would work better with excluding the minimum (or maximum values), for example:
return quadtree.x2 >= rect.x1
and quadtree.x1 < rect.x2
and quadtree.y2 >= rect.y1
and quadtree.y1 < rect.y2
So the lower left coordinate of a node is actually outside of that node. This would at least avoid points turning up in several nodes (such as the point (50,50)), and the lower left rectangle would be stored in only one node.

Cellular automation get non-living neighbours

I'm trying to develop a cellular automata simulation and the problem is I want to get the close neighbours and far neighbours of each cell (illustrated as blue and beige) and determine which cells are dead and using some rules bring them to life. So at each iteration I'll be running through all the cells in the array and I want to somehow efficiently get all the close and far neighbours of these cells.
However depending on the position of the cell on the grid, only some of the neighbours will be available, and the only way I thought of doing this so far is having a getNeighbours(cell) method which will return a list with all the available neighbours of that cell that I will have to iterate to get the non-living ones.
getNeighbours(cell):
If cell.row > 0:
neighbours.add((coordinate,value),CLOSE_TOP_MIDDLE)
If cell.row > 1:
neighbours.add((coordinate,value),FAR_TOP_MIDDLE)
[...]
However that is a lot of overhead and a lot of comparisons to be done for each cell in the grid!
Is there any generic approach that is generally used with cellular automations? Maybe any data structures I can use? Because with what I have so far each iteration will take a lot of time if the grid is large enough.
Depending on the programming language that you use, there may be packages which provide the desired functionality. In Java, for example, there exists a package called JCASim: Cellular automata simulation system.
Finding neighbours in a CA can be a non-trivial task (e.g., if you use hexagonal cells etc). Even the term 'neighbor' has to be defined: Moore neighborhood or von Neumann neighborhood (these Wikipedia-articles also provide some pseudo-code).
In your case, you can implement the neighbor-search yourself:
Let's assume your CA consists of n rows with n columns (labelled from 0,..., n-1) as shown in your picture.
Your getNeighbour-function has to check all next-neighbor cells (grey background color in your image).
If you use periodic boundary conditions, you can use the the modulus-operator (%) to get the 9 next-neighbor cells. With periodic boundary conditions the neighbour cells of cell (x,y) are: (x+1 % n, y), (x, y+1 % n), (x+1 % n, y+1 % n), (x+n-1 % n, y), (x, y+n-1 %n), ...)
With open boundaries you have to discard all neighbours where x+1 > n-1, y+1 > n-1 or x-1 < 0, y-1 < 0
This way, you can check all cells with a grey background color in your picture.
Call the same function on each of the grey cells. This way you also check the cells with a blue background color.
Now, you have checked all cells in the neighborhood that you defined

Variable names of unordered set items without implied structure

This question will be asked in a specific form, but applies to a more general question, how to name unordered set items without implying any sort of structure.
In terms of graph theory, a connected, undirected graph will contain vertices that are connected via edges.
When creating an edge class with two member variables that are vertices, representing the two vertices that the edge connects, there was a difficulty in describing the two variables that did not include some form of implied structure.
Consider
class Edge{
Vertex v1;
Vertex v2;
}
or
class Edge{
Vertex left;
Vertex right;
}
or
class Edge{
Vertex a;
Vertex b;
}
{v1, v2} implies order and a larger possible size than two, though an edge only has two ends.
{a, b} is similar to {v1,v2}, only substiting different symbols.
{left, right} or {up, down} imply direction, which may be counter-intuitive when there is not necessarily any spatial reference to the graph, since raw graphs are pure abstractions.
{start, end} would work for a directed graph but seems arbitrary in an undirected graph.
The closest that I can consider is:
class Edge{
Vertex oneEnd;
Vertex otherEnd;
}
but that feels kludgey.
What name complies with good practice for such variables without implying any form of direction, ordering, or structure?
I'd go with Edge { Vertex v1; Vertex v2; }. I don't think that the user of your code will interpret the numerical suffixes as the order, but simply as differentiators. What if your unordered set contained 10 or 100 items, as could be the case with for example a polygon structure? I'm sure the most intuitive solution would be to use numerical indices/suffixes when naming the items.