How to use "top" and "bottom" parameters to build network architecture - caffe

In Caffe prototxt, every layer includes either "top" or "bottom" parameter to specify connections between layers. There are sometimes, however, cases when, for example, the "top" is the layer itself (why do we have to link it to itself?) or there are several "top" layers. What is the exact meaning of "top" and "bottom" parameters and the rules behind using them?

There is a confusion here between layers and blobs.
In Caffe, all data is represented in the form of blobs. Each layer takes in zero or more blobs, transforms them, and sends out zero or more blobs. For example, a ReLU layer accepts a single blob with the data, applies the function f(x) = x if x>0, 0 otherwise, and outputs the result as a single blob. A data layer for classification problems usually has two output blobs, one for the data and the other for the labels, and no input blob.
The blobs are visualized as if they move through the network from the bottom to the top. So, the input blob is called the bottom blob and the output blob is called the top blob.
Now, in the prototxt definition, the name attribute stores the name of the layer. The bottom attribute stores the name of the input blob. The top attribute stores the name of the output blob, which for convenience, is generally taken to be the same as the name of the layer. If there are multiple input blobs for that layer, there are multiple bottom attributes, and if there are multiple output blobs, there are multiple top attributes.

In Caffe:
The names of the links between layers (top/bottom parameters) are very important.
Outward links from a layer are "top" and Incoming links into a layer are "bottom". So, a top from one layer connects as bottom into another layer. Sort of like a highway (top/bottom) between two towns (layers). Caffe determines the structure of your network from the names of your top/bottom links.
The names of the layers themselves are far less important, and do not carry structural information. You only need these to be sensible and unique. The same highway carries traffic between two towns regardless of how they are named.
The namespace of Layers and top/bottoms are separate. So, you can name a layer the same as a top or bottom. This doesn't mean anything. But it is confusing and should be avoided.

Related

PyTorch transformer argument "dim_feedforward"

I would like to understand what exactly is going on with this argument.
I have read that the feed forward sub-layer inside the transformer layer is a "pointwise" feed-forward layer. what does "pointwise" means in this context?
feed-forward layers takes 2 args: input features and output features.
this argument can't be the output features since no matter what value I use for it the output of the transformer layer always has the same shape. it also can't be the input features since it is determined by the self attention sublayer.
MOST IMPORTANTLY - where is the argument for the size of the tensors for the attention? the ones that translate the input into queries, keys and values?
"Position-wise", or "Point-wise", means the feed forward network (FFN) takes each position of a sequence, say, each word of a sentence, as its input. So point-wise FFN is a shared FFN that inputs each word one by one.
(and 3.) That's right. It is neither input features (determined by the self attention sublayer) nor output features (the same value as input features). It is actually the hidden features. The thing is, this particular FFN in transformer encoder has two linear layers, according to the implementation of TransformerEncoderLayer :
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
So dim_feedforward is the feature no. of hidden layer of the FFN. Usually, its value is set to be several times larger than d_model (2048 as default).

Training out false positives in object detection

This is my first foray into the world of object recognition. I have successfully trained a model on yolo with images that I have found on Google and annotated myself in CVAT.
My questions are as follows.
a) How do I train the model to ignore some special variant that I am specifically NOT interested in detecting? Say I am getting false positives because something looks similar to one of my objects, and I want to train so that these are not detected. Does it simply work to include images that contain the unwanted object into the training set, but don't annotate the unwanted object?
b) If so, am I right in assuming that if I train on annotated images that have somehow missed occasional instances of desired objects, is that effectively telling the training engine that I'm not interested in that object? In other words, is it therefore BAD if images don't have every single instance of desired objects annotated?
c) If I happen to include an image in my training set with an empty annotation file, and there are desired objects in that image, that effectively disincentivizes the training engine to find those in future?
Thanks for any thoughts.
a) This is true. The model will consider space inside bounding boxes as positive for a certain class during training, and space outside the boxes for the class negative for that class.
b) See a, this is indeed the case.
c) Empty annotation files will be used during training, but the model will train on that image as a 'background' class, so these are negatives too.
So, in short, annotate all instances of objects of a certain class and maybe add 'background images' as negative examples to disincentivize those.

Question about dimensions when processing lists with a multi layer perceptron

I'm quite new to PyTorch and I'm trying to build a net that is composed only of linear layers that will get a list of objects as input and output some score (which is a scalar) for each object. I'm wondering if my input tensor's dimensions should be (batch_size, list_size, object_size) or should I flatten each list and get (batch_size, list_size*object_size)? According to my understanding, in the first option I will have an output dimension of (batch_size, list_size, 1) and in the second (batch_size, list_size), does it matter? I read the documentation but it still wasn't very clear to me.
If you want to do the classification for each object in your input, you should keep the objects separate from each other; i.e., your input should be in the shape of (batch_size, list_size, object_size). Then considering the number of classes you got (let's say m classes), the linear layer would transform the input to the shape of (batch_size, list_size, m). In this case, you will have m scores for each object which can be utilized to predict the class label.
But question arises now; why do we flatten in neural networks at all? The answer is simple: because you want to couple the whole information (in your specific case, the information pieces are the objects) within a batch to see if they somehow affect each other, and if that's the case, to examine whether your network is able to learn these features/patterns. In practice, considering the nature of your problem and the data you are working with, if different objects really relate to each other, then your network will be able to learn those.

How to convert a directed graph to its most minimal form?

I'm dealing with rooted, directed, potentially cyclic graphs. Each vertex in the graph has a label, which might or might not be unique. Edges do not have labels. The graph has a designated root vertex from which every vertex is reachable. The order of the edges outgoing from a vertex is relevant.
For my purposes, a vertex is equal to another vertex if they share the same label, and if their outgoing edges are also considered equal (and are in the same order). Two edges are equal if they have the same direction and if the vertices at their corresponding ends are equal.
Because of the equality rules above, a graph can contain multiple "sections" that are effectively equal. For example, in the graph below, there are two isomorphic sections containing vertices with labels {1, 2, 3, 4}. The root of the graph is vertex 0.
(source: graphonline.ru)
I need to be able to identify sections that are identical, and then remove all duplication, without changing the "meaning" of the graph (with regard to the equality rules above). Using the above example as input, I need to produce this:
(source: graphonline.ru)
Is there a known way of doing this within polynomial time?
The solution that ended up working was to essentially run the recursive equality check against every pair of vertices with the same label.
Let S = all pairs of vertices with the same label
For each s in S:
Compare the two vertices a and b in s by recursively comparing their children
If they compare as equal, take all edges in the graph pointing to b, and point them to a instead

CUDA: Scatter communication pattern

I am learning CUDA from the Udacity's course on parallel programming. In a quiz, they have a given a problem of sorting a pre-ranked variable(player's height). Since, it is a one-one correspondence between input and output array, should it not be a Map communication pattern instead of a Scatter?
CUDA makes no canonical definition of these terms, that I know of. Therefore my answer is merely a suggestion of how it might be or have been interpreted.
"Since, it is a one-one correspondence between input and output array"
This statement doesn't appear to be supported by the diagram, which shows gaps in the output array, which have no corresponding input point associated with them.
If a smaller set of values are distributed into a larger array (with resultant gaps in the output array, therefore, in which no input value corresponds to the gap location(s)), then a scatter might be used to describe that operation. Both scatters and maps have maps which describe where the input values go, but it might be that the instructor has defined scatter and map in such a way as to differentiate between these two cases, such as the following plausible definitions:
Scatter: one-to-one relationship from input to output (ie. unidirectional relationship). Every input location has a corresponding output location, but not every output location has a corresponding input location.
Map: one-to-one relationship between input and output (ie. bidirectional relationship). Every input location has a corresponding output location, and every output location has a corresponding input location.
Gather: one-to-one relationship from output to input (ie. unidirection relationship). Every output location has a corresponding input location, but not every input location has a corresponding output location.
The definition of each communication pattern (map, scatter, gather, etc.) varies slightly from one language/environment/context to another, but since I have followed that same Udacity course I'll try to explain that term as I understand it in the context of the course:
The Map operation calculates each output element as a function of its corresponding input element, i.e.:
output[tid] = foo(input[tid]);
The Gather pattern calculates each output element as a function of one or more (usually more) input elements, not necessarily the corresponding one (typically these are elements from a neighborhood). For example:
output[tid] = (input[tid-1] + input[tid+1]) / 2;
Lastly, the Scatter operation has each input element contribute to one or more (again, usually more) output elements. For instance,
atomicAdd( &(output[tid-1]), input[tid]);
atomicAdd( &(output[tid]), input[tid]);
atomicAdd( &(output[tid+1]), input[tid]);
The example given in the question is clearly not a Map, because each output is calculated from an input at a different location.
Also, it is hard to see how the same example can be a scatter, because each input element only causes one write to the output, but it is indeed a scatter because each input causes a write to an output whose location is determined by the input.
In other words, each CUDA thread processes an input element at the location associated with its tid(thread ID number), and calculates where to write the result. More usually a scatter would write on several places instead of only one, so this is a particular case that might as well be named differently.
Each player has 3 properties (name, height, rank).
So I think scatter is correct, because we should consider these three things to make output.
If player has only one property like rank,
then Map is correct I think.
reference: Parallel Communication Patterns Recap in this lecture
reference: map/reduce/gather/scatter with image