"Neural nets have a weight space symmetry: we can permute all the hidden units in a given layer and obtain an equivalent solution" (From CSC321, lecture 10, Optimation)
I don't think it make sense, is there something wrong with my understanding?
For example, there is a simple DNN with 2 units in the only hidden layer. And there is one local optima and one global optima like this:
Obviously 2 symmetric points will result in different solution, they will go into different optima(the right-bottom one is the global optima).
Please tell me where it goes wrong?
I think you miss the definition of symmetry.
Geometry is the branch of mathematics studying invariants under some class of transformations. The invariants of a geometry are called the symmetry of the geometry. For instance, the symmetries of Euclidean geometry is length and angles because rotations and translations (the group of Euclidean transformations) preserve them. Simply put, in Euclidean geometry, length and angles are the symmetries of the geometry. In the same vein, the symmetry of the affine geometry is parallelism.
In the context of deep learning, weight space symmetry means that non-identifiable models are invariant to random permutations in their weight layers. This symmetry holds since in deep learning there are generally not enough training samples to rule out all parameter settings but one, there usually exist a large amount of possible weight combinations for a given dataset that yield similar model performance.
Sure, if you permute the weights of input layers randomly - you'll not come with the same result. Becase the order of input elements matter.
The permutations symetry is about permuting the neurons of hidden layers, not about permuting weights of single neuron.
For example, your hidden layer has 2 neorons with weights w11, w12, w13 and w21 w22, w23.
So the permutation principle states that you can easily permute
w11 <-> w21, w12<->w22 and w13<->w23 and the result will remain the same
the weight symmetry here means that there is an equivalent weight that maps the input to output. It doesn't mean the geometrical symmetry in coordinate space. You can have a deeper look in Bishop Ch5.1
Related
I am trying to implement nn.MultiheadAttention in my network. According to the docs,
embed_dim – total dimension of the model.
However, according to the source file,
embed_dim must be divisible by num_heads
and
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong?
Each head uses a different part of the projected query vector. You can imagine it as if the query gets split into num_heads vectors that are independently used to compute the scaled dot-product attention. So, each head operates on a different linear combination of the features in queries (and keys and values, too). This linear projection is done using the self.q_proj_weight matrix and the projected queries are passed to F.multi_head_attention_forward function.
In F.multi_head_attention_forward, it is implemented by reshaping and transposing the query vector, so that the independent attentions for individual heads can be computed efficiently by matrix multiplication.
The attention head sizes are a design decision of PyTorch. In theory, you could have a different head size, so the projection matrix would have a shape of embedding_dim × num_heads * head_dims. Some implementations of transformers (such as C++-based Marian for machine translation, or Huggingface's Transformers) allow that.
...and if so under what circumstances?
A Convolutional Layer usually yields an output of lesser size. Is it possible to reverse/invert such an operation by flipping/transposing the used kernel and providing padding or likewise?
Just looking at the convolutional layer's operation here - without pooling layers, concatenation, non-linear activation functions etc.
I'm not looking for any of the several trainable versions of reverse convolutional operations. Such can be achieved by strides $\geq 1$ in the output space or intrinsic padding in the input space for example. Vincent Dumoulin and Francesco Visin provide very elucidating, animated gifs on their github page. And the Deep Learning community is divided over the naming of these operations: Transpose convolution, fractionally strided convolution and deconvolution are all used (the latter, although widely used, is very misleading since it's no proper mathematical deconvolution).
I believe this is where the difference between a transposed convolution and a deconvolution is essential.
A deconvolution is the mathematical inverse of what a convolution does, whereas a transposed convolution reverses only the spatial transformation between input and output. Meaning, if you want to reverse the changes concerning the shape of the output, a transposed convolution will do the job, but it will not be the mathematical inverse concerning the values it produces. I wrote a few words about this topic in an article.
Well the deconvolution is defined very clearly.
In convo, you multiply the map with the input frame like vector miltiplication, summ it and ASSIGN the value to output.
In deconvo, you take the output value (one), multiplie it with the map, quasi highlighting the points what influated the output and ADD it to former input layer, (what of course must be filled with zeros in the begin. This must give the same "input" layer as in forward pass.
I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best
Link to paper
I'm trying to understand the region proposal network in faster rcnn. I understand what it's doing, but I still don't understand how training exactly works, especially the details.
Let's assume we're using VGG16's last layer with shape 14x14x512 (before maxpool and with 228x228 images) and k=9 different anchors. At inference time I want to predict 9*2 class labels and 9*4 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
In the paper they write
"we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1"
That's the part I'm not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?
Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.
I appreciate your help.
Not exactly. From what I understand, the RPN predicts WHk bounding boxes per feature map, and then 256 are randomly sampled per the 1:1 criteria, and these are used as part of the computation for the loss function of that particular mini-batch. You're still only training one network, not k, since the 256 random samples are not of any particular type.
Disclaimer: I only started learning about CNNs a month ago, so I may not understand what I think I understand.
I have data with integer target class in the range 1-5 where one is the lowest and five the highest. In this case, should I consider it as regression problem and have one node in the output layer?
My way of handling it is:
1- first I convert the labels to binary class matrix
labels = to_categorical(np.asarray(labels))
2- in the output layer, I have five nodes
main_output = Dense(5, activation='sigmoid', name='main_output')(x)
3- I use 'categorical_crossentropy with mean_squared_error when compiling
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['mean_squared_error'],loss_weights=[0.2])
Also, can anyone tells me: what is the difference between using categorical_accuracy and 'mean_squared_error in this case?
Regression and classification are vastly different things. If you reimagine this as a regression task than the difference of predicting 2 when the ground truth is 4 will be rated more than if you predict 3 instead of 4. If you have class like car, animal, person you do not care for the ranking between those classes. Predicting car is just as wrong as animal, iff the image shows a person.
Metrics do not impact your learning at all. It is just something that is computed additionally to the loss to show the performance of the model. Here the accuracy makes sense, because this is mostly the metric that we care about. Mean squared error does not tell you how well your model performs. If you get something like 0.0015 mean squared error it sounds good, but it is hard to visualize just how well this performs. In contrast using accuracy and achieving 95% accuracy for example is meaningful.
One last thing you should use softmax instead of sigmoid as your final output to get a probability distribution in your final layer. Softmax will output percentages for every class that sum up to 1. Then crossentropy calculates the difference of the probability distribution of your network output and the ground truth.