How does BERT loss function works? - deep-learning

I'm confused about how cross-entropy works in bert LM. To calculate loss function we need the truth labels of masks. But we don't have the vector representation of the truth labels and the predictions are vector representations. So how to calculate loss ?

We already know the words we mask before passing to BERT so the actual word's one hot encoding is the actual truth label. The predicted token of masked word is passed to a softmax layer which converts the masked word's vector into another embedding (size will be similar to input word vector's size). Then we can calculate cross entropy loss between the input vector and the one we got after softmax layer.
Hope this clarifies. For better clarification watch this
https://www.youtube.com/watch?v=xI0HHN5XKDo

Related

How to implement soft-argmax in caffe?

In Caffe deep-learning framework there is an argmax layer which is not differentiable and hence can not be used for end to end training of a CNN.
Can anyone tell me how I could implement the soft version of argmax which is soft-argmax?
I want to regress coordinates from heatmap and then use those coordinates in loss calculations. I am very new to this framework therefore no idea how to do this. any help will be much appreciated.
I don't get exactly what you want, but there are following options:
Use L2 loss to train regression task (EuclideanLoss). Or SmoothL1Loss (from SSD Caffe by Wei Lui), or L1 (don't know were you get it).
Use softmax with cross-entropy loss (SoftmaxWithLoss) to train classification task with classes corresponding to the possible values of x or y coordinate. For example, one loss layer for x, and one for y. SoftmaxWithLoss accepts label as a numeric value, and casts it to int with static_cast(). But take into account that implementation doesn't check that the casted value is within 0..(num_classes-1) range, so you have to be careful.
If you want something more unusual, you'll have to write you own layer in C++, C++/CUDA or Python+NumPy. This is very often the case unless you are already using someone other's implementation.

Predicting continuous valued output

I am working on predicting Semantic Textual Similarity (SemEval 2017 Task-1) between a pair of texts. The similarity score (output) is a continuous value between [0,5]. The neural network model (link below), therefore, has 6 units in the final layer for prediction between values [0,5]. The objective function used is the Pearson correlation coefficient and softmax activation is used. Now, in order to train the model, how can I give the target output values to the model? Since there are 6 output classes, I should probably send one-hot-encoded vectors of the output. In that case, how can we convert the output (which might be a float value such as 2.33) to a one-hot vector of length 6? Or is there any other way of specifying the target output and training the model?
Paper: http://nlp.arizona.edu/SemEval-2017/pdf/SemEval016.pdf
If the value you're trying to predict is continuously-defined, you might be better off configuring this as a regression architecture. This will be simpler to train and interpret and will give you non-integer predictions (which you can then bucket or threshold however you please).
In order to do this, replace your softmax layer with a layer containing a single neuron with a linear activation function. Then you can simply train this network using your real-valued similarity numbers at the output. For loss function, you can use MSE / L2 unless you have a reason to do otherwise.

Does a Convolutional Layer Have an Exact Inverse

...and if so under what circumstances?
A Convolutional Layer usually yields an output of lesser size. Is it possible to reverse/invert such an operation by flipping/transposing the used kernel and providing padding or likewise?
Just looking at the convolutional layer's operation here - without pooling layers, concatenation, non-linear activation functions etc.
I'm not looking for any of the several trainable versions of reverse convolutional operations. Such can be achieved by strides $\geq 1$ in the output space or intrinsic padding in the input space for example. Vincent Dumoulin and Francesco Visin provide very elucidating, animated gifs on their github page. And the Deep Learning community is divided over the naming of these operations: Transpose convolution, fractionally strided convolution and deconvolution are all used (the latter, although widely used, is very misleading since it's no proper mathematical deconvolution).
I believe this is where the difference between a transposed convolution and a deconvolution is essential.
A deconvolution is the mathematical inverse of what a convolution does, whereas a transposed convolution reverses only the spatial transformation between input and output. Meaning, if you want to reverse the changes concerning the shape of the output, a transposed convolution will do the job, but it will not be the mathematical inverse concerning the values it produces. I wrote a few words about this topic in an article.
Well the deconvolution is defined very clearly.
In convo, you multiply the map with the input frame like vector miltiplication, summ it and ASSIGN the value to output.
In deconvo, you take the output value (one), multiplie it with the map, quasi highlighting the points what influated the output and ADD it to former input layer, (what of course must be filled with zeros in the begin. This must give the same "input" layer as in forward pass.

Obtaining multiple output in regression using deep learning

Given an RGB image of hand and 3d position of the keypoints of the hand as dataset, I want to do this as regression problem in DL. In this case input will be the RGB image, and output should be estimated 3d position of keypoints.
I have seen some info about regression but most of them are trying to estimate one single value. Is it possible to estimate multiple values(or output) all at once?
For now I have referred to this code. This guy is trying to estimate the age of a person in the image.
The output vector from a neural net can represent anything as long as you define loss function well. Say you want to detect (x,y,z) co-ordinates of 10 keypoints, then just have 30 element long output vector say (x1,y1,z1,x2,y2,z2..............,x10,y10,z10), where xi,yi,zi denote coordinates of ith keypoint, basically you can use any order you feel convenient with. Just be careful with your loss function. Say you want to calculate RMSE loss, you would have to extract tripes correctly and then calculate RMSE loss for each keypoint, or if you are fimiliar with linear algebra, just reshape it into a 3x10 matrix correctly and and have your results also as a 3x10 matrix and then just use
loss = tf.sqrt(tf.reduce_mean(tf.squared_difference(Y1, Y2)))
But once you have formulated your net you will have to stick to it.

How to train the RPN in Faster R-CNN?

Link to paper
I'm trying to understand the region proposal network in faster rcnn. I understand what it's doing, but I still don't understand how training exactly works, especially the details.
Let's assume we're using VGG16's last layer with shape 14x14x512 (before maxpool and with 228x228 images) and k=9 different anchors. At inference time I want to predict 9*2 class labels and 9*4 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
In the paper they write
"we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1"
That's the part I'm not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?
Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.
I appreciate your help.
Not exactly. From what I understand, the RPN predicts WHk bounding boxes per feature map, and then 256 are randomly sampled per the 1:1 criteria, and these are used as part of the computation for the loss function of that particular mini-batch. You're still only training one network, not k, since the 256 random samples are not of any particular type.
Disclaimer: I only started learning about CNNs a month ago, so I may not understand what I think I understand.