How are matrices multiplied in Hierarchical Softmax model? - deep-learning

As I understood, the simple word2vec approach uses two matrices like the following:
Assuming that the corpus consists of N words.
Weighted input matrix (WI) with dimensions NxF (F is number of features).
Weighted output matrix (WO) with dimensions FxN.
We multiply one hot vector 1xN with WI and get a neurone 1xF.
Then we multiply the neurone with WO and get an output vector 1xN.
We apply softmax function and choose the highest entry (probability) in the vector.
Question: how is this illustrated when using the Hierarchical Softmax model?
What will be multiplied with which matrix to get the 2 dimensional vector that will lead to branch left or right?
P.S. I do understand the idea of the Hierarchical Softmax model using a binary tree and so on, but I don't know how the multiplications are done mathematically.
Thanks

To make things easy, assume that N is a power of 2. The binary tree will then have N-1 inner nodes. These nodes hook to WO with dimensions Fx(N-1).
Once you have computed a value for each inner node, calculate left and right branch values. Use something like a sigmoid function to assign to (say) the left branch. The right branch is just 1 minus the left.
To predict, find the maximum probability path starting from the root to a leaf.
To train, identify the correct leaf and identify the path of inner nodes to the root. Backpropagate starting with those log(N) nodes.

Related

Flatten data matrix for better deep learning?

I am implementing Algorithm 1 from this paper. The core of the algorithm is a deep learning optimization procedure (the loss is not important). Each of my data points is a matrix of dimension 2 by N, with n such data points. I have two options for the architecture of the Neural Network:
Make the NN a function from R^2, i.e. 2 input neurons. When I apply it to a 2 by N matrix, I simply apply it to each individual column of the matrix. This makes sense from the physical point of view underlying the problem (each data point is a collection of N interacting particles in 2-dimensional space).
Make the NN a function from R^{2 x N}, i.e. 2 x N input neurons. When I apply it to a 2 by N matrix, I flatten the matrix, and apply the NN to the resulting vector. This makes sense from the deep learning point of view, since the data is "really" 2xN-dimensional.
I have implemented the former architecture. But is the second architecture more "correct"? Can I expect better accuracy from the second approach? Mathematically speaking it can approximate a strictly larger family of functions (functions with interdependence of columns of the 2xN input matrix).
I know that in image classification an image matrix gets flattened and fed into the NN as a vector. This suggests using the latter approach over the former. However, images lack the physical structure that my problem possesses: The columns of each 2XN matrix are positions of N physical particles in 2-dimensional space.

Why W_q matrix in torch.nn.MultiheadAttention is quadratic

I am trying to implement nn.MultiheadAttention in my network. According to the docs,
embed_dim  – total dimension of the model.
However, according to the source file,
embed_dim must be divisible by num_heads
and
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong?
Each head uses a different part of the projected query vector. You can imagine it as if the query gets split into num_heads vectors that are independently used to compute the scaled dot-product attention. So, each head operates on a different linear combination of the features in queries (and keys and values, too). This linear projection is done using the self.q_proj_weight matrix and the projected queries are passed to F.multi_head_attention_forward function.
In F.multi_head_attention_forward, it is implemented by reshaping and transposing the query vector, so that the independent attentions for individual heads can be computed efficiently by matrix multiplication.
The attention head sizes are a design decision of PyTorch. In theory, you could have a different head size, so the projection matrix would have a shape of embedding_dim × num_heads * head_dims. Some implementations of transformers (such as C++-based Marian for machine translation, or Huggingface's Transformers) allow that.

How is dividing into minibatches implemented in batch normalization for deeper layers?

Suppose, we have dataset X (2D array), and we divide it into batches X_1, ..., X_k.
Then for each batch we do normalization, then each i-th component of batch element we multiply by parameter gamma_i and add to them beta_i.
Batch normalization layer can be repeated several times and I didn't found anything about how it is implemented deeper in network.
In next BN layers do we use the same division to batches as in the beginning (using the same rows in X as in the firsh BN layer), just adding new gamma and beta parameters, or we do it from scratch for every layers's input?
Hope, my question is clear.

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

Multidimensional interpolation

Given a dataset of samples in a multi dimensional space (in my case a 4D space) where the samples are present on all the corners of the 4D cube and a substantial amount of samples within this cube but not in a neatly grid. Each sample has an output value next to it's 4D coordinate. The cube has coordinates [0,0,0,0]..[1,1,1,1].
Given a new coordinate (4D) how can I come up with the best interpolated value given these samples? Eg how do I choose the samples to start with, how to interpolate.
As a first guess I would guess that this can be done with a two step process:
find the smallest convex pentachoron (4D equivalent of the 3D tetrahedron / the 2D triangle) around the coordinate we need to interpolate.
interpolate within this tetrahedron.
Especially step 1 seems quite complex and slow.
Here's the first approach I'd try.
Step 1
Find the point's 4 nearest neighbors by Euclidean distance. It's important that these 4 points are linearly independent because next they're used to create a Barycentric coordinate system. Those 4 points become the vertices of your pentachoron (aka 4-simplex).
If nearest-neighbor checks are too slow, try structuring your data into a spatial lookup tree that works in 4D.
Step 2
Now we need to associate a value with the interpolation point X. Start by deriving X's representation in this new Barycentric coordinate system. This Barycentric coordinate consists of 4 numbers, which collectively describe the relative distance between the interpolation point and each of the 4-simplex's vertices.
Normalize the Barycentric coordinate so its components sum to 1.
Each of those 4 simplex vertices are data points and have an output value. Combine those 4 output values into a vector.
Finally, interpolate by calculating the dot product of the normalized coordinate with the vector of output values.
Source: This idea is really just a 4D extension of this gem in middle of the Barycentric coordinate system page on Wikipedia.