How much is the dimension of some bidirectional LSTM layers? - deep-learning

I read a paper about machine translation, and it uses projection layer. Its encoder has 6 bidirectional LSTM layers. If input embedding dimension is 512, how much will be the dimension of the encoder output? 512*2**5?
The paper's link: https://www.aclweb.org/anthology/P18-1008.pdf

Not quite. Unfortunately, Figure 1 in the mentioned paper is a bit misleading. It is not that the six encoding layers are in parallel, as it might be understood from the figure, but rather that these layers are successive, meaning that the hidden state/output from the previous layer is used in the subsequent layer as an input.
This, and the fact that the input (embedding) dimension is NOT the output dimension of the LSTM layer (in fact, it is 2 * hidden_size) change your output dimension to exactly that: 2 * hidden_size, before it is put into the final projection layer, which again is changing the dimension depending on your specifications.
It is not quite clear to me what the description of add does in the layer, but if you look at a reference implementation it seems to be irrelevant to the answer. Specifically, observe how the encoding function is basically
def encode(...):
encode_inputs = self.embed(...)
for l in num_layers:
prev_input = encode_inputs
encode_inputs = self.nth_layer(...)
# ...
Obviously, there is a bit more happening here, but this illustrates the basic functional block of the network.

Related

PyTorch transformer argument "dim_feedforward"

I would like to understand what exactly is going on with this argument.
I have read that the feed forward sub-layer inside the transformer layer is a "pointwise" feed-forward layer. what does "pointwise" means in this context?
feed-forward layers takes 2 args: input features and output features.
this argument can't be the output features since no matter what value I use for it the output of the transformer layer always has the same shape. it also can't be the input features since it is determined by the self attention sublayer.
MOST IMPORTANTLY - where is the argument for the size of the tensors for the attention? the ones that translate the input into queries, keys and values?
"Position-wise", or "Point-wise", means the feed forward network (FFN) takes each position of a sequence, say, each word of a sentence, as its input. So point-wise FFN is a shared FFN that inputs each word one by one.
(and 3.) That's right. It is neither input features (determined by the self attention sublayer) nor output features (the same value as input features). It is actually the hidden features. The thing is, this particular FFN in transformer encoder has two linear layers, according to the implementation of TransformerEncoderLayer :
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
So dim_feedforward is the feature no. of hidden layer of the FFN. Usually, its value is set to be several times larger than d_model (2048 as default).

Why neural network have weight space symmetry?

"Neural nets have a weight space symmetry: we can permute all the hidden units in a given layer and obtain an equivalent solution" (From CSC321, lecture 10, Optimation)
I don't think it make sense, is there something wrong with my understanding?
For example, there is a simple DNN with 2 units in the only hidden layer. And there is one local optima and one global optima like this:
Obviously 2 symmetric points will result in different solution, they will go into different optima(the right-bottom one is the global optima).
Please tell me where it goes wrong?
I think you miss the definition of symmetry.
Geometry is the branch of mathematics studying invariants under some class of transformations. The invariants of a geometry are called the symmetry of the geometry. For instance, the symmetries of Euclidean geometry is length and angles because rotations and translations (the group of Euclidean transformations) preserve them. Simply put, in Euclidean geometry, length and angles are the symmetries of the geometry. In the same vein, the symmetry of the affine geometry is parallelism.
In the context of deep learning, weight space symmetry means that non-identifiable models are invariant to random permutations in their weight layers. This symmetry holds since in deep learning there are generally not enough training samples to rule out all parameter settings but one, there usually exist a large amount of possible weight combinations for a given dataset that yield similar model performance.
Sure, if you permute the weights of input layers randomly - you'll not come with the same result. Becase the order of input elements matter.
The permutations symetry is about permuting the neurons of hidden layers, not about permuting weights of single neuron.
For example, your hidden layer has 2 neorons with weights w11, w12, w13 and w21 w22, w23.
So the permutation principle states that you can easily permute
w11 <-> w21, w12<->w22 and w13<->w23 and the result will remain the same
the weight symmetry here means that there is an equivalent weight that maps the input to output. It doesn't mean the geometrical symmetry in coordinate space. You can have a deeper look in Bishop Ch5.1

One Hot Encoding dimension - Model Compexity

I will explain my problem:
I have around 50.000 samples, each of one described by a list of codes representing "events"
The number of unique codes are around 800.
The max number of codes that a sample could have is around 600.
I want to represent each sample using one-hot encoding. The representation should be, if we consider the operation of padding for those samples that has fewer codes, a 800x600 matrix.
Giving this new representation as input of a network, means to flatten each matrix to a vector of size 800x600 (460.000 values).
At the end the dataset should consist in 50.000 vectors of size 460.000 .
Now, I have two considerations:
How is it possible to handle a dataset of that size?(I tried data generator to obtain the representation on-the-fly but they are really slow).
Having a vector of size 460.000 as input for each sample, means that the complexity of my model( number of parameters to learn ) is extremely high ( around 15.000.000 in my case ) and, so, I need an huge dataset to train the model properly. Doesn't it?
Why do not you use the conventional model used in NLP?
These events can be translated as you say by embedding matrix.
Then you can represent the chains of events using LSTM (or GRU or RNN o Bilateral LSTM), the difference of using LSTM instead of a conventional network is that you use the same module repeated by N times.
So your input really is not 460,000, but internally an event A indirectly helps you learn about an event B. That's because the LSTM has a module that repeats itself for each event in the chain.
You have an example here:
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
Broadly speaking what I would do would be the following (in Keras pseudo-code):
Detect the number of total events. I generate a unique list.
unique_events = list (set ([event_0, ..., event_n]))
You can perform the translation of a sequence with:
seq_events_idx = map (unique_events.index, seq_events)
Add the necessary pad to each sequence:
sequences_pad = pad_sequences (sequences, max_seq)
Then you can directly use an embedding to carry out the transfer of the event to an associated vector of the dimension that you consider.
input_ = Input (shape = (max_seq,), dtype = 'int32')
embedding = Embedding (len(unique_events),
                    dimensions,
                    input_length = max_seq,
                    trainable = True) (input_)
Then you define the architecture of your LSTM (For example):
lstm = LSTM (128, input_shape = (max_seq, dimensions), dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True) (embedding)
Add the dense and the result you want:
out = Dense (10, activation = 'softmax') (lstm)
I think that this type of model can help you and give better results.

How to reshape a pytorch matrix without mixing elements of items in a batch

In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)

Keras LSTM input - Predicting a parabolic trajectory

I want to predict the trajectory of a ball falling. That trajectory is parabolic. I know that LSTM may be too much for this (i.e. a simpler method could suffice).
I thought that we can do this with 2 LSTM layers and a Dense layer at the end.
The end result that I want is to give the model 3 heights h0,h1,h2 and let it predict h3. Then, I want to give it h1, h2, and the h3 it outputted previously to predict h4, and so on, until I can predict the whole trajectory.
Firstly, what would the input shape be for the first LSTM layer ?
Would it be input_shape = (3,1) ?
Secondly, would the LSTM be able to predict a parabolic path ?
I am getting almost a flat line, not a parabola, and I want to rule out the possibility that I am misunderstanding how to feed and shape input.
Thank you
The input shape is in the form (samples, timeSteps, features).
Your only feature is "height", so features = 1.
And since you're going to input sequences with different lengths, you can use timeSteps = None.
So, your input_shape could be (None, 1).
Since we're going to use a stateful=True layer below, we can use batch_input_shape=(1,None,1). Choose the amount of "samples" you want.
Your model can predict the trajectory indeed, but maybe it will need more than one layer. (The exact answer about how many layers and cells depend on knowing how the match inside LSTM works).
Training:
Now, first you need to train your network (only then it will be able to start predicting good things).
For training, suppose you have a sequence of [h1,h2,h3,h4,h5,h6...], true values in the correct sequence. (I suggest you have actually many sequences (samples), so your model learns better).
For this sequence, you want an output predicting the next step, then your target would be [h2,h3,h4,h5,h6,h7...]
So, suppose you have a data array with shape (manySequences, steps, 1), you make:
x_train = data[:,:-1,:]
y_train = data[:,1:,:]
Now, your layers should be using return_sequences=True. (Every input step produces an output step). And you train the model with this data.
A this point, whether you're using stateful=True or stateful=False is not very relevant. (But if true, you always need model.reset_state() before every single epoch and sequence)
Predicting:
For predicting, you can use stateful=True in the model. This means that when you input h1, it will produce h2. And when you input h2 it will remember the "current speed" (the state of the model) to predict the correct h3.
(In the training phase, it's not important to have this, because you're inputting the entire sequences at once. So the speed will be understood between steps of the long sequences).
You can se the method reset_states() as set_current_speed_to(0). You will use it whenever the step you're going to input is the first step in a sequence.
Then you can do loops like this:
model.reset_states() #make speed = 0
nextH = someValueWithShape((1,1,1))
predictions = [nextH]
for i in range(steps):
nextH = model.predict(nextH)
predictions.append(nextH)
There is an example here, but using two features. There is a difference that I use two models, one for training, one for predicting, but you can use only one with return_sequences=True and stateful=True (don't forget to reset_states() at the beginning of every epoch in training).