I would like to understand what exactly is going on with this argument.
I have read that the feed forward sub-layer inside the transformer layer is a "pointwise" feed-forward layer. what does "pointwise" means in this context?
feed-forward layers takes 2 args: input features and output features.
this argument can't be the output features since no matter what value I use for it the output of the transformer layer always has the same shape. it also can't be the input features since it is determined by the self attention sublayer.
MOST IMPORTANTLY - where is the argument for the size of the tensors for the attention? the ones that translate the input into queries, keys and values?
"Position-wise", or "Point-wise", means the feed forward network (FFN) takes each position of a sequence, say, each word of a sentence, as its input. So point-wise FFN is a shared FFN that inputs each word one by one.
(and 3.) That's right. It is neither input features (determined by the self attention sublayer) nor output features (the same value as input features). It is actually the hidden features. The thing is, this particular FFN in transformer encoder has two linear layers, according to the implementation of TransformerEncoderLayer :
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
So dim_feedforward is the feature no. of hidden layer of the FFN. Usually, its value is set to be several times larger than d_model (2048 as default).
Related
I'm quite new to PyTorch and I'm trying to build a net that is composed only of linear layers that will get a list of objects as input and output some score (which is a scalar) for each object. I'm wondering if my input tensor's dimensions should be (batch_size, list_size, object_size) or should I flatten each list and get (batch_size, list_size*object_size)? According to my understanding, in the first option I will have an output dimension of (batch_size, list_size, 1) and in the second (batch_size, list_size), does it matter? I read the documentation but it still wasn't very clear to me.
If you want to do the classification for each object in your input, you should keep the objects separate from each other; i.e., your input should be in the shape of (batch_size, list_size, object_size). Then considering the number of classes you got (let's say m classes), the linear layer would transform the input to the shape of (batch_size, list_size, m). In this case, you will have m scores for each object which can be utilized to predict the class label.
But question arises now; why do we flatten in neural networks at all? The answer is simple: because you want to couple the whole information (in your specific case, the information pieces are the objects) within a batch to see if they somehow affect each other, and if that's the case, to examine whether your network is able to learn these features/patterns. In practice, considering the nature of your problem and the data you are working with, if different objects really relate to each other, then your network will be able to learn those.
I read a paper about machine translation, and it uses projection layer. Its encoder has 6 bidirectional LSTM layers. If input embedding dimension is 512, how much will be the dimension of the encoder output? 512*2**5?
The paper's link: https://www.aclweb.org/anthology/P18-1008.pdf
Not quite. Unfortunately, Figure 1 in the mentioned paper is a bit misleading. It is not that the six encoding layers are in parallel, as it might be understood from the figure, but rather that these layers are successive, meaning that the hidden state/output from the previous layer is used in the subsequent layer as an input.
This, and the fact that the input (embedding) dimension is NOT the output dimension of the LSTM layer (in fact, it is 2 * hidden_size) change your output dimension to exactly that: 2 * hidden_size, before it is put into the final projection layer, which again is changing the dimension depending on your specifications.
It is not quite clear to me what the description of add does in the layer, but if you look at a reference implementation it seems to be irrelevant to the answer. Specifically, observe how the encoding function is basically
def encode(...):
encode_inputs = self.embed(...)
for l in num_layers:
prev_input = encode_inputs
encode_inputs = self.nth_layer(...)
# ...
Obviously, there is a bit more happening here, but this illustrates the basic functional block of the network.
I want to predict the trajectory of a ball falling. That trajectory is parabolic. I know that LSTM may be too much for this (i.e. a simpler method could suffice).
I thought that we can do this with 2 LSTM layers and a Dense layer at the end.
The end result that I want is to give the model 3 heights h0,h1,h2 and let it predict h3. Then, I want to give it h1, h2, and the h3 it outputted previously to predict h4, and so on, until I can predict the whole trajectory.
Firstly, what would the input shape be for the first LSTM layer ?
Would it be input_shape = (3,1) ?
Secondly, would the LSTM be able to predict a parabolic path ?
I am getting almost a flat line, not a parabola, and I want to rule out the possibility that I am misunderstanding how to feed and shape input.
Thank you
The input shape is in the form (samples, timeSteps, features).
Your only feature is "height", so features = 1.
And since you're going to input sequences with different lengths, you can use timeSteps = None.
So, your input_shape could be (None, 1).
Since we're going to use a stateful=True layer below, we can use batch_input_shape=(1,None,1). Choose the amount of "samples" you want.
Your model can predict the trajectory indeed, but maybe it will need more than one layer. (The exact answer about how many layers and cells depend on knowing how the match inside LSTM works).
Training:
Now, first you need to train your network (only then it will be able to start predicting good things).
For training, suppose you have a sequence of [h1,h2,h3,h4,h5,h6...], true values in the correct sequence. (I suggest you have actually many sequences (samples), so your model learns better).
For this sequence, you want an output predicting the next step, then your target would be [h2,h3,h4,h5,h6,h7...]
So, suppose you have a data array with shape (manySequences, steps, 1), you make:
x_train = data[:,:-1,:]
y_train = data[:,1:,:]
Now, your layers should be using return_sequences=True. (Every input step produces an output step). And you train the model with this data.
A this point, whether you're using stateful=True or stateful=False is not very relevant. (But if true, you always need model.reset_state() before every single epoch and sequence)
Predicting:
For predicting, you can use stateful=True in the model. This means that when you input h1, it will produce h2. And when you input h2 it will remember the "current speed" (the state of the model) to predict the correct h3.
(In the training phase, it's not important to have this, because you're inputting the entire sequences at once. So the speed will be understood between steps of the long sequences).
You can se the method reset_states() as set_current_speed_to(0). You will use it whenever the step you're going to input is the first step in a sequence.
Then you can do loops like this:
model.reset_states() #make speed = 0
nextH = someValueWithShape((1,1,1))
predictions = [nextH]
for i in range(steps):
nextH = model.predict(nextH)
predictions.append(nextH)
There is an example here, but using two features. There is a difference that I use two models, one for training, one for predicting, but you can use only one with return_sequences=True and stateful=True (don't forget to reset_states() at the beginning of every epoch in training).
I have a dialog corpus like below. And I want to implement a LSTM model which predicts a system action. The system action is described as a bit vector. And a user input is calculated as a word-embedding which is also a bit vector.
t1: user: "Do you know an apple?", system: "no"(action=2)
t2: user: "xxxxxx", system: "yyyy" (action=0)
t3: user: "aaaaaa", system: "bbbb" (action=5)
So what I want to realize is "many to many (2)" model. When my model receives a user input, it must output a system action.
But I cannot understand return_sequences option and TimeDistributed layer after LSTM. To realize "many-to-many (2)", return_sequences==True and adding a TimeDistributed after LSTMs are required? I appreciate if you would give more description of them.
return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.
TimeDistributed: This wrapper allows to apply a layer to every temporal slice of an input.
Updated 2017/03/13 17:40
I think I could understand the return_sequence option. But I am not still sure about TimeDistributed. If I add a TimeDistributed after LSTMs, is the model the same as "my many-to-many(2)" below? So I think Dense layers are applied for each output.
The LSTM layer and the TimeDistributed wrapper are two different ways to get the "many to many" relationship that you want.
LSTM will eat the words of your sentence one by one, you can chose via "return_sequence" to outuput something (the state) at each step (after each word processed) or only output something after the last word has been eaten. So with return_sequence=TRUE, the output will be a sequence of the same length, with return_sequence=FALSE, the output will be just one vector.
TimeDistributed. This wrapper allows you to apply one layer (say Dense for example) to every element of your sequence independently. That layer will have exactly the same weights for every element, it's the same that will be applied to each words and it will, of course, return the sequence of words processed independently.
As you can see, the difference between the two is that the LSTM "propagates the information through the sequence, it will eat one word, update its state and return it or not. Then it will go on with the next word while still carrying information from the previous ones.... as in the TimeDistributed, the words will be processed in the same way on their own, as if they were in silos and the same layer applies to every one of them.
So you dont have to use LSTM and TimeDistributed in a row, you can do whatever you want, just keep in mind what each of them do.
I hope it's clearer?
EDIT:
The time distributed, in your case, applies a dense layer to every element that was output by the LSTM.
Let's take an example:
You have a sequence of n_words words that are embedded in emb_size dimensions. So your input is a 2D tensor of shape (n_words, emb_size)
First you apply an LSTM with output dimension = lstm_output and return_sequence = True. The output will still be a squence so it will be a 2D tensor of shape (n_words, lstm_output).
So you have n_words vectors of length lstm_output.
Now you apply a TimeDistributed dense layer with say 3 dimensions output as parameter of the Dense. So TimeDistributed(Dense(3)).
This will apply Dense(3) n_words times, to every vectors of size lstm_output in your sequence independently... they will all become vectors of length 3. Your output will still be a sequence so a 2D tensor, of shape now (n_words, 3).
Is it clearer? :-)
return_sequences=True parameter:
If We want to have a sequence for the output, not just a single vector as we did with normal Neural Networks, so it’s necessary that we set the return_sequences to True. Concretely, let’s say we have an input with shape (num_seq, seq_len, num_feature). If we don’t set return_sequences=True, our output will have the shape (num_seq, num_feature), but if we do, we will obtain the output with shape (num_seq, seq_len, num_feature).
TimeDistributed wrapper layer:
Since we set return_sequences=True in the LSTM layers, the output is now a three-dimension vector. If we input that into the Dense layer, it will raise an error because the Dense layer only accepts two-dimension input. In order to input a three-dimension vector, we need to use a wrapper layer called TimeDistributed. This layer will help us maintain output’s shape, so that we can achieve a sequence as output in the end.
I am new to keras and despite reading the documentation and the examples folder in keras, I'm still struggling with how to fit everything together.
In particular, I want to start with a simple task: I have a sequence of tokens, where each token has exactly one label. I have a lot training data like this - practically infinite, as I can generate more (token, label) training pairs as needed.
I want to build a network to predict labels given tokens. The number of tokens must always be the same as the number of labels (one token = one label).
And I want this to be based on all surrounding tokens, say within the same line or sentence or window -- not just on the preceding tokens.
How far I got on my own:
created the training numpy vectors, where I converted each sentence into a token-vector and label-vector (of same length), using a token-to-int and label-to-int mappings
wrote a model using categorical_crossentropy and one LSTM layer, based on https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.
Now struggling with:
All the input_dim and input_shape parameters... since each sentence has a different length (different number of tokens and labels in it), what should I put as input_dim for the input layer?
How to tell the network to use the entire token sentence for prediction, not just one token? How to predict a whole sequence of labels given a sequence of tokens, rather than just label based on previous tokens?
Does splitting the text into sentences or windows make any sense? Or can I just pass a vector for the entire text as a single sequence? What is a "sequence"?
What are "time slices" and "time steps"? The documentation keeps mentioning that and I have no idea how that relates to my problem. What is "time" in keras?
Basically I have trouble connecting the concepts from the documentation like "time" or "sequence" to my problem. Issues like Keras#40 didn't make me any wiser.
Pointing to relevant examples on the web or code samples would be much appreciated. Not looking for academic articles.
Thanks!
If you have sequences of different length you can either pad them or use a stateful RNN implementation in which the activations are saved between batches. The former is the easiest and most used.
If you want to use future information when using RNNs you want to use a bidirectional model where you concatenate two RNN's moving in opposite directions. RNN will use a representation of all previous information when e.g. predicting.
If you have very long sentences it might be useful to sample a random sub-sequence and train on that. Fx 100 characters. This also helps with overfitting.
Time steps are your tokens. A sentence is a sequence of characters/tokens.
I've written an example of how I understand your problem but it's not tested so it might not run. Instead of using integers to represent your data I suggest one-hot encoding if it is possible and then use binary_crossentropy instead of mse.
from keras.models import Model
from keras.layers import Input, LSTM, TimeDistributed
from keras.preprocessing import sequence
# Make sure all sequences are of same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
# The input shape is your sequence length and your token embedding size (which is 1)
inputs = Input(shape=(maxlen, 1))
# Build a bidirectional RNN
lstm_forward = LSTM(128)(inputs)
lstm_backward = LSTM(128, go_backwards=True)(inputs)
bidirectional_lstm = merge([lstm_forward, lstm_backward], mode='concat', concat_axis=2)
# Output each timestep into a fully connected layer with linear
# output to map to an integer
sequence_output = TimeDistributed(Dense(1, activation='linear'))(bidirectional_lstm)
# Dense(n_classes, activation='sigmoid') if you want to classify
model = Model(inputs, sequence_output)
model.compile('adam', 'mse')
model.fit(X_train, y_train)