Keras LSTM input - Predicting a parabolic trajectory - deep-learning

I want to predict the trajectory of a ball falling. That trajectory is parabolic. I know that LSTM may be too much for this (i.e. a simpler method could suffice).
I thought that we can do this with 2 LSTM layers and a Dense layer at the end.
The end result that I want is to give the model 3 heights h0,h1,h2 and let it predict h3. Then, I want to give it h1, h2, and the h3 it outputted previously to predict h4, and so on, until I can predict the whole trajectory.
Firstly, what would the input shape be for the first LSTM layer ?
Would it be input_shape = (3,1) ?
Secondly, would the LSTM be able to predict a parabolic path ?
I am getting almost a flat line, not a parabola, and I want to rule out the possibility that I am misunderstanding how to feed and shape input.
Thank you

The input shape is in the form (samples, timeSteps, features).
Your only feature is "height", so features = 1.
And since you're going to input sequences with different lengths, you can use timeSteps = None.
So, your input_shape could be (None, 1).
Since we're going to use a stateful=True layer below, we can use batch_input_shape=(1,None,1). Choose the amount of "samples" you want.
Your model can predict the trajectory indeed, but maybe it will need more than one layer. (The exact answer about how many layers and cells depend on knowing how the match inside LSTM works).
Training:
Now, first you need to train your network (only then it will be able to start predicting good things).
For training, suppose you have a sequence of [h1,h2,h3,h4,h5,h6...], true values in the correct sequence. (I suggest you have actually many sequences (samples), so your model learns better).
For this sequence, you want an output predicting the next step, then your target would be [h2,h3,h4,h5,h6,h7...]
So, suppose you have a data array with shape (manySequences, steps, 1), you make:
x_train = data[:,:-1,:]
y_train = data[:,1:,:]
Now, your layers should be using return_sequences=True. (Every input step produces an output step). And you train the model with this data.
A this point, whether you're using stateful=True or stateful=False is not very relevant. (But if true, you always need model.reset_state() before every single epoch and sequence)
Predicting:
For predicting, you can use stateful=True in the model. This means that when you input h1, it will produce h2. And when you input h2 it will remember the "current speed" (the state of the model) to predict the correct h3.
(In the training phase, it's not important to have this, because you're inputting the entire sequences at once. So the speed will be understood between steps of the long sequences).
You can se the method reset_states() as set_current_speed_to(0). You will use it whenever the step you're going to input is the first step in a sequence.
Then you can do loops like this:
model.reset_states() #make speed = 0
nextH = someValueWithShape((1,1,1))
predictions = [nextH]
for i in range(steps):
nextH = model.predict(nextH)
predictions.append(nextH)
There is an example here, but using two features. There is a difference that I use two models, one for training, one for predicting, but you can use only one with return_sequences=True and stateful=True (don't forget to reset_states() at the beginning of every epoch in training).

Related

Question about dimensions when processing lists with a multi layer perceptron

I'm quite new to PyTorch and I'm trying to build a net that is composed only of linear layers that will get a list of objects as input and output some score (which is a scalar) for each object. I'm wondering if my input tensor's dimensions should be (batch_size, list_size, object_size) or should I flatten each list and get (batch_size, list_size*object_size)? According to my understanding, in the first option I will have an output dimension of (batch_size, list_size, 1) and in the second (batch_size, list_size), does it matter? I read the documentation but it still wasn't very clear to me.
If you want to do the classification for each object in your input, you should keep the objects separate from each other; i.e., your input should be in the shape of (batch_size, list_size, object_size). Then considering the number of classes you got (let's say m classes), the linear layer would transform the input to the shape of (batch_size, list_size, m). In this case, you will have m scores for each object which can be utilized to predict the class label.
But question arises now; why do we flatten in neural networks at all? The answer is simple: because you want to couple the whole information (in your specific case, the information pieces are the objects) within a batch to see if they somehow affect each other, and if that's the case, to examine whether your network is able to learn these features/patterns. In practice, considering the nature of your problem and the data you are working with, if different objects really relate to each other, then your network will be able to learn those.

How to use return_sequences option and TimeDistributed layer in Keras?

I have a dialog corpus like below. And I want to implement a LSTM model which predicts a system action. The system action is described as a bit vector. And a user input is calculated as a word-embedding which is also a bit vector.
t1: user: "Do you know an apple?", system: "no"(action=2)
t2: user: "xxxxxx", system: "yyyy" (action=0)
t3: user: "aaaaaa", system: "bbbb" (action=5)
So what I want to realize is "many to many (2)" model. When my model receives a user input, it must output a system action.
But I cannot understand return_sequences option and TimeDistributed layer after LSTM. To realize "many-to-many (2)", return_sequences==True and adding a TimeDistributed after LSTMs are required? I appreciate if you would give more description of them.
return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.
TimeDistributed: This wrapper allows to apply a layer to every temporal slice of an input.
Updated 2017/03/13 17:40
I think I could understand the return_sequence option. But I am not still sure about TimeDistributed. If I add a TimeDistributed after LSTMs, is the model the same as "my many-to-many(2)" below? So I think Dense layers are applied for each output.
The LSTM layer and the TimeDistributed wrapper are two different ways to get the "many to many" relationship that you want.
LSTM will eat the words of your sentence one by one, you can chose via "return_sequence" to outuput something (the state) at each step (after each word processed) or only output something after the last word has been eaten. So with return_sequence=TRUE, the output will be a sequence of the same length, with return_sequence=FALSE, the output will be just one vector.
TimeDistributed. This wrapper allows you to apply one layer (say Dense for example) to every element of your sequence independently. That layer will have exactly the same weights for every element, it's the same that will be applied to each words and it will, of course, return the sequence of words processed independently.
As you can see, the difference between the two is that the LSTM "propagates the information through the sequence, it will eat one word, update its state and return it or not. Then it will go on with the next word while still carrying information from the previous ones.... as in the TimeDistributed, the words will be processed in the same way on their own, as if they were in silos and the same layer applies to every one of them.
So you dont have to use LSTM and TimeDistributed in a row, you can do whatever you want, just keep in mind what each of them do.
I hope it's clearer?
EDIT:
The time distributed, in your case, applies a dense layer to every element that was output by the LSTM.
Let's take an example:
You have a sequence of n_words words that are embedded in emb_size dimensions. So your input is a 2D tensor of shape (n_words, emb_size)
First you apply an LSTM with output dimension = lstm_output and return_sequence = True. The output will still be a squence so it will be a 2D tensor of shape (n_words, lstm_output).
So you have n_words vectors of length lstm_output.
Now you apply a TimeDistributed dense layer with say 3 dimensions output as parameter of the Dense. So TimeDistributed(Dense(3)).
This will apply Dense(3) n_words times, to every vectors of size lstm_output in your sequence independently... they will all become vectors of length 3. Your output will still be a sequence so a 2D tensor, of shape now (n_words, 3).
Is it clearer? :-)
return_sequences=True parameter:
If We want to have a sequence for the output, not just a single vector as we did with normal Neural Networks, so it’s necessary that we set the return_sequences to True. Concretely, let’s say we have an input with shape (num_seq, seq_len, num_feature). If we don’t set return_sequences=True, our output will have the shape (num_seq, num_feature), but if we do, we will obtain the output with shape (num_seq, seq_len, num_feature).
TimeDistributed wrapper layer:
Since we set return_sequences=True in the LSTM layers, the output is now a three-dimension vector. If we input that into the Dense layer, it will raise an error because the Dense layer only accepts two-dimension input. In order to input a three-dimension vector, we need to use a wrapper layer called TimeDistributed. This layer will help us maintain output’s shape, so that we can achieve a sequence as output in the end.

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.

How to massage inputs into Keras framework?

I am new to keras and despite reading the documentation and the examples folder in keras, I'm still struggling with how to fit everything together.
In particular, I want to start with a simple task: I have a sequence of tokens, where each token has exactly one label. I have a lot training data like this - practically infinite, as I can generate more (token, label) training pairs as needed.
I want to build a network to predict labels given tokens. The number of tokens must always be the same as the number of labels (one token = one label).
And I want this to be based on all surrounding tokens, say within the same line or sentence or window -- not just on the preceding tokens.
How far I got on my own:
created the training numpy vectors, where I converted each sentence into a token-vector and label-vector (of same length), using a token-to-int and label-to-int mappings
wrote a model using categorical_crossentropy and one LSTM layer, based on https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.
Now struggling with:
All the input_dim and input_shape parameters... since each sentence has a different length (different number of tokens and labels in it), what should I put as input_dim for the input layer?
How to tell the network to use the entire token sentence for prediction, not just one token? How to predict a whole sequence of labels given a sequence of tokens, rather than just label based on previous tokens?
Does splitting the text into sentences or windows make any sense? Or can I just pass a vector for the entire text as a single sequence? What is a "sequence"?
What are "time slices" and "time steps"? The documentation keeps mentioning that and I have no idea how that relates to my problem. What is "time" in keras?
Basically I have trouble connecting the concepts from the documentation like "time" or "sequence" to my problem. Issues like Keras#40 didn't make me any wiser.
Pointing to relevant examples on the web or code samples would be much appreciated. Not looking for academic articles.
Thanks!
If you have sequences of different length you can either pad them or use a stateful RNN implementation in which the activations are saved between batches. The former is the easiest and most used.
If you want to use future information when using RNNs you want to use a bidirectional model where you concatenate two RNN's moving in opposite directions. RNN will use a representation of all previous information when e.g. predicting.
If you have very long sentences it might be useful to sample a random sub-sequence and train on that. Fx 100 characters. This also helps with overfitting.
Time steps are your tokens. A sentence is a sequence of characters/tokens.
I've written an example of how I understand your problem but it's not tested so it might not run. Instead of using integers to represent your data I suggest one-hot encoding if it is possible and then use binary_crossentropy instead of mse.
from keras.models import Model
from keras.layers import Input, LSTM, TimeDistributed
from keras.preprocessing import sequence
# Make sure all sequences are of same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
# The input shape is your sequence length and your token embedding size (which is 1)
inputs = Input(shape=(maxlen, 1))
# Build a bidirectional RNN
lstm_forward = LSTM(128)(inputs)
lstm_backward = LSTM(128, go_backwards=True)(inputs)
bidirectional_lstm = merge([lstm_forward, lstm_backward], mode='concat', concat_axis=2)
# Output each timestep into a fully connected layer with linear
# output to map to an integer
sequence_output = TimeDistributed(Dense(1, activation='linear'))(bidirectional_lstm)
# Dense(n_classes, activation='sigmoid') if you want to classify
model = Model(inputs, sequence_output)
model.compile('adam', 'mse')
model.fit(X_train, y_train)

Why Caffe's Accuracy layer's bottoms consist of InnerProduct and Label?

I'm new to caffe, in the MNIST example, I was thought that label should compared with softmax layers, but it was not the situation in lenet.prototxt.
I wonder why use InnerProduct result and label to get the accuracy, it seems unreasonable. Was it because I missed something in the layers?
The output dimension of the last inner product layer is 10, which corresponds to the number of classes (digits 0~9) of your problem.
The loss layer, takes two blobs, the first one being the prediction(ip2) and the second one being the label provided by the data layer.
loss_layer = SoftmaxLossLayer(name="loss", bottoms=[:ip2,:label])
It does not produce any outputs - all it does is to compute the loss function value, report it when backpropagation starts, and initiates the gradient with respect to ip2. This is where all magic starts.
After training ( in TEST phase), In the last layer desire results come from multiply weights and ip1( that are computed in last layer ); And each of class( one of 10 neurons) has max value is choosen.