I am a newbie to LSTM and RNN as a whole, I've been racking my brain to understand what exactly is a timestep. I would really appreciate an intuitive explanation to this
Let's start with a great image from Chris Olah's blog (a highly recommended read btw):
In a recurrent neural network you have multiple repetitions of the same cell. The way inference goes is - you take some input (x0), pass it through the cell to get some output1(depicted with black arrow to the right on the picture), then pass output1 as input(possibly adding some more input components - x1 on the image) to the same cell, producing new output output2, pass that again as input to the same cell(again with possibly additional input component x2), producing output3 and so on.
A time step is a single occurrence of the cell - e.g. on the first time step you produce output1, h0, on the second time step you produce output2 and so on.
Related
I want to learn how to backpropagate neural net offline, so I came up with the example following:
and 1 stays for bias. Activation function in last layer is linear.
where
My intuitive problem
I want to calculate backprogation offline but I'm not sure how it should be done. I understand intuition behind online backprop where we calculate gradient observation by observation. But I don't have idea how it should work offline i.e. to calculate with all observations at once. Could you please give me a hint in which direction should I follow?
So you say that you know how to do backpropagation on a single sample, but not on many samples at the same time, right? Then let's just assume that you have some loss function, e.g. the mean squared error. Then, for a single sample (x, label) your loss would be (y - label)^2. Since you said you know how to do backpropagation on a single sample, you should know that we now take the gradient of our loss with respect to out last node, y. This gradient should be 2 * (y - label). From there on, wo propagate the gradient through the whole network.
If we now have a batch of samples instead, we just use the loss function for the whole batch. Lets say out batch has two samples. Then our mean square error would just be the sum of the individual losses divided by the number of samples. Thus, the loss would be 1/2 * ((y_1 - label_1)^2 + ((y_2 - label_2)^2). And now we can just do the same as before: We take the gradient of our loss function with respect to our last node, y. It is important to realize, that both y_1 and y_2 in our loss are actually the output of our final node y, just for the different samples. This means, that the gradient is 1/2 * (2*(y_1 - label_1) + 2*(y_2 - label_2)). From here you have a single gradient value for the whole batch (if you plug in y_1, y_2, label_1, and label_2) and can propagate the gradient through the network as before.
As a summary: Instead of calculating the loss for a single sample, we now use a loss function that includes our whole batch (e.g. by just summing over all samples). This produces a single gradient, and we can proceed as before.
I am currently designing a artificial neural network for a problem with a decay curve.
For example, building a model for predicting the durability of the some material. It may includes the environment condition like temperature and humidity.
However, it is not adequate to predict the durability of the material. For such a problem, I think it is better to using the output durability of previous time slots as one of the current input to predict the durability of next time slot.
Moreover, I do not know how to train a model which feed the output back to input as one of the input columns has only the initial value before training.
For this case,
Method 1 (fail)
I have tried to fill the predicted output durability of current row to the input durability of next row. Nevertheless, it will prevent the model from "loss.backward()" so we cannot compute and update the gradient if we do so. The gradient function used was "CopySlices" instead of "MSELoss" when I copied the predicted output to the next row of the input data.
Feed output to input
gradient function -copy-
Method 2 "fill the input column with expected output"
In this method, I fill the blank input column with expected output (row-1) before training the model. Filling the input column with expected output of previous row is only done for training. For real prediction, I will feed the predicted output to the input. In this case, I am successful to train a overfitting model with MSELoss.
Moreover, I do not believe it is a right method as it uses the expected output as the input no matter how bad it predict. I strongly believed that it is not a right method.
Therefore, I want to ask whether it is possible to feed output to input in linear regression problem using artificial neural network.
I apologize for uploading no code here as I am not convenient to upload the full code here. It may be confidential.
It looks like you need an RNN (recurrent neural network). This tutorial is pretty helpful for understanding an RNN: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
I want to predict the trajectory of a ball falling. That trajectory is parabolic. I know that LSTM may be too much for this (i.e. a simpler method could suffice).
I thought that we can do this with 2 LSTM layers and a Dense layer at the end.
The end result that I want is to give the model 3 heights h0,h1,h2 and let it predict h3. Then, I want to give it h1, h2, and the h3 it outputted previously to predict h4, and so on, until I can predict the whole trajectory.
Firstly, what would the input shape be for the first LSTM layer ?
Would it be input_shape = (3,1) ?
Secondly, would the LSTM be able to predict a parabolic path ?
I am getting almost a flat line, not a parabola, and I want to rule out the possibility that I am misunderstanding how to feed and shape input.
Thank you
The input shape is in the form (samples, timeSteps, features).
Your only feature is "height", so features = 1.
And since you're going to input sequences with different lengths, you can use timeSteps = None.
So, your input_shape could be (None, 1).
Since we're going to use a stateful=True layer below, we can use batch_input_shape=(1,None,1). Choose the amount of "samples" you want.
Your model can predict the trajectory indeed, but maybe it will need more than one layer. (The exact answer about how many layers and cells depend on knowing how the match inside LSTM works).
Training:
Now, first you need to train your network (only then it will be able to start predicting good things).
For training, suppose you have a sequence of [h1,h2,h3,h4,h5,h6...], true values in the correct sequence. (I suggest you have actually many sequences (samples), so your model learns better).
For this sequence, you want an output predicting the next step, then your target would be [h2,h3,h4,h5,h6,h7...]
So, suppose you have a data array with shape (manySequences, steps, 1), you make:
x_train = data[:,:-1,:]
y_train = data[:,1:,:]
Now, your layers should be using return_sequences=True. (Every input step produces an output step). And you train the model with this data.
A this point, whether you're using stateful=True or stateful=False is not very relevant. (But if true, you always need model.reset_state() before every single epoch and sequence)
Predicting:
For predicting, you can use stateful=True in the model. This means that when you input h1, it will produce h2. And when you input h2 it will remember the "current speed" (the state of the model) to predict the correct h3.
(In the training phase, it's not important to have this, because you're inputting the entire sequences at once. So the speed will be understood between steps of the long sequences).
You can se the method reset_states() as set_current_speed_to(0). You will use it whenever the step you're going to input is the first step in a sequence.
Then you can do loops like this:
model.reset_states() #make speed = 0
nextH = someValueWithShape((1,1,1))
predictions = [nextH]
for i in range(steps):
nextH = model.predict(nextH)
predictions.append(nextH)
There is an example here, but using two features. There is a difference that I use two models, one for training, one for predicting, but you can use only one with return_sequences=True and stateful=True (don't forget to reset_states() at the beginning of every epoch in training).
Hello Guys,
I am working right now on an Autoencoder reducing some simple 2D Data to 1D. The architecture is 2 - 10 - 1 - 10 - 2 Neurons/Layer. As Activation Function I use sigmoid in every layer but the output-layer, where I use the identity.
I am using the Accord.NET Framework to build that.
I am Pre-Training the Autoencoder with RBMs and CD-Algorithm, where I can change the initial weights, the learning rate, the momentum and the weight decay.
The Fine-Tuning is accomplished by backpropagation where I can configure the learning rate and the momentum.
The data is some artificially created shape and is marked green in the picture:
data + reconstruction
The reconstruction of the autoencoder is the yellow line. Which leads to my problem. Somehow the encoder is not able to create a non-linear shape as output.
Although I tested arround a lot and changed values a dozen times, I am not getting better results. Maybe someone here has an idea how I could find the problem.
Thanks!
Look in general any neural network is based on a linear representation for your feature against the output so what the net is actually doing (consider two features) is [w1*x1 + w2*x2 = output].
What you need to do to achieve a non-linear representation is to use extra feature(s) which is a non-linear representation of the old feature(s). Let's say for example use x1^2 as an extra feature or x2^2 or both of them. Hence, the net will give this global equation [w1*x1 + w2*x2 + w3*x1^2 = output] which in nature is a non-linear equation and then you can have a nonlinear representation.
The extra feature equation depends mainly on your data. I have used a quadratic equation in my example but it is not always the correct thing to do. Referring to
Your data I think you need to use a cos(x) or sin(x) representation.
These days I study something about RNN and teacher forcing. But there is one point that I can't figure out. What is the principle of readout and teacher forcing? How can we feeding the output(or ground truth) of RNN from the previous time step back to the current time step, by using the output as features together with the input of this step, or using the output as this step's cell state? I have read some paper but still it confused me.o(╯□╰)o. Hoping someone can answer for me。
Teacher forcing is the act of using the the ground truth as the input for each time step, rather than the output of the network, the following is some pseudo code to describe the situation.
x = inputs --> [0:n]
y = expected_outputs --> [1:n+1]
out = network_outputs --> [1:n+1]
teacher_forcing():
for step in sequence:
out[step+1] = cell(x[step])
As you can see rather than feeding the output of the network at the previous time step as input it instead provides the ground truth.
It was originally motivated to avoid BPTT (back propagation through time) for models that dont contain hidden to hidden connections, eg GRU's (gated recurrent units).
It can also be used in a training regime, the idea being that from the beginning of the train to the end you slowly decrease the amount of teacher forcing. This has been shown to have regularizing effects on the network. The paper linked here has further reading or the Deep Learning Book is good too.