Using output cell and hidden states of one LSTM cell as input states for another - deep-learning

Typically, when discussing stacking LSTMs (with independent weights), the cell and hidden states are unique to each individual cell and not shared between them. Each LSTM cell operates independently with its own set of states.
Is there any reason for using the output cell state and hidden state of one LSTM cell as the input cell state and hidden state for another LSTM cell? Does this have any logic?
I had in mind a model that only receives one vector/single timestep as input (not a sequence), but I wanted to keep memory between consecutive iterations of the model (using stateful=True in tf.keras.layers.LSTM).

what is your goal? there are three states for lstm : a memory state, a forgot state, and a transfer state. the advantage of lstm over recurrent is the memory state allowing for long term memory. the forget state removes noise from the network making it more efficient for non contributing states.

Related

how is stacked rnn (num layers > 1) implemented on pytorch?

The GRU layer in pytorch takes in a parameter called num_layers, where you can stack RNNs. However, it is unclear how exactly the subsequent RNNs use the outputs of the previous layer.
According to the documentation:
Number of recurrent layers. E.g., setting num_layers=2 would mean
stacking two GRUs together to form a stacked GRU, with the second GRU
taking in outputs of the first GRU and computing the final results.
Does this mean that the output of the final cell of the first layer of the GRU is fed as input to the next layer? Or does it mean the outputs of each cell (at each timestep) is fed as an input to the cell at the same timestep of the next layer?
Does this mean that the output of the final cell of the first layer of the GRU is fed as input to the next layer? Or does it mean the outputs of each cell (at each timestep) is fed as an input to the cell at the same timestep of the next layer?
The latter. Each time step's output from the first layer is used as input for the same time step of the second layer.
This figure from a Keras tutorial shows how multilayer RNNs are structured:

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

I wonder why some have inputs and some don't in Kalman filter

Whenever I study the Kalman filter. I saw two kind of algorithms of Kalman filter.
One has an input matrix, the other don't have input matrix.
So I'm always confused by that.
Let me know what is the difference.
Please.....
The 'inputs' are often called 'controls' and that is a typical case. For example if the system was a car, u might represent the position of the steering wheel, the position of the accelerator and so on. These do not help to determine the position, but they do help when predicting the new state from the old. In some cases the kalman filter may be part of a larger system that actually produces these control signals.
However one may not have access to such information. For example if one was estimating the position (and so on) of a car externally, for example using radar readings, the state of the car controls may be unknown, and so cannot be included.
Note that there is another difference between the two images, and that is the occurrence (in the input case) of a matrix Q that is added to the predicted state covariance. This is not related to the presence of the inputs and indeed I think that its omission from the no-input case is a mistake. Without such a term the state error covariance matrix will collapse over time to 0 and the filter will fail.
Second image is not a good representation of the Kalman Filter. Like dmuir said, initialization process is basically giving information about current states to the filter. In linear case, initial state condition errors are tolerable whereas in nonlinear case, ie. Extended KF which is the most used case of kalman filter, initial conditions are very important. Most the time if you cannot initialize your filter with close approximation of the real initial state, your filter will mostly diverge.
But it looks like you need to restudy whole filter again. I would suggest this site which helped me through Msc greatly.How Kalman Filters Work

Is it possible to feed the output back to input in artificial neural network?

I am currently designing a artificial neural network for a problem with a decay curve.
For example, building a model for predicting the durability of the some material. It may includes the environment condition like temperature and humidity.
However, it is not adequate to predict the durability of the material. For such a problem, I think it is better to using the output durability of previous time slots as one of the current input to predict the durability of next time slot.
Moreover, I do not know how to train a model which feed the output back to input as one of the input columns has only the initial value before training.
For this case,
Method 1 (fail)
I have tried to fill the predicted output durability of current row to the input durability of next row. Nevertheless, it will prevent the model from "loss.backward()" so we cannot compute and update the gradient if we do so. The gradient function used was "CopySlices" instead of "MSELoss" when I copied the predicted output to the next row of the input data.
Feed output to input
gradient function -copy-
Method 2 "fill the input column with expected output"
In this method, I fill the blank input column with expected output (row-1) before training the model. Filling the input column with expected output of previous row is only done for training. For real prediction, I will feed the predicted output to the input. In this case, I am successful to train a overfitting model with MSELoss.
Moreover, I do not believe it is a right method as it uses the expected output as the input no matter how bad it predict. I strongly believed that it is not a right method.
Therefore, I want to ask whether it is possible to feed output to input in linear regression problem using artificial neural network.
I apologize for uploading no code here as I am not convenient to upload the full code here. It may be confidential.
It looks like you need an RNN (recurrent neural network). This tutorial is pretty helpful for understanding an RNN: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

How to use return_sequences option and TimeDistributed layer in Keras?

I have a dialog corpus like below. And I want to implement a LSTM model which predicts a system action. The system action is described as a bit vector. And a user input is calculated as a word-embedding which is also a bit vector.
t1: user: "Do you know an apple?", system: "no"(action=2)
t2: user: "xxxxxx", system: "yyyy" (action=0)
t3: user: "aaaaaa", system: "bbbb" (action=5)
So what I want to realize is "many to many (2)" model. When my model receives a user input, it must output a system action.
But I cannot understand return_sequences option and TimeDistributed layer after LSTM. To realize "many-to-many (2)", return_sequences==True and adding a TimeDistributed after LSTMs are required? I appreciate if you would give more description of them.
return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.
TimeDistributed: This wrapper allows to apply a layer to every temporal slice of an input.
Updated 2017/03/13 17:40
I think I could understand the return_sequence option. But I am not still sure about TimeDistributed. If I add a TimeDistributed after LSTMs, is the model the same as "my many-to-many(2)" below? So I think Dense layers are applied for each output.
The LSTM layer and the TimeDistributed wrapper are two different ways to get the "many to many" relationship that you want.
LSTM will eat the words of your sentence one by one, you can chose via "return_sequence" to outuput something (the state) at each step (after each word processed) or only output something after the last word has been eaten. So with return_sequence=TRUE, the output will be a sequence of the same length, with return_sequence=FALSE, the output will be just one vector.
TimeDistributed. This wrapper allows you to apply one layer (say Dense for example) to every element of your sequence independently. That layer will have exactly the same weights for every element, it's the same that will be applied to each words and it will, of course, return the sequence of words processed independently.
As you can see, the difference between the two is that the LSTM "propagates the information through the sequence, it will eat one word, update its state and return it or not. Then it will go on with the next word while still carrying information from the previous ones.... as in the TimeDistributed, the words will be processed in the same way on their own, as if they were in silos and the same layer applies to every one of them.
So you dont have to use LSTM and TimeDistributed in a row, you can do whatever you want, just keep in mind what each of them do.
I hope it's clearer?
EDIT:
The time distributed, in your case, applies a dense layer to every element that was output by the LSTM.
Let's take an example:
You have a sequence of n_words words that are embedded in emb_size dimensions. So your input is a 2D tensor of shape (n_words, emb_size)
First you apply an LSTM with output dimension = lstm_output and return_sequence = True. The output will still be a squence so it will be a 2D tensor of shape (n_words, lstm_output).
So you have n_words vectors of length lstm_output.
Now you apply a TimeDistributed dense layer with say 3 dimensions output as parameter of the Dense. So TimeDistributed(Dense(3)).
This will apply Dense(3) n_words times, to every vectors of size lstm_output in your sequence independently... they will all become vectors of length 3. Your output will still be a sequence so a 2D tensor, of shape now (n_words, 3).
Is it clearer? :-)
return_sequences=True parameter:
If We want to have a sequence for the output, not just a single vector as we did with normal Neural Networks, so it’s necessary that we set the return_sequences to True. Concretely, let’s say we have an input with shape (num_seq, seq_len, num_feature). If we don’t set return_sequences=True, our output will have the shape (num_seq, num_feature), but if we do, we will obtain the output with shape (num_seq, seq_len, num_feature).
TimeDistributed wrapper layer:
Since we set return_sequences=True in the LSTM layers, the output is now a three-dimension vector. If we input that into the Dense layer, it will raise an error because the Dense layer only accepts two-dimension input. In order to input a three-dimension vector, we need to use a wrapper layer called TimeDistributed. This layer will help us maintain output’s shape, so that we can achieve a sequence as output in the end.