Citing from "Regularizing and Optimizing LSTM Language Models paper":
Given a fixed sequence length that is used to break a dataset into fixed length batches, the data set is not efficiently used. To illustrate this, imagine being given 100 elements
to perform backpropagation through with a fixed backpropagation through time (BPTT) window of 10. Any element divisible by 10 will never have any elements to backprop into, no matter how many times you may traverse the data set. Indeed, the backpropagation window that each element
receives is equal to i mod 10 where i is the element’s index. This is data inefficient, preventing 1/10 of the data set from ever being able to improve itself in a recurrent fashion, and resulting in 8/10 of the remaining elements receiving only a partial backpropagation window compared to the full possible backpropagation window of length 10.
I could not get the intuition from the above statements
I have a set of varying length multi variate time series data.
Also, each time series corresponds to a vector [non time series]. These vectors are not labels.
I'm trying to model both time series and vector data as a single model for extreme rare event classification.
The vector values have some influence over the time series hence this relation needs to be learnt in order to classify.
Could you please tell me how to model the mixture of time series and flat(vector) data?
I have been stuck for days because I cannot find a way to fit a traditional regression model such as ARIMA giving multiple time series as input.
I have got thousand trajectories positions of different vehicles (xy coordinates for each position). Let's say each sample (trajectory) is composed of 10 positions (all objects trajectories are not necessarily of the same length). It means I have got 10*N different time series (N is the total number of samples). I want to fit a model with all samples for x coordinates and then predict the future position of any new trajectory (test samples) that I give in input. Then I plan to do the same with another model for y coordinates. I do not say the method will work but I need to implement it to compare it with others (Neural networks, ...)
The hypothesis: a number of time series can be modeled with a single ARIMAX (or other) model (i.e. the same parameters work for all the time series). What is wanted: to fit them all simultaneously.
Can someone help me please?
Thank you in advance for your support!
Best regards,
I am using several regressors like xgboost, gradient boosting, random forest or decision tree to predict a continuous target value.
I have some complementary information like I know my prediction (target value) based on all features that I have should be in a given range.
Is there any way to more effectively take into consideration these bounds as a feature to any of these algorithms instead of verifying the range on already predicted values and only doing some post-processing.
Note that by just simply putting the lower and upper bound for my target value, not necessarily these algorithms will learn to effectively compute the prediction in the given range. I am looking for more effective way to take into consideration these bounds as a given data.
Thanks
The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.