Estimating error with a Kalman Filter - kalman-filter

I'm working on adding a simple 1-D Kalman Filter to an application to process some noisy input data and output a cleaned result.
The example code I'm using comes from the Single-Variable example section of this tutorial and this python code.
This is working well for calculating the resulting value, however, when I first read about Kalman Filters I was under the impression that they could also be used for giving some measurement about how much "error" is in the inputs.
As an example, say I'm measuring an value of 10 but my input has a large amount of error. My input data may look like 6, 11, 14, 5, 19, 5, etc (some gaussian distribution around 10).
But say I switch to a less-noisy measurement and the measurements are 9.7, 10.3, 10.1, 10.0, 9.8, 10.1.
In both cases, the Kalman Filter will theoretically converge to a proper measurement of 10. What I want is to also have it give me some sort of numerical value estimating how much error there was in these data streams.
I believe this should be quite possible with a Kalman Filter, however, I'm having trouble finding a resource describing this. How can I do this?

In fact the situation is quite the opposite: The KF's estimate of your process noise is not affected by your data at all. If you look at the predict/update steps of the KF you'll see that the P term is never influenced by your state or your measurements. It is computed from your estimate of the additive process noise Q and your estimate of the measurement noise R.
If you have a dataset and want to measure it you can compute its mean and variance (which are what your state and process covariance represent). If you are talking about your input then you are talking about measuring the variance of your samples to set R.
If your actual input measurements are actually less noisy than expected then you will get a less noisy state, but with greater latency than you would have if you had set expectations correctly in R.
In a running filter you can look at your innovation sequence (the differences between the predicted and actual measurements) and compare them to your predicted innovation covariance (usually called S although sometimes rolled directly in as the denominator of K).

The Kalman Filter won’t give you a measurement about how much “error” is in the inputs. You will only get the error of the estimated outputs.
Why don‘t you use an online algorithm to calculate the variance of the inputs?

Related

XGBoost regression RMSE individual prediction

I have a simple regression problem with two independent variables and one dependent one. I tried linear regression from statsmodels and sk-learn, but I get the best results (R ^ 2 and RMSE) with XGBoost regressor.
On the new data set, RMSE is still in line with earlier results, but individual predictions are very different.
For example, the RMSE is 1000, and individual predictions vary from 20 to 3000. Thus, predictions are either almost perfectly accurate or strongly deviate in a few cases, but i don't know why is that.
My question is what is the cause of such variations in individual predictions?
When testing your model with new data, it's normal to get some of the predictions wrong. Since RMSE is 1000 it means that, on average, the root of the difference between the actual and predicted values is 1000. You can have values that are predicted very well, as well as values that give a very large squared error. The reason for this could be overfitting. It could also be that the new data set contains data that is very different from the data the model was trained on. But since the RMSE is in line with earlier results, I understand that RMSE was around 1000 on the training set as well. Therefore I don't necessarily see a problem with the test set. What I would do is go through the preprocessing steps for the training data and make sure they're done correctly:
standardize the data and remove possible skewness
check for collinearity between independent variables (you only have 2, so it should be easy to do)
check to see if independent variables have an acceptable variance. If your variables don't vary too much for each new data point it could be that they are useless for explaining the dependent variable.
BTW, what is the R2 score for your regression? It should tell you how much of the variability of the target variable is explained by your model. A low R2 score should indicate that the regressors used aren't very useful in explaining your target variable.
You can use the sklearn function StandardScaler() to standaredize the data.

How does the number of Gibbs sampling iterations impacts Latent Dirichlet Allocation?

The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.
The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.

Deep learning model test accuracy unstable

I am trying to train and test a pytorch GCN model that is supposed to identify person. But the test accuracy is quite jumpy like it gives 49% at 23 epoch then goes below near 45% at 41 epoch. So it's not increasing all the time though loss seems to decrease at every epoch.
My question is not about implementation errors rather I want to know why this happens. I don't think there is something wrong in my coding as I saw SOTA architecture has this type of behavior as well. The author just picked the best result and published saying that their models gives that result.
Is it normal for the accuracy to be jumpy (up-down) and am I just to take the best ever weights that produce that?
Accuracy is naturally more "jumpy", as you put it. In terms of accuracy, you have a discrete outcome for each sample - you either get it right or wrong. This makes it so that the result fluctuate, especially if you have a relatively low number of samples (as you have a higher sampling variance).
On the other hand, the loss function should vary more smoothly. It is based on the probabilities for each class calculated at your softmax layer, which means that they vary continuously. With a small enough learning rate, the loss function should vary monotonically. Any bumps you see are due to the optimization algorithm taking discrete steps, with the assumption that the loss function is roughly linear in the vicinity of the current point.

Regression problem getting much better results when dividing values by 100

I'm working on a regression problem in pytorch. My target values can be either between 0 to 100 or 0 to 1 (they represent % or % divided by 100).
The data is unbalanced, I have much more data with lower targets.
I've noticed that when I run the model with targets in the range 0-100, it doesn't learn - the validation loss doesn't improve, and the loss on the 25% large targets is very big, much bigger than the std in this group.
However, when I run the model with targets in the range 0-1, it does learn and I get good results.
If anyone can explain why this happens, and if using the ranges 0-1 is "cheating", that will be great.
Also - should I scale the targets? (either if I use the larger or the smaller range).
Some additional info - I'm trying to fine tune bert for a specific task. I use MSEloss.
Thanks!
I think your observation relates to batch normalization. There is a paper written on the subject, an numerous medium/towardsdatascience posts, which i will not list here. Idea is that if you have a no non-linearities in your model and loss function, it doesn't matter. But even in MSE you do have non-linearity, which makes it sensitive to scaling of both target and source data. You can experiment with inserting Batch Normalization Layers into your models, after dense or convolutional layers. In my experience it often improves accuracy.

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.