Kalman-filter with 100 data samples containing noise - kalman-filter

If I have a series of observations of say 100 samples of x and y.
Is this enough to predict the 101th y corresponding to a x value?Can I use some part of this data of 100 samples to update some values(Considering that noise exists and some data might be corrupt) ?

Stack overflow is directed at coding - so if you have code that you expect to work, and it doesn't, you should post it with your question.
A Kalman filter can help in the problem you describe if you have a model for the dependence of y on x. So, for example, if your model is that:
y = a * x + b + Gaussian noise, then the Kalman filter is one way to estimate 'a' and 'b', which then allow you to predict the 101'st y from the 101'st x.

Related

Questions about convolution (in CNN)

I suddenly came up with a question about convolution and just wanted to be clear if I'm missing something. The question is whether if the two operations below are identical.
Case1)
Suppose we have a feature map C^2 x H x W. And, we have a K x K x C^2 Conv weight with stride S. (To be clear, C^2 is the channel dimension but just wanted to make it as a square number, K is the kernel size).
Case2)
Suppose we have a feature map 1 x CH x CW. And, we have a CK x CK x 1 Conv weight with stride CS.
So, basically Case2 is a pixel-upshuffled version of case1 (both feature-map and Conv weight.) As convolutions are simply element-wise multiplication, both operations seem identical to me.
# given a feature map and a conv_weight, namely f_map, conv_weight
#case1)
convLayer = Conv(conv_weight)
result = convLayer(f_map, stride=1)
#case2)
f_map = pixelshuffle(f_map, scale=C)
conv_weight = pixelshuffle(f_map, scale=C)
result = convLayer(f_map, stride=C)
But this means that, (for example) given a 256xHxW feature-map with a 3x3 Conv (as in many deep learning models), performing a convolution was simply doing a HUUUGE 48x48 Conv to a 1 x 16*H x 16*W Feature map.
But this doesn't meet my basic intuition of CNNs, stacking multiple of layers with the smallest 3x3 Conv, resulting in somewhat large receptive field, and each channel having different (possibly redundant) information.
You can, in a sense, think of "folding" spatial information into the channel dimension. This is the rationale behind ResNet's trade-off between spatial resolution and feature dimension. In the ResNet case whenever they sample x2 in space they increase feature space x2. However, since you have two spatial dimensions and you sample x2 in both you effectively reduce the "volume" of the feature map by x0.5.

What is exact logic of performing batch normalization in deep learning?

After reading the research paper on batchnorm and its various descriptions in forums, I am still not clear how the basic computations are performed. The core of my questions is: a vector is normalized with respect to the set to which it belongs; we can thus normalize vectors input to layer 1 using the batch selected from the training set. Each input vector to the next layer needs to be normalized with respect to the set to which it belongs, but how do we get hold of that set?
More precisely, let
N = batch size;
Bi = set of vectors Xij (j = 1..n) whose normalized value will be
input to layer i of the network;
BN = Batch normalization function;
BN(Xij, Bi) = Normalized version of j-th vector, Xij, with respect to the set Bi.
BN(X1j, B1), j = 1..n, can be calculated because we know B1. They are inputs to layer 1.
We need BN(X2j, B2), j = 1..n, to input to layer 2, but we do not have B2 readily available. My questions is how to get B2, B3, etc.
We could process each BN(X1j, B1), j = 1..n, by layer 1 and remember the outputs as X2j (that collection will be B2). Then calculate BN(x2j, B2) for each j by normalizing with respect to B2 and input them to layer 2, etc. So the forward pass would consists of many such steps. For simplicity, I have ignored the scale and shift step, as that is not relevant to my question.
Being new to this topic, I would appreciate expert opinion on it.
Batch Norm first subtracts the mean to center the output around zero. Then Batch Normalization devides by the standart derivation to scale the output to unit variance.
This means that each vector is normalized with respect to the batch it belongs to. Assume you run your x through your first layer, lets call its output y1. Batch Normalization is then applied to y1 in the form of (y1 - y1.mean()) / y.std(). The 'set' you're talking about is just the vector y1, the output of the layer.
All this is of course ignoring versions where a bias and a derivation.

feature scaling and intercept

it is a beginner question but I have a dataset of two features house sizes and number of bedrooms, So I'm working on Octave; So basically I try to do a feature scaling but as in the Design Matrix I added A column of ones (for tetha0) So I try to do a mean normalisation : (x- mean(x)/std(x)
but in the columns of one obviously the mean is 1 since there is only 1 in every line So when I do this the intercept colum is set to 0 :
mu = mean(X)
mu =
1.0000 2000.6809 3.1702
votric = X - mu
votric =
0.00000 103.31915 -0.17021
0.00000 -400.68085 -0.17021
0.00000 399.31915 -0.17021
0.00000 -584.68085 -1.17021
0.00000 999.31915 0.82979
So the first column shouldn't be left out of the mean normalization ?
Yes, you're supposed to normalise the original dataset over all observations first, and only then add the bias term (i.e. the 'column of ones').
The point of normalisation is to allow the various features to be compared on an equal basis, which speeds up optimization algorithms significantly.
The bias (i.e. column of ones) is technically not part of the features. It is just a mathematical convenience to allow us to use a single matrix multiplication to obtain our result in a computationally and notationally efficient manner.
In other words, instead of saying Y = bias + weight1 * X1 + weight2 * X2 etc, you create an imaginary X0 = 1, and denote the bias as weight0, which then allows you to express it in a vectorised fashion as follows: Y = weights * X
"Normalising" the bias term does not make sense, because clearly that would make X0 = 0, the effect of which would be that you would then discard the effect of the bias term altogether. So yes, normalise first, and only then add 'ones' to the normalised features.
PS. I'm going on a limb here and guessing that you're coming from Andrew Ng's machine learning course on coursera. You will see in ex1_multi.m that this is indeed what he's doing in his code (line 52).
% Scale features and set them to zero mean
fprintf('Normalizing Features ...\n');
[X mu sigma] = featureNormalize(X);
% Add intercept term to X
X = [ones(m, 1) X];

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

Tune input features using backprop in keras

I am trying to implement discriminant condition codes in Keras as proposed in
Xue, Shaofei, et al., "Fast adaptation of deep neural network based
on discriminant codes for speech recognition."
The main idea is you encode each condition as an input parameter and let the network learn dependency between the condition and the feature-label mapping. On a new dataset instead of adapting the entire network you just tune these weights using backprop. For example say my network looks like this
X ---->|----|
|DNN |----> Y
Z --- >|----|
X: features Y: labels Z:condition codes
Now given a pretrained DNN, and X',Y' on a new dataset I am trying to estimate the Z' using backprop that will minimize prediction error on Y'. The math seems straightforward except I am not sure how to implement this in keras without having access to the backprop itself.
For instance, can I add an Input() layer with trainable=True with all other layers set to trainable= False. Can backprop in keras update more than just layer weights? Or is there a way to hack keras layers to do this?
Any suggestions welcome.
thanks
I figured out how to do this (exactly) in Keras by looking at fchollet's post here
Using the keras backend I was able to compute the gradient of my loss w.r.t to Z directly and used it to drive the update.
Code below:
import keras.backend as K
import numpy as np
model.summary() #Pretrained model
loss = K.categorical_crossentropy(Y, Y_out)
grads = K.gradients(loss, Z)
grads /= (K.sqrt(K.mean(K.square(grads)))+ 1e-5)
iterate = K.function([X,Z],[loss,grads])
step = 0.1
Z_adapt = Z_in.copy()
for i in range(100):
loss_val, grads_val = iterate([X_in,Z_adapt])
Z_adapt -= grads_val[0] * step
print "iter:",i,np.mean(loss_value)
print "Before:"
print model.evaluate([X_in, Z_in],Y_out)
print "After:"
print model.evaluate([X_in, Z_adapt],Y_out)
X,Y,Z are nodes in the model graph. Z_in is an initial value for Z'. I set it to an average value from the train set. Z_adapt is after 100 iterations of gradient descent and should give you a better result.
Assume that the size of Z is m x n. Then you can first define an input layer of size m * n x 1. The input will be an m * n x 1 vector of ones. You can define a dense layer containing m * n neurons and set trainable = True for it. The response of this layer will give you a flattened version of Z. Reshape it appropriately and give it as input to the rest of the network that can be appended ahead of this.
Keep in mind that if the size of Z is too large, then network may not be able to learn a dense layer of that many neurons. In that case, maybe you need to put additional constraints or look into convolutional layers. However, convolutional layers will put some constraints on Z.