Will adding the quadratic term of x^2 as another independent variable would allow OLS to fit the data better. Why? - regression

Will adding the quadratic term of x^2 as another independent variable would allow OLS to fit the data better. Why?
can sb explain this? as y= beta0 + beta1x + u and x has zero correlation with residual.

Related

Why the time complexity of bellman equation is n^3 for the direct solution?

For matrix form of bellman equation there is a direct solution. But I dont understand why the computational complexity for this form is n^3. I will be very appreciated if anyone can explain. thx
The Bellman equation of the value function in vector form can be written as
V = R + γPV
Where
V is a column vector representing the value function for each state (1..n)
R is a column vector representing the immediate reward after exiting a particular state
γ (gamma) is the discount factor
P is an nxn transition matrix (All the places we may end up)
Because the Bellman equation is a linear equation we can solve it directly. We can solve for V by rearranging the equation like this.
V - γPV = R
(I - γP) V = R
V = R (I - γP)^-1
Basically, due to the fact that you need to perform matrix inversion to solve for V, which is O(n^3) complexity using the standard Gauss-Jordan algorithm

feature scaling and intercept

it is a beginner question but I have a dataset of two features house sizes and number of bedrooms, So I'm working on Octave; So basically I try to do a feature scaling but as in the Design Matrix I added A column of ones (for tetha0) So I try to do a mean normalisation : (x- mean(x)/std(x)
but in the columns of one obviously the mean is 1 since there is only 1 in every line So when I do this the intercept colum is set to 0 :
mu = mean(X)
mu =
1.0000 2000.6809 3.1702
votric = X - mu
votric =
0.00000 103.31915 -0.17021
0.00000 -400.68085 -0.17021
0.00000 399.31915 -0.17021
0.00000 -584.68085 -1.17021
0.00000 999.31915 0.82979
So the first column shouldn't be left out of the mean normalization ?
Yes, you're supposed to normalise the original dataset over all observations first, and only then add the bias term (i.e. the 'column of ones').
The point of normalisation is to allow the various features to be compared on an equal basis, which speeds up optimization algorithms significantly.
The bias (i.e. column of ones) is technically not part of the features. It is just a mathematical convenience to allow us to use a single matrix multiplication to obtain our result in a computationally and notationally efficient manner.
In other words, instead of saying Y = bias + weight1 * X1 + weight2 * X2 etc, you create an imaginary X0 = 1, and denote the bias as weight0, which then allows you to express it in a vectorised fashion as follows: Y = weights * X
"Normalising" the bias term does not make sense, because clearly that would make X0 = 0, the effect of which would be that you would then discard the effect of the bias term altogether. So yes, normalise first, and only then add 'ones' to the normalised features.
PS. I'm going on a limb here and guessing that you're coming from Andrew Ng's machine learning course on coursera. You will see in ex1_multi.m that this is indeed what he's doing in his code (line 52).
% Scale features and set them to zero mean
fprintf('Normalizing Features ...\n');
[X mu sigma] = featureNormalize(X);
% Add intercept term to X
X = [ones(m, 1) X];

Kalman-filter with 100 data samples containing noise

If I have a series of observations of say 100 samples of x and y.
Is this enough to predict the 101th y corresponding to a x value?Can I use some part of this data of 100 samples to update some values(Considering that noise exists and some data might be corrupt) ?
Stack overflow is directed at coding - so if you have code that you expect to work, and it doesn't, you should post it with your question.
A Kalman filter can help in the problem you describe if you have a model for the dependence of y on x. So, for example, if your model is that:
y = a * x + b + Gaussian noise, then the Kalman filter is one way to estimate 'a' and 'b', which then allow you to predict the 101'st y from the 101'st x.

How to find a function that fits a given set of data points in Julia?

So, I have a vector that corresponds to a given feature (same dimensionality). Is there a package in Julia that would provide a mathematical function that fits these data points, in relation to the original feature? In other words, I have x and y (both vectors) and need to find a decent mapping between the two, even if it's a highly complex one. The output of this process should be a symbolic formula that connects x and y, e.g. (:x)^3 + log(:x) - 4.2454. It's fine if it's just a polynomial approximation.
I imagine this is a walk in the park if you employ Genetic Programming, but I'd rather opt for a simpler (and faster) approach, if it's available. Thanks
Turns out the Polynomials.jl package includes the function polyfit which does Lagrange interpolation. A usage example would go:
using Polynomials # install with Pkg.add("Polynomials")
x = [1,2,3] # demo x
y = [10,12,4] # demo y
polyfit(x,y)
The last line returns:
Poly(-2.0 + 17.0x - 5.0x^2)`
which evaluates to the correct values.
The polyfit function accepts a maximal degree for the output polynomial, but defaults to using the length of the input vectors x and y minus 1. This is the same degree as the polynomial from the Lagrange formula, and since polynomials of such degree agree on the inputs only if they are identical (this is a basic theorem) - it can be certain this is the same Lagrange polynomial and in fact the only one of such a degree to have this property.
Thanks to the developers of Polynomial.jl for leaving me just to google my way to an Answer.
Take a look to MARS regression. Multi adaptive regression splines.

Best techinique to approximate a 32-bit function using machine learning?

I was wondering which is the best machine learning technique to approximate a function that takes a 32-bit number and returns another 32-bit number, from a set of observations.
Thanks!
Multilayer perceptron neural networks would be worth taking a look at. Though you'll need to process the inputs to a floating point number between 0 and 1, and then map the outputs back to the original range.
There are several possible solutions to your problem:
1.) Fitting a linear hypothesis with least-squares method
In that case, you are approximating a hypothesis y = ax + b with the least squares method. This one is really easy to implement, but sometimes, a linear model is not good enough to fit your data. But - I would give this one a try first.
Good thing is that there is a closed form, so you can directly calculate parameters a and b from your data.
See Least Squares
2.) Fitting a non-linear model
Once seen that your linear model does not describe your function very well, you can try to fit higher polynomial models to your data.
Your hypothesis then might look like
y = ax² + bx + c
y = ax³ + bx² + cx + d
etc.
You can also use least squares method to fit your data, and techniques from the gradient descent types (simmulated annealing, ...). See also this thread: Fitting polynomials to data
Or, as in the other answer, try fitting a Neural Network - the good thing is that it will automatically learn the hypothesis, but it is not so easy to explain what the relation between input and output is. But in the end, a neural network is also a linear combination of nonlinear functions (like sigmoid or tanh functions).