feature scaling and intercept

feature scaling and intercept - octave

it is a beginner question but I have a dataset of two features house sizes and number of bedrooms, So I'm working on Octave; So basically I try to do a feature scaling but as in the Design Matrix I added A column of ones (for tetha0) So I try to do a mean normalisation : (x- mean(x)/std(x)
but in the columns of one obviously the mean is 1 since there is only 1 in every line So when I do this the intercept colum is set to 0 :
mu = mean(X)
mu =
1.0000 2000.6809 3.1702
votric = X - mu
votric =
0.00000 103.31915 -0.17021
0.00000 -400.68085 -0.17021
0.00000 399.31915 -0.17021
0.00000 -584.68085 -1.17021
0.00000 999.31915 0.82979
So the first column shouldn't be left out of the mean normalization ?

Yes, you're supposed to normalise the original dataset over all observations first, and only then add the bias term (i.e. the 'column of ones').
The point of normalisation is to allow the various features to be compared on an equal basis, which speeds up optimization algorithms significantly.
The bias (i.e. column of ones) is technically not part of the features. It is just a mathematical convenience to allow us to use a single matrix multiplication to obtain our result in a computationally and notationally efficient manner.
In other words, instead of saying Y = bias + weight1 * X1 + weight2 * X2 etc, you create an imaginary X0 = 1, and denote the bias as weight0, which then allows you to express it in a vectorised fashion as follows: Y = weights * X
"Normalising" the bias term does not make sense, because clearly that would make X0 = 0, the effect of which would be that you would then discard the effect of the bias term altogether. So yes, normalise first, and only then add 'ones' to the normalised features.
PS. I'm going on a limb here and guessing that you're coming from Andrew Ng's machine learning course on coursera. You will see in ex1_multi.m that this is indeed what he's doing in his code (line 52).
% Scale features and set them to zero mean
fprintf('Normalizing Features ...\n');
[X mu sigma] = featureNormalize(X);
% Add intercept term to X
X = [ones(m, 1) X];

Related

Questions about convolution (in CNN)

I suddenly came up with a question about convolution and just wanted to be clear if I'm missing something. The question is whether if the two operations below are identical.
Case1)
Suppose we have a feature map C^2 x H x W. And, we have a K x K x C^2 Conv weight with stride S. (To be clear, C^2 is the channel dimension but just wanted to make it as a square number, K is the kernel size).
Case2)
Suppose we have a feature map 1 x CH x CW. And, we have a CK x CK x 1 Conv weight with stride CS.
So, basically Case2 is a pixel-upshuffled version of case1 (both feature-map and Conv weight.) As convolutions are simply element-wise multiplication, both operations seem identical to me.
# given a feature map and a conv_weight, namely f_map, conv_weight
#case1)
convLayer = Conv(conv_weight)
result = convLayer(f_map, stride=1)
#case2)
f_map = pixelshuffle(f_map, scale=C)
conv_weight = pixelshuffle(f_map, scale=C)
result = convLayer(f_map, stride=C)
But this means that, (for example) given a 256xHxW feature-map with a 3x3 Conv (as in many deep learning models), performing a convolution was simply doing a HUUUGE 48x48 Conv to a 1 x 16*H x 16*W Feature map.
But this doesn't meet my basic intuition of CNNs, stacking multiple of layers with the smallest 3x3 Conv, resulting in somewhat large receptive field, and each channel having different (possibly redundant) information.

You can, in a sense, think of "folding" spatial information into the channel dimension. This is the rationale behind ResNet's trade-off between spatial resolution and feature dimension. In the ResNet case whenever they sample x2 in space they increase feature space x2. However, since you have two spatial dimensions and you sample x2 in both you effectively reduce the "volume" of the feature map by x0.5.

Octave -inf and NaN

I searched the forum and found this thread, but it does not cover my question
Two ways around -inf
From a Machine Learning class, week 3, I am getting -inf when using log(0), which later turns into an NaN. The NaN results in no answer being given in a sum formula, so no scalar for J (a cost function which is the result of matrix math).
Here is a test of my function
>> sigmoid([-100;0;100])
ans =
3.7201e-44
5.0000e-01
1.0000e+00
This is as expected. but the hypothesis requires ans = 1-sigmoid
>> 1-ans
ans =
1.00000
0.50000
0.00000
and the Log(0) gives -Inf
>> log(ans)
ans =
0.00000
-0.69315
-Inf
-Inf rows do not add to the cost function, but the -Inf carries through to NaN, and I do not get a result. I cannot find any material on -Inf, but am thinking there is a problem with my sigmoid function.
Can you provide any direction?

The typical way to avoid infinity in these cases is to add eps to the operand:
log(ans + eps)
eps is a very, very small value, and won't affect the output for values of ans unless ans is zero:
>> z = [-100;0;100];
>> g = 1 ./ (1+exp(-z));
>> log(1-g + eps)
ans =
0.0000
-0.6931
-36.0437

Adding to the answers here, I really do hope you would provide some more context to your question (in particular, what are you actually trying to do.
I will go out on a limb and guess the context, just in case this is useful. You are probably doing machine learning, and trying to define a cost function based on the negative log likelihood of a model, and then trying to differentiate it to find the point where this cost is at its minimum.
In general for a reasonable model with a useful likelihood that adheres to Cromwell's rule, you shouldn't have these problems, but, in practice it happens. And presumably in the process of trying to calculate a negative log likelihood of a zero probability you get inf, and trying to calculate a differential between two points produces inf / inf = nan.
In this case, this is an 'edge case', and generally in computer science edge cases need to be spotted as exceptional circumstances and dealt with appropriately. The reality is that you can reasonably expect that inf isn't going to be your function's minimum! Therefore, whether you remove it from the calculations, or replace it by a very large number (whether arbitrarily or via machine precision) doesn't really make a difference.
So in practice you can do either of the two things suggested by others here, or even just detect such instances and skip them from the calculation. The practical result should be the same.

-inf means negative infinity. Which is the correct answer because log of (0) is minus infinity by definition.
The easiest thing to do is to check your intermediate results and if the number is below some threshold (like 1e-12) then just set it to that threshold. The answers won't be perfect but they will still be pretty close.
Using the following as the sigmoid function:
function g = sigmoid(z)
g = 1 ./ (1 + e.^-z);
end
Then the following code runs with no issues. Choose the threshold value in the 'max' statement to be less than the expected noise in your measurements and then you're good to go
>> a = sigmoid([-100, 0, 100])
a =
3.7201e-44 5.0000e-01 1.0000e+00
>> b = 1-a
b =
1.00000 0.50000 0.00000
>> c = max(b, 1e-12)
c =
1.0000e+00 5.0000e-01 1.0000e-12
>> d = log(c)
d =
0.00000 -0.69315 -27.63102

Pollard’s p−1 algorithm: understanding of Berkeley paper

This paper explains about Pollard's p-1 factorization algorithm. I am having trouble understanding the case when factor found is equal to the input we go back and change 'a' (basically page 2 point 2 in the aforementioned paper).
Why we go back and increment 'a'?
Why we not go ahead and keep incrementing the factorial? It it because we keep going into the same cycle we have already seen?
Can I get all the factors using this same algorithm? Such as 49000 = 2^3 * 5^3 * 7^2. Currently I only get 7 and 7000. Perhaps I can use this get_factor() function recursively but I am wondering about the base cases.
def gcd(a, b):
if not b:
return a
return gcd(b, a%b)
def get_factor(input):
a = 2
for factorial in range(2, input-1):
'''we are not calculating factorial as anyway we need to find
out the gcd with n so we do mod n and we also use previously
calculate factorial'''
a = a**factorial % input
factor = gcd(a - 1, input)
if factor == 1:
continue
elif factor == input:
a += 1
elif factor > 1:
return factor
n = 10001077
p = get_factor(n)
q = n/p
print("factors of", n, "are", p, "and", q)

The linked paper is not a particularly good description of Pollard's p − 1 algorithm; most descriptions discuss smoothness bounds that make the algorithm much more practical. You might like to read this page at Prime Wiki. To answer your specific questions:
Why increment a? Because the original a doesn't work. In practice, most implementations don't bother; instead, a different factoring method, such as the elliptic curve method, is tried instead.
Why not increment the factorial? This is where the smoothness bound comes into play. Read the page at Mersenne Wiki for more details.
Can I get all factors? This question doesn't apply to the paper you linked, which assumes that the number being factored is a semi-prime with exactly two factors. The more general answer is "maybe." This is what happens at Step 3a of the linked paper, and choosing a new a may work (or may not). Or you may want to move to a different factoring algorithm.
Here is my simple version of the p − 1 algorithm, using x instead of a. The while loop computes the magical L of the linked paper (it's the least common multiple of the integers less than the smoothness bound b), which is the same calculation as the factorial of the linked paper, but done in a different way.
def pminus1(n, b, x=2):
q = 0; pgen = primegen(); p = next(pgen)
while p < b:
x = pow(x, p**ilog(p,b), n)
q, p = p, next(pgen)
g = gcd(x-1, n)
if 1 < g < n: return g
return False
You can see it in action at http://ideone.com/eMPHtQ, where it factors 10001 as in the linked paper as well as finding a rather spectacular 36-digit factor of fibonacci(522). Once you master that algorithm, you might like to move on to the two-stage version of the algorithm.

Multivariate Bisection Method

I need an algorithm to perform a 2D bisection method for solving a 2x2 non-linear problem. Example: two equations f(x,y)=0 and g(x,y)=0 which I want to solve simultaneously. I am very familiar with the 1D bisection ( as well as other numerical methods ). Assume I already know the solution lies between the bounds x1 < x < x2 and y1 < y < y2.
In a grid the starting bounds are:
^
| C D
y2 -+ o-------o
| | |
| | |
| | |
y1 -+ o-------o
| A B
o--+------+---->
x1 x2
and I know the values f(A), f(B), f(C) and f(D) as well as g(A), g(B), g(C) and g(D). To start the bisection I guess we need to divide the points out along the edges as well as the middle.
^
| C F D
y2 -+ o---o---o
| | |
|G o o M o H
| | |
y1 -+ o---o---o
| A E B
o--+------+---->
x1 x2
Now considering the possibilities of combinations such as checking if f(G)*f(M)<0 AND g(G)*g(M)<0 seems overwhelming. Maybe I am making this a little too complicated, but I think there should be a multidimensional version of the Bisection, just as Newton-Raphson can be easily be multidimed using gradient operators.
Any clues, comments, or links are welcomed.

Sorry, while bisection works in 1-d, it fails in higher dimensions. You simply cannot break a 2-d region into subregions using only information about the function at the corners of the region and a point in the interior. In the words of Mick Jagger, "You can't always get what you want".

I just stumbled upon the answer to this from geometrictools.com and C++ code.
edit: the code is now on github.

I would split the area along a single dimension only, alternating dimensions. The condition you have for existence of zero of a single function would be "you have two points of different sign on the boundary of the region", so I'd just check that fro the two functions. However, I don't think it would work well, since zeros of both functions in a particular region don't guarantee a common zero (this might even exist in a different region that doesn't meet the criterion).
For example, look at this image:
There is no way you can distinguish the squares ABED and EFIH given only f() and g()'s behaviour on their boundary. However, ABED doesn't contain a common zero and EFIH does.
This would be similar to region queries using eg. kD-trees, if you could positively identify that a region doesn't contain zero of eg. f. Still, this can be slow under some circumstances.

If you can assume (per your comment to woodchips) that f(x,y)=0 defines a continuous monotone function y=f2(x), i.e. for each x1<=x<=x2 there is a unique solution for y (you just can't express it analytically due to the messy form of f), and similarly y=g2(x) is a continuous monotone function, then there is a way to find the joint solution.
If you could calculate f2 and g2, then you could use a 1-d bisection method on [x1,x2] to solve f2(x)-g2(x)=0. And you can do that by using 1-d bisection on [y1,y2] again for solving f(x,y)=0 for y for any given fixed x that you need to consider (x1, x2, (x1+x2)/2, etc) - that's where the continuous monotonicity is helpful -and similarly for g. You have to make sure to update x1-x2 and y1-y2 after each step.
This approach might not be efficient, but should work. Of course, lots of two-variable functions don't intersect the z-plane as continuous monotone functions.

I'm not much experient on optimization, but I built a solution to this problem with a bisection algorithm like the question describes. I think is necessary to fix a bug in my solution because it compute tow times a root in some cases, but i think it's simple and will try it later.
EDIT: I seem the comment of jpalecek, and now I anderstand that some premises I assumed are wrong, but the methods still works on most cases. More especificaly, the zero is garanteed only if the two functions variate the signals at oposite direction, but is need to handle the cases of zero at the vertices. I think is possible to build a justificated and satisfatory heuristic to that, but it is a little complicated and now I consider more promising get the function given by f_abs = abs(f, g) and build a heuristic to find the local minimuns, looking to the gradient direction on the points of the middle of edges.
Introduction
Consider the configuration in the question:
^
| C D
y2 -+ o-------o
| | |
| | |
| | |
y1 -+ o-------o
| A B
o--+------+---->
x1 x2
There are many ways to do that, but I chose to use only the corner points (A, B, C, D) and not middle or center points liky the question sugests. Assume I have tow function f(x,y) and g(x,y) as you describe. In truth it's generaly a function (x,y) -> (f(x,y), g(x,y)).
The steps are the following, and there is a resume (with a Python code) at the end.
Step by step explanation
Calculate the product each scalar function (f and g) by them self at adjacent points. Compute the minimum product for each one for each direction of variation (axis, x and y).
Fx = min(f(C)*f(B), f(D)*f(A))
Fy = min(f(A)*f(B), f(D)*f(C))
Gx = min(g(C)*g(B), g(D)*g(A))
Gy = min(g(A)*g(B), g(D)*g(C))
It looks to the product through tow oposite sides of the rectangle and computes the minimum of them, whats represents the existence of a changing of signal if its negative. It's a bit of redundance but work's well. Alternativaly you can try other configuration like use the points (E, F, G and H show in the question), but I think make sense to use the corner points because it consider better the whole area of the rectangle, but it is only a impression.
Compute the minimum of the tow axis for each function.
F = min(Fx, Fy)
G = min(Gx, Gy)
It of this values represents the existence of a zero for each function, f and g, within the rectangle.
Compute the maximum of them:
max(F, G)
If max(F, G) < 0, then there is a root inside the rectangle. Additionaly, if f(C) = 0 and g(C) = 0, there is a root too and we do the same, but if the root is in other corner we ignore him, because other rectangle will compute it (I want to avoid double computation of roots). The statement bellow resumes:
guaranteed_contain_zeros = max(F, G) < 0 or (f(C) == 0 and g(C) == 0)
In this case we have to proceed breaking the region recursively ultil the rectangles are as small as we want.
Else, may still exist a root inside the rectangle. Because of that, we have to use some criterion to break this regions ultil the we have a minimum granularity. The criterion I used is to assert the largest dimension of the current rectangle is smaller than the smallest dimension of the original rectangle (delta in the code sample bellow).
Resume
This Python code resume:
def balance_points(x_min, x_max, y_min, y_max, delta, eps=2e-32):
width = x_max - x_min
height = y_max - y_min
x_middle = (x_min + x_max)/2
y_middle = (y_min + y_max)/2
Fx = min(f(C)*f(B), f(D)*f(A))
Fy = min(f(A)*f(B), f(D)*f(C))
Gx = min(g(C)*g(B), g(D)*g(A))
Gy = min(g(A)*g(B), g(D)*g(C))
F = min(Fx, Fy)
G = min(Gx, Gy)
largest_dim = max(width, height)
guaranteed_contain_zeros = max(F, G) < 0 or (f(C) == 0 and g(C) == 0)
if guaranteed_contain_zeros and largest_dim <= eps:
return [(x_middle, y_middle)]
elif guaranteed_contain_zeros or largest_dim > delta:
if width >= height:
return balance_points(x_min, x_middle, y_min, y_max, delta) + balance_points(x_middle, x_max, y_min, y_max, delta)
else:
return balance_points(x_min, x_max, y_min, y_middle, delta) + balance_points(x_min, x_max, y_middle, y_max, delta)
else:
return []
Results
I have used a similar code similar in a personal project (GitHub here) and it draw the rectangles of the algorithm and the root (the system have a balance point at the origin):
Rectangles
It works well.
Improvements
In some cases the algorithm compute tow times the same zero. I thinh it can have tow reasons:
I the case the functions gives exatly zero at neighbour rectangles (because of an numerical truncation). In this case the remedy is to incrise eps (increase the rectangles). I chose eps=2e-32, because 32 bits is a half of the precision (on 64 bits archtecture), then is problable that the function don't gives a zero... but it was more like a guess, I don't now if is the better. But, if we decrease much the eps, it extrapolates the recursion limit of Python interpreter.
The case in witch the f(A), f(B), etc, are near to zero and the product is truncated to zero. I think it can be reduced if we use the product of the signals of f and g in place of the product of the functions.
I think is possible improve the criterion to discard a rectangle. It can be made considering how much the functions are variating in the region of the rectangle and how distante the function is of zero. Perhaps a simple relation between the average and variance of the function values on the corners. In another way (and more complicated) we can use a stack to store the values on each recursion instance and garantee that this values are convergent to stop recursion.

This is a similar problem to finding critical points in vector fields (see http://alglobus.net/NASAwork/topology/Papers/alsVugraphs93.ps).
If you have the values of f(x,y) and g(x,y) at the vertexes of your quadrilateral and you are in a discrete problem (such that you don't have an analytical expression for f(x,y) and g(x,y) nor the values at other locations inside the quadrilateral), then you can use bilinear interpolation to get two equations (for f and g). For the 2D case the analytical solution will be a quadratic equation which, according to the solution (1 root, 2 real roots, 2 imaginary roots) you may have 1 solution, 2 solutions, no solutions, solutions inside or outside your quadrilateral.
If instead you have analytic functions of f(x,y) and g(x,y) and want to use them, this is not useful. Instead you could divide your quadrilateral recursively, however as it was already pointed out by jpalecek (2nd post), you would need a way to stop your divisions by figuring out a test that would assure you would have no zeros inside a quadrilateral.

Let f_1(x,y), f_2(x,y) be two functions which are continuous and monotonic with respect to x and y. The problem is to solve the system f_1(x,y) = 0, f_2(x,y) = 0.
The alternating-direction algorithm is illustrated below. Here, the lines depict sets {f_1 = 0} and {f_2 = 0}. It is easy to see that the direction of movement of the algorithm (right-down or left-up) depends on the order of solving the equations f_i(x,y) = 0 (e.g., solve f_1(x,y) = 0 w.r.t. x then solve f_2(x,y) = 0 w.r.t. y OR first solve f_1(x,y) = 0 w.r.t. y and then solve f_2(x,y) = 0 w.r.t. x).
Given the initial guess, we don't know where the root is. So, in order to find all roots of the system, we have to move in both directions.

Normal vector from least squares-derived plane

I have a set of points and I can derive a least squares solution in the form:
z = Ax + By + C
The coefficients I compute are correct, but how would I get the vector normal to the plane in an equation of this form? Simply using A, B and C coefficients from this equation don't seem correct as a normal vector using my test dataset.

Following on from dmckee's answer:
a x b = (a2b3 − a3b2), (a3b1 − a1b3), (a1b2 − a2b1)
In your case a1=1, a2=0 a3=A b1=0 b2=1 b3=B
so = (-A), (-B), (1)

Form the two vectors
v1 = <1 0 A>
v2 = <0 1 B>
both of which lie in the plane and take the cross-product:
N = v1 x v2 = <-A, -B, +1> (or v2 x v1 = <A, B, -1> )
It works because the cross-product of two vectors is always perpendicular to both of the inputs. So using two (non-colinear) vectors in the plane gives you a normal.
NB: You probably want a normalized normal, of course, but I'll leave that as an exercise.

A little extra color on the dmckee answer. I'd comment directly, but I do not have enough SO rep yet. ;-(
The plane z = Ax + By + C only contains the points (1, 0, A) and (0, 1, B) when C=0. So, we would be talking about the plane z = Ax + By. Which is fine, of course, since this second plane is parallel to the original one, the unique vertical translation that contains the origin. The orthogonal vector we wish to compute is invariant under translations like this, so no harm done.
Granted, dmckee's phrasing is that his specified "vectors" lie in the plane, not the points, so he's arguably covered. But it strikes me as helpful to explicitly acknowledge the implied translations.
Boy, it's been a while for me on this stuff, too.
Pedantically yours... ;-)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

feature scaling and intercept - octave

Related

Questions about convolution (in CNN)

Octave -inf and NaN

Pollard’s p−1 algorithm: understanding of Berkeley paper

Multivariate Bisection Method

Normal vector from least squares-derived plane

Categories

Resources