Initialisation of weights for deeplearning model - deep-learning

I am going through a book on deep learning which initializes weights between two layers of neurons as:
w = np.random.randn(layers[i] + 1, layers[i + 1] + 1)
self.W.append(w / np.sqrt(layers[i]))
As per the book, divison by np.sqrt(layers[i]) in second line of code is done for following reason:
scale w by dividing by the square root of the number of nodes in the current layer, thereby
normalizing the variance of each neuron’s output
What does it exactly mean? And how would it impact if we don't do it?

Weights initialization is very important to tackle the vanishing/Explosion Gradients. In order for the output/gradients(reverse direction) to flow properly, the variance of the outputs of each layer to be equal to the variance of its input. Likewise of gradients in the reverse direction. the input and output flow of a layer is called fan-in and fan-out of the layer.
To better explain what I mean above, let me give you an example. Assume that we have a hundred consecutive layers and we apply a feed forward calculation with linear activation (After all it is just matrix multiplication), the data is 500 samples of 100 features:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons))
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(-4.055498760574568e+95, 8.424477240271639e+98)
You will see that it will have a huge mean and standard deviations. Lets break this problem down; a property of a matrix multiplication of which the result will have a standard deviation very close to the square root of the number of fan in (input) connections. This property can be verified with this snippet of code:
fan_in = 1000 # change it to any number
X = np.random.normal(size=(100, fan_in))
W = np.random.normal(size=(fan_in, 1))
np.dot(X, W).std()
# result:
32.764359213560454
This happens because we sum fan_in (1000 in the above case) products of the element-wise multiplication of one element of inputs X by one column of W. Therefore, if we scaled every weights by the 1/sqrt(fan_in) to maintain the distribution of the flow as seen in the following snippet:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons), scale=np.sqrt(1 / neurons)) # scaled the weights with the fan-in
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(0.0002608301398189543, 1.021452570914829)
You can read more about kernel initialization in the following blog

Related

How would you represent a 2D matrix as an input state and have it select the index of the row it thinks is the best action for that state?

I'm trying to build an RL model where the input is a NxM matrix, N being the number of selectable actions and M being features describing the action.
In all the RL problems I've seen so far, the state space is either a vector and passed in to a regular neural network or an image and is passed in through a convolutional neural network.
But say we have an environment where the objective is to learn to select the strongest worker for a fixed task, and a single state representation looked like this:
names = ['Bob','Henry','Mike','Phil']
max_squat = [300,400,200,100]
max_bench = [200,100,225,100]
max_deadlift = [600,400,300,225]
strongest_worker_df = pd.DataFrame({'Name':names,'Max_Squat':max_squat,'Max_Bench':max_bench,'Max_Deadlift':max_deadlift})
I want to pass in this 2D matrix (without Name column of course) as an input and have it return a row index, and then pass that row index as an action to the environment and get a reward. Then run a reinforcement learning algorithm on the gradient of the reward with respect to the action selection.
Any suggestions on how to go about this, specifically the state representation?
Well as long as your matrix is of fixed size (N and M don't change), you could just vectorize it (concatenate rows) and the network would work like that.
It is perhaps suboptimal to do this though because given the problem setting it seems preferable to maybe pass each row through the same neural net to get features and then have a top level discriminator that operates on the concatenated features.
An example model that would do this (in TensorFlow code) is:
model_input = x = Input(shape=(N, M))
x = Dense(64, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
# The layers above this line define the feature generator, at this point
# your model has 16 fetaures for every person, i.e. an Nx16 matrix.
# Each person's feature have gone through the same nodes and have received
# the same transformations from them.
x = Flatten()(x)
# The Nx16 matrix is now flattened and below define the discriminator
# which will have a softmax output of size N (the highest output identifies
# the selected index)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(N, activation='softmax')(x)
model = Model(inputs=model_input, outputs=x)

Second derivative using fft

All, I am trying to take the laplacian of the following function:
g(x,y) = 1/2cx^2+1/2dy2
The laplacian is c + d, which is a constant. Using FFT I should get the same ( in my FFT example I am padding the function to avoid edge effects).
Here is my code:
Define a 2D function
n = 30 # number of points
Lx = 30 # extension in x
Ly = 30 # extension in x
dx = n/Lx # Step in x
dy = n/Ly # Step in x
c=4
d=4
x=np.arange(-Lx/2,Lx/2)
y=np.arange(-Ly/2,Ly/2)
g = np.zeros((Lx,Ly))
lapg = np.zeros((Lx,Ly))
for j in range(Ly):
for i in range(Lx):
g[i,j] = (1/2)*c*x[i]**2 + (1/2)*d*y[j]**2
lapg[i,j] = c + d
kxpad = 2*np.pi*np.fft.fftfreq(2*Lx,d=dx)
#kxpad = (2*np.pi/(2*Lx))*np.arange(-2*Lx/2,2*Lx/2)
#kxpad = np.fft.fftshift(kxpad)
#kypad = (2*np.pi/(2*Ly))*np.arange(-2*Ly/2,2*Ly/2)
#kypad = np.fft.fftshift(kypad)
kypad = 2*np.pi*np.fft.fftfreq(2*Ly,d=dy)
kpad = np.zeros((2*Lx,2*Ly))
for j in range(2*Ly):
for i in range(2*Lx):
kpad[i,j] = math.sqrt(kxpad[i]**2+kypad[j]**2)
kpad = np.fft.fftshift(kpad)
gpad = np.zeros((2*Lx,2*Ly))
gpad[:Lx,:Ly] = g # Filling main part of g in gpad
gpad[:Lx,Ly:] = g[:,-1::-1] # Filling the last 3 columns of gpad with g flipped
gpad[Lx:,:Ly] = g[-1::-1,:]# Filling the last 3 lines of gpad with g flipped
gpad[Lx:,Ly:] = g[-1::-1, -1::-1]# Filling the last 3 lines and last 3 columns of gpad with g flipped in line and column
rdFFT2D = np.zeros((Lx,Ly))
gpadhat = np.fft.fft2(gpad)
dgpadhat = -(kpad**2)*gpadhat #taking the derivative iwFFT(f)
rdpadFFT2D = np.real(np.fft.ifft2(dgpadhat))
rdFFT2D = rdpadFFT2D[:Lx,:Ly]
[
First image is the plot of the original function g(x,y), 2nd image is the analytical laplacian of g and 3rd image is the sugar loaf in Rio de Janeiro( lol ), actually it is the laplacian using FFT. What Am I doing wrong here?
Edit : Commenting on ripple effect.
Cris you mean the ripple effect due to the set_zlimit in the image below?Just to remember you that the result should be 8.
Edit 2 : Using non-symmetrical x and y values, produce the two images.
The padding will not change the boundary condition: You are padding by replicating the function, mirrored, four times. The function is symmetric, so the mirroring doesn't change it. Thus, your padding simply repeats the function four times. The convolution through the DFT (which you're attempting to implement) uses a periodic boundary condition, and thus already sees the input function as periodic. Replicating the function will not improve the convolution results at the edges.
To improve the result at the edges, you would need to implement a different boundary condition, the most effective one (since the input is analytical anyway) is to simply extend your domain and then crop it after applying the convolution. This introduces a boundary extension where the image is padded by seeing more data outside the original domain. It is an ideal boundary extension suitable for an ideal case where we don't have to deal with real-world data.
This implements the Laplace though the DFT with greatly simplified code, where we ignore any boundary extension, as well as the sample spacing (basically setting dx=1 and dy=1):
import numpy as np
import matplotlib.pyplot as pp
n = 30 # number of points
c = 4
d = 4
x = np.arange(-n//2,n//2)
y = np.arange(-n//2,n//2)
g = (1/2)*c*x[None,:]**2 + (1/2)*d*y[:,None]**2
kx = 2 * np.pi * np.fft.fftfreq(n)
ky = 2 * np.pi * np.fft.fftfreq(n)
lapg = np.real(np.fft.ifft2(np.fft.fft2(g) * (-kx[None, :]**2 - ky[:, None]**2)))
fig = pp.figure()
ax = fig.add_subplot(121, projection='3d')
ax.plot_surface(x[None,:], y[:,None], g)
ax = fig.add_subplot(122, projection='3d')
ax.plot_surface(x[None,:], y[:,None], lapg)
pp.show()
Edit: Boundary extension would work as follows:
import numpy as np
import matplotlib.pyplot as pp
n_true = 30 # number of pixels we want to compute
n_boundary = 15 # number of pixels to extend the image in all directions
c = 4
d = 4
# First compute g and lapg including boundary extenstion
n = n_true + n_boundary * 2
x = np.arange(-n//2,n//2)
y = np.arange(-n//2,n//2)
g = (1/2)*c*x[None,:]**2 + (1/2)*d*y[:,None]**2
kx = 2 * np.pi * np.fft.fftfreq(n)
ky = 2 * np.pi * np.fft.fftfreq(n)
lapg = np.real(np.fft.ifft2(np.fft.fft2(g) * (-kx[None, :]**2 - ky[:, None]**2)))
# Now crop the two images to our desired size
x = x[n_boundary:-n_boundary]
y = y[n_boundary:-n_boundary]
g = g[n_boundary:-n_boundary, n_boundary:-n_boundary]
lapg = lapg[n_boundary:-n_boundary, n_boundary:-n_boundary]
# Display
fig = pp.figure()
ax = fig.add_subplot(121, projection='3d')
ax.plot_surface(x[None,:], y[:,None], g)
ax.set_zlim(0, 800)
ax = fig.add_subplot(122, projection='3d')
ax.plot_surface(x[None,:], y[:,None], lapg)
ax.set_zlim(0, 800)
pp.show()
Note that I'm scaling the z-axes of the two plots in the same way to not enhance the effects of the boundary too much. Fourier-domain filtering like this is typically much more sensitive to edge effects than spatial-domain (or temporal-domain) filtering because the filter has an infinitely-long impulse response. If you leave out the set_zlim command, you'll see a ripple effect in the otherwise flat lapg image. The ripples are very small, but no matter how small, they'll look huge on a completely flat function because they'll stretch from the bottom to the top of the plot. The equal set_zlim in the two plots just puts this noise in proportion.

Understanding log_prob for Normal distribution in pytorch

I'm currently trying to solve Pendulum-v0 from the openAi gym environment which has a continuous action space. As a result, I need to use a Normal Distribution to sample my actions. What I don't understand is the dimension of the log_prob when using it :
import torch
from torch.distributions import Normal
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
dist = Normal(means, stds)
a = torch.tensor([1.2,3.4])
d = dist.log_prob(a)
print(d.size())
I was expecting a tensor of size 2 (one log_prob for each actions) but it output a tensor of size(2,2).
However, when using a Categorical distribution for discrete environment the log_prob has the expected size:
logits = torch.tensor([[-0.0657, -0.0949],
[-0.0586, -0.1007]])
dist = Categorical(logits = logits)
a = torch.tensor([1, 1])
print(dist.log_prob(a).size())
give me a tensor a size(2).
Why is the log_prob for Normal distribution of a different size ?
If one takes a look in the source code of torch.distributions.Normal and finds the definition of the log_prob(value) function, one can see that the main part of the calculation is:
return -((value - self.loc) ** 2) / (2 * var) - some other part
where value is a variable containing values for which you want to calculate the log probability (in your case, a), self.loc is the mean of the distribution (in you case, means) and var is the variance, that is, the square of the standard deviation (in your case, stds**2). One can see that this is indeed the logarithm of the probability density function of the normal distribution, minus some constants and logarithm of the standard deviation that I don't write above.
In the first example, you define means and stds to be column vectors, while the values to be a row vector
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([1.2,3.4])
But subtracting a row vector from a column vector, that the code does in value - self.loc in Python gives a matrix (try!), thus the result you obtain is a value of log_prob for each of your two defined distribution and for each of the variables in a.
If you want to obtain a log_prob without the cross terms, then define the variables consistently, i.e., either
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([[1.2],[3.4]])
or
means = torch.tensor([0.0538,
0.0651])
stds = torch.tensor([0.7865,
0.7792])
a = torch.tensor([1.2,3.4])
This is how you do in your second example, which is why you obtain the result you expected.

Understanding texture's linear filtering in cuda

according to the CUDA programming guide, the value returned by the texture fetch is
tex(x) = (1-a)T[i] + aT[i+1] for a one-dimensional texture
where i = floor(Xb), a = frac(Xb), Xb=x-0.5
Suppose we have a one dimensional texture that only has two texals. For example:
T[0] = 0.2, T[1] = 1.5
Say we want to fetch the texal at 1, which I think should return T[1] which is 1.5.
However, if you follow the rule given in the programming guide, the return value will be:
Xb = 1 - 0.5 = 0.5
a = 0.5
i = 0
return value = 0.5*T[0] + 0.5*T[1] = 0.85
which does not make any sense to me. Can someone explain why the linear filtering is done in such way by CUDA? Thanks
The linear filtering algorithm in CUDA assumes texel values are located at the centroid of the interpolation volume (so voxel centered, if you like). In your 1D filtering example, the input data is implicitly taken as
T[0]=(0.5, 0.2) T[1]=(1.5, 1.5)
So your example is asking for Tex(1), which is the midpoint between the two texel values, ie.
0.5*0.2 + 0.5*1.5 = 0.85
To get T[1] returned you would require Tex(1.5) and that is the general rule - add 0.5 to coordinates if you want to treat the texture data as being at the voxel vertices, rather than the voxel center.

Correct solution for this tensor

I'm implementing the system in this paper and I've come a little unstuck correctly implementing the radial tensor field.
All tensors in this system are of the form given on page 3, section 4
R [ cos(2t), sin(2t); sin(2t), -cos(2t) ]
The radial tensor field is defined as:
R [ yy - xx, -2xy; -2xy, -(yy-xx) ]
In my system I'm only storing R and Theta, since I can calculate the tensor based off just that information. This means I need to calculate R and Theta for the radial tensor. Unfortunately, my attempts at this have failed. Although it looks correct, my solution fails in the top left and bottom right quadrants.
Addendum: Following on from discussion in the comments about the image of the system not working, I'll put some hard numbers here too.
The entire tensor field is 800x480, the center point is at { 400, 240 }, and we're using the standard graphics coordinate system with a negative y axis (ie. origin in the top left).
At { 400, 240 }, the tensor is R = 0, T = 0
At { 200, 120 }, the tensor is R = 2.95936E+9, T = 2.111216
At { 600, 120 }, the tensor is R = 2.95936E+9, T = 1.03037679
I can easily sample any more points which you think may help.
The code I'm using to calculate values is:
float x = i - center.X;
float xSqr = x * x;
float y = j - center.Y;
float ySqr = y * y;
float r = (float)Math.Pow(xSqr + ySqr, 2);
float theta = (float)Math.Atan2((-2 * x * y), (ySqr - xSqr)) / 2;
if (theta < 0)
theta += MathHelper.Pi;
Evidently you are comparing formulas (1) and (2) of the paper. Note the scalar multiple l = || (u_x,u_y) || in formula (1), and identify that with R early in the section. This factor is implicit in formula (2), so to make them match we have to factor R out.
Formula (2) works with an offset from the "center" (x0,y0) of the radial map:
x = xp - x0
y = yp - y0
to form the given 2x2 matrix:
y^2 - x^2 -2xy
-2xy -(y^2 - x^2)
We need to factor out a scalar R from this matrix to get a traceless orthogonal 2x2 matrix as in formula (1):
cos(2t) sin(2t)
sin(2t) -cos(2t)
Since cos^2(2t) + sin^2(2t) = 1 the factor R can be identified as:
R = (y^2 - x^2)^2 + (-2xy)^2 = (x^2 + y^2)^2
leaving a traceless orthogonal 2x2 matrix:
C S
S -C
from which the angle 'tan(2t) = S/C` can be extracted by an inverse trig function.
Well, almost. As belisarius warns, we need to check that angle t is in the correct quadrant. The authors of the paper write at the beginning of Sec. 4 that their "t" (which refers to the tensor) depends on R >= 0 and theta (your t) lying in [0,2pi) according to the formula R [ cos(2t), sin(2t); sin(2t) -cos(2t) ].
Since sine and cosine have period 2pi, t (theta) is only uniquely determined up to an interval of length pi. I suspect the authors meant to write either that 2t lies in [0,2pi) or more simply that t lies in [0,pi). belisarius suggestion to use "the atan2 equivalent" will avoid any division by zero. We may (if the function returns a negative value) need to add pi so that t >= 0. This amounts to adding 2pi to 2t, so it doesn't affect the signs of the entries in the traceless orthogonal matrix (since 'R >= 0` the pattern of signs should agree in formulas (1) and (2) ).