I would like to split the Blob channels in Caffe, so that I can split one Blob of (N, c, w, h) into two output Blobs of size (N, c/2, w, h).
What I have described above is very general, what I want to do actually is to separate a two-channel input image into two different images. One goes to a convolutional layer and the other goes to a pooling layer. Finally, I concatenate the outputs.
So I am wondering if a Caffe layer that allows the user to do such thing exists, and how to define it in the prototxt file.
Yes, the Slice layer is for that purpose. From the Layer Catalogue:
The Slice layer is a utility layer that slices an input layer to multiple output layers along a given dimension (currently num or channel only) with given slice indices.
To slice a Blob of size N x 2 x H x W into two Blobs of size N x 1 x H x W, you have to slice axis: 1 (along channels) at slice_point: 1 (after the first channel):
layer {
name: "slice-conv-pool"
type: "Slice"
bottom: "data"
top: "conv1"
top: "pool1"
slice_param {
axis: 1
slice_point: 1
}
}
Related
I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.
I am going through a book on deep learning which initializes weights between two layers of neurons as:
w = np.random.randn(layers[i] + 1, layers[i + 1] + 1)
self.W.append(w / np.sqrt(layers[i]))
As per the book, divison by np.sqrt(layers[i]) in second line of code is done for following reason:
scale w by dividing by the square root of the number of nodes in the current layer, thereby
normalizing the variance of each neuron’s output
What does it exactly mean? And how would it impact if we don't do it?
Weights initialization is very important to tackle the vanishing/Explosion Gradients. In order for the output/gradients(reverse direction) to flow properly, the variance of the outputs of each layer to be equal to the variance of its input. Likewise of gradients in the reverse direction. the input and output flow of a layer is called fan-in and fan-out of the layer.
To better explain what I mean above, let me give you an example. Assume that we have a hundred consecutive layers and we apply a feed forward calculation with linear activation (After all it is just matrix multiplication), the data is 500 samples of 100 features:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons))
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(-4.055498760574568e+95, 8.424477240271639e+98)
You will see that it will have a huge mean and standard deviations. Lets break this problem down; a property of a matrix multiplication of which the result will have a standard deviation very close to the square root of the number of fan in (input) connections. This property can be verified with this snippet of code:
fan_in = 1000 # change it to any number
X = np.random.normal(size=(100, fan_in))
W = np.random.normal(size=(fan_in, 1))
np.dot(X, W).std()
# result:
32.764359213560454
This happens because we sum fan_in (1000 in the above case) products of the element-wise multiplication of one element of inputs X by one column of W. Therefore, if we scaled every weights by the 1/sqrt(fan_in) to maintain the distribution of the flow as seen in the following snippet:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons), scale=np.sqrt(1 / neurons)) # scaled the weights with the fan-in
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(0.0002608301398189543, 1.021452570914829)
You can read more about kernel initialization in the following blog
This is my data layer of net.prototxt:
layer {
name: "csv"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 10
channels: 1
width: 14
height: 1
}
}
I find the function
MemoryDataLayer<Dtype>::Reset(Dtype* data, Dtype* labels, int n)
but I don't know where should I add this function to?
Now I want to know where
is the label data from? Because I only see label key word in Datum struct.
I always use MemoryData layer when I train a network through pycaffe module.Like this
solver = caffe.SGDSolver(solver_file)
X = np.zeros((batch_size, 3, im_height, im_width), dtype = np.float32)
Y = np.zeros((batch_size, ), dtype = np.float32)
# put processed images into X, put labels into Y
solver.net.set_input_arrays(X,Y)
you can refer caffe_root/python/caffe/pycaffe.py and _caffe.cpp for detail
Suppose there are N points in a 2-D graph.Each point has some weight attached to it.I am required to draw a straight line such a way that the line divides the points into 2 groups such that total weight(sum of weight of points in that group) of part with smaller weight be as many as possible.My task is to find this value.How to go about it ?
Note:No three points lie on the same line.
This is not a homework or part of any contest.
You could just scan over all angles and offsets until you find the optimal solution.
For ease of computation, I would rotate all the points with a simple rotation matrix to align the points with the scanline, so that you only have to look at their x coordinates.
You only have to check half a circle before the scanline doubles up on itself, that's an angle of 0 to PI assuming that you're working with radians, not degrees. Also assuming that the points can be read from the data as some kind of objects with an x, y and weight value.
Pseudocode:
Initialize points from input data
Initialize bestDifference to sum(weights of points)
Initialize bestAngle to 0
Initialize bestOffset to 0
Initialize angleStepSize to an arbitrary small value (e.g. PI/100)
For angle = 0:angleStepSize:PI
Initialize rotatedpoints from points and rotationMatrix(angle)
For offset = (lowest x in rotatedpoints) to (highest x in rotatedpoints)
weightsLeft = sum of the weights of all nodes with x < offset
weightsRight = sum of the weights of all nodes with x > offset
difference = abs(weightsLeft - weightsRight)
If difference < bestDifference
bestAngle = angle
bestOffset = offset
bestDifference = difference
Increment angle by stepsize
Return bestAngle, bestOffset, bestDifference
Here's a crude Paint image to clarify my approach:
I'm looking to write a little comp-geom library, in Ruby.
I'm about to write the code for lines, and was wondering which line equation I should use:
ax + by + c = 0
r + tv (where r and v are vectors)
Thanks.
If using the classical equations is not a requirement, I'd suggest an array of four co-ordinates: xStart, yStart, xEnd and yEnd.
In case you need to make the line position dynamic, you could use an array of two parameters: alpha and radius. The former represents radial rotation relative to the horizontal axis and the latter is the length of line.
Yet another option would be vectors in the form of (X;Y).
Samples in C:
int endpointsLine[4] = {0, 0, 30, 40};
double radialLine[2] = {5.35589, 50};
int vectorLine[2] = {30, 40};
The "endpoints" format is fully compatible with modern line-drawing algorithms, such as Xiaolin Wu's line algorithm and Bresenham's line algorithm but it represents specific screen co-ordinates which is not the case with "radial" and "vector" format.