How do I build a custom RNN layer in Keras? - deep-learning

I am trying to implement a custom RNN layer in Keras and I tried to follow what explained in this link, which basically instructs how to inherit from the existing RNN classes. However, the update equation of the hidden layer in my formulation is a bit different: h(t) = tanh(W.x + U.h(t-1) + V.r(t) + b) and I am a bit confused. In this equation, r(t) = f(x, p(t)) is a function of x, the fixed input distributed over time, and also p(t) = O(t-1).alpha + p(t-1), where O(t) is the Softmax output of each RNN cell.
I think after calling super(customRNN, self).step in the inherited step function, the standard h(t) should be overridden by my definition of h(t). However I am not sure how to modify the states and also get_constants function, and whether or not I need to modify any other parts of the recurrent and simpleRNN classes in Keras.
My intuition is that the get_constants function only returns the dropout matrices as extra states to the step function, so I am guessing at least one state should be added for the dropout matrix of V in my equations.
I have just recently started using Keras and I could not find many references on custom Keras layer definition. Sorry if my question is a bit overwhelmed with a lot of parameters, I just wanted to make sure that I am not missing any point. Thanks!

Related

How much is the dimension of some bidirectional LSTM layers?

I read a paper about machine translation, and it uses projection layer. Its encoder has 6 bidirectional LSTM layers. If input embedding dimension is 512, how much will be the dimension of the encoder output? 512*2**5?
The paper's link: https://www.aclweb.org/anthology/P18-1008.pdf
Not quite. Unfortunately, Figure 1 in the mentioned paper is a bit misleading. It is not that the six encoding layers are in parallel, as it might be understood from the figure, but rather that these layers are successive, meaning that the hidden state/output from the previous layer is used in the subsequent layer as an input.
This, and the fact that the input (embedding) dimension is NOT the output dimension of the LSTM layer (in fact, it is 2 * hidden_size) change your output dimension to exactly that: 2 * hidden_size, before it is put into the final projection layer, which again is changing the dimension depending on your specifications.
It is not quite clear to me what the description of add does in the layer, but if you look at a reference implementation it seems to be irrelevant to the answer. Specifically, observe how the encoding function is basically
def encode(...):
encode_inputs = self.embed(...)
for l in num_layers:
prev_input = encode_inputs
encode_inputs = self.nth_layer(...)
# ...
Obviously, there is a bit more happening here, but this illustrates the basic functional block of the network.

How to implement soft-argmax in caffe?

In Caffe deep-learning framework there is an argmax layer which is not differentiable and hence can not be used for end to end training of a CNN.
Can anyone tell me how I could implement the soft version of argmax which is soft-argmax?
I want to regress coordinates from heatmap and then use those coordinates in loss calculations. I am very new to this framework therefore no idea how to do this. any help will be much appreciated.
I don't get exactly what you want, but there are following options:
Use L2 loss to train regression task (EuclideanLoss). Or SmoothL1Loss (from SSD Caffe by Wei Lui), or L1 (don't know were you get it).
Use softmax with cross-entropy loss (SoftmaxWithLoss) to train classification task with classes corresponding to the possible values of x or y coordinate. For example, one loss layer for x, and one for y. SoftmaxWithLoss accepts label as a numeric value, and casts it to int with static_cast(). But take into account that implementation doesn't check that the casted value is within 0..(num_classes-1) range, so you have to be careful.
If you want something more unusual, you'll have to write you own layer in C++, C++/CUDA or Python+NumPy. This is very often the case unless you are already using someone other's implementation.

When exactly are the kernels initialized in Keras?

I'm using the Keras functional API, and I would like to know: when exactly are the kernels initialized? Is it during the creation of the layer, like in
x = Dense(32, kernel_initializer='glorot_uniform')(x)
or is it during the compilation of the model? e.g.
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
I guess it's not during model.fit(...) or I wouldn't be able to fine-tune a pre-trained model, because the previous weights would be lost. Am I missing something?
It turns out that the Layer superclass defines the method build(input_shape) that all derived classes that have weights, such as Dense and Conv2D, must implement. In the method, among other things, the weight variables are created and initialized. This build is actually called by the Layer's method __call__, which is the one called in the line
x = Dense(32, kernel_initializer='glorot_uniform')(x)
right after the constructor, __init__.
Reference: https://github.com/fchollet/keras/blob/master/keras/engine/topology.py

Need hint for the Exercise posed in the Tensorflow Convolution Neural Networks Tutorial

Below is the exercise question posed on this page https://www.tensorflow.org/versions/0.6.0/tutorials/deep_cnn/index.html
EXERCISE: The output of inference are un-normalized logits. Try
editing the network architecture to return normalized predictions
using tf.softmax().
In the spirit of the exercise, I want to know if I'm on the right-track (not looking for the coded-up answer).
Here's my proposed solution.
Step 1: The last layer (of the inference) in the example is a "softmax_linear", i.e., it simply does the unnormalized WX+b transformation. As stipulated, we apply the tf.nn.softmax operation with softmax_linear as input. This normalizes the output as probabilities on the range [0, 1].
Step 2: The next step is to modify the cross-entropy calculation in the loss-function. Since we already have normalized output, we need to replace the tf.nn.softmax_cross_entropy_with_logits operation with a plain cross_entropy(normalized_softmax, labels) function (that does not further normalize the output before calculating the loss). I believe this function is not available in the tensorflow library; it needs to be written.
That's it. Feedback is kindly solicited.
Step 1 is more then sufficient if you insert the tf.nn.softmax() in cifar10_eval.py (and not in cifar10.py). For example:
logits = cifar10.inference(images)
normalized_logits = tf.nn.softmax(logits)
top_k_op = tf.nn.in_top_k(normalized_logits, labels, 1)

How to Solve non-specific non-linear equations?

I am attempting to fit a circle to some data. This requires numerically solving a set of three non-linear simultaneous equations (see the Full Least Squares Method of this document).
To me it seems that the NEWTON function provided by IDL is fit for solving this problem. NEWTON requires the name of a function that will compute the values of the equation system for particular values of the independent variables:
FUNCTION newtfunction,X
RETURN, [Some function of X, Some other function of X]
END
While this works fine, it requires that all parameters of the equation system (in this case the set of data points) is hard coded in the newtfunction. This is fine if there is only one data set to solve for, however I have many thousands of data sets, and defining a new function for each by hand is not an option.
Is there a way around this? Is it possible to define functions programmatically in IDL, or even just pass in the data set in some other manner?
I am not an expert on this matter, but if I were to solve this problem I would do the following. Instead of solving a system of 3 non-linear equations to find the three unknowns (i.e. xc, yc and r), I would use an optimization routine to converge to a solution by starting with an initial guess. For this steepest descent, conjugate gradient, or any other multivariate optimization method can be used.
I just quickly derived the least square equation for your problem as (please check before use):
F = (sum_{i=1}^{N} (xc^2 - 2 xi xc + xi^2 + yc^2 - 2 yi yc + yi^2 - r^2)^2)
Calculating the gradient for this function is fairly easy, since it is just a summation, and therefore writing a steepest descent code would be trivial, to calculate xc, yc and r.
I hope it helps.
It's usual to use a COMMON block in these types of functions to pass in other parameters, cached values, etc. that are not part of the calling signature of the numeric routine.