LSTM gates activations - deep-learning

LSTM gates activations - deep-learning

I would like to study the activation of the gates in the LSTM (forget, input and output) but I can not extract the values, so is there any way to do?
I tried to get the weights and biases but this did not help me.

Related

Resnet50 does not converge. VGG16 works fine

I trained one regression network using resnet50 as backbone. The input of the network is image whose size is 224*224*3, the output of the network is one value, varying from 0 to 1.
but the netwrok can not converge, no matter I use sigmoid or relu as output layer's activation. mae or mse as loss function.
For exampple, I use resnet50 as backbone,mae as loss function, sigmoid is the activation function of output layer. SGD as optimizer. The training loss would be:
Epoch 1 training loss is 0.4900， val_loss is 0.4797
Epoch 2 training loss is 0.4923, val_loss is 0.4794
Epoch 3 training loss is 0.4923, val_loss is 0.4783
...
Epoch 35 training loss is 0.4923, val_loss is 0.4771
The training loss would not change, it is constant 0.4923. the val_loss is always about 0.47. I tested differentoptimizer, learning rate. the network is still not converge.
When I use VGG16 or Mobilenet as backbone, the network converged.
Could anyone give me some suggestions about how I can fix this problem.

Can you somehow validate if the Resnet50 backbone is correctly implemented. Maybe try to train it on MNIST and see if it works in general.
It kinda seems to me that the ResNet varaint just outputs some mean value instead of learning the actual problem.
Can you give some more information on what you want to achieve. How your regression looks like and what input is expected from the backbone. Also you might want to have a look at similar work (if that exists) and read what architectures they were using and what hyperparameters.

Predicting continuous valued output

I am working on predicting Semantic Textual Similarity (SemEval 2017 Task-1) between a pair of texts. The similarity score (output) is a continuous value between [0,5]. The neural network model (link below), therefore, has 6 units in the final layer for prediction between values [0,5]. The objective function used is the Pearson correlation coefficient and softmax activation is used. Now, in order to train the model, how can I give the target output values to the model? Since there are 6 output classes, I should probably send one-hot-encoded vectors of the output. In that case, how can we convert the output (which might be a float value such as 2.33) to a one-hot vector of length 6? Or is there any other way of specifying the target output and training the model?
Paper: http://nlp.arizona.edu/SemEval-2017/pdf/SemEval016.pdf

If the value you're trying to predict is continuously-defined, you might be better off configuring this as a regression architecture. This will be simpler to train and interpret and will give you non-integer predictions (which you can then bucket or threshold however you please).
In order to do this, replace your softmax layer with a layer containing a single neuron with a linear activation function. Then you can simply train this network using your real-valued similarity numbers at the output. For loss function, you can use MSE / L2 unless you have a reason to do otherwise.

What type of value does one LSTM block produce? Is it a number in range [0;1] or a vector with numbers in range [0;1]?

I have read different papers and tutorials about LSTM neural nets and I can't understand this thing. If you look at this article https://arxiv.org/pdf/1609.08144.pdf at page 4 you will see architecture of Google’s Neural Machine Translation system which has 8 layers of LSTM neurons. Let's discuss certain moment of time "t" so we are interested only in 8 LSTM neurons. If first LSTM neuron returns number then second LSTM neuron takes this number as an input(SO NOT A VECTOR that is strange). On the other hand if LSTM neuron produces vector as an output the things get more weird. Help please.

Hard to understand Caffe MNIST example

After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.

Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.

For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.

Can one inspect the weights learned by a logistic regression classifier in weka?

I'm training Weka's logistic regression classifier and I'm trying to figure out what is going on under the hood. I know that I can use the classifier to look at the confidence distribution per instance using the logistic.distributionForInstance method but is there a way that I can look at the feature weights learned by the classifier?
thanks

Yes you can get the weights as a 2D array of doubles with the coefficients method. The first index is for the attributes and the second index is for the classes.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008