Need hint for the Exercise posed in the Tensorflow Convolution Neural Networks Tutorial - deep-learning

Below is the exercise question posed on this page https://www.tensorflow.org/versions/0.6.0/tutorials/deep_cnn/index.html
EXERCISE: The output of inference are un-normalized logits. Try
editing the network architecture to return normalized predictions
using tf.softmax().
In the spirit of the exercise, I want to know if I'm on the right-track (not looking for the coded-up answer).
Here's my proposed solution.
Step 1: The last layer (of the inference) in the example is a "softmax_linear", i.e., it simply does the unnormalized WX+b transformation. As stipulated, we apply the tf.nn.softmax operation with softmax_linear as input. This normalizes the output as probabilities on the range [0, 1].
Step 2: The next step is to modify the cross-entropy calculation in the loss-function. Since we already have normalized output, we need to replace the tf.nn.softmax_cross_entropy_with_logits operation with a plain cross_entropy(normalized_softmax, labels) function (that does not further normalize the output before calculating the loss). I believe this function is not available in the tensorflow library; it needs to be written.
That's it. Feedback is kindly solicited.

Step 1 is more then sufficient if you insert the tf.nn.softmax() in cifar10_eval.py (and not in cifar10.py). For example:
logits = cifar10.inference(images)
normalized_logits = tf.nn.softmax(logits)
top_k_op = tf.nn.in_top_k(normalized_logits, labels, 1)

Related

PyTorch find keypoints: output nodes to be in a range and negative loss

I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)
You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.
One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.

How to train two pytorch networks with different inputs together?

I'm totally new to pytorch, so it might be a very basic question. I have two networks that should be trained together.
First one takes data as input and returns its embedding as output.
Second one takes pairs of embedded datapoints and returns their 'similarity' as output.
Partial loss is then computed for every datapoint, and then all the losses are combined.
This final loss should be backpropagated through both networks.
How should the code for that look like? I'm thinking something like this:
def train_models(inputs, targets):
network1.train()
network2.train()
embeddings = network1(inputs)
paired_embeddings = pair_embeddings(embeddings)
similarities = network2(similarities)
"""
I don't know how the loss should be calculated here.
I have a loss formula for every embedded datapoint,
but not for every similarity.
But if I only calculate loss for every embedding (using similarites),
won't backpropagate() only modify network1,
since embeddings are network1's outputs
and have not been modified in network2?
"""
optimizer1.step()
optimizer2.step()
scheduler1.step()
scheduler2.step()
network1.eval()
network2.eval()
I hope this specific enough. I'll gladly share more details if necessary. I'm just so inexperienced with pytorch and deep learning in general, that I'm not even sure how to ask this question.
You can use single optimizer for this purpose, and even pass different learning rate for each network.
optimizer = optim.Adam([
{'params': network1.parameters()},
{'params': network2.parameters(), 'lr': 1e-3}
], lr=1e-4)
# ...
loss = loss1 + loss2
loss.backward()
optimizer.step()

How to get the accuracy of classifier on test data in DeepLearning

I am trying to use DL4J for deep learning and have provided the training data with the labels. I am then trying to send a test data by assigning a dummy label. Without providing a dummy label, it gives runtime error. I dont understand why we need to assign label to test data.
Additionally, I want to know what is the accuracy of the prediction made. From what I saw in the dl4j docs, there is something known as a confusion matrix which is generated. I understand that this just gives us an idea of how well the training data has trained the system. Is there a way to get the accuracy of prediction on test data? Since we are giving a dummy label for the test data, I feel that the confusion matrix is also not generated correctly.
First, how can you test if the network outputs the correct labels if you don't know what the correct labels are? You should always have a labels when training and testing because that way you can assert if the output is correct.
Second question, I've found this on dl4j webpage:
Evaluation eval = new Evaluation(3);
INDArray output = model.output(testData.getFeatures());
eval.eval(testData.getLabels(), output);
log.info(eval.stats());
There is stated that this .stats() method displays the confusion matrix entries (one per line), Accuracy, Precision, Recall and F1 Score. Additionally the Evaluation Class can also calculate and return the following values:
Confusion Matrix
False Positive/Negative Rate
True Positive/Negative
Class Counts
F-beta, G-measure, Matthews Correlation Coefficient and more
I hope this helps you.
You may find people who can respond to your question in the DL4J dev community here: https://gitter.im/deeplearning4j/deeplearning4j/tuninghelp

Tensorflow Multiple Input Loss Function

I am trying to implement a CNN in Tensorflow (quite similar architecture to VGG), which then splits into two branches after the first fully connected layer. It follows this paper: https://arxiv.org/abs/1612.01697
Each of the two branches of the network outputs a set of 32 numbers. I want to write a joint loss function, which will take 3 inputs:
The predictions of branch 1 (y)
The predictions of branch 2 (alpha)
The labels Y (ground truth) (q)
and calculate a weighted loss, as in the image below:
Loss function definition
q_hat = tf.divide(tf.reduce_sum(tf.multiply(alpha, y),0), tf.reduce_sum(alpha,0))
loss = tf.abs(tf.subtract(q_hat, q))
I understand the fact that I need to use the tf functions in order to implement this loss function. Having implemented the above function, the network is training, but once trained, it is not outputting the expected results.
Has anyone ever tried combining outputs of two branches of a network in one joint loss function? Is this something TensorFlow supports? Maybe I am making a mistake somewhere here? Any help whatsoever would be greatly appreciated. Let me know if you would like me to add any further details.
From TensorFlow perspective, there is absolutely no difference between a "regular" CNN graph and a "branched" graph. For TensorFlow, it is just a graph that needs to be executed. So, TensorFlow certainly supports this. "Combining two branches into joint loss" is also nothing special. In fact, it is "good" that loss depends on both branches. It means that when you ask TensorFlow to compute loss, it will have to do the forward pass through both branches, which is what you want.
One thing I noticed is that your code for loss is different than the image. Your code appears to do this https://ibb.co/kbEH95

Understanding stateful LSTM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm going through this tutorial on RNNs/LSTMs and I'm having quite a hard time understanding stateful LSTMs. My questions are as follows :
1. Training batching size
In the Keras docs on RNNs, I found out that the hidden state of the sample in i-th position within the batch will be fed as input hidden state for the sample in i-th position in the next batch. Does that mean that if we want to pass the hidden state from sample to sample we have to use batches of size 1 and therefore perform online gradient descent? Is there a way to pass the hidden state within a batch of size >1 and perform gradient descent on that batch ?
2. One-Char Mapping Problems
In the tutorial's paragraph 'Stateful LSTM for a One-Char to One-Char Mapping' were given a code that uses batch_size = 1 and stateful = True to learn to predict the next letter of the alphabet given a letter of the alphabet. In the last part of the code (line 53 to the end of the complete code), the model is tested starting with a random letter ('K') and predicts 'B' then given 'B' it predicts 'C', etc. It seems to work well except for 'K'. However, I tried the following tweak to the code (last part too, I kept lines 52 and above):
# demonstrate a random starting point
letter1 = "M"
seed1 = [char_to_int[letter1]]
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed1[0]], "->", int_to_char[index])
letter2 = "E"
seed2 = [char_to_int[letter2]]
seed = seed2
print("New start: ", letter1, letter2)
for i in range(0, 5):
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed[0]], "->", int_to_char[index])
seed = [index]
model.reset_states()
and these outputs:
M -> B
New start: M E
E -> C
C -> D
D -> E
E -> F
It looks like the LSTM did not learn the alphabet but just the positions of the letters, and that regardless of the first letter we feed in, the LSTM will always predict B since it's the second letter, then C and so on.
Therefore, how does keeping the previous hidden state as initial hidden state for the current hidden state help us with the learning given that during test if we start with the letter 'K' for example, letters A to J will not have been fed in before and the initial hidden state won't be the same as during training ?
3. Training an LSTM on a book for sentence generation
I want to train my LSTM on a whole book to learn how to generate sentences and perhaps learn the authors style too, how can I naturally train my LSTM on that text (input the whole text and let the LSTM figure out the dependencies between the words) instead of having to 'artificially' create batches of sentences from that book myself to train my LSTM on? I believe I should use stateful LSTMs could help but I'm not sure how.
Having a stateful LSTM in Keras means that a Keras variable will be used to store and update the state, and in fact you could check the value of the state vector(s) at any time (that is, until you call reset_states()). A non-stateful model, on the other hand, will use an initial zero state every time it processes a batch, so it is as if you always called reset_states() after train_on_batch, test_on_batch and predict_on_batch. The explanation about the state being reused for the next batch on stateful models is just about that difference with non-stateful; of course the state will always flow within each sequence in the batch and you do not need to have batches of size 1 for that to happen. I see two scenarios where stateful models are useful:
You want to train on split sequences of data because these are very long and it would not be practical to train on their whole length.
On prediction time, you want to retrieve the output for each time point in the sequence, not just at the end (either because you want to feed it back into the network or because your application needs it). I personally do that in the models that I export for later integration (which are "copies" of the training model with batch size of 1).
I agree that the example of an RNN for the alphabet does not really seem very useful in practice; it will only work when you start with the letter A. If you want to learn to reproduce the alphabet starting at any letter, you would need to train the network with that kind of examples (subsequences or rotations of the alphabet). But I think a regular feed-forward network could learn to predict the next letter of the alphabet training on pairs like (A, B), (B, C), etc. I think the example is meant for demonstrative purposes more than anything else.
You may have probably already read it, but the popular post The Unreasonable Effectiveness of Recurrent Neural Networks shows some interesting results along the lines of what you want to do (although it does not really dive into implementation specifics). I don't have personal experience training RNN with textual data, but there is a number of approaches you can research. You can build character-based models (like the ones in the post), where your input and receive one character at a time. A more advanced approach is to do some preprocessing on the texts and transform them into sequences of numbers; Keras includes some text preprocessing functions to do that. Having one single number as feature space is probably not going to work all that well, so you could simply turn each word into a vector with one-hot encoding or, more interestingly, have the network learn the best vector representation for each for, which is what they call en embedding. You can go even further with the preprocessing and look into something like NLTK, specially if you want to remove stop words, punctuation and things like that. Finally, if you have sequences of different sizes (e.g. you are using full texts instead of excerpts of a fixed size, which may or may not be important for you) you will need to be a bit more careful and use masking and/or sample weighting. Depending on the exact problem, you can set up the training accordingly. If you want to learn to generate similar text, the "Y" would be the similar to the "X" (one-hot encoded), only shifted by one (or more) positions (in this case you may need to use return_sequences=True and TimeDistributed layers). If you want to determine the autor, your output could be a softmax Dense layer.
Hope that helps.