I have trained AlexNet for my own data (gray level medical images with size 227*227). Now that I look into the caffemodel files, I see that except some lines in the beginning all the other values are ff7f or ffff like this:
additionally when I try to visualize the network weights by python, I get this waring:
UserWarning: Warning: converting a masked element to nan.
and nothing is shown in the figure!
Do you know what those ff7f or ffff values in my coffemodel are? why they are resulted? or any recommendation or any similar issues that have been disussed? please help me find out.
Related
So, I have been assigned a task to read and save values of C-struct that was stored within tifftag of tiff image as byte buffer. This might be quite simple, but I am quite new to this realm of programming.
I know the exact positions I need to read bytes from. When I use python tiff tag readers, I get these weird values of bytes, that I could not make sense of. I was expecting it to be \xb5\x00\x00\x00\x01
format, not something strange like \n\xd7#=\n\xd7#=K.
Here is the snippet of weird buffer values
However, in utility app AsTiffViewer, those are perfectly fine as shown
here.
How do I decode this? What does this all mean?
\n\xd7#=\n\xd7#=K (0A D7 23 3D 0A D7 23 3D - as per AsTiffViewer)
By the way, these 0A D7 23 3D & 0A D7 23 3D are supposed to be two float value, each of them 4 bytes.
I was expecting tiff tag byte buffer to be in format of\xb5\x00\x00\x00\x01 etc, However, it spit out some weird format - \n\xd7#=\n\xd7#=K. I don't know how to decode or read this.
So, after mucking around a bit, I found out that, \n\xd7#=\n\xd7#=K, this is nothing but how python represents float in binary string.
I'm constructing Projector-camera system, I want to build radiometric compensation for it using deep-learning.
Here, Is it possible to use network as below? (I guess gradient does not flow, thus weights will not be updated, but I cannot sure)
0. I have ground truth image GT. set Input_image = GT
While True:
1. Encoder-Decoder network structure : projection_image = network(Input_image)
2. project projection_image and capture it as Cap
3. loss calculation : loss = RMSE(Cap, GT)
4. Input_image = projection_image
For this situation,
If I assume ordinary deep-learning, the loss will be calculated between direct output of the network (projection_image) and ground truth data GT. Of course, it works.
However for my case, I want to calculate loss between post-processed network output (network output image -> projection -> capture) and GT.
Here, post-processing is done by cpu, I guess loss does not affect network weights. Actually In my code, the network did not updated.
Is it possible to solve my problem?
2021-03-09
I trained my transformer models in pytrorch. In the first few batches, the loss calculation and gradient updates were all performing well. However, the output of the model turned out to be nan values after several iterations. I am confident that there are no flawed data in the dataset. Besides, it's not a classification problem, the labels are float numbers.
2021-03-10
Follow-up:
What an interesting story! When I ran this transformer model with a larger architecture (like 6 encoder layers, 8 heads, etc.). The NAN values disappeared. It seems that the gradient explosion only existed in tiny models.
Solutions:
I searched the Pytorch forum and Stackoverflow and found out the accurate reason for this NAN instance. First, since the NAN loss didn't appear at the very beginning. We can conclude that the model might be well defined. The cause might be the data or the training process. I ran torch.autograd.set_detect_anomaly(True) as told in https://discuss.pytorch.org/t/gradient-value-is-nan/91663/2. It returned that the RuntimeError: Function ‘StdBackward1’ returned nan values in its 0th output.
According to the similar question in https://discuss.pytorch.org/t/gradient-of-standard-deviation-is-nan/14713, I double-checked the output in each layer inside the transformer. Strangely, after dozens of iterations, the positional embedding layer outputted a vector full of zeros. As a result, the LayerNorm that does the normalization job cannot backward the loss well, since it calculated the standard deviations and the standard deviation has no gradient at zero (or you can say it's infinite)! The possible solution is to add x.std(unbiased=False) if you are using pytorch.
This's my encounter with the NAN loss and mse. I hope my experience can give some insights to you when you meet this circumstance!
Relative Questions: Deep-Learning Nan loss reasons
For what it's worth, I had this problem and it turned out that I had forgot to initialize an embedding vector, so it was just whatever torch.empty() happened to come upon (likely a lot of zeros.)
Recently, I have worked on quantization aware training on tf1.x to push the model to Coral Dev Board. However, when I finished training the model, why is my min max of my 2 outputs fake quantization is the same?
Should it be different when one's maximum target is 95 and one is 2pi?
I have figured out the problem. It is the problem when that part of the model is not really trained QAT. This happens for the output node that somehow forgets to QAT when training. The -6 and 6 values come from the default source of the quantization of tf1.x as mention here
To overcome the problem, we should provide some op to trigger the QAT for the output nodes. In my regression case, I add a dummy op: tf.maximum(output,0) in the model to make the node QAT. If your output is strictly between 0-1, applying "sigmoid" activation at output instead of relu can also solve the problems.
I am training a deep autoencoder (for now 5 layers encoding and 5 layers decoding, using leaky ReLu) to reduce the dimensionality of the data from about 2000 dims to 2. I can train my model on 10k data, and the outcome is acceptable.
The problem arises when I am using bigger data (50k to 1M). Using the same model with the same optimizer and drop out etc does not work and the training gets stuck after a few epochs.
I am trying to do some hyper-parameter search on the optimizer (I am using adam), but I am not sure if this will solve the problem.
Should I look for something else to change/check? Does the batch size matter in this case? Should I solve the problem by fine tuning the optimizer? Shoul I play with the dropout ratio? ...
Any advice is very much appreciated.
p.s. I am using Keras. It is very convenient. If you do not know about it, then check it out: http://keras.io/
I would have the following questions when trying to find a cause of the problem:
1) What happens if you change the size of the middle layer from 2 to something bigger? Does it improve the performance of the model trained on >50k training set?
2) Are 10k training examples and test examples randomly selected from 1M dataset?
My guess is that your training model is simply not able to decompress your 50K-1M data using just 2 dimensions in the middle layer. So, it's easier for the model to fit their params for 10k data, activations from middle layer are more sensible in that case, but for >50k data activations are random noise.
After some investigation, I have realized that the layer configuration I am using is somehow ill for the problem, and this seems to cause -at least parts of the- problem.
I have been using sequence of layers for encoding and decoding. The layer sizes where chosen to decrease linearly, for example:
input: 1764 (dims)
hidden1: 1176
hidden2: 588
encoded: 2
hidden3: 588
hidden4: 1176
output: 1764 (same as input)
However this seems to work only occasionally and it is sensitive to the choice of hyper parameters.
I tried to replace this with an exponentially decreasing layer size (for encoding) and the other way for decoding. so:
1764, 128, 16, 2, 16, 128, 1764
Now in this case the training seems to be happening more robustly. I still have to make a hyper parameter search to see if this one is sensitive or not, but a few manual trials seems to show its robustness.
I will post an update if I encounter some other interesting points.