what is wrong when training an autoencoder on mnist dataset with caffe? - caffe

I want to use mnist dataset to train a simple autoencoder in caffe and with nvidia-digits.
I have:
caffe: 0.16.4
DIGITS: 5.1
python 2.7
I use the structure provided here:
https://github.com/BVLC/caffe/blob/master/examples/mnist/mnist_autoencoder.prototxt
Then I face 2 problems:
When I use the provided structure I get this error:
Traceback (most recent call last):
File "digits/scheduler.py", line 512, in run_task
task.run(resources)
File "digits/task.py", line 189, in run
self.before_run()
File "digits/model/tasks/caffe_train.py", line 220, in before_run
self.save_files_generic()
File "digits/model/tasks/caffe_train.py", line 665, in save_files_generic
'cannot specify two val image data layers'
AssertionError: cannot specify two val image data layers
when I remove the layer for ''test-on-test'', I get a bad result like this:
https://screenshots.firefox.com/8hwLmSmEP2CeiyQP/localhost
What is the problem??

The first problem occurs because the .prototxt has two layers with name data and TEST phase. The first layer that uses data, i.e. flatdata, does not know which data to use (the test-to-train or test-to-test). That's why when you remove one of the data layers with TEST phase, the error does not happen. Edit: I've checked the solver file and it has a test_stage parameter that should switch between the test files, but it's clearly not working in your case.
The second problem is a little more difficult to solve. My knowledge in autoencoders is limited. It seems your euclidean loss changes very little during your iterations; I would check the base learning rate in your solver.prototxt and decrease it. Check how the losses fluctuate.
Besides that, for the epochs/iterations that achieved a low error, have you checked the output data/images? Do they make sense?

Related

TPU training fails with certain metric, succeeds on CPU

I'm trying to train a simple EfficientNet style model on some images. Training works fine on a CPU, but when I switch across to using a TPU I get the following error:
(0) Invalid argument: {{function_node
__inference_train_function_38255}} Output shapes of then and else branches do not match: (s64[1,<=4]) vs. (s64[<=4])
[[{{node cond}}]]
[[TPUReplicate/_compile/_5430787790498024493/_4]]
[[tpu_compile_succeeded_assert/_6318656678166656164/_5/_289]]
(1) Invalid argument: {{function_node
__inference_train_function_38255}} Output shapes of then and else branches do not match: (s64[1,<=4]) vs. (s64[<=4])
[[{{node cond}}]]
[[TPUReplicate/_compile/_5430787790498024493/_4]]
[[tpu_compile_succeeded_assert/_6318656678166656164/_5/_225]]
This error only occurs when I'm using a particular metric, Cohen's Kappa. If I remove this metric, the model trains fine.
I've tried to figure out the offending section in CohensKappa and narrowed it down to _update_confusion_matrix - if I overload this and result, the model trains fine.
When I start training, I see this log message:
TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond_8/Identity:0' shape=(None, 456, 456, 3) dtype=float32>, <tf.Tensor 'cond_8/Identity_1:0' shape=(None,) dtype=int64>]
Which may be related, however given that the model trains fine when I leave out this metric and I still get that log, it might be a red herring.
Any suggestions on solutions, or how to debug this would be very helpful. Switching to eager execution mode isn't an option, as it all works fine on CPU.
Please share the code snippet that leads to this error. From what it shows, you seem to have tensor dimension inconsistency problem (i.e. (s64[1,<=4]) vs. (s64[<=4]))

Exception: Input blob arguments do not match net inputs

This question has been asked before but with different context. So please dont mark it as duplicate.
I want to feedforward a network step by step. First feedforward upto some layer then get its result change it and then pass it on to the next layer. Here is the code.
forward_kwargs = {'data': blobs['data'].astype(np.float32, copy=False)}
blobs_out = net.forward(end='proposal',**forward_kwargs)
forward_kwargs = {'proposal': blobs_out}
blobs_out = net.forward(start='roi_pool_conv5',**forward_kwargs)
When it run this code, it gives error
Exception: Input blob arguments do not match net inputs.
this error comes from the file pycaffe.py. The line in this file giving error is
if set(kwargs.keys()) != set(self.inputs):
raise Exception('Input blob arguments do not match net inputs.')
Because in prototxt file i have mentioned only two inputs data and im_info. But i want to input my network again from roi_pool_conv5 layer and when i send this argument as start layer to network it checks whether this blob is in the inputs or not. Clearly it is not in the inputs. I cannot mention this in input because i am unsure of dimension. Any workaround for this?
I think your problem is you don't know dimensions of proposal.
If so, you just fill dummy dimensions in prototxt file and reshape it before you forward.
After running your program, batch size is gonna fixed, right?
Then you can reshape your roi_pool_conv5 layer and your network!
I hope this answer is helpful to you :)

Tf-slim: ValueError: Variable vgg_19/conv1/conv1_1/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?

I am using tf-slim to extract features from several batches of images. The problem is my code works for the first batch , after that I get the error in the title.My code is something like this:
for i in range(0, num_batches):
#Obtain the starting and ending images number for each batch
batch_start = i*training_batch_size
batch_end = min((i+1)*training_batch_size, read_images_number)
#obtain the images from the batch
images = preprocessed_images[batch_start: batch_end]
with slim.arg_scope(vgg.vgg_arg_scope()) as sc:
_, end_points = vgg.vgg_19(tf.to_float(images), num_classes=1000, is_training=False)
init_fn = slim.assign_from_checkpoint_fn(os.path.join(checkpoints_dir, 'vgg_19.ckpt'),slim.get_model_variables('vgg_19'))
feature_conv_2_2 = end_points['vgg_19/pool5']
So as you can see, in each batch, I select a batch of images and use the vgg-19 model to extract features from the pool5 layer. But after the first iteration I get error in the line where I am trying to obtain the end-points. One solution, as I found on the internet is to reset the graph each time , but I don't want to do that because I have some weights in my graph in later part of the code which I train using these extracted features. I don't want to reset them. Any leads highly appreciated. Thanks!
You should create your graph once, not in a loop. The error message tells you exactly that - you try to build the same graph twice.
So it should be (in pseudocode)
create_graph()
load_checkpoint()
for each batch:
process_data()

Caffe: Print the softmax score

In the given example of MNIST in the Caffe installation.
For any given test image, how to get the softmax scores for each category and do some processing on them? Say compute the mean and variance of them.
I am newbie so a detail would help me a lot. I am able to train the model and use the testing feature to get the prediction but I am not sure which files are to be edited in order to get the above results.
You can use python interface
import caffe
net = caffe.Net('/path/to/deploy.prototxt', '/path/to/weights.caffemodel', caffe.TEST)
in_ = read_data(...) # this is up to you to read a sample and convert it to numpy array
out_ = net.forward(data=in_) # assuming your net expects "data" in blob
Now you have the output of your net in a dictionary out (keys are names of output blobs). You can run it in a loop on several examples etc.
I can try to answer your question. Assuming in your deploying net, the softmax layer is like below:
layer {
name: "prob"
type : "Softmax"
bottom: "fc6"
top: "prob"
}
In your python code that processes data, combining with the code #Shai provided, you can get the probability of each category by adding code based on #Shai's code:
predicted_prob = net.blobs['prob'].data
predicted_prob will be returned an array that contains the probabilities with all categories.
For example, if you only have two categories, predicted_prob[0][0] will be the probability that this testing data belongs to one category and predicted_prob[0][1] will be the probability of the other one.
PS:
If you don't want to write any additional python script, according to https://github.com/BVLC/caffe/tree/master/examples/mnist
it says this example will automatically do the testing every 500 iterations. "500" is defined in solver, such as https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
So you need to trace back the caffe source code that processes the solver file. I guess it should be https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp
I am not sure solver.cpp is the correct file you need to look at. But in this file, you can see it has functions of testing and calculation of some values. I hope it can give you some ideas if no one else can answer your question.

Check failed: registry.count(type) == 0 (1 vs. 0) Layer type Split already registered

I have created a python layer for data augmentation which worked well with digits but when I train the network using terminal command on ubuntu 14.04, I get this error:
I1130 16:29:56.155732 18230 layer_factory.hpp:77] Creating layer aug_data
F1130 16:29:56.220578 18230 layer_factory.hpp:69] Check failed: registry.count(type) == 0 (1 vs. 0) Layer type Split already registered.
where aug_data is the custom python layer. I have made changes in configuration file to accept the python layer but I think there is something wrong with linking the layers that I could not fix. I cannot use DIGITS as my data is hyperspectral while the DIGITS accept either grayscale or RGB images.
Any help would be appreciated.
According to your prototxt file, you should be able to run "from digits_python_layers import AugmentationLayer". Does this work (from any directory)?
Old answer:
Your new layer should return something other than "Split" for its layer type (via its type() function).