Why are the layer weights null and TRT not finding cache when trying to run nvidia's tutorial code on Jetson TX2? - caffe

I'm trying to run the tutorial code from nvidia's repo here.
Here's what happens with the console imagenet program on my Jetson TX2:
nvidia#tegra-ubuntu:~/jetson-inference/build/aarch64/bin$ ./imagenet-console orange_0.pjg output_0.jpg
imagenet-console
args (3): 0 [./imagenet-console] 1 [orange_0.pjg] 2 [output_0.jpg]
imageNet -- loading classification network model from:
-- prototxt networks/googlenet.prototxt
-- model networks/bvlc_googlenet.caffemodel
-- class_labels networks/ilsvrc12_synset_words.txt
-- input_blob 'data'
-- output_blob 'prob'
-- batch_size 2
[TRT] TensorRT version 4.0.2
[TRT] attempting to open cache file networks/bvlc_googlenet.caffemodel.2.tensorcache
[TRT] cache file not found, profiling network model
[TRT] platform has FP16 support.
[TRT] loading networks/googlenet.prototxt networks/bvlc_googlenet.caffemodel
Weights for layer conv1/7x7_s2 doesn't exist
[TRT] CaffeParser: ERROR: Attempting to access NULL weights
Weights for layer conv1/7x7_s2 doesn't exist
[TRT] CaffeParser: ERROR: Attempting to access NULL weights
[TRT] Parameter check failed at: ../builder/Network.cpp::addConvolution::40, condition: kernelWeights.values != NULL
error parsing layer type Convolution index 1
[TRT] failed to parse caffe network
failed to load networks/bvlc_googlenet.caffemodel
failed to load networks/bvlc_googlenet.caffemodel
imageNet -- failed to initialize.
imagenet-console: failed to initialize imageNet
I do not have Caffe installed on the Jetson board, as the tutorial specifically states that it is not needed. I'm not sure if the null weights error would be fixed if TRT would properly cache. Any ideas?
Python 2.7
Cuda 9.0
TensorRT 4.0

The corporate firewall was preventing the proper download of the models. Downloading the models manually and putting them in the networks folder solved the problem.

Related

CUDA Error: out of memory (Yolov4 custom model training)

I am trying to train a custom model on weights in the darknet and the algorithm involved is yolov4. After the model is successfully loaded, I am getting a Cuda error: out of memory as shown below.
896 x 896
Create 6 permanent cpu-threads
Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: D:\darknet\src\dark_cuda.c : cuda_make_array() : line: 492 :
build time: Jan 21 2022 - 16:57:15
CUDA Error: out of memory
As suggested in the error I even changed my subdivision=64 in the configuration file, but still I am getting the same error. I have tried various combinations of batch and subdivisions,but I am unable to solve this issue.
I am using cuda version:10.1 and nvidia-gtx1050.
A snapshot of my configuration file:

Error after predicting using CNN model loaded from disk

I loaded my CNN model using
model = load_model('./model/model.h5')
This model loads well ( Though I do get warning messages: "Call initializer instance with the dtype argument instead of passing it to the constructor" )
However, when I try to predict using this model, I get the following error:
UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d/Conv2D}}]]
[[dense_2/Softmax/_273]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d/Conv2D}}]]
0 successful operations.
0 derived errors ignored
Any idea how to overcome this issue?
Apparently, I just didn't have CUDA installed on my system. Did that and now it's working!

A simple distributed training python program for deep learning models by Horovod on GPU cluster

I am trying to run some example python3 code
https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html
on databricks GPU cluster (with 1 driver and 2 workers).
Databricks environment:
ML 6.6, scala 2.11, Spark 2.4.5, GPU
It is for distributed deep learning model training.
I just tried a very simple example at first:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=2)
def train():
print('in train')
import tensorflow as tf
print('after import tf')
hvd.init()
print('done')
hr.run(train)
But, the command is alway running without any progress.
HorovodRunner will stream all training logs to notebook cell output. If there are too many
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.
### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod Timeline.
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable to the location
of the
timeline file to be created. You can then open the timeline file using the chrome://tracing
facility of the Chrome browser.
Do I miss something or need to set up something to make it work ?
Thanks
your code does no actual training within it.. you might have better luck with the better example code
https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/mnist-pytorch.html

Regarding caffe caffe.paramspec has no parameter named compression

I was trying to run a customized model on caffe.Unfortunately,I was only provided with a trainval.prototxt and trainval.caffemodel.
The exact error is as follows
Error parsing text-format caffe.NetParameter: 54:17: Message type “caffe.ParamSpec has no field named compression
This is followed by
[upgrade_proto.cpp :79] check failed read protofromtextfile failed to parse param file
A similar question was asked here.
So should I assume that my version of caffe that I have on my system was different from the client's version of caffe and apparently the client version of caffe has a slightly different proto definition??

ValueError: Unknown initializer: GlorotUniform

Recently I've trained a model using MNIST dataset in Google colab. I've saved the weights using Model.save('model.h5').
I've downloaded the weights and tried to run in other code by loading weights offline in anaconda Model = Keras.models.load_model('model.h5).
But it throws
ValueError: Unknown initializer: GlorotUniform
The issue could be because you're using tf.keras and keras in a mixed way. There could also be a version mismatch between local and remote keras versions. Do check this discussion on stackoverflow.