Neural network not training, parameters.grad is None - deep-learning

I implemented a simple NN and a custom objective function for a minimization problem. Everything seems to be working fine, except for the fact that the model does not seem to learn.
I checked if list(network.parameters())[0].grad was None, and indeed this seems to be the problem. Based on previously asked questions, the problem seems to be the detachment of the graph, but I don't know what I am doing wrong.
Here's the link to the code that you can run on colab: Colab code
Thank you!!

This part seems problematic in your code.
output_grad, _ = torch.autograd.grad(q, (x,y))
output_grad.requires_grad_()
Your loss depends on output_grad and so when you do loss.backward() are trying to compute the gradient of parameters w.r.t to output_grad.
You cannot compute the gradient of output_grad since create_graph is False by default. And so output_grad is implicitly detached from the rest of the graph. To fix this, just pass create_graph=True in the autograd.grad. You do not need to set requires_grad either for output_grad, i.e., the second line is not needed.

Related

pass loss function and metrics in config

In the official example, both metrics and loss function are hard coded. I am wondering if we can pass those in the config jsonnet, so I can reuse my model in different datasets with different metrics.
I knew I had seen that question before. Copy and paste from GitHub:
Metric is registrable, so you can easily add a parameter to you model of type List[Metric], and then specify metrics in Jsonnet. You'll have to make sure those metrics take exactly the same input.
For the loss, this is a little bit harder. You would create your own Registrable base class, and then implement the losses you want to use this way. You can use the Metric class as an example of how to do this. It would be a bit of typing work, but not difficult.

DQN: Access to raw observations after conversion of observation space into a Box environment?

I'm adapting the Pytorch code from Tabor's course on DQNs (https://github.com/philtabor/Deep-Q-Learning-Paper-To-Code) to work with the vizdoomgym library, having previously managed to make a version work in TF.
After training my agent, I will visualize it's performance as an .mp4 video. Previously, I used the SK-video library to record the agent at play, as the in-house Monitor class did not work with the VZDgym library. This was achieved by simply saving each observation into an image array.
I have encountered a problem, as the code I am following invokes wrappers in order to convert the observation space into a Box environment, the images are in effect distorted as a result. These wrappers can be found in the utils.py file, with the main method shown below:
def make_env(env_name, shape=(84,84,1), repeat=4, clip_rewards=False,
no_ops=0, fire_first=False):
env = gym.make(env_name)
env = RepeatActionAndMaxFrame(env, repeat, clip_rewards, no_ops, fire_first)
env = PreprocessFrame(shape, env)
env = StackFrames(env, repeat)
return env
I notice that the preprocessing wrappers inherit the observation method, meaning that I should be able to access the observations and store them prior to preprocessing. However, I am not familiar with the memory management issues with such a solution, if it is even feasible? An alternative approach would be to try and "restore" the observations from their distorted representations back into their original form, but that doesn't seem feasible.
Any advice is appreciated.
As suspected, the preprocessing wrapper can be used to successfully save the frames prior to preprocessing functions into an image array.
This image array can then be converted into a .mp4 video using the sk-video library.
However, a separate version of the wrapper had to be built for this, due to the risk of memory overflow, so the method is not ideal.

How to debug system error in google data studio?

I'm working on a community connector, my fields are getting pulled in properly, but trying to use it in a report I get the following error below.
I'm less concerned about the specific error, but more concerned how I even figure out what is going on or breaking on the server.
Anyone figure out good ways to debug these errors?
After a lot of research on this and digging around. I found that it came down to an invalid schema. ugh. I noticed that the getData, wasn't even being ran when trying to use data from a report. This made me think it was failing elsewhere.
In my case I was a prebuilt json object, and passing that to the schema field for data studio.
Unfortunately google provides no feedback for mis-configured JSON schemas.
:sigh:
I simplified the schema and found the issue was a incorrect data type. Once I fixed this all worked :)
This method is mentioned here, and even google data studio says its hard to debug.
https://developers.google.com/datastudio/connector/semantics
Its nice to have the schema seperate from code, but careful that the schema is correct otherwise you'll run into this very generic issue. Until they add more logging in this area.
Hope this helps someone!
Sharing my discover here just in case, I was with the same case, but in my case this problem was occurring in a Pie Chart, but the same Metric combined with another segmentation didn't appear this problem.
After some test I found which case generates the problem, my problem was happening because this metric, in a specific segmentation, generates a total value negative, what Pie Charts have problems to show.
My field has string and numeric values too, but this is not a problem (until now at least). To solve my problem I create a new field where negative values are replaced by 0 (this doesn't make impacts at insight I need, because this values I don't need to track or show them in a Pie Chart).
So my suggestion here is try to understand if it is happening with a specific segmentation, or all time. With this segmentation, try to create filters and display each segment only, if this problem occurs you will discover which segment creates the error. Try to check the field you are making of sum, mean, or anything else, if he have negative values. If it happens, Pie Chart is not a possibility to show that, or you need to filter, replace, make any rule.

INFO:tensorflow:Summary name conv2d_1/kernel:0 is illegal

I am trying to use the tensorboard callback in keras. When I run the pretrained inceptionv3 model with the tensorboard callback I am getting the following warning:
INFO:tensorflow:Summary name conv2d_95/kernel:0 is illegal; using conv2d_95/kernel_0 instead.
I saw a comment on Github addressing this issue. SeaFX on his comment pointed out that he solved it by replacing variable.name with variable.name.replace(':','_'). I am unsure how to do that. Can anyone please help me. Thanks in advance :)
Not sure on getting name replacement to work however a workaround that may be sufficient for your needs is:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.WARN)
import keras
This will turn off all INFO level logging but keep warnings, errors etc.
See this question for a discussion on the various log levels and changing them. Personally I found setting the TF_CPP_MIN_LOG_LEVEL environment variable didn't work under Jupyter notebook but I haven't tested on base Python.

Deploy network is missing a Softmax layer

I try to use pretrained model (VGG 19) to DIGITS but I got this error.
ERROR: Your deploy network is missing a Softmax layer! Read the
documentation for custom networks and/or look at the standard networks
for examples
I try to test with my dataset which has only two classes.
I read this and this try to modify last layer but also I got error. How can I modify layers based on new dataset?
I try to modify the last layer and I got error
ERROR: Layer 'softmax' references bottom 'fc8' at the TRAIN stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.
You're having a problem because you're trying to upload a "train/val" network when you really need to be uploading an "all-in-one" network. Unfortunately, we don't document this very well. I've created an RFE to remind us to improve the documentation.
Try to adjust the last layers in your network to look something like this: https://github.com/NVIDIA/DIGITS/blob/v4.0.0/digits/standard-networks/caffe/lenet.prototxt#L162-L184
For more information, here is how I've proposed updating Caffe's example networks to all-in-one nets, and here is how I updated the default DIGITS networks to be all-in-one nets.