QAT output nodes for Quantized Model got the same min max range - deep-learning

Recently, I have worked on quantization aware training on tf1.x to push the model to Coral Dev Board. However, when I finished training the model, why is my min max of my 2 outputs fake quantization is the same?
Should it be different when one's maximum target is 95 and one is 2pi?

I have figured out the problem. It is the problem when that part of the model is not really trained QAT. This happens for the output node that somehow forgets to QAT when training. The -6 and 6 values come from the default source of the quantization of tf1.x as mention here
To overcome the problem, we should provide some op to trigger the QAT for the output nodes. In my regression case, I add a dummy op: tf.maximum(output,0) in the model to make the node QAT. If your output is strictly between 0-1, applying "sigmoid" activation at output instead of relu can also solve the problems.

Related

Why `ep_rew_mean` much larger than the reward evaluated by the `evaluate_policy()` fuction

I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:
the ep_rew_mean curve for total 100 million steps, each episode has 50 steps
As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy() for the trained model, and the reward is much smaller than the ep_rew_mean value. The first value is mean reward, the second value is std of reward:
4.349947246664763 1.1806464511030819
the way I use function evaluate_policy() is:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)
According to my understanding, the initial environment is randomly distributed in an area when using reset() fuction, so there should not be overfitting problem.
I have also tried different learning rate or other parameters, and this problem is not solved.
I have checked my environment, and I think there is no error.
I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.
evaluate_policy has deterministic to True by default (https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html).
If you sample from the distribution during training, it may help to evaluate the policy without it selecting the actions with an argmax (by passing in deterministic=False).

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

Transformer Model Output Nan Values in Pytorch

2021-03-09
I trained my transformer models in pytrorch. In the first few batches, the loss calculation and gradient updates were all performing well. However, the output of the model turned out to be nan values after several iterations. I am confident that there are no flawed data in the dataset. Besides, it's not a classification problem, the labels are float numbers.
2021-03-10
Follow-up:
What an interesting story! When I ran this transformer model with a larger architecture (like 6 encoder layers, 8 heads, etc.). The NAN values disappeared. It seems that the gradient explosion only existed in tiny models.
Solutions:
I searched the Pytorch forum and Stackoverflow and found out the accurate reason for this NAN instance. First, since the NAN loss didn't appear at the very beginning. We can conclude that the model might be well defined. The cause might be the data or the training process. I ran torch.autograd.set_detect_anomaly(True) as told in https://discuss.pytorch.org/t/gradient-value-is-nan/91663/2. It returned that the RuntimeError: Function ‘StdBackward1’ returned nan values in its 0th output.
According to the similar question in https://discuss.pytorch.org/t/gradient-of-standard-deviation-is-nan/14713, I double-checked the output in each layer inside the transformer. Strangely, after dozens of iterations, the positional embedding layer outputted a vector full of zeros. As a result, the LayerNorm that does the normalization job cannot backward the loss well, since it calculated the standard deviations and the standard deviation has no gradient at zero (or you can say it's infinite)! The possible solution is to add x.std(unbiased=False) if you are using pytorch.
This's my encounter with the NAN loss and mse. I hope my experience can give some insights to you when you meet this circumstance!
Relative Questions: Deep-Learning Nan loss reasons
For what it's worth, I had this problem and it turned out that I had forgot to initialize an embedding vector, so it was just whatever torch.empty() happened to come upon (likely a lot of zeros.)

Is it possible to feed the output back to input in artificial neural network?

I am currently designing a artificial neural network for a problem with a decay curve.
For example, building a model for predicting the durability of the some material. It may includes the environment condition like temperature and humidity.
However, it is not adequate to predict the durability of the material. For such a problem, I think it is better to using the output durability of previous time slots as one of the current input to predict the durability of next time slot.
Moreover, I do not know how to train a model which feed the output back to input as one of the input columns has only the initial value before training.
For this case,
Method 1 (fail)
I have tried to fill the predicted output durability of current row to the input durability of next row. Nevertheless, it will prevent the model from "loss.backward()" so we cannot compute and update the gradient if we do so. The gradient function used was "CopySlices" instead of "MSELoss" when I copied the predicted output to the next row of the input data.
Feed output to input
gradient function -copy-
Method 2 "fill the input column with expected output"
In this method, I fill the blank input column with expected output (row-1) before training the model. Filling the input column with expected output of previous row is only done for training. For real prediction, I will feed the predicted output to the input. In this case, I am successful to train a overfitting model with MSELoss.
Moreover, I do not believe it is a right method as it uses the expected output as the input no matter how bad it predict. I strongly believed that it is not a right method.
Therefore, I want to ask whether it is possible to feed output to input in linear regression problem using artificial neural network.
I apologize for uploading no code here as I am not convenient to upload the full code here. It may be confidential.
It looks like you need an RNN (recurrent neural network). This tutorial is pretty helpful for understanding an RNN: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Loss function for ordinal target on SoftMax over Logistic Regression

I am using Pylearn2 OR Caffe to build a deep network. My target is ordered nominal. I am trying to find a proper loss function but cannot find any in Pylearn2 or Caffe.
I read a paper "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels" . I get the general idea - but I am not sure I understand what will the thresholds be, if my final layer is a SoftMax over Logistic Regression (outputting probabilities).
Can some help me by pointing to any implementation of such a loss function ?
Thanks
Regards
For both pylearn2 and caffe, your labels will need to be 0-4 instead of 1-5...it's just the way they work. The output layer will be 5 units, each is a essentially a logistic unit...and the softmax can be thought of as an adaptor that normalizes the final outputs. But "softmax" is commonly used as an output type. When training, the value of any individual unit is rarely ever exactly 0.0 or 1.0...it's always a distribution across your units - which log-loss can be calculated on. This loss is used to compare against the "perfect" case and the error is back-propped to update your network weights. Note that a raw output from PL2 or Caffe is not a specific digit 0,1,2,3, or 5...it's 5 number, each associated to the likelihood of each of the 5 classes. When classifying, one just takes the class with the highest value as the 'winner'.
I'll try to give an example...
say I have a 3 class problem, I train a network with a 3 unit softmax.
the first unit represents the first class, second the second and third, third.
Say I feed a test case through and get...
0.25, 0.5, 0.25 ...0.5 is the highest, so a classifier would say "2". this is the softmax output...it makes sure the sum of the output units is one.
You should have a look at ordinal (logistic) regression. This is the formal solution to the problem setup you describe ( do not use plain regression as the distance measures of errors are wrong).
https://stats.stackexchange.com/questions/140061/how-to-set-up-neural-network-to-output-ordinal-data
In particular I recommend looking at Coral ordinal regression implementation at
https://github.com/ck37/coral-ordinal/issues.