The torch command
x = torch.Tensor(4, 3)
is supposed to create an uninitialized tensor (based on documentations).
But when we try to print the content of x, there are values there.
>>>from __future__ import print_function
>>>print(x)
0.0000e+00 -8.5899e+09 6.1021e-38
8.5920e+09 1.7470e-21 4.5806e-41
0.0000e+00 0.0000e+00 0.0000e+00
0.0000e+00 0.0000e+00 0.0000e+00
[torch.FloatTensor of size 4x3]
So what is the meaning of uninitialized here?
It means that PyTorch just reserves a certain area within the memory for the tensor, without changing its content.
This part of the memory was before occupied by something else (an other tensor or or maybe something completely different like Browser, Code Editor .. if you use CPU memory). The values inside are not cleared afterwards for performance reasons.
The content (what previously might be something entirely different) is just interpreted as values for tensor.
Writing zeros or some other initialization requires computational power, so just reserving the area in memory is much faster.
But the values are also completely uncontrolled, values can grow very high, so in many cases you might do additional initialization.
Related
no
13
what's the meaning of 'parameterize' in deep learning? As shown in the photo, does it means the matrix 'A' can be changed by the optimization during training?
Yes, when something can be parameterized it means that gradients can be calculated.
This means that the (dE/dw) which means the derivative of Error with respect to weight can be calculated (i.e it must be differentiable) and subtracted from the model weights along with obviously a learning_rate and other params being included depending on the optimizer.
What the paper is saying is that if you make a binary matrix a weight and then find the gradient (dE/dw) of that weight with respect to a loss and then make an update on the binary matrix through backpropagation, there is not really an activation function (which by requirement must be differentiable) that can keep the values discrete (like 0 and 1) but rather you will end up with continous values (like these decimal values).
Therefore it is saying since the idea of having binary values be weights and for them to be back-propagated in a way where the weights + activation function also yields an updated weight matrix that is also binary is difficult, another solution like the Bernoulli Distribution is used instead to initialize parameters of a model.
Hope this helps,
So going the AMP: Automatic Mixed Precision Training tutorial for Normal networks, I found out that there are two versions, Automatic and GradScaler. I just want to know if it's advisable / necessary to use the GradScaler with the training becayse it is written in the document that:
Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(1):
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
opt.zero_grad()
Also, looking at NVIDIA Apex Documentation for PyTorch, they have used it as,
from apex import amp
model, optimizer = amp.initialize(model, optimizer)
loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
I think this is what GradScaler does too so I think it is a must. Can someone help me with the query here.
Short answer: yes, your model may fail to converge without GradScaler().
There are three basic problems with using FP16:
Weight updates: with half precision, 1 + 0.0001 rounds to 1. autocast() takes care of this one.
Vanishing gradients: with half precision, anything less than (roughly) 2e-14 rounds to 0, as opposed to single precision 2e-126. GradScaler() takes care of this one.
Explosive loss: similar to the above, overflow is also much more likely with half precision. This is also managed by autocast() context.
Assume I have a Ricker wavelet. I can compute the envelope of this wavelet as shown below:
This is the normal condition we usually see.
However, if I shift the Ricker wavelet to be wholly negative, and then I compute its envelope. A confusing phenomenon happens that the envelope looks like the opposite of the original wavelet:
Furthermore, if I shift the Ricker wavelet to be wholly positive, and compute its envelope. You can see the envelope is almost the same as the original wavelet:
Does anybody know the mathematical explanation behind these phenomena?
And how can we avoid the latter two cases? Remove the mean value of the wavelet to force it having zero mean?
I assume that you are working with Python with the Scipy module:
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
points = 100
a = 4.0
myricker = signal.ricker(points, a)
The envelope is typically computed as the absolute value of the analytic signal, which is the Hilbert transform of the original signal:
analytic_signal = signal.hilbert(myricker)
amplitude_envelope = np.abs(analytic_signal)
The analytic signal is a complex quantity.
In order to understand the behavior of envelope, try plotting its real and imaginary parts:
fig, ((ax1, ax2)) = plt.subplots(2, sharey=True)
fig.suptitle('')
ax1.plot(myricker, label='myricker')
ax1.plot(amplitude_envelope, label='envelope')
ax1.legend()
ax2.plot(np.real(analytic_signal), label='real')
ax2.plot(np.imag(analytic_signal), color='black', label='imaginary')
ax2.legend()
This is what you should get:
The real part of the analytic signal is identical to the original signal. The imaginary part is often referred to as the Hilbert transform itself.
Now if you shift your original ricker upwards by adding a small constant value (e.g. 0.2), the real part of the signal will be shifted accordingly, but the imaginary part will remain the same and therefore its contribution to the envelope will be smaller:
As you increase the shift, the contribution of the imaginary part to the envelope becomes smaller and smaller. Here for a shift of 1, it is so small that the envelope looks very close to the original ricker wavelet:
I am dealing with pytorch in colab
While training, pytorch consumes enormous memory
after training, I saved model, and loaded model to another notebook(note 2).
in note 2, after loading state_dict and everything, pytorch consumes way less memory than in training state.
So, I wonder 'useless' data is stored in graphic card memory while training(in my case, about 13gb)...
If so, how do I delete useless data after training?
plus. I tried to delete variables used while training, but wasn't big enough(about 2gb)
This is to be expected while training. During the training process, the operations themselves will take up memory.
For example, consider the following operation -
a = np.random.rand(100, 500, 300)
b = np.random.rand(200, 500, 300)
c = (a[:, None, :, :] * b[None, :, :, :]).sum(-1).sum(-1)
The memory size of a, b and c individually is around 400 MB. However, if you check
%memit (a[:, None, :, :] * b[None, :, :, :]).sum(-1).sum(-1)
That's 23 GB! The line itself takes up a lot of memory to actually do the operation because there are massive intermediate arrays involved. These arrays are temporary and are automatically deleted after the operation is over. So you deleting some variables isn't going to do much for reducing the footprint.
The way to get around this is to use memory optimized operations.
For example, doing np.tensordot(a, b, ((1, 2), (1, 2))) instead of multiplying by broadcasting leaves a much better memory footprint.
So what you need to do is to identify which operation in your code is requiring such a huge memory and see if you can replace that with a more memory efficient equivalent (which might not even be possible depending on your specific use-case).
I am training a simple MLP by computing the MSE and get the following error:
UserWarning: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([1, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
The following gives me the right solution
target = target.unsqueeze(1) while torch.unsqueeze(target,1) does not. The former solution is from a previous question and the latter comes from the documentation
Why does the latter fix the UserWarning message with the former doesn't?
torch.unsqueeze Returns a new tensor with a dimension of size one inserted at the specified position. That is its not an inplace operation thus you need to assign its output to something. i.e. simply do :
target = torch.unsqueeze(target, 1)
Otherwise the tensor will remain the same, as you did not store the changes back into it!