Pytorch autograd function backward is doesn't work ( which is output 0 of MmBackward, is at version 1; expected version 0 instead) - deep-learning

I'm making a model mixing Fine-tuning CLIP model & Freezing clip model. And I make a custom loss using kl_loss and CEE
with torch.no_grad():
zero_shot_image_features = zero_shot_model.encode_image(input_image)
zero_shot_context_text_features = zero_shot_model.encode_text(context_label_text)
zero_shot_image_features /= zero_shot_image_features.norm(dim=-1, keepdim=True)
zero_shot_context_text_features /= zero_shot_context_text_features.norm(dim=-1, keepdim=True)
zero_shot_output_context = (zero_shot_image_features # zero_shot_context_text_features.T).softmax(dim=-1)
fine_tunning_image_features = fine_tunning_model.encode_image(input_image)
fine_tunning_context_text_features = fine_tunning_model.encode_text(context_label_text)
fine_tunning_image_features /= fine_tunning_image_features.norm(dim=-1, keepdim=True)
fine_tunning_context_text_features /= fine_tunning_context_text_features.norm(dim=-1, keepdim=True)
fine_tunning_output_context = (fine_tunning_image_features # fine_tunning_context_text_features.T).softmax(dim=-1)
fine_tunning_label_text_features = fine_tunning_model.encode_text(label_text)
fine_tunning_label_text_features /= fine_tunning_label_text_features.norm(dim=-1, keepdim=True)
fine_tunning_output_label = (fine_tunning_image_features # fine_tunning_label_text_features.T).softmax(dim=-1)
optimizer_zeroshot.zero_grad()
optimizer_finetunning.zero_grad()
loss.backward(retain_graph=True)
def custom_loss(zero_shot_output_context, fine_output_context, fine_output_label, target, alpha):
\# Compute the cross entropy loss
ce_loss = F.cross_entropy(fine_output_label, target)
# Compute ce_loss KL divergence between the output and the target
kl_loss = F.kl_div(zero_shot_output_context.log(), fine_output_context.log(), reduction = 'batchmean').requires_grad_(True)
final_loss = (ce_loss + alpha * kl_loss)
return final_loss
RuntimeError Traceback (most recent call last) Cell In[18], line 81 78 optimizer2.zero_grad() 79 optimizer.zero_grad() ---> 81 loss.backward(retain_graph=True) 83 if device == "cpu": 84 optimizer.step()
File ~/anaconda3/envs/sh_clip/lib/python3.8/site-packages/torch/tensor.py:221, in Tensor.backward(self, gradient, retain_graph, create_graph) 213 if type(self) is not Tensor and has_torch_function(relevant_args): 214 return handle_torch_function( 215 Tensor.backward, 216 relevant_args, (...) 219 retain_graph=retain_graph, 220 create_graph=create_graph) --> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph)
File ~/anaconda3/envs/sh_clip/lib/python3.8/site-packages/torch/autograd/init.py:130, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 127 if retain_graph is None: 128 retain_graph = create_graph --> 130 Variable.execution_engine.run_backward( 131 tensors, grad_tensors, retain_graph, create_graph, 132 allow_unreachable=True)
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [6, 1024]], which is output 0 of MmBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
But when I have train model, backward() function dosen't work,,,,, how to fix it??

You use 'a /= b' which is an inplace operation, it will work well if you change it to 'a = a/b'.

Related

How to apply sqrt to vector in Cython?

Hello I'm really beginner to Cython or C-based language.
I have a problem to get a square of a vector.
I have a vector(each value is double type):
x = [1, 4, 9]
and I want to get:
y = [1, 2, 3]
How can I get this vector?
A solution I thought is:
cdef floating[::1] y = x
for i in range(length):
y[i] = x[i] ** 0.5
But in this way it's too slow. I want to acclerate this.
Can I use sqrt or square function from libc.math in this case?
Edit:
If I want to get a vector like 1/3 root (like [1, 8, 27] -> [1,2,3]) what function should I use instead of sqrt?
Quick win
First you should check if your function is already implemented in Numpy. If so, it will probably be a very fast (C/C++) implementation.
This is the case for your function:
import numpy as np
x = np.array([1, 4, 9])
y = np.sqrt(x)
#> array([ 1, 2, 3])
With Numpy arrays
Alternatively (following #joni's comment), you can use np arrays just for the input/output and compute the function element-wise using C/C++:
cimport numpy as cnp
import numpy as np
from libc.math cimport sqrt
cpdef cnp.ndarray[double, ndim=1] cy_sqrt_np(cnp.ndarray[double, ndim=1] x):
cdef Py_ssize_t i, l=x.shape[0]
cdef np.ndarray[double, 1] y = np.empty(l)
for i in range(l):
y[i] = sqrt(x[i])
return y
With C++ vectors
Lastly, here is a possible implementation with C++ vectors and automatic conversion from/to python lists:
from libc.math cimport sqrt
from libcpp.vector cimport vector
cpdef vector[double] cy_sqrt_vec(vector[double] x):
cdef Py_ssize_t i, l = x.size()
cdef vector[double] y
y.reserve(l)
for i in range(l):
y.push_back(sqrt(x[i]))
return y
Some things to keep in mind in this case and the previous:
We initialize the y vector to be empty, and then allocate space for it with reserve(). According to SO this seems to be a good option.
We use a typed i in the for loop, and use push_back to assign new values.
We use sqrt from libc.math to avoid using Python code inside the loop.
We type the input of the function to be vector[double]. This automatically adds convenient type conversions from other python types (e.g., list of ints).
Time comparison
We define a random input x to avoid cached results polluting our measures:
%%timeit -n 10000 -r 7 x = gen_x()
y = np.sqrt(x)
#> executed in 177ms, finished 16:16:57 2022-04-19
#> 2.3 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = gen_x()
y = x**.5
#> executed in 194ms, finished 16:16:51 2022-04-19
#> 2.46 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = gen_x()
y = cy_sqrt(x)
executed in 359ms, finished 16:17:02 2022-04-19
4.9 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = list(gen_x())
y = cy_sqrt_vec(x)
executed in 2.85s, finished 16:17:11 2022-04-19
40.4 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As expected, the np.sqrt version wins. Besides, the vector allocation looks comparatively slower.

how to prevent my DNN / MLP converging to average

I want to use available several features to predict a variable. It does not seem to be related to vision or NLP. Although I believe there are good reasons that the variable to be predicted is a non linear function of these features. So I just use normal MLP like following:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(53, 200)
self.fc2 = nn.Linear(200, 100)
self.fc3 = nn.Linear(100, 36)
self.fc4 = nn.Linear(36, 1)
def forward(self, x):
x = F.leaky_relu(self.fc1(x))
x = F.leaky_relu(self.fc2(x))
x = F.leaky_relu(self.fc3(x))
x = self.fc4(x)
return x
net = Net().to(device)
loss_function = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.001, weight_decay= 1e-6)
def train_normal(model, device, train_loader, optimizer, epoch):
model.train ()
for batch_idx, (data, target) in enumerate (train_loader):
data = data.to (device)
target = target.to (device)
optimizer.zero_grad ()
output = model (data)
loss = loss_function (output, target)
loss.backward ()
torch.nn.utils.clip_grad_norm_(model.parameters(), 100)
optimizer.step ()
if batch_idx % 100 == 0:
print ('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format (
epoch, batch_idx * len (data), len (train_loader.dataset),
100. * batch_idx / len (train_loader), loss.item ()))
At first it seems to work and did learn something:
Train Epoch: 9 [268800/276316 (97%)] Loss: 0.217219
Train Epoch: 9 [275200/276316 (100%)] Loss: 0.234965
predicted actual diff
-1.18 -1.11 -0.08
0.15 -0.15 0.31
0.19 0.27 -0.08
-0.49 -0.48 -0.01
-0.05 0.08 -0.14
0.44 0.50 -0.06
-0.17 -0.05 -0.12
1.81 1.92 -0.12
1.55 0.76 0.79
-0.05 -0.30 0.26
But when it kept learning, I saw the results seemingly to be close to each other's average regardless the different input:
predicted actual diff
-0.16 -0.06 -0.10
-0.16 -0.55 0.39
-0.13 -0.26 0.14
-0.15 0.50 -0.66
-0.16 0.02 -0.18
-0.16 -0.12 -0.04
-0.16 -0.40 0.24
-0.01 1.20 -1.21
-0.07 0.33 -0.40
-0.09 0.02 -0.10
What technology / trick can prevent it? Also, how to increase the accuracy, shall I add more hidden layers or add more neurons of each layer?
One possible problem is that there is nothing to learn.
Check that your data is standardized and try different learning rates (maybe even cyclic learning rate). Something that can be happening is that the algorithm is not able to get inside the minima and keeps jumping around.
I am not sure, if you are using it but, use a standard implementation that works in another dataset and then change it to your problem, just to avoid small development mistakes. You can check either this tutorial How to apply Deep Learning on tabular data with FastAi but if you are really new I will totally recommend doing this MOOC https://course.fast.ai/. This should allow you to gain some level and understanding.
If you have all tabular data already you can try to use a machine learning algorithm like linear regression/gradient boosting. Just to check if your data has some info.
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0000...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
Let me know if you find the solution to your problem!

Gradient behavior in pytorch with multi-layer loss

I have a loss where each layer plays into the loss. Which is the correct approach in terms of making sure the weights are updated properly?
# option 1
x2 = self.layer1(x1)
x3 = self.layer2(x2)
x4 = self.layer3(x3)
In this option, I detach when feeding into each subsequent block
# option 2
# x2 = self.layer1(x1.detach())
# x3 = self.layer2(x2.detach())
# x4 = self.layer3(x3.detach())
shared ops which calculate 4 losses and sum them.
x4 = F.relu(self.bn1(x4))
loss = some_loss([x1, x2, x3, x4])
Option 1 is correct. When you detach a tensor, computation history/graph is lost and gradients won't be propogated to inputs/for computation done before detaching.
This can also be seen by this toy experiment.
In [14]: import torch
In [15]: x = torch.rand(10,10).requires_grad_()
In [16]: y = x**2
In [19]: z = torch.sum(y)
In [20]: z.backward()
In [23]: x.grad is not None
Out[23]: True
Using detach
In [26]: x = torch.rand(10,10).requires_grad_()
In [27]: y = x**2
In [28]: z = torch.sum(y)
In [29]: z_ = z.detach()
In [30]: z_.backward()
# this gives error
This is because when you call detach, it returns a new tensor with the values copied and information about previous computations is lost.

I was training the lstm network using pytorch and encountered this error

I was training the lstm network using pytorch and encountered this error in jupyter notebook.
RuntimeError Traceback (most recent call last)
<ipython-input-16-b6b1e0b8cad1> in <module>()
4
5 # train the model
----> 6 train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=10)
<ipython-input-14-43dc0cc515e7> in train(net, data, epochs, batch_size, seq_length, lr, clip, val_frac, print_every)
55
56 # calculate the loss and perform backprop
---> 57 loss = criterion(output, targets.view(batch_size*seq_length))
58 loss.backward()
59 # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
~\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in forward(self, input, target)
902 def forward(self, input, target):
903 return F.cross_entropy(input, target, weight=self.weight,
--> 904 ignore_index=self.ignore_index, reduction=self.reduction)
905
906
~\Anaconda3\lib\site-packages\torch\nn\functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
1968 if size_average is not None or reduce is not None:
1969 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 1970 return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
1971
1972
~\Anaconda3\lib\site-packages\torch\nn\functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
1788 .format(input.size(0), target.size(0)))
1789 if dim == 2:
-> 1790 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
1791 elif dim == 4:
1792 ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected object of scalar type Long but got scalar type Int for argument #2 'target'
Cast output vector of your network to Long (you have Int) as the error says.
Oh, and please provide Minimal, Complete and Verifiable example next time you ask a question.

NVVM_ERROR_INVALID_OPTION when using the CUDA kernel with Numbapro api

I want to execute a CUDA kernel in python using Numbapro API. I have this code:
import math
import numpy
from numbapro import jit, cuda, int32, float32
from matplotlib import pyplot
#cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, float32, float32, int32)')
def calculate_velocity_field(X, Y, u_source, v_source, x_source, y_source, strength_source, N):
start = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
end = N
stride = cuda.gridDim.x * cuda.blockDim.x
for i in range(start, end, stride):
u_source[i] = strength_source/(2*math.pi) * (X[i]-x_source)/((X[i]-x_source)**2 + (Y[i]-y_source)**2)
v_source[i] = strength_source/(2*math.pi) * (Y[i]-x_source)/((X[i]-x_source)**2 + (Y[i]-y_source)**2)
def main():
N = 200 # number of points in each direction
x_start, x_end = -4.0, 4.0 # boundaries in the x-direction
y_start, y_end = -2.0, 2.0 # boundaries in the y-direction
x = numpy.linspace(x_start, x_end, N) # creates a 1D-array with the x-coordinates
y = numpy.linspace(y_start, y_end, N) # creates a 1D-array with the y-coordinates
X, Y = numpy.meshgrid(x, y) # generates a mesh grid
strength_source = 5.0 # source strength
x_source, y_source = -1.0, 0.0 # location of the source
start = timer()
#calculate grid dimensions
blockSize = 1024
gridSize = int(math.ceil(float(N)/blockSize))
#transfer memory to device
X_d = cuda.to_device(X)
Y_d = cuda.to_device(Y)
u_source_d = cuda.device_array_like(X)
v_source_d = cuda.device_array_like(Y)
#launch kernel
calculate_velocity_field[gridSize,blockSize](X_d,Y_d,u_source_d,v_source_d,x_source,y_source,strength_source,N)
#transfer memory to host
u_source = numpy.empty_like(X)
v_source = numpy.empty_like(Y)
u_source_d.to_host(u_source)
v_source_d.to_host(v_source)
elapsed_time = timer() - start
print("Exec time with GPU %f s" % elapsed_time)
if __name__ == "__main__":
main()
Is giving me this error:
NvvmError Traceback (most recent call last)
<ipython-input-17-85e4a6e56a14> in <module>()
----> 1 #cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, float32, float32, int32)')
2 def calculate_velocity_field(X, Y, u_source, v_source, x_source, y_source, strength_source, N):
3 start = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
4 end = N
5 stride = cuda.gridDim.x * cuda.blockDim.x
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/decorators.py in kernel_jit(func)
89 # Force compilation for the current context
90 if bind:
---> 91 kernel.bind()
92
93 return kernel
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in bind(self)
319 Force binding to current CUDA context
320 """
--> 321 self._func.get()
322
323 #property
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in get(self)
254 cufunc = self.cache.get(device.id)
255 if cufunc is None:
--> 256 ptx = self.ptx.get()
257
258 # Link
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in get(self)
226 arch = nvvm.get_arch_option(*cc)
227 ptx = nvvm.llvm_to_ptx(self.llvmir, opt=3, arch=arch,
--> 228 **self._extra_options)
229 self.cache[cc] = ptx
230 if config.DUMP_ASSEMBLY:
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in llvm_to_ptx(llvmir, **opts)
420 cu.add_module(llvmir.encode('utf8'))
421 cu.add_module(libdevice.get())
--> 422 ptx = cu.compile(**opts)
423 return ptx
424
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in compile(self, **options)
211 for x in opts])
212 err = self.driver.nvvmCompileProgram(self._handle, len(opts), c_opts)
--> 213 self._try_error(err, 'Failed to compile\n')
214
215 # get result
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in _try_error(self, err, msg)
229
230 def _try_error(self, err, msg):
--> 231 self.driver.check_error(err, "%s\n%s" % (msg, self.get_log()))
232
233 def get_log(self):
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in check_error(self, error, msg, exit)
118 sys.exit(1)
119 else:
--> 120 raise exc
121
122
NvvmError: Failed to compile
libnvvm : error: -arch=compute_52 is an unsupported option
NVVM_ERROR_INVALID_OPTION
I tried another numbapro examples and the same error ocurrs.
I don't know if it's a bug of Numbapro that doesn't support 5.2 compute capability or it's a problem of Nvidia NVVM... suggestions?
In theory it should be supported, but I don't know what is happening.
I'm using Linux with CUDA 7.0 and driver version 346.29
Finally I found a solution here
Solution 1:
conda update cudatoolkit
Fetching package metadata: ....
# All requested packages already installed.
# packages in environment at ~/.anaconda3:
#
cudatoolkit 6.0 p0
It looks like me updating the CUDA toolkit doesn't update to CUDA 7.0. A second solution can be done:
Solution 2
conda install -c numba cudatoolkit
Fetching package metadata: ......
Solving package specifications: .
Package plan for installation in environment ~/.anaconda3:
The following packages will be downloaded:
package | build
---------------------------|-----------------
cudatoolkit-7.0 | 1 190.8 MB
The following packages will be UPDATED:
cudatoolkit: 6.0-p0 --> 7.0-1
Proceed ([y]/n)? y
Before:
In [4]: check_cuda()
------------------------------libraries detection-------------------------------
Finding cublas
located at ~/.anaconda3/lib/libcublas.so.6.0.37
trying to open library... ok
Finding cusparse
located at ~/.anaconda3/lib/libcusparse.so.6.0.37
trying to open library... ok
Finding cufft
located at ~/.anaconda3/lib/libcufft.so.6.0.37
trying to open library... ok
Finding curand
located at ~/.anaconda3/lib/libcurand.so.6.0.37
trying to open library... ok
Finding nvvm
located at ~/.anaconda3/lib/libnvvm.so.2.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0 b'GeForce GTX 970' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 7
Summary:
1/1 devices are supported
PASSED
Out[4]: True
After:
In [6]: check_cuda()
------------------------------libraries detection-------------------------------
Finding cublas
located at ~/.anaconda3/lib/libcublas.so.7.0.28
trying to open library... ok
Finding cusparse
located at ~/.anaconda3/lib/libcusparse.so.7.0.28
trying to open library... ok
Finding cufft
located at ~/.anaconda3/lib/libcufft.so.7.0.35
trying to open library... ok
Finding curand
located at ~/.anaconda3/lib/libcurand.so.7.0.28
trying to open library... ok
Finding nvvm
located at ~/.anaconda3/lib/libnvvm.so.3.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0 b'GeForce GTX 970' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 7
Summary:
1/1 devices are supported
PASSED
Out[6]: True