My gpu is detected on
torch.cuda.is_available()
torch.randn(1).cuda()
tensorflow.test.is_built_with_cuda()
torch.cuda.device_count()
but not on
device_lib.list_local_devices()
tensorflow.config.list_local_devices('GPU')
the code I used is:
print(torch.cuda.is_available())
print(torch.randn(1).cuda())
print(device_lib.list_local_devices())
print(tf.test.is_built_with_cuda())
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
print(torch.cuda.device_count())
and the result is:
True
tensor([0.7429], device='cuda:0')
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11914755976629927437
xla_global_id: -1
]
True
2.10.0
[]
cuda:0
1
I'm using NVIDIA GTX 1070 Ti, Nvidia graphics driver 460.89, cuda 11.2, cudnn 8.1.1, torch 1.7.1+cu110, torchvision 0.8.2+cu110.
Although the result above, my deeplearning model successfully got in cuda. The problem is that I cannot get my tensor data to gpu.
When I checked the datatype with print(type(x)), it returned <class 'torch.Tensor'>.
I then tried both x.to(device) and x.cuda().
But in both cases, it returns false to x.is_cuda.
When I tried to make this data go through my model, it returned this:
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same
Which means my model is on gpu. I can't figure out why my tensor is on cpu but my model isn't.
My data is a tensor converted from image, and is a shape of [3, 3, 512, 512].
My model is a GAN.
Never mind.
I didn't reassign x.
In case someone bump into this, using x = x.cuda() instead of x.cuda() made it fixed.
Related
I used Faster R-CNN of PyTorch to train it on a dataset. It works well with one GPU. However, I have access to a system with 4 GPUs. I want to use 4 GPUs. However, when I check GPUs usage, only one GPU is used.
I select device in this manner:
if torch.cuda.is_available() == False and device_name == 'gpu':
raise ValueError('GPU is not available!')
elif device_name == 'cpu':
device = torch.device('cpu')
elif device_name == 'gpu':
if batch_size % torch.cuda.device_count() != 0:
raise ValueError('Batch Size is no dividable by number of gpus')
device = torch.device('cuda')
After that I do this:
# multi GPUs
if torch.cuda.device_count() > 1 and device_name == 'gpu':
print('=' * 50)
print('=' * 50)
print('=' * 50)
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
# model = nn.DataParallel(model, device_ids=[i for i in range(torch.cuda.device_count())])
model = nn.DataParallel(model)
print('=' * 50)
print('=' * 50)
print('=' * 50)
# transfer model to selected device
model.to(device)
I move data to the device in this way:
# iterate over all batches
counter_batches = 0
for images, targets in metric_logger.log_every(data_loader, print_freq, header):
# transfer tensors to device(gpu, if not available cpu)
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
# in train mode, faster r-cnn gives losses
loss_dict = model(images, targets)
# sum of losses
losses = sum(loss for loss in loss_dict.values())
I do not know what I did wrong.
Also, I get this warning:
/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all ’
If x is a torch.Tensor of dtype torch.float then are the operations x.item() and float(x) exactly the same?
The operations x.item() and float(x) are not the same.
From the documentation of item(), it can be used to get the value of tensor as a Python number(only from tensors containing a single value). It basically returns the value of the tensor as it is. It does not make any modifications to the tensor.
Where as float() is for converting its input to a floating point number, when possible. Find the documentation here.
To see the difference, consider another Tensor y of dtype int64:
import torch
y = torch.tensor(2)
print(y, y.dtype)
>>> tensor(2) torch.int64
print('y.item(): {}, float(y): {}'.format(y.item(), float(y)))
>>> y.item(): 2, float(y): 2.0
print(type(y.item()), type(float(y)))
>>> <class 'int'> <class 'float'>
Note that float(y) does not convert the type in-place. You would need to assign it in case you need that change. Like:
z = float(y)
print('y.dtype: {}, type(z): {}'.format(y.dtype, type(z)))
>>> y.dtype: torch.int64, type(z): <class 'float'>
We can see that z is not a torch.Tensor. It is simply a floating point number.
The float() operation is not to be confused with self.float(). This operation performs the Tensor dtype conversion (not in-place, needs assignment).
print('y.float(): {},\n y.float().dtype: {},\n y: {},\n y.dtype'.format(y.float(), y.float().dtype, y, y.dtype))
y.float(): 2.0,
y.float().dtype: torch.float32,
y: 2,
y.dtype: torch.int64
I'm trying to create a memoryview to store several vectors as rows, but when I try to change the value of any I got an error, like it is expecting a scalar.
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.float
ctypedef np.float_t DTYPE_t
cdef DTYPE_t[:, ::1] results = np.zeros(shape=(10, 10), dtype=DTYPE)
results[:, 0] = np.random.rand(10)
This trows me the following error:
TypeError: only size-1 arrays can be converted to Python scalars
Which I don't understand given that I want to overwrite the first row with that vector... Any idea about what I am doing wrong?
The operation you would like to use is possible between numpy arrays (Python functionality) or Cython's memory views (C functionality, i.e. Cython generates right for-loops in the C-code), but not when you mix a memory view (on the left-hand side) and a numpy array (on the right-hand side).
So you have either to use Cython's memory-views:
...
cdef DTYPE_t[::1] r = np.random.rand(10)
results[:, 0] = r
#check it worked:
print(results.base)
...
or numpy-arrays (we know .base is a numpy-array):
results.base[:, 0] = np.random.rand(10)
#check it worked:
print(results.base)
Cython's version has less overhead, but for large matrices there won't be much difference.
Keras provides categorical_accuracy, sparse_categorical_accuracy and top_k_categorical_accuracy metrics.
I would like to define a sparse version of top_k_categorical_accuracy.
My tentative implementation, is as follows:
def top_k_sparse_categorical_accuracy(y_true, y_pred, z=5):
return K.mean(K.in_top_k(y_pred, K.max(y_true, axis=-1), z), axis=-1)
This is modeled after top_k_categorical_accuracy and sparse_categorical_accuracy
def sparse_categorical_accuracy(y_true, y_pred):
return K.cast(K.equal(K.max(y_true, axis=-1),
K.cast(K.argmax(y_pred, axis=-1), K.floatx())),
K.floatx())
def top_k_categorical_accuracy(y_true, y_pred, k=5):
return K.mean(K.in_top_k(y_pred, K.argmax(y_true, axis=-1), k), axis=-1)
However, this does not work because the second argument of K.in_top_k should be a tensor of integers, and here I have a tensor of floats, and K.cast can only cast toward floats.
Is there a way to perform a cast to int in the Keras backend? Or is there another way to implement this metric?
I am trying to test out the effectiveness of using the Python Numba module's #vectorize decorator for speeding up a code snippet relevant to my actual code. I'm utilizing a code snippet provided in CUDAcast #10 available here and shown below:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
#vectorize(["float32(float32, float32)"], target='cpu')
def VectorAdd(a,b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("VectorAdd took %f seconds" % vectoradd_time)
if __name__ == '__main__':
main()
In the demo in the CUDAcast, the demonstrator gets a 100x speedup by sending the large array equation to the gpu via the #vectorize decorator. However, when I set the #vectorize target to the gpu:
#vectorize(["float32(float32, float32)"], target='cuda')
... the result is 3-4 times slower. With target='cpu' my runtime is 0.048 seconds; with target='cuda' my runtime is 0.22 seconds. I'm using a DELL Precision laptop with Intel Core i7-4710MQ processor and NVIDIA Quadro K2100M GPU. The output of running nvprof (NVIDIA profiler tool) indicate that the majority of the time is spent in memory handling (expected), but even the function evaluation takes longer on the GPU than the whole process did on the CPU. Obviously this isn't the result I was hoping for, but is it due to some error on my part or is this reasonable based on my hardware and code?
This question is also interesting for me.
I've tried your code and got similar results.
To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda
N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1
#cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
i = cuda.grid(1)
if i < N:
a[i] += b[i]
#vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
return a + b
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)
start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.
For the case above:
CPU - 0.0033;
GPU - 0.0096;
Vectorize (target='cuda') - 0.15 (for my PC).
If the copying time is not accounted:
GPU - 0.000245
So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but #vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).
By the way I have also tested the #cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.
UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.