I have a GTX Titan Z graphics card. It has twin GPUs with total memory of 12 GB (6GB + 6GB). When I use DeviceQuery application in Cuda Sample (V6.5) folder to see the specification, it shows two devices, each having total memory of 4 GB. Furthermore, in my C++ code I can only access to 4GB memory. On the other hand, when I run GPU-Z software, it shows two Titan Zs, each having 6GB memory. Could anyone explain what has caused this problem and how can be resolved?
The problem here was that the program was being compiled as a 32-bit application. With 32 bits the program is only able to address 4GB of memory. The CUDA call to check device specifications (cudaGetDeviceProperties) appears to recognize this fact, and only reports the 4GB you can actually use.
Compiling as a 64-bit application should resolve this problem.
Related
So I found this wikipedia resource
Maximum number of resident grids per device
(Concurrent Kernel Execution)
and for each compute capability it says a number of concurrent kernels, which I assume to be the maximum number of concurrent kernels.
Now I am getting a GTX 1060 delivered which according to this nvidia CUDA resource has a compute capability of 6.1. From what I have learned about CUDA so far you can specify the virtual compute capability of your code at compile time in NVCC though with the flag -arch=compute_XX.
So will my GPU be hardware constrained to 32 concurrent kernels or is it capable of 128 with the -arch=compute_60 flag?
According to table 13 in the NVIDIA CUDA programming guide compute capability 6.1 devices have a maximum of 32 resident grids = 32 concurrent kernels.
Even if you use the -arch=compute_60 flag, you will be limited to the hardware limit of 32 concurrent kernels. Choosing particular architectures to compile for does not allow you to exceed the hardware limits of the machine.
In my gpu max threads per block is 1024. I am working on a image processing project using CUDA. Now if I want to use shared memory is that mean that I can only work with 1024 pixels using one block and need to copy only those 1024 elements to the shared memory
Your question is quite unclear, so I will answer to what is asked in the title.
The amount of data that can be hold in shared memory in CUDA depends on the Compute Capability of your GPU.
For instance, on CC 2.x and 3.x :
On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory.
See Configuring the amount of shared memory section here : Nvidia Parallel Forall Devblog : Using Shared Memory in CUDA C/C++
The optimization you have to think about is to avoid bank conflicts by mapping the threads' access to memory banks. This is introduced in this blog and you should read about it.
The Gpu I work is a Tesla C2075 with 6gb VRAM. And the OS is Ubuntu 64 bit with cuda toolkit 5.5.
What do I need to do to allow global __device__ static arrays bigger than 2gb ?
I couldn't find many related topics in Google or here in StackOverflow.
I frequently see programs that allocate (dynamically at least) more than 2GB of data. So theoretically allocating a __device__ array statically should be possible. To try and solve this, here are my suggestions:
Update the graphics driver to the latest version (the new driver supports large allocations - up to 6GB - on K20, it should support it on your card as well)
Update your CUDA toolkit to 6.5 (this may not be necessary but maybe CUDA 6.5 will get along better with the latest driver)
If this does not work, I would try allocating page-locked memory (pinned). Check here for more info. After allocating it, to make it visible on the device you need to map it. This enables DMA transfer and might be an even faster solution. I saw this approach in programs that allocate up to 5GB of device memory - see here.
I'm working on a "fujitsu" machine. It has 2 GPUs installed: Quadro 2000 and Tesla C2075. The Quadro GPU has 1 GB RAM and Tesla GPU has 5GB of it. (I checked using the output of nvidia-smi -q). When I run nvidia-smi, the output shows 2 GPUs, but the Tesla ones display is shown as off.
I'm running a memory intensive program and would like to use 5 GB of RAM available, but whenever I run a program, it seems to be using the Quadro GPU.
Is there some way to use a particular GPU out of the 2 in a program? Does the Tesla GPU being "disabled" means it's drivers are not installed?
You can control access to CUDA GPUs either using the environment or programmatically.
You can use the environment variable CUDA_VISIBLE_DEVICES to specify a list of 1 or more GPUs that will be visible to any application, as well as their order of visibility. For example if nvidia-smi reports your Tesla GPU as GPU 1 (and your Quadro as GPU 0), then you can set CUDA_VISIBLE_DEVICES=1 to enable only the Tesla to be used by CUDA code.
See my blog post on the subject.
To control what GPU your application uses programmatically, you should use the device management API of CUDA. Query the number of devices using cudaGetDeviceCount, then you can cudaSetDevice to each device, query its properties using cudaGetDeviceProperties, and then select the device that fits your application criteria. You can also use cudaChooseDevice to select the device that most closely matches the device properties you specify.
I have a CUDA application that is data movement bound (i.e. large memcopies from host to device with relatively little computations done in the kernel). On older GPUs I was compute-bound (e.g. QUADRO FX 5800), but with Fermi and Kepler architectures, that is no longer the case (for my optimized code).
I just moved my code to a GTX 680 and was impressed with the increased compute performance, but perplexed that the bandwidth between the host and GPU seems to have dropped (relative to my Fermi M20270).
In short when I run the canned SDK bandwidth test I get ~5000 MB/sec on the GTX 680 versus ~5700 MB/sec on the M2070. I recognize that the GTX is "only a gamer card", but the specs for the GTX 680 seem more impressive than for the M2070, WITH THE EXCEPTION OF THE BUS WIDTH.
From wikipedia:
M2070: 102.4 GB/sec, GDDR3, 512 bit bus width
GTX 680: 192 GB/sec, GDDR5, 256 bit bus width
I'm running the canned test with "--wc --memory=pinned" to use write-combined memory.
The improved results I get with this test are mirrored by the results I am getting with my optimized CUDA code.
Unfortunately, I can't run the test on the same machine (and just switch video cards), but I have tried the GTX 680 on older and newer machines and get the same "degraded" results (relative to what I get on the M2070).
Can anyone confirm that they are able to achieve higher throughput memcopies with the M2070 Quadro than the GTX 680? Doesn't the bandwidth spec take into consideration the bus width? The other possibility is that I'm not doing the memcopies correctly/optimally on the GTX 680, but in that case, is there a patch for the bandwidth test so that it will also show that I'm transfering data faster to the 680 than to the M2070?
Thanks.
As Robert Crovella has already commented, your bottleneck is the PCIe bandwidth, not the GPU memory bandwidth.
Your GTX 680 can potentially outperform the M2070 by a factor of two here as it supports PCIe 3.0 which doubles the bandwidth over the PCIe 2.0 interface of the M2070. However you need a mainboard supporting PCIe 3.0 for that.
The bus width of the GPU memory is not a concern in itself, even for programs that are GPU memory bandwidth bound. Nvidia managed to substantially increase the frequencies used on the memory bus of the GTX 680, which more than compensates for the reduced bus width relative to the M2070.