What does 'compute capability' mean w.r.t. CUDA? - cuda

I am new to CUDA programming and don't know much about it. Can you please tell me what does 'CUDA compute capability' mean? When I use the following code on my university server, it showed me the following result.
for (device = 0; device < deviceCount; ++device)
{
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, device);
printf("\nDevice %d has compute capability %d.%d.\n", device, deviceProp.major, deviceProp.minor);
}
RESULT:
Device 0 has compute capability 4199672.0.
Device 1 has compute capability 4199672.0.
Device 2 has compute capability 4199672.0.
.
.
cudaGetDeviceProperties returns two fields major and minor. Can you please tell me what is this 4199672.0. means?

The compute capability is the "feature set" (both hardware and software features) of the device. You may have heard the NVIDIA GPU architecture names "Tesla", "Fermi" or "Kepler". Each of those architectures have features that previous versions might not have.
In your CUDA toolkit installation folder on your hard drive, look for the file CUDA_C_Programming_Guide.pdf (or google it), and find Appendix F.1. It describes the differences in features between the different compute capabilities.

As #dialer mentioned, the compute capability is your CUDA device's set of computation-related features. As NVidia's CUDA API develops, the 'Compute Capability' number increases. At the time of writing, NVidia's newest GPUs are Compute Capability 3.5. You can get some details of what the differences mean by examining this table on Wikipedia.
As #aland suggests, your call probably failed, and what you're getting is the result of using an uninitialized variable. You should wrap your cudaGetDeviceProps() call with some kind of error checking; see
What is the canonical way to check for errors using the CUDA runtime API?
for a discussion of the options for doing this.

Related

Using atomic arithmetic operations in CUDA Unified Memory multi-GPU or multi-processor

I am trying to implement a CUDA program that uses Unified Memory. I have two unified arrays and sometimes they need to be updated atomically.
The question below has an answer for a single GPU environment but I am not sure how to extend the answer given in the question to adapt in multi-GPU platforms.
Question: cuda atomicAdd example fails to yield correct output
I have 4 Tesla K20 if you need this information and all of them updates a part of those arrays that must be done atomically.
I would appreciate any help/recommendations.
To summarize comments into an answer:
You can perform this sort of address space wide atomic operation using atomicAdd_system
However, you can only do this on compute capability 6.x or newer devices (7.2 or newer if using Tegra)
specifically this means you have to compile for the correct compute capability such as -arch=sm_60 or similar
You state in the question you are using Telsa K20 cards -- these are compute capability 3.5 and do not support any of the system wide atomic functions.
As always, this information is neatly summarized in the relevant section of the Programming Guide.

Does the nVidia Titan V support GPUDirect?

I was wondering if someone might be able to help me figure out if the new Titan V from nVidia support GPUDirect. As far as I can tell it seems limited to Tesla and Quadro cards.
Thank you for taking the time to read this.
GPUDirect Peer-to-Peer (P2P) is supported between any 2 "like" CUDA GPUs (of compute capability 2.0 or higher), if the system topology supports it, and subject to other requirements and restrictions. In a nutshell, the system topology requirement is that both GPUs participating must be enumerated under the same PCIE root complex. If in doubt, "like" means identical. Other combinations may be supported (e.g. 2 GPUs of the same compute capability) but this is not specified, or advertised as supported. If in doubt, try it out. Finally, these things must be "discoverable" by the GPU driver. If the GPU driver cannot ascertain these facts, and/or the system is not part of a whitelist maintained in the driver, then P2P support will not be possible.
Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. So the statement here "is supported" should not be construed to refer to a particular GPU type. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.
GPUDirect RDMA is only supported on Tesla and possibly some Quadro GPUs.
So, if you had a system that had 2 Titan V GPUs plugged into PCIE slots that were connected to the same root complex (usually, except in Skylake CPUs, it should be sufficient to say "connected to the same CPU socket"), and the system (i.e. core logic) was recognized by the GPU driver, I would expect P2P to work between those 2 GPUs.
I would not expect GPUDirect RDMA to work to a Titan V, under any circumstances.
YMMV. If in doubt, try it out, before making any large purchasing decisions.

If I have multiple Nvidia GPUs in my system, how to check which GPU is currently used by CUDA compiler?

I have a windows system with 2 Nvidia GPUs. Can someone tell me which GPU is the CUDA compiler using? Is it possible to switch the GPUs or use both together for same process?
http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0
Use 'cudaGetDeviceCount' to get the number of devices. If deviceCount is 2, then device index 0 and device index 1 refer to the two current devices.
And 'cudaGetDeviceProperties' can be used to get many properties of the device.
For example,
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 1);
can be used to get many properties of device 1.
And the way to switch to different GPUs is easy. After initialization, use
'cudaSetDevice(0)'
and
'cudaSetDevice(1)'
to switch to different GPUs.
The CUDA_VISIBLE_DEVICES environment variable will allow you to modify this enabling/ordering.
CUDA_VISIBLE_DEVICES="0,1" will enable both GPU devices to be available to your program.
Possible duplicate of CUDA GPU selected by position, but how to set default to be something other than device 0?

Inconsistency of IDs between 'nvidia-smi -L' and cuDeviceGetName()

I'm running this command into a shell and get:
C:\Users\me>nvidia-smi -L
GPU 0: Quadro K2000 (UUID: GPU-b1ac50d1-019c-58e1-3598-4877fddd3f17)
GPU 1: Quadro 2000 (UUID: GPU-1f22a253-c329-dfb7-0db4-e005efb6a4c7)
But in my code, when I run cuDeviceGetName(.., ID) where ID is the ID given by the nvidia-smi output, the devices have been inverted: GPU 0 becomes Quadro 2000 and GPU 1 becomes Quadro K2000.
Is this an expected behavior or a bug ? Does anyone know a workaround to make nvidia-smi get the 'real' ID of GPUs ? I could use the UUID to get the proper device with nvmlDeviceGetUUID() but using nvml API seems a bit too complicated for what I'm trying to achieve.
This question discuss how CUDA assign IDs to devices without clear conclusion.
I am using CUDA 6.5.
EDIT: I've had a look at nvidia-smi manpage (should have done that earlier...). It states:
"It is recommended that users desiring consistencyuse either UUDI or PCI bus ID, since device enumeration ordering is not guaranteed to be consistent"
Still looking for a kludge...
You can set the device order for CUDA environment in your shell to follow the bus ID instead of the default of fastest card. Requires CUDA 7 and up.
export CUDA_DEVICE_ORDER=PCI_BUS_ID
It's expected behavior.
nvidia-smi enumerates in PCI order.
By default, the CUDA driver and runtime APIs do not.
The question you linked clearly shows how to associate the two numbering/ordering schemes.
There is no way to cause nvidia-smi to modify its ordering scheme to match whatever will be generated by the CUDA runtime or driver APIs. However you can modify the CUDA runtime enumeration order through the use of an environment variable in CUDA 8.
It's the expected behaviour.
nvidia-smi manpage says that
the GPU/Unit's 0-based index in the natural enumeration returned by the driver,
CUDA API enumerates in descending order of compute capability according to "Programming Guide" 3.2.6.1 Device enumeration.
I had this problem and I have written a program is analog of nvidia-smi, but with enumerated devices in an order consistent with CUDA API. Farther in the text ref on the program
https://github.com/smilart/nvidia-cdl
I have written the program because nvidia-smi cannot enumerated device in an order consistent with CUDA API.

Dearth of CUDA 5 Dynamic Parallelism Examples

I've been googling around and have only been able to find a trivial example of the new dynamic parallelism in Compute Capability 3.0 in one of their Tech Briefs linked from here. I'm aware that the HPC-specific cards probably won't be available until this time next year (after the nat'l labs get theirs). And yes, I realize that the simple example they gave is enough to get you going, but the more the merrier.
Are there other examples I've missed?
To save you the trouble, here is the entire example given in the tech brief:
__global__ ChildKernel(void* data){
//Operate on data
}
__global__ ParentKernel(void *data){
ChildKernel<<<16, 1>>>(data);
}
// In Host Code
ParentKernel<<<256, 64>>(data);
// Recursion is also supported
__global__ RecursiveKernel(void* data){
if(continueRecursion == true)
RecursiveKernel<<<64, 16>>>(data);
}
EDIT:
The GTC talk New Features In the CUDA Programming Model focused mostly on the new Dynamic Parallelism in CUDA 5. The link has the video and slides. Still only toy examples, but a lot more detail than the tech brief above.
Here is what you need, the Dynamic parallelism programming guide. Full of details and examples: http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Just to confirm that dynamic parallelism is only supported on GPU's with a compute capability of 3.5 upwards.
I have a 3.0 GPU with cuda 5.0 installed I have compiled the Dynamic Parallelism examples
nvcc -arch=sm_30 test.cu
and received the below compile error
test.cu(10): error: calling a global function("child_launch") from a global function("parent_launch") is only allowed on the compute_35 architecture or above.
GPU info
Device 0: "GeForce GT 640"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
hope this helps
I edited the question title to "...CUDA 5...", since Dynamic Parallelism is new in CUDA 5, not CUDA 4. We don't have any public examples available yet, because we don't have public hardware available that can run them. CUDA 5.0 will support dynamic parallelism but only on Compute Capability 3.5 and later (GK110, for example). These will be available later in the year.
We will release some examples with a CUDA 5 release candidate closer to the time the hardware is available.
I think compute capability 3.0 doesn´t include dynamic paralelism. It will be included in the GK110 architecture (aka "Big Kepler"), I don´t know what compute capability number will have assigned (3.1? maybe). Those cards won´t be available until late this year (I´m waiting sooo much for those). As far as I know the 3.0 corresponds to the GK104 chips like the GTX690 o the GT640M for laptops.
Just wanted to check in with you all given that the CUDA 5 RC was released recently. I looked in the SDK examples and wasn't able to find any dynamic parallelism there. Someone correct me if I'm wrong. I searched for kernel launches within kernels by grepping for "<<<" and found nothing.