I am writing a program that can get and display all information (properties) about GPU device in CUDA 6.5 (C++). But when I run, it does not show the device name as I want and maximum number of threads per block is 1.
I used GPU EN9400GT ASUS.
EN9400GT ASUS uses GeForce 9400GT and its compute capability is 1.0. CUDA 6.5 dropped support for cc1.0 so your code won't work. You should use CUDA 6.0 for cc1.0 devices (link).
You could have found out this by yourself if you had used correct error checking code for the CUDA APIs. When checking the return value of a CUDA API, you should compare the return value with cudaSuccess, not with an arbitrary integer value. If you had compared GPUAvail with cudaSuccess like this:
if (GPUAvail != cudaSuccess)
exit(EXIT_FAILURE);
then your program would have stopped. See this article for proper error checking method.
Also, check out deviceQuery CUDA sample code. This sample code does what you are trying to do.
Related
I have a windows system with 2 Nvidia GPUs. Can someone tell me which GPU is the CUDA compiler using? Is it possible to switch the GPUs or use both together for same process?
http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0
Use 'cudaGetDeviceCount' to get the number of devices. If deviceCount is 2, then device index 0 and device index 1 refer to the two current devices.
And 'cudaGetDeviceProperties' can be used to get many properties of the device.
For example,
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, 1);
can be used to get many properties of device 1.
And the way to switch to different GPUs is easy. After initialization, use
'cudaSetDevice(0)'
and
'cudaSetDevice(1)'
to switch to different GPUs.
The CUDA_VISIBLE_DEVICES environment variable will allow you to modify this enabling/ordering.
CUDA_VISIBLE_DEVICES="0,1" will enable both GPU devices to be available to your program.
Possible duplicate of CUDA GPU selected by position, but how to set default to be something other than device 0?
Recently I started to build the application which uses CUDA 8.0 on Visual Studio 2015. Because I have to use Dynamic Parallelism I had to change Code Generation into compute_35, sm_35 from compute_20, sm_20 (defualt). Since I have changed it, invoked printf() inside a Kernel does not print anything.
Do you know the way that I can use Dynamic Parallelism and print something from inside the Kernel?
Perhaps it is worth mentioning that my graphic card is GeForce GTX 760.
Your GeForce GTX 760 is of compute capability 3.0 and doesn't support dynamic parallelism.
Compiling for the virtual compute_35 architecture prevents your kernel from running at all, as the virtual architecture needs to be less or equal your device's compute capability. Thus you see no output from printf() inside the kernel.
As Robert Crovella has remarked above, you would have noticed this with proper error checking.
I'm running this command into a shell and get:
C:\Users\me>nvidia-smi -L
GPU 0: Quadro K2000 (UUID: GPU-b1ac50d1-019c-58e1-3598-4877fddd3f17)
GPU 1: Quadro 2000 (UUID: GPU-1f22a253-c329-dfb7-0db4-e005efb6a4c7)
But in my code, when I run cuDeviceGetName(.., ID) where ID is the ID given by the nvidia-smi output, the devices have been inverted: GPU 0 becomes Quadro 2000 and GPU 1 becomes Quadro K2000.
Is this an expected behavior or a bug ? Does anyone know a workaround to make nvidia-smi get the 'real' ID of GPUs ? I could use the UUID to get the proper device with nvmlDeviceGetUUID() but using nvml API seems a bit too complicated for what I'm trying to achieve.
This question discuss how CUDA assign IDs to devices without clear conclusion.
I am using CUDA 6.5.
EDIT: I've had a look at nvidia-smi manpage (should have done that earlier...). It states:
"It is recommended that users desiring consistencyuse either UUDI or PCI bus ID, since device enumeration ordering is not guaranteed to be consistent"
Still looking for a kludge...
You can set the device order for CUDA environment in your shell to follow the bus ID instead of the default of fastest card. Requires CUDA 7 and up.
export CUDA_DEVICE_ORDER=PCI_BUS_ID
It's expected behavior.
nvidia-smi enumerates in PCI order.
By default, the CUDA driver and runtime APIs do not.
The question you linked clearly shows how to associate the two numbering/ordering schemes.
There is no way to cause nvidia-smi to modify its ordering scheme to match whatever will be generated by the CUDA runtime or driver APIs. However you can modify the CUDA runtime enumeration order through the use of an environment variable in CUDA 8.
It's the expected behaviour.
nvidia-smi manpage says that
the GPU/Unit's 0-based index in the natural enumeration returned by the driver,
CUDA API enumerates in descending order of compute capability according to "Programming Guide" 3.2.6.1 Device enumeration.
I had this problem and I have written a program is analog of nvidia-smi, but with enumerated devices in an order consistent with CUDA API. Farther in the text ref on the program
https://github.com/smilart/nvidia-cdl
I have written the program because nvidia-smi cannot enumerated device in an order consistent with CUDA API.
I tried to run a PTX assembly code generated by a .cl kernel with the CUDA driver API. The steps i took were these ( standard opencl procedure ):
1) Load .cl kernel
2) JIT compile it
3) Get the compiled ptx code and save it.
So far so good.
I noticed some special registers inside ptx assembly, %envreg3, %envreg6 etc. The problem is that these registers are not set ( according to ptx_isa these registers are set by the driver before the kernel launch ) when i try to execute the code with the driver API. So the code is falling into an infinite loop, and fails to run corectly. But if i manually set the values ( nore exactly i replace %envreg6 with the blocksize inside ptx ), the code is executing and i get the correct results ( correct compared with the cpu results ).
Does anyone know how we can set values to these registers, or maybee if i am missing something? i.e. a flag on cuLaunchKernel, that sets values to these registers?
You are trying to compile an OpenCL kernel and run it using the CUDA driver API. The NVIDIA driver/compiler interface is different between OpenCL and CUDA, so what you want to do is not supported and fundamentally cannot work.
Presumably, the only workaround would be the one you found: to patch the PTX code. But I'm afraid this might not work in the general case.
Edit:
Specifically, OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
I've been googling around and have only been able to find a trivial example of the new dynamic parallelism in Compute Capability 3.0 in one of their Tech Briefs linked from here. I'm aware that the HPC-specific cards probably won't be available until this time next year (after the nat'l labs get theirs). And yes, I realize that the simple example they gave is enough to get you going, but the more the merrier.
Are there other examples I've missed?
To save you the trouble, here is the entire example given in the tech brief:
__global__ ChildKernel(void* data){
//Operate on data
}
__global__ ParentKernel(void *data){
ChildKernel<<<16, 1>>>(data);
}
// In Host Code
ParentKernel<<<256, 64>>(data);
// Recursion is also supported
__global__ RecursiveKernel(void* data){
if(continueRecursion == true)
RecursiveKernel<<<64, 16>>>(data);
}
EDIT:
The GTC talk New Features In the CUDA Programming Model focused mostly on the new Dynamic Parallelism in CUDA 5. The link has the video and slides. Still only toy examples, but a lot more detail than the tech brief above.
Here is what you need, the Dynamic parallelism programming guide. Full of details and examples: http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Just to confirm that dynamic parallelism is only supported on GPU's with a compute capability of 3.5 upwards.
I have a 3.0 GPU with cuda 5.0 installed I have compiled the Dynamic Parallelism examples
nvcc -arch=sm_30 test.cu
and received the below compile error
test.cu(10): error: calling a global function("child_launch") from a global function("parent_launch") is only allowed on the compute_35 architecture or above.
GPU info
Device 0: "GeForce GT 640"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
hope this helps
I edited the question title to "...CUDA 5...", since Dynamic Parallelism is new in CUDA 5, not CUDA 4. We don't have any public examples available yet, because we don't have public hardware available that can run them. CUDA 5.0 will support dynamic parallelism but only on Compute Capability 3.5 and later (GK110, for example). These will be available later in the year.
We will release some examples with a CUDA 5 release candidate closer to the time the hardware is available.
I think compute capability 3.0 doesn´t include dynamic paralelism. It will be included in the GK110 architecture (aka "Big Kepler"), I don´t know what compute capability number will have assigned (3.1? maybe). Those cards won´t be available until late this year (I´m waiting sooo much for those). As far as I know the 3.0 corresponds to the GK104 chips like the GTX690 o the GT640M for laptops.
Just wanted to check in with you all given that the CUDA 5 RC was released recently. I looked in the SDK examples and wasn't able to find any dynamic parallelism there. Someone correct me if I'm wrong. I searched for kernel launches within kernels by grepping for "<<<" and found nothing.