Several questions about CUDA and cuPrintf - cuda

I can compile successfully my code using cuPrintf by nvcc, but cannot compile it in Visual Studio 2012 environment. It says that "volatile char *" cannot be changed to "const void *" in "cudaMemcpyToSymbol" function.
cuPrintf seems doesn't work, there's no cuPrintf function executed in kernel code.
How to make nvcc export pdb file?
Is there any other convenient way to debug in kernel function? I have only one laptop.

1st , cuPrinft is deprecated (As far as I know it has never been released) you can print data from kernel using print command, but this is a very not recommended way of debugging your kernels.
2nd, You are compiling using CUDA nvcc compiler, there is no such thing pdb file in CUDA, Albeit watch the 'g' and 'G' flags, those may dramatically increase your running time.
3rd,
The best way to debug kernels is using visual Nsight

Related

let Nsight start debugging after certain kernel function is executed

My CUDA program have too many kernel functions and if I open the CUDA debugging mode, I'll have to wait for an whole hour after the breakpoint in certain kernel function is triggered.
Is there any way for Nsight to start debugging after certain kernel functions, or only debug the certain kernel function?
I'm using Nsight with VS2012
In theory you can follow the instructions in the Nsight help file (either the online help or local help. At the time of writing the page is here).
In short:
In the Nsight Monitor options, CUDA » Use this Monitor for CUDA attach should be True.
Before starting your application, set an environment variable called NSIGHT_CUDA_DEBUGGER to 1.
Then in your CUDA kernel, you can add a breakpoint like this:
asm("brkpt;");
This will work similar to the __debugbreak() intrinsic or int 3 assembly instruction in host code. When hit you will get a dialog prompting you to attach the CUDA debugger.
In practice, at least for me it Just Doesn't Work™. Maybe you'll have more luck.

How can i tell PyCUDA which GPU to use?

I have two NVidia cards in my machine, and both are CUDA capable. When I run the example script to get started with PyCUDA seen here: http://documen.tician.de/pycuda/ i get the error
nvcc fatal : Value 'sm_30' is not defined for option 'gpu-architecture'
My computing GPU is compute capability 3.0, so sm_30 should be the right option for the nvcc compiler. My graphics GPU is only CC 1.2, so i thought maybe that's the problem. I've installed the CUDA 5.0 release for linux with no errors, and all the compiler components and python components.
Is there a way to tell PyCUDA explicitly which GPU to use?
nvcc isn't going to complain based on the specific GPUs you have installed. It will compile for whatever GPU type you tell it to compile for. The problem is you are specifying sm_30 which is not a valid option for --gpu-architecture when a --gpu-code option is also specified.
You should be passing compute_30 for --gpu-architecture and sm_30 for --gpu-code
Also be sure you have the correct nvcc in use and are not inadvertently using some old version of the CUDA toolkit.
Once you have the compile problem sorted out, there is an environment variable CUDA_DEVICE that pycuda will observe to select a particular installed GPU.
From here:
CUDA_DEVICE=2 python my-script.py
By the way someone else had your problem.
Are you sure you don't have an old version of the CUDA toolkit laying around that PyCUDA is using?
I don't know about Python wrapper( or about Python in general), but in C++ you have WGL_NV_gpu_affinity NVidia extension which allows you to target a specific GPU. Probably you can write a wrapper for it in Python.
EDIT:
Now that I see you are actually running Linux, the solution is simpler (C++).You just need to enumerate XDisplay before context init.
So basically the default GPU is usually targeted with Display string "0.0"
To open display with second GPU you can do something like this:
const char* gpuNum = "0:1";
if (!(_display = XOpenDisplay(gpuNum ))) {
printf("error: %s\n", "failed to open display");
} else {
printf("message: %s\n", "display created");
}
////here comes the rest of context setup....
At lest currently, it seems possible to just say
import pycuda.driver as drv
drv.Device(6).make_context()
and this sets Device 6 as current context.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.

how to extract ptx from cuda exe and some related cuda compiler questions

1)I want to extract ptx code from a CUDA exe and use that kernel code in another program .
Is there a way to identify the kernel ptx code from an exe. I know they are arbitrarily laid out in an exe file data section.
I learnt that in MAC executables the ptx kernels start with .version and ends with a null string. Is there something like that for win exe(PE) files. I guess i need to parse the exe file , gather ptx statements one at a time and group them together as kernels. But I dont know how i would go about it. some help would get me started. I also find a .nvFatBi section in Cuda exe. What is that supposed to be?
2)I also learnt that there are global constructors which register the cubin with the cuda runtime. I dont understand this part completely. Does the function cudaRegisterFatBinary come into play here. If so how can I use this ptx to supply the pointer to the cudaRegisterFatBinary ? I understand i have to compile the ptx to cubin file . is it possible programatically? In short i want to emulate the nvcc itself in some sense.
Try: cuobjdump --dump-ptx [executable-name]

cuda with mingw - updated

We have been developing our code in linux, but would like to compile a windows executable. The old non-gpu version compiles just fine with mingw in windows, so I was hoping I'd be able to do the same with the CUDA version.
The strategy is to compile kernel code with nvcc in visual studio, and the rest with gcc in mingw.
So far, we easily compiled the .cu file (with the kernel and kernel launches) in visual studio. However, we still can't compile the c code in mingw. The c code contains cuda api calls such as cudaMalloc and cuda types such as cudaEvent_t, so we must include cuda.h and cuda_runtime.h. However, gcc gives warnings and errors for these headers, for example:
../include/host_defines.h:57:0: warning: "__cdecl" redefined
and
../include/vector_functions.h:127:14: error: 'short_4' has no member named 'x'
Any ideas on how we can include these headers and compile the c portion of the code?
If you are really desperate there might be a way. The nvcc is really just a frontend for a bunch of compilers. It invokes g++ a lot to strip comments, separate device and host code, handle name mangling, link stuff back together, etc. (use --verbose) to get the details.
My idea is as follows: You should be able to compile the host code with mingw while compiling the device code to a fatbin on a linux machine (as I guess the device binary is host-machine independent). Afterwards link both parts of the code back together with mingw or use the driver API to load the fatbin dynamically. Disclaimer: Did not test!
As far as I know, it is impossible to use CUDA without MSVC. So, you need MSVC to make nvcc work, and you can compile CPU code with mingw and link everything together.
According to http://forums.nvidia.com/index.php?showtopic=30743
"There are no current plans to support mingw."
You might want to take a look at how the cycles renderer handles this, look at https://developer.blender.org/diffusion/B/browse/master/extern/cuew/ and
https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/device_cuda.cpp
I know it's not an automagic trick but it might help you get started.