How can i tell PyCUDA which GPU to use? - cuda

I have two NVidia cards in my machine, and both are CUDA capable. When I run the example script to get started with PyCUDA seen here: http://documen.tician.de/pycuda/ i get the error
nvcc fatal : Value 'sm_30' is not defined for option 'gpu-architecture'
My computing GPU is compute capability 3.0, so sm_30 should be the right option for the nvcc compiler. My graphics GPU is only CC 1.2, so i thought maybe that's the problem. I've installed the CUDA 5.0 release for linux with no errors, and all the compiler components and python components.
Is there a way to tell PyCUDA explicitly which GPU to use?

nvcc isn't going to complain based on the specific GPUs you have installed. It will compile for whatever GPU type you tell it to compile for. The problem is you are specifying sm_30 which is not a valid option for --gpu-architecture when a --gpu-code option is also specified.
You should be passing compute_30 for --gpu-architecture and sm_30 for --gpu-code
Also be sure you have the correct nvcc in use and are not inadvertently using some old version of the CUDA toolkit.
Once you have the compile problem sorted out, there is an environment variable CUDA_DEVICE that pycuda will observe to select a particular installed GPU.
From here:
CUDA_DEVICE=2 python my-script.py
By the way someone else had your problem.
Are you sure you don't have an old version of the CUDA toolkit laying around that PyCUDA is using?

I don't know about Python wrapper( or about Python in general), but in C++ you have WGL_NV_gpu_affinity NVidia extension which allows you to target a specific GPU. Probably you can write a wrapper for it in Python.
EDIT:
Now that I see you are actually running Linux, the solution is simpler (C++).You just need to enumerate XDisplay before context init.
So basically the default GPU is usually targeted with Display string "0.0"
To open display with second GPU you can do something like this:
const char* gpuNum = "0:1";
if (!(_display = XOpenDisplay(gpuNum ))) {
printf("error: %s\n", "failed to open display");
} else {
printf("message: %s\n", "display created");
}
////here comes the rest of context setup....

At lest currently, it seems possible to just say
import pycuda.driver as drv
drv.Device(6).make_context()
and this sets Device 6 as current context.

Related

Can I use cuda without using nvcc on my host code?

I'm writing a single header library that executes a cuda kernel. I was wondering if there is a way to get around the <<<>>> syntax, or get C source output from nvcc?
You can avoid the host language extensions by using the CUDA driver API instead. It is a little more verbose and you will require a little more boilerplate code to manage the context, but it is not too difficult.
Conventionally, you would compile to PTX or a binary payload to load at runtime, however NVIDIA now also ship an experimental JIT CUDA C compiler library, libNVVM, which you could try if you want JIT from source.

Several questions about CUDA and cuPrintf

I can compile successfully my code using cuPrintf by nvcc, but cannot compile it in Visual Studio 2012 environment. It says that "volatile char *" cannot be changed to "const void *" in "cudaMemcpyToSymbol" function.
cuPrintf seems doesn't work, there's no cuPrintf function executed in kernel code.
How to make nvcc export pdb file?
Is there any other convenient way to debug in kernel function? I have only one laptop.
1st , cuPrinft is deprecated (As far as I know it has never been released) you can print data from kernel using print command, but this is a very not recommended way of debugging your kernels.
2nd, You are compiling using CUDA nvcc compiler, there is no such thing pdb file in CUDA, Albeit watch the 'g' and 'G' flags, those may dramatically increase your running time.
3rd,
The best way to debug kernels is using visual Nsight

Reading mxArray in CUSP or in cuSPARSE

I am trying to read mxArray from matlab into my custom made .cu file.
I have two sparse matrices to operate on.
How do I read them inside cusp sparse matrices say A and B ( or in cuSPARSE matrices), so that I can perform operations and return them back to matlab.
One idea that I could come up with is to write mxArrays in .mtx file and then read
from it. But again, are there any alternatives?
Further, I am trying understand the various CUSP mechanisms using the examples posted on its website.But every I try to compile and run the examples, I am getting the following error.
terminate called after throwing an instance of
'thrust::system::detail::bad_alloc'
what(): N6thrust6system6detail9bad_allocE: CUDA driver version is
insufficient for CUDA runtime version
Abort
Here are the stuff that is installed on the machine that I am using.
CUDA v4.2
Thrust v1.6
Cusp v0.3
I am using GTX 480 with Linux x86_64 on my machine.
Strangely enough, code for device query is also returning this output.
CUDA Device Query...
There are 0 CUDA devices.
Press any key to exit...
I updated my drivers and SDK few days.
Not sure whats wrong.
I know, I am asking a lot in one questions but I am facing this problem from quite a while and upgrading and downgrading the drivers doesn't seem to solve.
Cheers
This error is most revealing, "CUDA driver version is insufficient for CUDA runtime version". You definitely need to update your driver.
I use CUSPARSE/CUSP through Jacket's Sparse Linear Algebra library. It's been good, but I wish there were more sparse features available in CUSPARSE/CUSP. I hear Jacket is going to get CULA Sparse into it soon, so that'll be nice.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.

cuda with mingw - updated

We have been developing our code in linux, but would like to compile a windows executable. The old non-gpu version compiles just fine with mingw in windows, so I was hoping I'd be able to do the same with the CUDA version.
The strategy is to compile kernel code with nvcc in visual studio, and the rest with gcc in mingw.
So far, we easily compiled the .cu file (with the kernel and kernel launches) in visual studio. However, we still can't compile the c code in mingw. The c code contains cuda api calls such as cudaMalloc and cuda types such as cudaEvent_t, so we must include cuda.h and cuda_runtime.h. However, gcc gives warnings and errors for these headers, for example:
../include/host_defines.h:57:0: warning: "__cdecl" redefined
and
../include/vector_functions.h:127:14: error: 'short_4' has no member named 'x'
Any ideas on how we can include these headers and compile the c portion of the code?
If you are really desperate there might be a way. The nvcc is really just a frontend for a bunch of compilers. It invokes g++ a lot to strip comments, separate device and host code, handle name mangling, link stuff back together, etc. (use --verbose) to get the details.
My idea is as follows: You should be able to compile the host code with mingw while compiling the device code to a fatbin on a linux machine (as I guess the device binary is host-machine independent). Afterwards link both parts of the code back together with mingw or use the driver API to load the fatbin dynamically. Disclaimer: Did not test!
As far as I know, it is impossible to use CUDA without MSVC. So, you need MSVC to make nvcc work, and you can compile CPU code with mingw and link everything together.
According to http://forums.nvidia.com/index.php?showtopic=30743
"There are no current plans to support mingw."
You might want to take a look at how the cycles renderer handles this, look at https://developer.blender.org/diffusion/B/browse/master/extern/cuew/ and
https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/device_cuda.cpp
I know it's not an automagic trick but it might help you get started.