How to use cudamemcpy for vectors in C++ ? my code works fine for arrays but vectors it doesnt seem to support. Any idea how to support vectors in CUDA?
The short answer is that you can't just using the basic CUDA APIs.
If you want to use STL containers with CUDA, you should look at the thrust template library, which provides and STL like interface to the GPU and a number of useful GPU algorithms to operate on data within container types.
Related
So I'm trying to see if I can get some significant speedup from using a GPU to solve a small overdetermined system of equations by solving a bunch at the same time. My current algorithm involves using an LU decomposition function from the CULA Dense library that also has to switch back and forth between the GPU and the CPU to initialize and run the CULA functions. I would like to be able to call the CULA functions from my CUDA kernels so that I don't have to jump back to the CPU and copy the data back. This would also allow me to create multiple threads that are working on different data sets to be solving multiple systems concurrently. My question is can I call CULA functions from device functions? I know it's possible with CUBLAS and some of the other CUDA libraries.
Thanks!
The short answer is no. The CULA library routines are designed to be called from host code, not device code.
Note that CULA have their own support forums here which you may be interested in.
I'm using Thrust in my current project so I don't have to write a device_vector abstraction or (segmented) scan kernels myself.
So far I have done all of my work using thrust abstractions, but for simple kernels or kernels that don't translate easily to the for_each or transform abstractions I'd prefer at some point to write my own kernels instead.
So my question is: Can I through Thrust (or perhaps CUDA) ask which device is currently being used and what properties it has (max block size, max shared memory, all that stuff)?
If I can't get the current device, is there then some way for me to get thrust to calculate the kernel dimensions if I provide the kernel registers and shared memory requirements?
You can query the current device with CUDA. See the CUDA documentation on device management. Look for cudaGetDevice(), cudaSetDevice(), cudaGetDeviceProperties(), etc.
Thrust has no notion of device management currently. I'm not sure what you mean by "get thrust to calculate the kernel dimensions", but if you are looking to determine grid dimensions for launching your custom kernel, then you need to do that on your own. It can help to query the properties of the kernel with cudaFuncGetAttributes(), which is what Thrust uses.
I have an existing C++ code that extensively uses the bitset template. I am porting this code to CUDA C, and I am really new to CUDA programming. Can I use the bitset template as a "shared" variable?
As far as I know, you can't use the bitset container in CUDA at all, because there is no device implementation of it. It should be trivial to implement it in terms of a regular array though, and you can put the array in shared memory.
I had done my physics simulation project using C++ , OpenGL in Visual Studio 10. Later I had used OpenMP for CPU Parallelization. Now I want to accelerate my C++ code to CUDA so that I can achieve higher performance. Is it possible to convert my code into CUDA or any GPU devices?
Cuda and C++ are different programming languages (even if they look syntactically similar) with different programming paradigm.
You'll have to recode, and perhaps even redesign, your project to take advantage of Cuda (or of OpenCL).
Actually, you'll need to define what are the numerical kernels that might take advantage of your GPGPU and then recode these kernels (in Cuda, or in OpenCL); you'll also have to write some glue code to make all this work together.
You can determine which parts of your project can be parallelized and then reimplement these parts in Cuda. You can take a look at Fast N-Body Simulation with CUDA.
How do I use Thrust with multiple GPUs?
Is it simply a matter of using cudaSetDevice(deviceId)
and then running the relevant Thrust code?
With CUDA 4.0 or later, cudaSetDevice(deviceId) followed by your thrust code should work.
Just keep in mind that you will need to create and operate on separate vectors on each device (unless you have devices that support peer-to-peer memory access and PCI-express bandwidth is sufficient for your task).