Thrust device management and kernel - cuda

I'm using Thrust in my current project so I don't have to write a device_vector abstraction or (segmented) scan kernels myself.
So far I have done all of my work using thrust abstractions, but for simple kernels or kernels that don't translate easily to the for_each or transform abstractions I'd prefer at some point to write my own kernels instead.
So my question is: Can I through Thrust (or perhaps CUDA) ask which device is currently being used and what properties it has (max block size, max shared memory, all that stuff)?
If I can't get the current device, is there then some way for me to get thrust to calculate the kernel dimensions if I provide the kernel registers and shared memory requirements?

You can query the current device with CUDA. See the CUDA documentation on device management. Look for cudaGetDevice(), cudaSetDevice(), cudaGetDeviceProperties(), etc.
Thrust has no notion of device management currently. I'm not sure what you mean by "get thrust to calculate the kernel dimensions", but if you are looking to determine grid dimensions for launching your custom kernel, then you need to do that on your own. It can help to query the properties of the kernel with cudaFuncGetAttributes(), which is what Thrust uses.

Related

CUDA equivalent of OpenCL CL_MEM_USE_HOST_PTR

I'd like to know if there is something similar to CL_MEM_USE_HOST_PTR but for CUDA. Reading the CUDA docs it seems the only "zero-copy" functionality is implemented through the API function cudaHostAlloc. The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area. A thing that is normal with OpenCL using the specificied flag for clCreateBuffer.
Maybe I am wrong, but it looks like CUDA doesn't implement such a thing at all.
The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area.
The API call that does that in CUDA is cudaHostRegister(), see here.
It takes a pointer returned by an ordinary host allocator such as malloc() or new, and converts the memory region into pinned memory. (Which would be suitable for "zero-copy" usage, among other things.)

CUDA lapack librairies (CULA & MAGMA) as device functions [duplicate]

So I'm trying to see if I can get some significant speedup from using a GPU to solve a small overdetermined system of equations by solving a bunch at the same time. My current algorithm involves using an LU decomposition function from the CULA Dense library that also has to switch back and forth between the GPU and the CPU to initialize and run the CULA functions. I would like to be able to call the CULA functions from my CUDA kernels so that I don't have to jump back to the CPU and copy the data back. This would also allow me to create multiple threads that are working on different data sets to be solving multiple systems concurrently. My question is can I call CULA functions from device functions? I know it's possible with CUBLAS and some of the other CUDA libraries.
Thanks!
The short answer is no. The CULA library routines are designed to be called from host code, not device code.
Note that CULA have their own support forums here which you may be interested in.

CUDA How to launch a new kernel call in one kernel function?

I am new to CUDA programming. Now, I have a problem to handle: I am trying to use CUDA parallel programming to handle a set of datasets. And for each datasets, there are some matrix calculation needed to be done.
My design is like this:
Launch N threads to handle each dataset as they are independent to each other and the method to handle them are the same.
In each thread in 1, I want to use a new function and this function also works like a kernel as they are matrix calc... e.g. call M threads to parallel handle matrix calculation..
Does anyone know whether it is possible or not?
You can launch a kernel from a thread in another kernel if you use CUDA dynamic parallelism and your GPU supports it. GPUs that support CUDA dynamic parallelism currently are of compute capability 3.5.
You can discover the compute capability of your device from the CUDA deviceQuery sample.
You can learn more about how to use CUDA dynamic parallelism from the CUDA programming guide section.

Parallelism in GPU - CUDA / OpenCL

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.
Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.
If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.

Multiple GPUs with Cuda Thrust?

How do I use Thrust with multiple GPUs?
Is it simply a matter of using cudaSetDevice(deviceId)
and then running the relevant Thrust code?
With CUDA 4.0 or later, cudaSetDevice(deviceId) followed by your thrust code should work.
Just keep in mind that you will need to create and operate on separate vectors on each device (unless you have devices that support peer-to-peer memory access and PCI-express bandwidth is sufficient for your task).