I am new to CUDA programming. Now, I have a problem to handle: I am trying to use CUDA parallel programming to handle a set of datasets. And for each datasets, there are some matrix calculation needed to be done.
My design is like this:
Launch N threads to handle each dataset as they are independent to each other and the method to handle them are the same.
In each thread in 1, I want to use a new function and this function also works like a kernel as they are matrix calc... e.g. call M threads to parallel handle matrix calculation..
Does anyone know whether it is possible or not?
You can launch a kernel from a thread in another kernel if you use CUDA dynamic parallelism and your GPU supports it. GPUs that support CUDA dynamic parallelism currently are of compute capability 3.5.
You can discover the compute capability of your device from the CUDA deviceQuery sample.
You can learn more about how to use CUDA dynamic parallelism from the CUDA programming guide section.
Related
I'm trying to compile my CUDA C code for a GPU with sm_10 architecture which does not support invoking malloc from __global__ functions.
I need to keep a tree for which the nodes are created dynamically in the GPU memory. Unfortunately, without malloc apparently I can't do that.
Is there is a way to copy an entire tree using cudaMalloc? I think that such an approach will just copy the root of my tree.
Quoting the CUDA C Programming Guide
Dynamic global memory allocation and operations are only supported by devices of
compute capability 2.x and higher.
For compute capability earlier than 2.0, the only possibilities are:
Use cudaMalloc from host side to allocate as much global memory as you need in your __global__ function;
Use static allocation if you know the required memory size at compile time;
I am about to create GPU-enabled program using CUDA technology. It is supposed to be C# Emgu or C++ Cuda toolkit (not yet decided).
I need to use all GPU power (I have card with 16 GPU cores). How do I run 16 tasks in parallel?
First of. 16 GPU cores is, on pre 6xx series, equal to 16*8=128 cores. On 6xx series it is 16*32=512 cores. That does not mean you should limit yourself to 128/512 tasks.
Second: emgu seems to be a OpenCV wrapper for .NET, and is related to image processing. It generally has nothing to do with GPU programming. Might be some algorithms have been gpu accelerated, but I don't know anything about that. The alternative to CUDA in this is OpenCL, not OpenCV. If you will be using CUDA technology like you say, you have no alternative to CUDA, as only CUDA is CUDA.
When it comes to starting tasks, you only tell the GPU how many threads you wish to run. Actually, you tell the GPU how many blocks, and how many threads pr. block you wish to run. This is done when you call the cuda function itself. You don't want to limit yourself to 128/512 threads either, but experiment.
Don't know your knowledge on GPGPU programming, but remember that you can not run tasks as you do on the CPU. You can not run 128 different tasks, all threads have to run the exact same instructions (except for when branching, which should generally be avoided).
Generally speaking, you want sufficient threads to fill all the streaming multiprocessors. At a minimum that is .25 * MULTIPROCESSORS * MAX_THREADS_PER_MULTIPROCESSOR.
Specifically in CUDA now, suppose you have some CUDA kernel __global__ void square_array(float *a, int N)...
Now when you launch the kernel you specify the number of blocks and the number of threads per block
square_array <<< n_blocks, n_threads_per_block >>> (a, N);
Note: you need to get more framiliar with the CUDA parallel programming model as you not approaching to in a manor which will use all your GPU power. Consider reading Programming Massively Parallel Processors, A Hands-on Approach.
I'm using Thrust in my current project so I don't have to write a device_vector abstraction or (segmented) scan kernels myself.
So far I have done all of my work using thrust abstractions, but for simple kernels or kernels that don't translate easily to the for_each or transform abstractions I'd prefer at some point to write my own kernels instead.
So my question is: Can I through Thrust (or perhaps CUDA) ask which device is currently being used and what properties it has (max block size, max shared memory, all that stuff)?
If I can't get the current device, is there then some way for me to get thrust to calculate the kernel dimensions if I provide the kernel registers and shared memory requirements?
You can query the current device with CUDA. See the CUDA documentation on device management. Look for cudaGetDevice(), cudaSetDevice(), cudaGetDeviceProperties(), etc.
Thrust has no notion of device management currently. I'm not sure what you mean by "get thrust to calculate the kernel dimensions", but if you are looking to determine grid dimensions for launching your custom kernel, then you need to do that on your own. It can help to query the properties of the kernel with cudaFuncGetAttributes(), which is what Thrust uses.
Is it possible to launch multiple kernels on multiple GPUs concurrently from a single thread in cuda 4.0?
To use multiple GPUs from a single thread, you can switch between cuda contexts (each of which is bound is bound to a GPU) and launch kernels asynchronously. In effect you will be running multiple kernels across multiple GPUs this way.
However if you have cards with compute capability > 2.0, you can also run kernels concurrently as shown in the comments above. You can find the post about concurrent kernel execution over here.
Ofcourse you can use both if you have multiple cards with compute capability >= 2.0.
yes.
If there are 2 devices you can run kernel1<<<>>> at device0 and kernel2<<<>>> at device1. there is an option setdevice() with which you choose the device on which the kernel will be executed.
google it, it is in the cuda library 4.0
I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.
Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.
If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.