Benefit of splitting a big CUDA kernel and using dynamic parallelism - cuda

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.
The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.
I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.
I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job.
Is it fast to launch a kernel from a kernel?
Do I need to copy data in global memory to make them available to a sub-kernel?
If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?
I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.
EDIT:
To provide some other details, you can imagine my kernel to have this kind of structure:
__global__ void kernel(int *sampleData, int *initialData) {
__shared__ int systemState[N];
__shared__ int someTemp[N * 3];
__shared__ int time;
int tid = ...;
systemState[tid] = initialData[tid];
while (time < TIME_END) {
bool c = calc_something(systemState);
if (c)
break;
someTemp[tid] = do_something(systemState);
c = do_check(someTemp);
if (__syncthreads_or(c))
break;
sample(sampleData, systemState);
if (__syncthreads_and(...)) {
do_something(systemState);
sync();
time += some_increment(systemState);
}
else {
calcNewTemp(someTemp, systemState);
sync();
do_something_else(someTemp, systemState);
time += some_other_increment(someTemp, systemState);
}
}
do_some_stats();
}
this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.
Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.
Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?
I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.

I have benefitted from dynamic parallelism for solving an interpolation problem of the form:
int i = threadIdx.x + blockDim.x * blockIdx.x;
for(int m=0; m<(2*K+1); m++) {
PP1 = calculate_PP1(i,m);
phi_cap1 = calculate_phi_cap1(i,m);
for(int n=0; n<(2*K+1); n++) {
PP2 = calculate_PP2(i,m);
phi_cap2 = calculate_phi_cap2(i,n);
atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }
where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.
From my (possibly incomplete) experience, dynamic parallelism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

Related

how to sync threads in this cuda example

I have the following rough code outline:
run a loop, millions of times
in that loop, compute values 'I's - see example of such functions below
After all 'I's have been computed, compute other values 'V's
repeat the loop
Each computation of an I or V could involve up to 20ish mathematical operations, (e.g. I1 = A + B/C * D + 1/exp(V1) - E + F + V2 etc).
There are roughly:
50 'I's
10 'V's
10 values in each I and V, i.e. they are vectors of length 10
At first I tried running a simple loop in C, with kernel calls for each time step but this was really slow. It seems like I can get the code to run faster if the main loop is in a kernel that calls other kernels. However, I'm worried about kernel call overhead (maybe I shouldn't be) so I came up with something like the following, where each I and V loop independently, with syncing between the kernels as necessary.
For reference, the variables below are hardcoded as __device__ values, but eventually I will pass some values into specific kernels to make the system interesting.
__global__ void compute_IL1()
{
int id = threadIdx.x;
//n_t = 1e6;
for (int i = 0; i < n_t; i++){
IL1[id] = gl_1*(V1[id] - El_1);
//atomic, sync, event????,
}
}
__global__ void compute_IK1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
Ik1[id] = gk_1*powf(0.75*(1-H1[id]),4)*(V1[id]-Ek_1);
//atomic, sync, event?
}
}
__global__ void compute_V1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
//wait for IL1 and Ik1 and others, but how????
V1[id] = Ik1[id]+IL1[id] + ....
//trigger the I's again
}
}
//main function
compute_IL1<<<1,10,0,s0>>>();
compute_IK1<<<1,10,0,s1>>>();
//repeat this for many 50 - 70 more kernels (Is and Vs)
So the question is, how would I sync these kernels? Is an event approach best? Is there a better paradigm to use here?
There is no sane mechanism I can think of to have multiple resident kernels synchronize without resorting to hacky atomic tricks which may well not work reliably.
If you are running blocks with 10 threads and these kernels cannot execute concurrently for correctness reasons, you are (in the best possible case) using 1/64 of the computational capacity of your device. This problem as you have described it sounds completely Ill suited to a GPU.
So, I tried a couple of approaches.
A loop with a few kernel calls, where the last kernel call is dependent on the previous ones. This can be done with cudaStreamWaitEvent which can wait for multiple events. I found this on: http://cedric-augonnet.com/declaring-dependencies-with-cudastreamwaitevent/ . Unfortunately, the kernel calls were too expensive.
Global variables between concurrent streams. The logic was pretty simple, having one thread pause until a global variable equaled the loop variable, indicating that all threads could proceed. This was then followed by a sync-threads call. Unfortunately, this did not work well.
Ultimately, I think I've settled on a nested loop, where the outer loop represents time, and the inner loop indicates which of a set instructions to run, based on dependencies. I also launched the maximum number of threads per block (1024) and broke up the vectors that needed to be processed into warps. The rough psuedocode is:
run_main<<<1,1024>>>();
__global__ void run_main(){
int warp = threadIdx.x/32;
int id = threadIdx.x - warp*32;
if (id < 10){
for (int i = 0; i < n_t; i++){
for(int j = 0; j < n_j; j++){
switch (j){
case 0:
switch(warp){
case 0:
I1[id] = a + b + c*d ...
break;
case 1:
I2[id] = f*g/h
break;
}
break;
//These things depend on case 0 OR
//we've run out of space in the first pass
//32 cases max [0 ... 31]
case 1:
switch(warp){
case 0:
V1[ID] = I1*I2+ ...
break;
case 1:
V2[ID] = ...
//syncs across the block
__syncthreads();
This design is based on my impression that each set of 32 threads runs independently but should run the same code, otherwise things can slow done significantly.
So at the end, I'm running roughly 32*10 instructions simultaneously.
Where 32 is the number of warps, and it depends on how many different values I can compute at the same time (due to dependencies) and 10 is the # of elements in each vector. This is slowed down by any imbalances in the # of computations in each warp case, since all warps need to merge before moving onto the next step (due to the syncthreads call). I'm running different parameters (parameter sweep) on top of this, so I could potentially run 3 at a time in the block, multiplied by the # of streaming processors (or whatever the official name is) on the card.
One thing I need to change is that I'm currently testing on a video card that is attached to a monitor as well. Apparently Windows will kill the kernel if it lasts for more than 5 seconds, so I need to call the kernel in chunked time steps, like once every 1e5 time steps (in my case).

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.

CUDA: streaming the same memory location to all threads

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??
This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:
// choose some reasonable block size
const unsigned int block_size = 256;
__global__ void kernel(double *ptr)
{
__shared__ double window[block_size];
// cooperate with my block to load block_size elements
window[threadIdx.x] = ptr[threadIdx.x];
// wait until the window is full
__syncthreads();
// operate on the data
...
}
You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.

CUDA - what is this loop doing

Hey
I've seen on a website this example kernel
__global__ void loop1( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
y[i] = alpha*x[i] + y[i];
}
}
To compute this function in C
for(i=0;i<N;i++) {
y[i] = alpha*x[i] + y[i];
}
Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0] and remove the for loop altogether.
I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>> so presumably gridDim.x = 1
You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.
Interesting kernel. The loop inside the kernel is necessary, because N is greater than total number of threads, which is 16 384 (blockDim.x*gridDim.x), but I think it's not good practice to do it (the whole point of CUDA is to use SIMT concept). According to CUDA Programming Guide you can have at most 65535 thread blocks with one kernel. Futhermore starting from Compute Capability 2.x (Fermi) you can have at most 1024 threads per one block (512 before Fermi) Also you can (if possible) separate code into multiple (sequential) kernels.
Much as we would like to believe that CUDA GPUs have infinite execution resources, they do not, and authors of highly optimized code are finding that unrolled for loops, often with fixed numbers of blocks, give the best performance. Makes for painful coding, but optimized CPU code is also pretty painful.
btw a commenter mentioned that this code would have coalescing problems, and I don't see why. If the base addresses are correctly aligned (64B since those are floats), all of the memory transactions by this code will be coalesced, provided the threads/block is also divisible by 64.