Dynamic parallelism - launching many small kernels is very slow

Dynamic parallelism - launching many small kernels is very slow - cuda

I am trying to use dynamic parallelism to improve an algorithm I have in CUDA. In my original CUDA solution, every thread computes a number that is common for each block. What I want to do is to first launch a coarse (or low resolution) kernel, where threads compute the common value just once (like if every thread represents one block). Then each thread creates a small grid of 1 block (16x16 threads), and launches a child kernel for it passing the common value. In theory it should be faster because one is saving many redundant operations. But in practice, the solution works very slow, I don't know why.
This is the code, very simplified, just the idea.
__global__ coarse_kernel( parameters ){
int common_val = compute_common_val();
dim3 dimblock(16, 16, 1);
dim3 dimgrid(1, 1, 1);
child_kernel <<< dimgrid, dimblock >>> (common_val, parameters);
}
__global__ child_kernel( int common_val, parameters ){
// use common value
do_computations(common_val, parameters);
}
The amount of child_kernels is a lot, one per thread and there must be around 400x400 threads. From what I understand, the GPU should process all these kernels in parallel, right?
Or child kernels are processed somehow sequentially?
My results show that performance is more than 10 times slower than in the original solution I had.

There is a cost in launching kernels, either parent or child. If your child kernels do not extract much parallelism and there is not much benefit against their non-parallel counterparts, then your faint benefit may be cancelled out by the child kernel launch overheads.
In formulas, let to be the overhead to execute a child kernel, te its execution time and ts the time to execute the same code without the help of dynamic parallelism. The speedup arising from the use of dynamic parallelism is ts/(to+te). Perhaps (but this cannot be envinced from your code) te<ts but te,ts<<to, so that ts/(to+te) is about (ts/to)<1 and you observe a slowdown instead of a speedup.

Related

CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?

Will this lead to inconsistencies in shared memory?
My kernel code looks like this (pseudocode):
__shared__ uint histogram[32][64];
uint threadLane = threadIdx.x % 32;
for (data){
histogram[threadLane][data]++;
}
Will this lead to collisions, given that, in a block with 64 threads, threads with id "x" and "(x + 32)" will very often write into the same position in the matrix?
This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.

It is dangerous. It leads to race conditions.
If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.
Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf
Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust for instance .

Empirically determining how many threads are in a warp

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?

Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5

Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?

cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.

Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

Failing to read an array from the global mem CUDA

I am implementing a complicated algorithm on CUDA. But there is a really odd problem. The problem can be summarised as following: the kernel will repeat a series of calculation many times. The calculation of the present iteration is upon the result of the previous one. I am using an array on the global memory for passing information between blocks in each iteration. For example there are 2 blocks, for each iteration block 0 saves the result to the global memory, then block 1 read it from the global memory. However the problem is that the block 1 can’t read the array from the global memory. it sometimes returns the result of the 1st iteration, not the previous one.
a_e and e_a are two arrays on the global mem, the size is [2*8].
d_a_e and d_e_a are on the shared mem, the size is [blockDim.x+1][8].
if(threadIdx.x<8)
{
//block 0 writes, block 1 reads, this can't work properly
a_e[blockIdx.x*8+threadIdx.x]=d_a_e[blockDim.x][threadIdx.x];
if(blockIdx.x>0)
d_a_e[0][threadIdx.x]=a_e[(blockIdx.x-1)*8+threadIdx.x];
//block 1 writes, block 0 reads, this can work properly
e_a[blockIdx.x*8+threadIdx.x]=d_e_a[0][threadIdx.x];
if(blockIdx.x < gridDim.x-1)
d_e_a[blockDim.x][threadIdx.x]=e_a[(blockIdx.x+1)*8+threadIdx.x];
}

This setup won't work; you're effectively trying to serialize your blocks, which as talonmies alluded to in his comment, doesn't work. From the CUDA programming guide:
"Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores..."
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
Your best recourse if probably to launch seperate kernels (such that you perform the block 0 computation in the 1st kernel, block 1 in the 2nd kernel, etc) to try to enforce that the results from the 1st kernel are done before reading them in the next kernel. There has been some work done on have inter-block synchronization, but you wouldn't derive much benefit from them, as you need to serialize your blocks.
EDIT: I should also point out that the block scheduling isn't documented, and is liable to change at any point, so any inter-block synchronization will be non-portable and liable to break on a driver or CUDA toolkit update.

CUDA finding the max value in given array

I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value initialized by the first value of the input_data[0],
The final answer is stored in result[0].
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.

You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (__global__ function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in your do{} while(atomicCAS(&lock,0,1)) until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.
Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
Parallel prefix sum # wikipedia
Parallel Reduction # CUDA website
NVIDIA's Guide to Reduction

Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.

Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.

There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008