why the first cuda kernel cannot overlap with previous memcpy? - cuda

Here is a demo. The kernel cannot overlap with previous cudaMemcpyAsync, although they are in different streams.
#include <iostream>
#include <cuda_runtime.h>
__global__ void warmUp(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
printf("warm up!");
}
}
__global__ void kernel(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
long long x = 0;
for(int i=0; i<1000000; i++){
x += i>>1;
}
printf("kernel!%d\n", x);
}
}
int main(){
//warmUp<<<1,32>>>();
int *data, *data_dev;
int dataSize = pow(10, 7);
cudaMallocHost(&data, dataSize*sizeof(int));
cudaMalloc(&data_dev, dataSize*sizeof(int));
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(data_dev, data, dataSize*sizeof(int), cudaMemcpyHostToDevice, stream1);
kernel<<<1, 32, 0, stream2>>>();
}
Visual Profiler show
After some attempts, I found out that this is due to it being the first kernel call.
Uncomment warmUp<<<1,32>>>();, Visual Profiler show, overlap!
Why?

CUDA uses lazy initialization. Because of this, the first time you do a particular operation or a particular operation type, it's possible that the behavior will not be as you expect.
The operation will/should work "correctly", but performance measurements may not be as you expect.
Contrary to the linked article, there really is no specified formula to force the lazy initialization to complete, without performing the actual work you intend to do.
If the only thing you ever intend to do with your application is launch a single kernel, then having that kernel overlap with a previous copy operation doesn't seem to make a lot of sense to me. In any event, you should expect that device initialization is necessary before all operations will proceed at expected speeds or in expected ways.
Lazy initialization behavior may vary based on CUDA version, platform (e.g. OS) and GPU type.
Additionally, kernel launches are asynchronous. So this particular coding pattern:
int main(){
...
kernel<<<1, 32, 0, stream2>>>();
}
is generally not recommended in CUDA, and specifically is not recommended when using a profiler. Your code should provide the opportunity for all issued work to complete properly, in order for the profiler to provide useful results. You should provide a cudaDeviceSynchronize() or similar operation at the end of your code, if you want to profile it, for this type of pattern.
I also don't recommend doing performance analysis on kernels that are issuing printf calls. The printf call imposes additional host/device synchronization behavior/needs, and this can be confusing; its not easy to predict the performance impact of that.

Related

How to reuse code for CPU fallback in CUDA

I have some calculations that I want to parallelize if my user has a CUDA-compliant GPU, otherwise I want to execute the same code on the CPU. I don't want to have two versions of the algorithm code, one for CPU and one for GPU to maintain. I'm considering the following approach but am wondering if the extra level of indirection will hurt performance or if there is a better practice.
For my test, I took the basic CUDA template that adds the elements of two integer arrays and stores the result in a third array. I removed the actual addition operation and placed it into its own function marked with both device and host directives...
__device__ __host__ void addSingleItem(int* c, const int* a, const int* b)
{
*c = *a + *b;
}
... then modified the kernel to call the aforementioned function on the element identified by threadIdx...
__global__ void addKernel(int* c, const int* a, const int* b)
{
const unsigned i = threadIdx.x;
addSingleItem(c + i, a + i, b + i);
}
So now my application can check for the presence of a CUDA device. If one is found I can use...
addKernel <<<1, size>>> (dev_c, dev_a, dev_b);
... and if not I can forego parallelization and iterate through the elements calling the host version of the function...
int* pA = (int*)a;
int* pB = (int*)b;
int* pC = (int*)c;
for (int i = 0; i < arraySize; i++)
{
addSingleItem(pC++, pA++, pB++);
}
Everything seems to work in my small test app but I'm concerned about the extra call involved. Do device-to-devce function calls incur any significant performance hits? Is there a more generally accepted way to do CPU fallback that I should adopt?
If addSingleItem and addKernel are defined in the same translation unit/module/file, there should be no cost to having a device-to-device function call. The compiler will aggressively inline that code, as if you wrote it in a single function.
That is undoubtedly the best approach if it can be managed, for the reason described above.
If it's desired to still have some file-level modularity, it is possible to break code into a separate file and include that file in the compilation of the kernel function. Conceptually this is no different than what is described already.
Another possible approach is to use compiler macros to assist in the addition or removal or modification of code to handle the GPU case vs. non-GPU case. There are endless possibilities here, but see here for a simple idea. You can redefine what __host__ __device__ means in different scenarios, for example. I would say this probably only makes sense if you are building separate binaries for the GPU vs. non-GPU case, but you may find a clever way to handle it in the same executable.
Finally, if you desire this but must place the __device__ function in a separate translation unit, it is still possible but there may be some performance loss due to the device-to-device function call across module boundaries. The amount of performance loss here is hard to generalize since it depends heavily on code structure, but it's not unusual to see 10% or 20% performance hit. In that case, you may wish to investigate link-time-optimizations that became available in CUDA 11.
This question may also be of interest, although only tangentially related here.

Implementing of mutex on cuda kernel function happens to be deadlocked

I'm a newcomer to cuda, and I try to perform mutex in the kernel function.
I read some tutorials and wrote my function, but in some case, deadlock happened.
Here are my codes, kernel function is very simple to count numbers of running thread started by main function.
#include <iostream>
#include <cuda_runtime.h>
__global__ void countThreads(int* sum, int* mutex) {
while(atomicCAS(mutex, 0, 1) != 0); // lock
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
int main() {
int* mutex = nullptr;
cudaMalloc(&mutex, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int* sum = nullptr;
cudaMalloc(&sum, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int ret = 0;
// pass, result is 1024
countThreads<<<1024, 1>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
// deadlock, why?
countThreads<<<1, 2>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
return 0;
}
So, can anyone tell me why the program deadlocked when calling countThreads<<<1, 2>>>(), and how to fix it? I want to perform cross-block mutex, may be it is not a good idea though. Many thanks.
I experimented for some time, and found if use thread in the same block, deadlock happens, otherwise, everything works well.
Threads in the same warp attempting to negotiate for a lock or mutex is probably the worst-case scenario. It is fairly difficult to program correctly, and the behavior may change depending on the exact GPU you are running on.
Here is an example of the type of analysis needed to explain the exact reason for the deadlock in a particular case. Such analysis is not readily done on what you have shown here because you have not indicated the type of GPU you are compiling for, or running on. It's also fairly important to provide the CUDA version you are using for compilation. I have witnessed code changes from one compiler generation to another, that may impact this. Even if you provided that information, I'm not sure the analysis is really worth-while, because I consider the negotiation-within-a-warp case to be extra troublesome to program correctly. This question/answer may also be of interest.
My general suggestion for a newcomer in CUDA (as you say) would be to use a method similar to what is described here. Briefly, negotiate for a lock at the threadblock level (ie. have one thread in each block negotiate among other blocks for the lock) then manage singleton activity within the block using standard, available block-level coordination schemes, such as __syncthreads(), and conditional coding.
You can learn more about this topic by searching on the cuda tag for such keywords as "lock" "critical section" etc.
FWIW, for me, anyway, your code does deadlock on a Kepler device and does not deadlock on a Volta device, as suggested by the reference in the comments. I'm not attempting to communicate any statement about whether your code is defect-free, it's just an observation. If I modify your kernel to look like this:
__global__ void countThreads(int* sum, int* mutex) {
int old = 1;
while (old){
old = atomicCAS(mutex, 0, 1); // lock
if (old == 0){
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
}
}
Then it seems to me to work in either the Kepler case or the Volta case. I'm not advancing this example to suggest it is "correct", rather to show that somewhat innocuous code modifications can change a code from deadlock to non-deadlock case, or vice versa. This kind of fragility is best avoided, certainly in the pre-Volta case, in my opinion.
For the volta and forward case, CUDA 11 and forward, you may want to use capability from the libcu++ library such as semaphore

Determining the optimal value for #pragma unroll N in CUDA

I understand how #pragma unroll works, but if I have the following example:
__global__ void
test_kernel( const float* B, const float* C, float* A_out)
{
int j = threadIdx.x + blockIdx.x * blockDim.x;
if (j < array_size) {
#pragma unroll
for (int i = 0; i < LIMIT; i++) {
A_out[i] = B[i] + C[i];
}
}
}
I want to determine the optimal value for LIMITin the kernel above which will be launched with x number of threads and y number of blocks. The LIMIT can be anywhere from 2 to 1<<20. Since 1 million seems like a very big number for the variable (1 million loops unrolled will cause register pressure and I am not sure if the compiler will do that unroll), what is a "fair" number, if any? And how do I determine that limit?
Your example kernel is completely serial and not in anyway a useful real world use case for loop unrolling, but let's restrict ourselves to the question of how much loop unrolling the compiler will perform.
Here is a compileable version of your kernel with a bit of template decoration:
template<int LIMIT>
__global__ void
test_kernel( const float* B, const float* C, float* A_out, int array_size)
{
int j = threadIdx.x + blockIdx.x * blockDim.x;
if (j < array_size) {
#pragma unroll
for (int i = 0; i < LIMIT; i++) {
A_out[i] = B[i] + C[i];
}
}
}
template __global__ void test_kernel<4>(const float*, const float*, float*, int);
template __global__ void test_kernel<64>(const float*, const float*, float*, int);
template __global__ void test_kernel<256>(const float*, const float*, float*, int);
template __global__ void test_kernel<1024>(const float*, const float*, float*, int);
template __global__ void test_kernel<4096>(const float*, const float*, float*, int);
template __global__ void test_kernel<8192>(const float*, const float*, float*, int);
You can compile this to PTX and see for yourself that (at least with the CUDA 7 release compiler and the default compute capability 2.0 target architecture), the kernels with up to LIMIT=4096are fully unrolled. The LIMIT=8192 case is not unrolled. If you have more patience that I do, you can probably play around with the templating to find the exact compiler limit for this code, although I doubt that is particularly instructive to know.
You can also see for yourself via the compiler that all of the heavily unrolled versions use the same number of registers (because of the trivial nature of your kernel).
CUDA takes advantage of thread-level parallelism, which you expose by splitting work into multiple threads, and instruction-level parallelism, which CUDA finds by searching for independent instructions in your compiled code.
#talonmies' result, showing that your loop might be unrolled somewhere between 4096 and 8192 iterations was surprising to me because loop unrolling has sharply diminishing returns on a modern CPU, where most iteration overhead has been optimized away with techniques such as branch prediction and speculative execution.
On a CPU, I doubt that there would be much to gain from unrolling more than, say, 10-20 iterations and an unrolled loop takes up more room in the instruction cache so there's a cost to unrolling as well. The CUDA compiler will be considering the cost/benefit tradeoff when determining how much unrolling to do. So the question is, what might be the benefit from unrolling 4096+ iterations? I think it might be because it gives the GPU more code in which it can search for independent instructions that it can then run concurrently, using instruction-level parallelism.
The body of your loop is A_out[i] = B[i] + C[i];. Since the logic in your loop does not access external variables and does not access results from earlier iterations of the loop, each iteration is independent from all other iterations. So i doesn't have to increase sequentially. The end result would be the same even if the loop iterated over each value of i between 0 and LIMIT - 1 in completely random order. That property makes the loop a good candidate for parallel optimization.
But there is a catch, and it's what I mentioned in the comment. The iterations of your loop are only independent if the A buffer is stored separately from your B and C buffers. If your A buffer partially or fully overlaps the B and/or C buffers in memory, a connection between different iterations is created. One iteration may now change the B and C input values for another iteration by writing to A. So you get different results depending on which of the two iterations runs first.
Multiple pointers pointing to the same locations in memory is called pointer aliasing. So, in general, pointer aliasing can cause "hidden" connections between sections of code that appear to be separate because writes done by one section of code through one pointer can alter values read by another section of code reading from another pointer. By default, CPU compilers generate code that take possible pointer aliasing into account, generating code that yields the correct result regardless. The question is what CUDA does, because, coming back to the talonmies' test results, the only reason I can see for such a large amount of unrolling is that it opens the code up for instruction level parallelism. But that then means that CUDA does not take pointer aliasing into account in this particular situation.
Re. your question about running more than a single thread, a regular serial program does not automatically become a parallel program when you increase the number of threads. You have to identify the portions of the work that can run in parallel and then express that in your CUDA kernel. That's what's called thread-level parallelism and it's the main source of performance increase for your code. In addition, CUDA will search for independent instructions in each kernel and may run those concurrently, which is the instruction-level parallelism. Advanced CUDA programmers may keep instruction-level parallelism in mind and write code that facilitates that, but we mortals should just focus on thread-level parallelism. That means that you should look at your code again and consider might be able to run in parallel. Since we already concluded that the body of your loop is a good candidate for parallelization, your job becomes rewriting the serial loop in your kernel to express to CUDA how to run separate iterations in parallel.

Benefit of splitting a big CUDA kernel and using dynamic parallelism

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.
The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.
I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.
I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job.
Is it fast to launch a kernel from a kernel?
Do I need to copy data in global memory to make them available to a sub-kernel?
If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?
I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.
EDIT:
To provide some other details, you can imagine my kernel to have this kind of structure:
__global__ void kernel(int *sampleData, int *initialData) {
__shared__ int systemState[N];
__shared__ int someTemp[N * 3];
__shared__ int time;
int tid = ...;
systemState[tid] = initialData[tid];
while (time < TIME_END) {
bool c = calc_something(systemState);
if (c)
break;
someTemp[tid] = do_something(systemState);
c = do_check(someTemp);
if (__syncthreads_or(c))
break;
sample(sampleData, systemState);
if (__syncthreads_and(...)) {
do_something(systemState);
sync();
time += some_increment(systemState);
}
else {
calcNewTemp(someTemp, systemState);
sync();
do_something_else(someTemp, systemState);
time += some_other_increment(someTemp, systemState);
}
}
do_some_stats();
}
this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.
Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.
Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?
I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.
I have benefitted from dynamic parallelism for solving an interpolation problem of the form:
int i = threadIdx.x + blockDim.x * blockIdx.x;
for(int m=0; m<(2*K+1); m++) {
PP1 = calculate_PP1(i,m);
phi_cap1 = calculate_phi_cap1(i,m);
for(int n=0; n<(2*K+1); n++) {
PP2 = calculate_PP2(i,m);
phi_cap2 = calculate_phi_cap2(i,n);
atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }
where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.
From my (possibly incomplete) experience, dynamic parallelism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

CUDA performance: branching and shared memory

I wish to ask two questions on performance. I have been unable to create simple code to illustrate.
Question 1: How expensive is non-divergent branching? In my code it seems that it even goes up as to more then the equivalent of 4 non-fma FLOPS. Note that I am speaking of the BRA PTX code whereby the predicate is already calculated
Question 2: I have been reading a lot about performance of shared memory and some articles like a Dr Dobbs article even state that it can be as fast as registers (as far as accessed well). In my code all threads within the warps within the block access the same shared variable. I believe in this case shared memory is accessed in broadcast mode, isn't it? Should it reach the performance of registers in this way? Is there any special things that should be considered to make it work?
EDIT: I have been able to construct some simple code that give more insight for my query
Here it is
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <float.h>
#include "cuComplex.h"
#include "time.h"
#include "cuda_runtime.h"
#include <iostream>
using namespace std;
__global__ void test()
{
__shared__ int t[1024];
int v=t[0];
bool b=(v==-1);
bool c=(v==-2);
int myValue=0;
for (int i=0;i<800;i++)
{
#if 1
v=i;
#else
v=t[i];
#endif
#if 0
if (b) {
printf("abs");
}
#endif
if (c)
{
printf ("IT HAPPENED");
v=8;
}
myValue+=v;
}
if (myValue==1000)
printf ("IT HAPPENED");
}
int main(int argc, char *argv[])
{
cudaEvent_t event_start,event_stop;
float timestamp;
float4 *data;
// Initialise
cudaDeviceReset();
cudaSetDevice(0);
dim3 threadsPerBlock;
dim3 blocks;
threadsPerBlock.x=32;
threadsPerBlock.y=32;
threadsPerBlock.z=1;
blocks.x=1;
blocks.y=1000;
blocks.z=1;
cudaEventCreate(&event_start);
cudaEventCreate(&event_stop);
cudaEventRecord(event_start, 0);
test<<<blocks,threadsPerBlock,0>>>();
cudaEventRecord(event_stop, 0);
cudaEventSynchronize(event_stop);
cudaEventElapsedTime(&timestamp, event_start, event_stop);
printf("Calculated in %f", timestamp);
}
I am running this code on a GTX680.
Now the results are as follows ..
If run as it is it takes 5.44 ms
If I change the first #if conditional to 0 (which will enable reading from shared memory) it will take 6.02ms.. Not much more but still not enough for me
If I enable the second #if conditional (inserts a branch that will never evaluate to true) the it runs in 9.647040ms. The performance reduction is very big. What is the cause and what can be done?
I have also changed slightly the code to make further checks with shared memory
Instead of
__shared__ int t[1024]
I did
__shared__ int2 t[1024]
and wherever I access t[] I just access t[].x. In got a further drop in performance to 10ms..(another 400micro seconds) Why this should happen?
Regards
Daniel
Have you determined if your kernel is compute bound or memory bound? Your first question would be most relevant if your kernel is compute bound, while the second wold be most relevant if your kernel is memory bound. You might be getting results that are confusing or hard to reproduce if you're assuming one, while it is the other.
(1) I don't think the cost of a branch has been published. You might be left to determining that experimentally for your architecture. The CUDA Programming Guide does say that there is no "branch prediction and no speculative execution."
(2) You're right that when you access a single 32-bit value in shared memory from all the threads in a warp, the value is broadcast. But my guess would be that accessing a single value from all threads would have the same cost as accessing any combination of values as long as you don't incur any bank conflicts. So you end up with the latency of a single fetch from shared memory. I don't think the number of cycles of latency has been published. It is short enough that it is normally easily hidden.
You need to keep in mind that the compiler is highly optimizing. So if you comment out the branch, you also eliminate the evaluation of the conditional, wether or not you leave it in the source code. Thus a difference of four instructions seems very plausible for your example:
load -1,
compare v to it (and store result in b),
test b,
branch,
although I have not compiled your example and looked at the code (which is what you should do - run cuobjdump -sass on your binaries and look at the actual differences in machine code.
Using the only the .x compnent of an int2 changes the layout in shared memory so that you go from bank conflict free access to a 2-way bank conflict, which causes the slight further slowdown in your example. IIRC the latency of a shared memory access is of the order of 30 cycles, which usually is easily hidden by other threads (as Roger has already mentioned).