Running zero blocks in cuda - cuda

I have a loop like this:
while ( ... ) {
...
kernel<<<blocks, threads>>>( ... );
}
and in some iterations blocks or threads have value 0. When I use this my code runs. My question is if this is considered bad practice, and if there are any other bad side effects.

It's bad practice because it will interfere with proper CUDA error checking.
If you do proper error checking, your kernel launches that have all-zero values for block or grid dimensions will throw an error.
It's preferable to write error free programs for a variety of reasons.
Instead, include a test for these cases and skip the kernel launch when your dimensions are zero. The small overhead in C-code to do this will be more than offset by the reduced API overhead by not making the spurious kernel launch request.

I have tried zero block kernel invocation by simply writing following empty kernel.
File:
#include<stdio.h>
__global__ void fg()
{
}
int main()
{
fg<<<0,1>>>();
}
What I noticed was the only side effect was in terms of the time required for execution.
Run time :
real 0m0.242s,
user 0m0.004s,
sys 0m0.148s.
When I run the same file with kernel invocation commented the side effect of overhead in time decreases.
Run time:
real 0m0.003s,
user 0m0.000s,
sys 0m0.000s.
This side effect arises due to the kernel invocation over head for zero blocks.

Related

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

Why is the first cudaMalloc the only bottleneck?

I defined this function :
void cuda_entering_function(...)
{
StructA *host_input, *dev_input;
StructB *host_output, *dev_output;
host_input = (StructA*)malloc(sizeof(StructA));
host_output = (StructB*)malloc(sizeof(StructB));
cudaMalloc(&dev_input, sizeof(StructA));
cudaMalloc(&dev_output, sizeof(StructB));
... some more other cudaMalloc()s and cudaMemcpy()s ...
cudaKernel<< ... >>(dev_input, dev_output);
...
}
This function is called several times (about 5 ~ 15 times) throughout my program, and I measured this program's performance using gettimeofday().
Then I found that the bottleneck of cuda_entering_function() is the first cudaMalloc() - the very first cudaMalloc() throughout my whole program. Over 95% of the total execution time of cuda_entering_function() was consumed by the first cudaMalloc(), and this also happens when I changed the size of first cudaMalloc()'s allocating memory or when I changed the executing order of cudaMalloc()s.
What is the reason and is there any way to reduce the first cuda allocating time?
The first cudaMalloc is responsible for the initialization of the device too, because it's the first call to any function involving the device. This is why you take such a hit: it's overhead due to the use of CUDA and your GPU. You should make sure that your application can gain a sufficient speedup to compensate for the overhead.
In general, people use a call to an initialization function in order to setup their device. In this answer, you can see that apparently a call to cudaFree(0) is the canonical way to do so. This sample shows the use of cudaSetDevice, which could be a good habit if you ever work on machines with several CUDA-ready devices.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?
cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.
Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

CUDA, low performance in storing data in shared memroy

Here is my problem, in order to speed up my project, i want to save a value which generated inside kernel into a shared memory, However, i found it takes such a long time to save that value. If i remove "THIS LINE"(see codes below), i.e., remove the "THIS LINE", it is very fast to save that value(100 times speed-up!).
extern __shared__ int sh_try[];
__global__ void xxxKernel (...)
{
float v, e0, e1;
float t;
int count(0);
for (...)
{
v = fetchTexture();
e0 = fetchTexture();
e1 = fetchTexture();
t = someDeviceFunction(v, e0, e1);
if (t>0.0 && t < 1.0) <========== <THIS LINE>
count++;
}
sh_try[threadIdx.x] = count;
}
main()
{
sth..
START TIMING:
xxxKernel<<<gridDim.x, BlockDim.x, BlockDim.x*sizeof(int)>>> (...);
cudaDeviceSynchronize();
END TIMING.
sth...
}
In order to figure out this problem, i simplify my codes that just save the data into the shared mem. and stop. As i know shared mem. is the most efficient mem. besides register, I wonder if this high latency is normal or i've done sth wrong. PLEASE give me some advice!!! Thank you guys in advance!!!
trudi
UPDATE:
If i replace shared memory with global mem., it takes almost the same time, 33ms without "THIS LINE", 297ms with it. Is that normal for saving data to global mem. takes the same time as shared mem.? Is that also a part of 'compiler optimization'?
I have checked the other similar problems on stackoverflow also, i.e., there is a huge time gap between saving data into shared memory or not, which may caused by compiler optimization, since it is pointless to calculating data but not saving them, so the compiler just 'removed' those pointless code.
I am not sure if i share the same reason, since the line changes the game is a hypothesis - "THIS LINE", when i comment it, the variable 'count' increases in EVERY iteration, when i uncomment it, it increases when the t is meaningful.
Any ideas? Please...
Frequently, when large performance changes are seen as a result of relatively small code changes (such as adding or deleting a line of code in a kernel), the performance changes are not due to the actual performance impact of that line of code, but are due to the compiler making different optimization decisions, which can result in wholesale additions or deletions of machine code in your kernels.
A relatively easy way to help confirm this is to look at the generated machine code. For example, if the size of the generated machine code changes substantially due to the addition or deletion of a single line of source code, it may be the case that the compiler made an optimization decision that drastically affected the code.
Although it's not machine code, for these purposes a reasonable proxy is to look at the generated PTX code, which is an intermediate code that the compiler creates.
You can generated ptx by simply adding the -ptx switch to your compile command:
nvcc -ptx mycode.cu
This will generate a file called mycode.ptx which you can inspect. Naturally if your regular compile command requires extra switches (e.g -I/path/to/include/files) then this command may require those same switches. The nvcc manual provides more information on code generation options, and there is a PTX manual to help you learn about PTX, but you may be able to get a rough idea just based on the size of the generated PTX (e.g. number of lines in the .ptx file).

how can a __global__ function RETURN a value or BREAK out like C/C++ does

Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.