Looping inside cuda code - cuda

I ran some CUDA code that updated an array of floats. I have a wrapper function like the one discussed in How can I compile CUDA code then link it to a C++ project? this question.
Inside my CUDA function I create a for loop like this...
int tid = threadIdx.x;
for(int i=0;i<X;i++)
{
//code here
}
Now the issue is that if X is equal to the value of 100, everything works just fine, but if X is equal to 1000000, my vector does not get updated (almost as if the code inside the for loop does not get executed)
Now inside the wrapper function, if I call the CUDA function in a for loop, it still works just fine, (but is significantly slower for some reason than if I simply did the same process all on the CPU) like this...
for(int i=0;i<1000000;i++)
{
update<<<NumObjects,1>>>(dev_a, NumObjects);
}
Does anyone know why I can loop a million times in the wrapper function but not simply call the CUDA "update" function once and then inside that function start a for loop of a million?

You should be using cudaThreadSynchronize and cudaGetLastError after running this to see if there was some error. I imagine the first time, it timed out. This happens if the kernel takes a long time to complete. The card just gives up on it.
The second thing, the reason it takes much longer to execute, is because there is a set overhead time for each kernel launch. When you had the loop inside the kernel, you experienced this overhead once and ran the loop. Now you're experiencing it X times. The overhead is fairly small, but large enough that as much of the loop should be put inside the kernel as possible.
If X is particularly large, you might look into running as much of the loop in the kernel as possible until it completes in a safe amount of time, and then loop over these kernels.

Related

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

Simulation loop GPU utilization

I am struggling with utilizing a simulation loop.
There are 3 kernel launched in every cycle.
The next time step size is computed by the second kernel.
while (time < end)
{
kernel_Flux<<<>>>(...);
kernel_Timestep<<<>>>(d_timestep);
memcpy(&h_timestep, d_timestep, sizeof(float), ...);
kernel_Integrate<<<>>>(d_timestep);
time += h_timestep;
}
I only need copy back a single float. What would be the most efficient way to avoid unnecessary synchronizations?
Thank you in advance. :-)
In CUDA all operations running from the default stream are synchronized. So in the code you've posted kernels will run one after another. From what I can see the kernel kernel_integrate() depends from the result of the kernel kernel_Timestep(), so no way of avoiding synchronization. Anyway if the kernels kernel_Flux() and kernel_Timestep() work on independent data, you can try to execute them in parallel, in two different streams.
If you care about the iteration time a lot, you can probably setup a new stream dedicated to the memcpy of the h_timestep out (you need to use cudaMemcpyAsync in this case). Then use something like speculative execution, where your loop proceeds before you figure out the time. To do so you will have to setup the GPU memory buffers for the next several iterations. You can probably do this by using a circular buffer. You also need to use cudaEventRecord and cudaStreamWaitEvent to synchronize the different streams, such that a next iteration is allowed to proceed only if the time corresponds to the buffer you are about to overwrite, has been calculated (the memcpy stream has done the job), because otherwise you will lose the state at that iteration.
Another potential solution, which I haven't tried but I suspect would work, is to make use of dynamic parallelism. If your cards support that, you can probably put the whole loop in GPU.
EDIT:
Sorry, I just realized that you have the third kernel. Your delay due to synchronization may be because you are not doing cudaMemcpyAsync? It's very likely that the third kernel will run longer than the memcpy. You should be able to proceed without any delay. The only synchronization needed is after each iteration.
The ideal solution would be to move everything to the GPU. However, I cannot do so, because I need to launch CUDPP compact after every few iterations, and it does not support CUDA streams nor dynamics parallelism. I know that the Thrust 1.8 library has copy_if method, which does the same, and it is working with dynamic parallelism. The problem is it does not compile with separate compilation on.
To sum up, now I use the following code:
while (time < end)
{
kernel_Flux<<<gs,bs, 0, stream1>>>();
kernel_Timestep<<<gs,bs, 0, stream1>>>(d_timestep);
cudaEventRecord(event, stream1);
cudaStreamWaitEvent(mStream2, event, 0);
memcpyasync(&h_timestep, d_timestep, sizeof(float), ..., stream2);
kernel_Integrate<<<>>>(d_timestep);
cudaStreamSynchronize(stream2);
time += h_timestep;
}

cudaMemcpy replace existing values?

I am making a program in cuda 4.2 and I am having this problem...
I need to execute the same code to 2 images. So I put the code in a for loop and I am calling all the cudaMalloc once before the for loop. The code in the the loop uses cudaMemcpy(..,..,..,cudaMemcpyHostToDevice) to the same device array pointer. So I thought that the new values (from the first image) would replace the old ones (from the second image) in the second loop. But cudaMemcpy(..,..,..,cudaMemcpyHostToDevice) fails...
Do i have to use another function?
Thanks a lot!
I had a potentially similar problem,
It turned out I was calling cudaFree() inside at the end of the loop but cudaMalloc() only once before the loop.
This meant when I tried to copy over the values onto the GPU after the first time the memory was no longer allocated, I ended up making sure CudaMalloc and CudaFree were all outside the loop and it worked fine.

how can a __global__ function RETURN a value or BREAK out like C/C++ does

Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.

Cuda Kernel with different array sizes

I am working on a fluid dynamic problem in cuda and discovered a problem like this
if I have an array e.g debug_array with the length 600 and an array
value_array with the length 100 and I wanna do sth like
for(int i=0;i<6;i++)
{
debug_array[6*(bx*block_size+tx)+i] = value_array[bx*block_size+tx];
}
block_size would in this example be based on the 100 element array, e.g
4 blocks block_size 25
if value_array contains e.g 10;20;30;.....
I would expect debug_array to have groups of 6 similar values like
10;10;10;10;10;10;20;20;20;20;20;20;30......
The problem is that it is not picking up all values from the values array, any idea
why this isn't working or a good workaround.
What will work is if I define float val = value_array[bx*block_size+tx]; outside the for loop and keep this inside the loop debug_array[bx*block_size+tx+i] = val;
But I would like to avoid that as my kernels have between 5 and 10 device function inside the loop and it makes it just hard to read.
thanks in advance any advice will be appriciated
Markus
There seems to be an error in computing the index:
Let's assume bx = 0 and tx = 0
The first 6 elements in debug_array will be filled with data.
Next thread: tx = 1: Elements 1 to 7 will be filled with data (overwriting existing data).
Due to the threads working in parallel it is not determined which thread will be scheduled first and therefore which values will be written into the debug_array.
You should have written:
debug_array[6*(bx*block_size+tx)+i] = value_array[bx*block_size+tx];
If changing the code to move the value_array expression out of the loop and into a temp variable makes the code work - and that is the only code change you made - then this smells like a compiler bug.
Try changing your nvcc compiler options to reduce or disable optimizations and see if the value_array expression inside the loop changes behavior. Also, make sure you're using the latest CUDA tools.
Optimizing compilers will often attempt to move expressions that aren't dependent on the loop index variable out of the loop, exactly like your manual workaround. It's called "invariant code motion" and it makes loops faster by reducing the amount of code that executes in each iteration of the loop. If manually extracting the invariant code from the loop works, but letting the compiler figure it out on its own doesn't, that casts doubt on the compiler.