Launch CUDA kernels sequence and data transfer between them - cuda

i have two kernel in same file, the code should run the first kernel to generate an array. then i need to send the generated array to the second kernel.
however, when i do this the second kernel see all the array elements are 0.
here is simplification (not a runnable code ) just a psyducode.
cudaMalloc(device input array)
cudaMalloc(result array)
cudaMemcpy(device_input_array,inputarray,size,hosttodevice)
kernel1<<<1,n>>(device_input_array,device_result_array)
cudaMemcpy(host_result_array,device_result_array ... )
cudaMalloc(dev_secndarray)
kernel2<<<1,n>>>(dev_secndarray,device_result_array )
for testing.. in kernel2 i create a loop on device_result_array, how ever it prints all its elements as zero.
what is the proper way to send data between kernels. should i reserve space for the result array again ? what should i do?

Memory allocated through cudaMalloc exists till the end of application, or until you explicitly free the memory. Thus, the device_result_array can be passed directly to the second kernel as an input. I would recommend the following pattern:
cudaMalloc(device_input_array)
cudaMalloc(device_intermediate_result_array)
cudaMalloc(device_final_result_array)
cudaMemcpy(device_input_array,host_input_array,size,hosttodevice)
kernel1<<<G,B>>>(device_input_array,device_intermediate_result_array)
kernel2<<<G,B>>>(device_intermediate_result_array,device_final_result_array)
cudaMemcpy(host_result_array,device_final_result_array,size,devicetohost)
If for some reason you actually need to make a copy of the intermediate result in the device, you have an option to call cudaMemcpy(...,cudaMemcpyDeviceToDevice).
In either case, don't copy the intermediate result to host (unless you really need it for other reasons). Host<->Device copies are expensive.

Related

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?
cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.
Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

CUDA: sum of data on a global memory variable

I've launched a kernel with 2100 blocks and 4 threads per block.
Somewhat in this kernel all the threads have to execute a function, and put its result on an array (on global memory) into "threadIdx.x" position.
I surely know that, in this fase of the project, the function always returns 1.012086.
Now, I've written this code to do that sum:
currentErrors[threadIdx.x]=0;
for(i=0;i<gridDim.x;i++)
{
if(i==blockIdx.x)
{
currentErrors[threadIdx.x]+=globalError(mynet,myoutput);
}
}
But when the kernel ends all array's position has 1.012086 as value (instead 1.012086*2100).
Where I'm wrong?
Thanks for your helps!
To compute a final sum out of partial results of your blocks, I would suggest doing it the following way:
Let every block write a partial result into a separate cell of a gridDim.x-sized array.
Copy the array to host.
Perform final sum on the host.
I assume each block has a lot to compute on its own, which would warrant the usage of CUDA in the first place.
In your current state --- I think there can be something wrong in your kernel. Seems to me that every block is summing all the data, returning a final result as if it was a partial result.
The loop you presented does not really make sense. For each block there is only one i which will do something. The code will be equivalent to simply writing:
currentErrors[threadIdx.x]=0;
currentErrors[threadIdx.x]+=globalError(mynet,myoutput);
save for some unpredictable scheduling differences.
Remember that blocks are not executed in sync. Each block can run before, during or after any other block.
Also:
You may be interested in parallel prefix sum algorithm.
You may want to check an efficient CUDA implementation of the prefix sum.

CUDA: Move content of a volume texture using only one kernel and a threadfence

I want to move the content of a volume texture along the vector vecShift. I think of a kernel like this:
__global__ void
moveVolume(int* vecShift)
{
// Determine position of current voxel as ptDest
// Determine position of voxel we copy the content from as ptSrc
// Read value at ptSrc and store it to voxelColor
// __threadfence()
// Write voxelColor to voxel at position ptDest
}
The threadfence will ensure that ALL voxels have read the contents of their "partner" and there will be no write to ptDest before every voxel has done the read-operation, does it?
If this is true, why I (sometimes) get artifacts of a blurry kind? Or do I have a wrong opinion on the functionality of threadfence?
As talonmies explains in the comments, using __threadfence() here is neither necessary nor sufficient. __threadfence() does not provide global barrier synchronization, it simply ensures that before the thread that calls __threadfence() proceeds, all writes by that thread before the fence are visible to all other active threads in the kernel launch.
What you really want here is to double buffer your volume data (i.e. write to a different array than you read). You cannot overwrite other parts of the array unless you can guarantee that they are only read by other threads in the same thread block. Otherwise you have a race condition and your program is incorrect.
Note: even in a sequential (CPU) implementation, you would need to double buffer your data for this type of computation!
What you are implementing is very similar to an advection kernel, as would be used in fluid dynamics simulations, and I'm sure there are multiple examples of what you want on the web (parallel or sequential).
Mark

How best to transfer a large number of arrays of chars to the GPU?

I am new to CUDA and am trying to do some processing of a large number of arrays. Each array is an array of about 1000 chars (not a string, just stored as chars) and there can be up to 1 million of them, so about 1 gb of data to be transfered. This data is already all loaded into memory and I have a pointer to each array, but I don't think I can rely on all the data being sequential in memory, so I can't just transfer it all with one call.
I currently made a first go at it with thrust, and based my solution kind of on this message ... I made a struct with a static call that allocates all the memory, and then each individual constructor copies that array, and I have a transform call which takes in the struct with the pointer to the device array.
My problem is that this is obviously extremely slow, since each array is copied individually. I'm wondering how to transfer this data faster.
In this question (the question is mostly unrelated, but I think the user is trying to do something similar) talonmies suggests that they try and use a zip iterator but I don't see how that would help transfer a large number of arrays.
I also just found out about cudaMemcpy2DToArray and cudaMemcpy2D while writing this question, so maybe those are the answer, but I don't see immediately how these would work, since neither seem to take pointers to pointers as input...
Any suggestions are welcome...
One way to do this is as marina.k suggested, batching your transfers only as you need them. Since you said each array only contains about 1000 chars, you could assign each char to a thread (since on Fermi we can allocate 1024 threads per block) and have each array handled by one block. In this case you may be able to transfer all the arrays for one "round" in one call - can you use a FORTRAN style, where you make one gigantic array and to get the 5th element of the "third" 1000 char array you would go:
third_array[5] = big_array[5 + 2*1000]
so that the first 1000 char array makes up the first 1000 elements of big_array, the second 1000 char array makes up the second 1000 elements of big_array, etc. ? In this case your chars would be continuous in memory and you could move the set you were going to process with one kernel launch in only one memcpy. Then as soon as you launch one kernel, you refill big_array on the CPU side and copy it asynchronously to the GPU.
Within each kernel, you could simply handle each array within 1 block, so that block N handles the (N-1)-thousandth element up to the N-thousandth of d_big_array (where you copied all those chars to).
Did you try pinned memory? This may provide a considerable speed-up on some hardware configurations.
Take try of async, you can assign the same job to different streams, each stream process a small part of date, make tranfer and computation at the same time
here is code:
cudaMemcpyAsync(
inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]
);
MyKernel<<<100, 512, 0, stream[i]>>> (outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(
hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]
);