Using device variable by multiple threads on CUDA - cuda

I am playing around with cuda.
At the moment I have a problem. I am testing a large array for particular responses, and when I get the response, I have to copy the data onto another array.
For example, my test array of 5 elements looks like this:
[ ][ ][v1][ ][ ][v2]
Result must look like this:
[v1][v2]
The problem is how do I calculate the address of the second array to store the result? All elements of the first array are checked in parallel.
I am thinking to declare a device variable int addr = 0. Every time I find a response, I will increment the addr. But I am not sure about that because it means that addr may be accessed by multiple threads at the same time. Will that cause problems? Or will the thread wait until another thread finishes using that variable?

Is not as trivial as it seems. I just finished to implement one and I can tell what you need
read the scan Gpu Gems 3 Article in particular chapter 39.3.1 Stream Compaction.
To implement your own start from the LargeArrayScan example in the SDK, that will give you just the prescan. Assuming you have the selection array in device memory (an array of 1 and 0 meaning 1- select 0 - discard), dev_selection_array a dev_elements_array elements to be selected a dev_prescan_array and a dev_result_array all of size N then you do
prescan(dev_prescan_array,dev_selection_array, N);
scatter(dev_result_array, dev_prescan_array,
dev_selection_array, dev_elements_array, N);
where the scatter is
__global__ void scatter_kernel( T*dev_result_array,
const T* dev_prescan_array,
const T* dev_selection_array,
const T* dev_elements_array, std::size_t size){
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= size) return;
if (dev_selection_array[idx] == 1){
dev_result_array[dev_prescan_array[idx]] = dev_elements_array[idx];
}
}
for other nice application of the prescan see the paper Ble93
Have fun!

You're talking about classic stream compaction. Generally I would recommend looking at Thrust or CUDPP (those links go to the compaction documentation). Both of these are open source, if you want to roll your own then I would also suggest looking at the 'scan' SDK sample.

Related

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

CUDA and addressing bits in parallel

I want to write a CUDA program that returns locations of a bigger array that hold a specific criteria.
The trivial way to do it is to write a kernel that returns an array of integers with 1 if the criteria was held, or 0 if it was not.
Another way might be to return only indexes that were found - but that would be problematic based on my knowledge of GPU synchronization (it's equivalent to implement a queue/linked list on GPU).
The problem with the first idea presented is that the array would be in the input size.
Another way I thought about is to create an array the size of log(n)/8+1 (n=number of items I check), and use 1 bit for each array location (holding a sort of compressed representation of the output).
The only thing I could not find is if CUDA supports bit addressing in parallel..
An example of how I am doing it now:
__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU)
{
int start_idx = threadIdx.x + (blockIdx.x * blockDim.x);
if (start_idx > *gputTextSize - *gputSearchSize){return;}
unsigned int wrong=0;
for(int i=0; i<*gputSearchSize;i++){
wrong = calculationOnGpu(gpuText, gpuFind, start_idx,i, gputSearchSize);
}
resultsGPU[start_idx] = !wrong;
}
What I want to do is instead of using int or char for the "resultsGpu" variable , to use something else.
Thanks
A CUDA GPU can access items on boundaries of 1,2,4,8, or 16 bytes. It does not have the ability to independently access bits in a byte.
Bits in a byte would be modified by reading a larger size item, such as char or int, modifying the bit(s) in a register, then writing that item back to memory. Thus it would be a read-modify-write operation.
In order to preserve adjacent bits in such a scenario with multiple threads, it would be necessary to atomically update the item (char, int, etc.) There are no atomics that operate on char quantities, so the bits would need to be grouped into quantities of 32, and written e.g. as int. Following that idiom, every thread would be doing an atomic operation.
32 also happens to be the warp size currently, so a warp-based intrinsic might be a more efficient way to go here, in particular the warp vote __ballot() function. Something like this:
__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU)
{
int start_idx = threadIdx.x + (blockIdx.x * blockDim.x);
if (start_idx > *gputTextSize - *gputSearchSize){return;}
unsigned int wrong=0;
wrong = calculationOnGpu(gpuText, gpuFind, start_idx,0, gputSearchSize);
wrong = __ballot(wrong);
if ((threadIdx.x & 31) == 0)
resultsGPU[start_idx/32] = wrong;
}
You haven't provided a complete code, so the above is just a sketch of how it might be done. I'm not sure the loop in your original kernel was an efficient approach anyway, and the above assumes 1 thread per data item to be searched. __ballot() should be safe even in the presence of inactive threads at one end or the other of the array being searched.

Loading from global memory

Suppose simple kernel like this:
__global__ void fg(struct s_tp tp, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double3 r = tp.rh[idx];
double d = sqrt(r.x*r.x + r.y*r.y + r.z*r.z);
tp.d[idx] = d;
}
Is this true ?:
double3 r = tp.rh[idx];
data are loaded from global memory into r variables.
r are stored in registers or if there is many variables, in local memory.
r are not stored in shared memory.
d are calculated and after that written back into global memory.
registers are faster than other memories.
if the space of registers is full (some big kernels), local memory is used, and the access is slower
when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Thanks to all.
Yes, it's pretty much all true.
•when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Using shared memory is useful when there is either data reuse (loading the same data item more than once, usually by more than one thread in a threadblock), or possibly when you are making a specialized use of shared memory to aid in global coalescing, such as during an optimized matrix transpose.
Data reuse means that you are using (loading) the data more than once, and for shared memory to be useful, it means you are loading it more than once by more than one thread. If you are using it more than once in a single thread, then the single load plus the compiler (automatic) "optimization" of storing it in a register is all you need.
EDIT
The answer given by #Jez has some good ideas for optimal loading. I would suggest another idea is to convert your AoS data storage scheme to a SoA scheme. Data storage transformation is a common approach to improving speed of CUDA codes.
Your s_tp struct, which you haven't shown, appears to have storage for several double quantities per item/struct. If you instead create separate arrays for each of these quantities, you'll have opportunities for optimal loading/storage. Something like this:
__global__ void fg(struct s_tp tp, double* s_tp_rx, double* s_tp_ry, double* s_tp_rz, double* s_tp_d, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double rx = s_tp_rx[idx];
double ry = s_tp_ry[idx];
double rz = s_tp_rz[idx];
double d = sqrt(rx*rx + ry*ry + rz*rz);
s_tp_d[idx] = d;
}
This approach is likely to have benefits elsewhere in your device code also, for similar types of usage patterns.
It's all true.
when I need doubles, is there any way to speed it up? For example load
data firstly into shared memory and then operate them?
For the example you gave, your implementation is possibly not optimal. The first thing you should do is compare the bandwidth acheived to that of a reference kernel, for example, a cudaMemcpy. If the gap is large, and the speedup you'll gain from closing this gap is significant, optimisations may be possible.
Looking at your kernel there are two things that strike me as potentially suboptimal:
There's not much work per thread. If possible, processing mulitple elements per thread can improve performance. This is, in part, because it avoids thread intialisation/removal overheads.
Loading from a double3 isn't as efficient as loading from other types. The best way to load data is usually using 128-bit loads per thread. Loading three consective 64-bit values will be slower, perhaps not by a lot, but slower all the same.
EDIT: Robert Crovella's answer below gives a good solution to the second point which requires changing around your data type. For some reason I had originally thought this wasn't an option, so the below solution is probably over-the-top if you cna just change your data type!
While adding more work per thread is a fairly simple thing to try, optimising your memory access pattern (without changing your datatype) for a solution is less so. Fortunately there are libraries that can help. I think that using CUB, and in particular, the BlockLoad collective, should allow you to load more efficently. By loading, say, 6 double items per thread using a transpose operator, you can process two elements per thread, pack them into a double2, and store them normally.

using thrust::sort inside a thread

I would like to know if thrust::sort() can be used inside a thread
__global__
void mykernel(float* array, int arrayLength)
{
int threadID = blockIdx.x * blockDim.x + threadIdx.x;
// array length is vector in the device global memory
// is it possible to use inside the thread?
thrust::sort(array, array+arrayLength);
// do something else with the array
}
If yes, does the sort launch other kernels to parallelize the sort?
Yes, thrust::sort can be combined with the thrust::seq execution policy to sort numbers sequentially within a single CUDA thread (or sequentially within a single CPU thread):
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
__global__
void mykernel(float* array, int arrayLength)
{
int threadID = blockIdx.x * blockDim.x + threadIdx.x;
// each thread sorts array
// XXX note this causes a data race
thrust::sort(thrust::seq, array, array + arrayLength);
}
Note that your example causes a data race because each CUDA thread attempts to sort the same data in parallel. A correct race-free program would partition array according to thread index.
The thrust::seq execution policy, which is required for this feature, is only available in Thrust v1.8 or better.
#aland already referred you to an earlier answer about calling Thrust's parallel algorithms on the GPU - in that case the asker was simply trying to sort data which was already on the GPU; Thrust called from the CPU can handle GPU-resident data by cast pointers to vectors.
Assuming your question is different and you really want to call a parallel sort in the middle of your kernel (as opposed to break the kernel into multiple smaller kernels and call sort in between) then you should consider CUB, which provides a variety of primitives suitable for your purposes.
Update: Also see #Jared's answer in which he explains that you can call Thrust's sequential algorithms from on the GPU as of Thrust 1.7.

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.