CUDA array-to-array sum - cuda

I have a short piece of code like this:
typedef struct {
double sX;
double sY;
double vX;
double vY;
int rX;
int rY;
int mass;
int species;
int boxnum;
} particle;
typedef struct {
double mX;
double mY;
double count;
int rotDir;
double cX;
double cY;
int superDir;
} box;
//....
int i;
for(i=0;i<PART_COUNT;i++) {
particles[i].boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
}
for(i=0;i<PART_COUNT;i++) {
//sum the momenta
boxnum = particles[i].boxnum;
boxes[boxnum].mX += particles[i].vX*particles[i].mass;
boxes[boxnum].mY += particles[i].vY*particles[i].mass;
boxes[boxnum].count++;
}
Now, I want to port this to CUDA. The first step is easy; spreading the calculation across a bunch of threads is no problem. The issue is the second. Since any two particles are equally likely to be in any same box, I'm not sure how I can partition it so as to avoid conflicts.
Number of particles is on the order of 10,000 to 10,000,000, and number of boxes is on the order of 1024 to 1048576.
Ideas?

You can try to use atomicAdd operations to modify your boxes array. Atomic operations on global memory are very slow but at the same time it's quite impossible to do any optimizations involving shared memory for two reasons:
Under the assumption that the properties boxnum of the particles particles[0]..particles[n] aren't ordered and do not lie in any small boundaries (in the range of a block size) you can't predict which boxes to load from global memory into shared memory. You would've to first collect all the boxnumbers..
If you try to collect all boxnumbers you can't use an array with every possible boxnumber as an index since there are way too many boxes to fit into shared memory. So you'd have to collect indices with a queue (realized with an array, a pointer to the next free slot and atomic operations), but then you'd still have conflicts because the same boxnumber could occur multiple times in your queue.
Conclusion: atomicAdd will give you at least correct behavior. Try it out and test the performance. If you aren't satisfied by the performance, think if there's another way to do the same computations that would profit from shared memory.

As an alternative, you could launch a 2D grid of blocks.
blocks.x = numParticles / threadsPerBlock / repeatPerBlock.
blocks.y = numOfBoxes / 1024;
Each block performs atomic additions in shared memory if and only if boxnum lies in between 1024 * blockIdx.y and 1024 * (blockIdx.y + 1);
This is followed by a reduction along blocks.x
This may or may not be faster than atomicAdd on global memory as the data is read blocks.y number of times. This could however be fixed if the "particles" are sorted by boxnum in a sorting pass followed by a partitioning pass.
There may be several other ways to do it, but since the problem size varies by a large amount, you may end up having to write 2-3 different methods that are optimized for a given size range.

Related

Loading from global memory

Suppose simple kernel like this:
__global__ void fg(struct s_tp tp, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double3 r = tp.rh[idx];
double d = sqrt(r.x*r.x + r.y*r.y + r.z*r.z);
tp.d[idx] = d;
}
Is this true ?:
double3 r = tp.rh[idx];
data are loaded from global memory into r variables.
r are stored in registers or if there is many variables, in local memory.
r are not stored in shared memory.
d are calculated and after that written back into global memory.
registers are faster than other memories.
if the space of registers is full (some big kernels), local memory is used, and the access is slower
when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Thanks to all.
Yes, it's pretty much all true.
•when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Using shared memory is useful when there is either data reuse (loading the same data item more than once, usually by more than one thread in a threadblock), or possibly when you are making a specialized use of shared memory to aid in global coalescing, such as during an optimized matrix transpose.
Data reuse means that you are using (loading) the data more than once, and for shared memory to be useful, it means you are loading it more than once by more than one thread. If you are using it more than once in a single thread, then the single load plus the compiler (automatic) "optimization" of storing it in a register is all you need.
EDIT
The answer given by #Jez has some good ideas for optimal loading. I would suggest another idea is to convert your AoS data storage scheme to a SoA scheme. Data storage transformation is a common approach to improving speed of CUDA codes.
Your s_tp struct, which you haven't shown, appears to have storage for several double quantities per item/struct. If you instead create separate arrays for each of these quantities, you'll have opportunities for optimal loading/storage. Something like this:
__global__ void fg(struct s_tp tp, double* s_tp_rx, double* s_tp_ry, double* s_tp_rz, double* s_tp_d, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double rx = s_tp_rx[idx];
double ry = s_tp_ry[idx];
double rz = s_tp_rz[idx];
double d = sqrt(rx*rx + ry*ry + rz*rz);
s_tp_d[idx] = d;
}
This approach is likely to have benefits elsewhere in your device code also, for similar types of usage patterns.
It's all true.
when I need doubles, is there any way to speed it up? For example load
data firstly into shared memory and then operate them?
For the example you gave, your implementation is possibly not optimal. The first thing you should do is compare the bandwidth acheived to that of a reference kernel, for example, a cudaMemcpy. If the gap is large, and the speedup you'll gain from closing this gap is significant, optimisations may be possible.
Looking at your kernel there are two things that strike me as potentially suboptimal:
There's not much work per thread. If possible, processing mulitple elements per thread can improve performance. This is, in part, because it avoids thread intialisation/removal overheads.
Loading from a double3 isn't as efficient as loading from other types. The best way to load data is usually using 128-bit loads per thread. Loading three consective 64-bit values will be slower, perhaps not by a lot, but slower all the same.
EDIT: Robert Crovella's answer below gives a good solution to the second point which requires changing around your data type. For some reason I had originally thought this wasn't an option, so the below solution is probably over-the-top if you cna just change your data type!
While adding more work per thread is a fairly simple thing to try, optimising your memory access pattern (without changing your datatype) for a solution is less so. Fortunately there are libraries that can help. I think that using CUB, and in particular, the BlockLoad collective, should allow you to load more efficently. By loading, say, 6 double items per thread using a transpose operator, you can process two elements per thread, pack them into a double2, and store them normally.

Benefit of splitting a big CUDA kernel and using dynamic parallelism

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.
The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.
I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.
I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job.
Is it fast to launch a kernel from a kernel?
Do I need to copy data in global memory to make them available to a sub-kernel?
If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?
I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.
EDIT:
To provide some other details, you can imagine my kernel to have this kind of structure:
__global__ void kernel(int *sampleData, int *initialData) {
__shared__ int systemState[N];
__shared__ int someTemp[N * 3];
__shared__ int time;
int tid = ...;
systemState[tid] = initialData[tid];
while (time < TIME_END) {
bool c = calc_something(systemState);
if (c)
break;
someTemp[tid] = do_something(systemState);
c = do_check(someTemp);
if (__syncthreads_or(c))
break;
sample(sampleData, systemState);
if (__syncthreads_and(...)) {
do_something(systemState);
sync();
time += some_increment(systemState);
}
else {
calcNewTemp(someTemp, systemState);
sync();
do_something_else(someTemp, systemState);
time += some_other_increment(someTemp, systemState);
}
}
do_some_stats();
}
this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.
Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.
Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?
I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.
I have benefitted from dynamic parallelism for solving an interpolation problem of the form:
int i = threadIdx.x + blockDim.x * blockIdx.x;
for(int m=0; m<(2*K+1); m++) {
PP1 = calculate_PP1(i,m);
phi_cap1 = calculate_phi_cap1(i,m);
for(int n=0; n<(2*K+1); n++) {
PP2 = calculate_PP2(i,m);
phi_cap2 = calculate_phi_cap2(i,n);
atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }
where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.
From my (possibly incomplete) experience, dynamic parallelism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

Copying 2D arrays to GPU of known variable width

I am looking into how to copy a 2D array of variable width for each row into the GPU.
int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;
...
Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?
According to this thread, that won't be a clever way of doing it:
cudaMalloc(d_array, rows*sizeof(int*));
for(int i = 0 ; i < rows ; i++) {
cudaMalloc((void **)&d_array[i], length[i] * sizeof(int));
}
But I cannot think of any other method. Is there any other smarter way of doing it?
Can it be improved using cudaMallocPitch and cudaMemCpy2D ??
The correct way to allocate an array of pointers for the GPU in CUDA is something like this:
int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));
for(int i = 0 ; i < nrows ; i++) {
cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int));
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);
(disclaimer: written in browser, never compiled, never tested, use at own risk)
The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.
If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:
int * data, * rows, * lengths;
cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));
In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).
I would put all the data into one array. Then compose another array with the row lengths, so that A[0] is the length of row 0 and so on. so A[i] = length[i]
Then you need just to allocate 2 arrays on the card and call memcopy twice.
Of course it's a little bit of extra work, but i think performance wise it will be an improvement (depending of course on how you use the data on the card)

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.

CUDA: streaming the same memory location to all threads

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??
This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:
// choose some reasonable block size
const unsigned int block_size = 256;
__global__ void kernel(double *ptr)
{
__shared__ double window[block_size];
// cooperate with my block to load block_size elements
window[threadIdx.x] = ptr[threadIdx.x];
// wait until the window is full
__syncthreads();
// operate on the data
...
}
You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.