Loading from global memory - cuda

Suppose simple kernel like this:
__global__ void fg(struct s_tp tp, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double3 r = tp.rh[idx];
double d = sqrt(r.x*r.x + r.y*r.y + r.z*r.z);
tp.d[idx] = d;
}
Is this true ?:
double3 r = tp.rh[idx];
data are loaded from global memory into r variables.
r are stored in registers or if there is many variables, in local memory.
r are not stored in shared memory.
d are calculated and after that written back into global memory.
registers are faster than other memories.
if the space of registers is full (some big kernels), local memory is used, and the access is slower
when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Thanks to all.

Yes, it's pretty much all true.
•when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Using shared memory is useful when there is either data reuse (loading the same data item more than once, usually by more than one thread in a threadblock), or possibly when you are making a specialized use of shared memory to aid in global coalescing, such as during an optimized matrix transpose.
Data reuse means that you are using (loading) the data more than once, and for shared memory to be useful, it means you are loading it more than once by more than one thread. If you are using it more than once in a single thread, then the single load plus the compiler (automatic) "optimization" of storing it in a register is all you need.
EDIT
The answer given by #Jez has some good ideas for optimal loading. I would suggest another idea is to convert your AoS data storage scheme to a SoA scheme. Data storage transformation is a common approach to improving speed of CUDA codes.
Your s_tp struct, which you haven't shown, appears to have storage for several double quantities per item/struct. If you instead create separate arrays for each of these quantities, you'll have opportunities for optimal loading/storage. Something like this:
__global__ void fg(struct s_tp tp, double* s_tp_rx, double* s_tp_ry, double* s_tp_rz, double* s_tp_d, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double rx = s_tp_rx[idx];
double ry = s_tp_ry[idx];
double rz = s_tp_rz[idx];
double d = sqrt(rx*rx + ry*ry + rz*rz);
s_tp_d[idx] = d;
}
This approach is likely to have benefits elsewhere in your device code also, for similar types of usage patterns.

It's all true.
when I need doubles, is there any way to speed it up? For example load
data firstly into shared memory and then operate them?
For the example you gave, your implementation is possibly not optimal. The first thing you should do is compare the bandwidth acheived to that of a reference kernel, for example, a cudaMemcpy. If the gap is large, and the speedup you'll gain from closing this gap is significant, optimisations may be possible.
Looking at your kernel there are two things that strike me as potentially suboptimal:
There's not much work per thread. If possible, processing mulitple elements per thread can improve performance. This is, in part, because it avoids thread intialisation/removal overheads.
Loading from a double3 isn't as efficient as loading from other types. The best way to load data is usually using 128-bit loads per thread. Loading three consective 64-bit values will be slower, perhaps not by a lot, but slower all the same.
EDIT: Robert Crovella's answer below gives a good solution to the second point which requires changing around your data type. For some reason I had originally thought this wasn't an option, so the below solution is probably over-the-top if you cna just change your data type!
While adding more work per thread is a fairly simple thing to try, optimising your memory access pattern (without changing your datatype) for a solution is less so. Fortunately there are libraries that can help. I think that using CUB, and in particular, the BlockLoad collective, should allow you to load more efficently. By loading, say, 6 double items per thread using a transpose operator, you can process two elements per thread, pack them into a double2, and store them normally.

Related

CUDA Constant Memory Best Practices

I present here some code
__constant__ int array[1024];
__global__ void kernel1(int *d_dst) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = array[tId];
}
__global__ void kernel2(int *d_dst, int *d_src) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = d_src[tId];
}
int main(int argc, char **argv) {
int *d_array;
int *d_src;
cudaMalloc((void**)&d_array, sizeof(int) * 1024);
cudaMalloc((void**)&d_src, sizeof(int) * 1024);
int *test = new int[1024];
memset(test, 0, sizeof(int) * 1024);
for (int i = 0; i < 1024; i++) {
test[i] = 100;
}
cudaMemcpyToSymbol(array, test, sizeof(int) * 1024);
kernel1<<< 1, 1024 >>>(d_array);
cudaMemcpy(d_src, test, sizeof(int) * 1024, cudaMemcpyHostToDevice);
kernel2<<<1, 32 >>>(d_array, d_src),
free(test);
cudaFree(d_array);
cudaFree(d_src);
return 0;
}
Which simply shows constant memory and global memory usage. On its execution the "kernel2" executes about 4 times faster (in terms of time) than "kernel1"
I understand from the Cuda C programming guide, that this this because accesses to constant memory are getting serialized. Which brings me to the idea that constant memory can be best utilized if a warp accesses a single constant value such as integer, float, double etc. but accessing an array is not beneficial at all. In other terms, I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more? I mean a structure might contain multiple simple types and array for example; when accessing these simple types, are these accesses also serialized or not?
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Anyone can refer me some example code where an efficient constant memory usage is shown.
regards,
I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
Yes this is generally correct and is the principal intent of usage of constant memory/constant cache. The constant cache can serve up one quantity per SM "at a time". The precise wording is as follows:
The constant memory space resides in device memory and is cached in the constant cache.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
An important takeaway from the text above is the desire for uniform access across a warp to achieve best performance. If a warp makes a request to __constant__ memory where different threads in the warp are accessing different locations, those requests will be serialized. Therefore if each thread in a warp is accessing the same value:
int i = array[20];
then you will have the opportunity for good benefit from the constant cache/memory. If each thread in a warp is accessing a unique quantity:
int i = array[threadIdx.x];
then the accesses will be serialized, and the constant data usage will be disappointing, performance-wise.
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more?
You can certainly put structures in constant memory. The same rules apply:
int i = constant_struct_ptr->array[20];
has the opportunity to benefit, but
int i = constant_struct_ptr->array[threadIdx.x];
does not. If you access the same simple type structure element across threads, that is ideal for constant cache usage.
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Yes, if you know that in general your accesses will break the constant memory one 32-bit quantity per cycle rule, then you'll probably be better off leaving the data in ordinary global memory.
There are a variety of cuda sample codes that demonstrate usage of __constant__ data. Here are a few:
graphics volumeRender
imaging bilateralFilter
imaging convolutionTexture
finance MonteCarloGPU
and there are others.
EDIT: responding to a question in the comments, if we have a structure like this in constant memory:
struct Simple { int a, int b, int c} s;
And we access it like this:
int p = s.a + s.b + s.c;
^ ^ ^
| | |
cycle: 1 2 3
We will have good usage of the constant memory/cache. When the C code gets compiled, under the hood it will generate machine code accesses corresponding to 1,2,3 in the diagram above. Let's imagine that access 1 occurs first. Since access 1 is to the same memory location independent of which thread in the warp, during cycle 1, all threads will receive the value in s.a and it will take advantage of the cache for best possible benefit. Likewise for accesses 2 and 3. If on the other hand we had:
struct Simple { int a[32], int b[32], int c[32]} s;
...
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int p = s.a[idx] + s.b[idx] + s.c[idx];
This would not give good usage of constant memory/cache. Instead, if this were typical of our accesses to s, we'd probably have better performance locating s in ordinary global memory.

Benefit of splitting a big CUDA kernel and using dynamic parallelism

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.
The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.
I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.
I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job.
Is it fast to launch a kernel from a kernel?
Do I need to copy data in global memory to make them available to a sub-kernel?
If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?
I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.
EDIT:
To provide some other details, you can imagine my kernel to have this kind of structure:
__global__ void kernel(int *sampleData, int *initialData) {
__shared__ int systemState[N];
__shared__ int someTemp[N * 3];
__shared__ int time;
int tid = ...;
systemState[tid] = initialData[tid];
while (time < TIME_END) {
bool c = calc_something(systemState);
if (c)
break;
someTemp[tid] = do_something(systemState);
c = do_check(someTemp);
if (__syncthreads_or(c))
break;
sample(sampleData, systemState);
if (__syncthreads_and(...)) {
do_something(systemState);
sync();
time += some_increment(systemState);
}
else {
calcNewTemp(someTemp, systemState);
sync();
do_something_else(someTemp, systemState);
time += some_other_increment(someTemp, systemState);
}
}
do_some_stats();
}
this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.
Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.
Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?
I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.
I have benefitted from dynamic parallelism for solving an interpolation problem of the form:
int i = threadIdx.x + blockDim.x * blockIdx.x;
for(int m=0; m<(2*K+1); m++) {
PP1 = calculate_PP1(i,m);
phi_cap1 = calculate_phi_cap1(i,m);
for(int n=0; n<(2*K+1); n++) {
PP2 = calculate_PP2(i,m);
phi_cap2 = calculate_phi_cap2(i,n);
atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }
where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.
From my (possibly incomplete) experience, dynamic parallelism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.
this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage
When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.

CUDA array-to-array sum

I have a short piece of code like this:
typedef struct {
double sX;
double sY;
double vX;
double vY;
int rX;
int rY;
int mass;
int species;
int boxnum;
} particle;
typedef struct {
double mX;
double mY;
double count;
int rotDir;
double cX;
double cY;
int superDir;
} box;
//....
int i;
for(i=0;i<PART_COUNT;i++) {
particles[i].boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
}
for(i=0;i<PART_COUNT;i++) {
//sum the momenta
boxnum = particles[i].boxnum;
boxes[boxnum].mX += particles[i].vX*particles[i].mass;
boxes[boxnum].mY += particles[i].vY*particles[i].mass;
boxes[boxnum].count++;
}
Now, I want to port this to CUDA. The first step is easy; spreading the calculation across a bunch of threads is no problem. The issue is the second. Since any two particles are equally likely to be in any same box, I'm not sure how I can partition it so as to avoid conflicts.
Number of particles is on the order of 10,000 to 10,000,000, and number of boxes is on the order of 1024 to 1048576.
Ideas?
You can try to use atomicAdd operations to modify your boxes array. Atomic operations on global memory are very slow but at the same time it's quite impossible to do any optimizations involving shared memory for two reasons:
Under the assumption that the properties boxnum of the particles particles[0]..particles[n] aren't ordered and do not lie in any small boundaries (in the range of a block size) you can't predict which boxes to load from global memory into shared memory. You would've to first collect all the boxnumbers..
If you try to collect all boxnumbers you can't use an array with every possible boxnumber as an index since there are way too many boxes to fit into shared memory. So you'd have to collect indices with a queue (realized with an array, a pointer to the next free slot and atomic operations), but then you'd still have conflicts because the same boxnumber could occur multiple times in your queue.
Conclusion: atomicAdd will give you at least correct behavior. Try it out and test the performance. If you aren't satisfied by the performance, think if there's another way to do the same computations that would profit from shared memory.
As an alternative, you could launch a 2D grid of blocks.
blocks.x = numParticles / threadsPerBlock / repeatPerBlock.
blocks.y = numOfBoxes / 1024;
Each block performs atomic additions in shared memory if and only if boxnum lies in between 1024 * blockIdx.y and 1024 * (blockIdx.y + 1);
This is followed by a reduction along blocks.x
This may or may not be faster than atomicAdd on global memory as the data is read blocks.y number of times. This could however be fixed if the "particles" are sorted by boxnum in a sorting pass followed by a partitioning pass.
There may be several other ways to do it, but since the problem size varies by a large amount, you may end up having to write 2-3 different methods that are optimized for a given size range.

Coding a CUDA Kernel that has many threads writing to the same index?

I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.
So here is the kernel code, and I'll try to explain it a bit clearer with the variables.
__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
if(index_in < cLength)
{
sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
//__threadfence();
__threadfence_block();
}
}
First off, the number of connections in the network is cLength. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray contains that information. So index i of sourceTargetArray is the source neuron index of connection i, and target neuron index of connection i. The weightArray contains the weight information (so index i of weightArray corresponds to connection i).
As you can see, SumArray is where I'm storing the sums. So kernel increments the sumArray (at target neuron index of connection i) by the absolute value of the weight of connection i. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.
The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).
You can see that I added __threadfence() (and __threadfence_block() in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?
EDIT: For reference, __threadfence() usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions
+= is not atomical => not thread safe. Use atomicAdd.
Also you should avoid writing to same memory cell. Problem is that these calls will be serialized, threads will stand in line and wait for each other. If you can't avoid this operation try to break your algorithm into two phases: individual computation and merging. Parallel merging can be implemented very efficiently.
You need to do a reduction.
Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads
Now reduce the resulting sub totals by adding successive neighboring subtotals:
int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex] + 1;
__syncthreads;
i /= 2;
}
}
The following deck explains this in some detail:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
Sample code for this is here:
http://www.nvidia.com/object/cuda_sample_data-parallel.html
It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).
There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.