CUDA Constant Memory Best Practices - cuda

I present here some code
__constant__ int array[1024];
__global__ void kernel1(int *d_dst) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = array[tId];
}
__global__ void kernel2(int *d_dst, int *d_src) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = d_src[tId];
}
int main(int argc, char **argv) {
int *d_array;
int *d_src;
cudaMalloc((void**)&d_array, sizeof(int) * 1024);
cudaMalloc((void**)&d_src, sizeof(int) * 1024);
int *test = new int[1024];
memset(test, 0, sizeof(int) * 1024);
for (int i = 0; i < 1024; i++) {
test[i] = 100;
}
cudaMemcpyToSymbol(array, test, sizeof(int) * 1024);
kernel1<<< 1, 1024 >>>(d_array);
cudaMemcpy(d_src, test, sizeof(int) * 1024, cudaMemcpyHostToDevice);
kernel2<<<1, 32 >>>(d_array, d_src),
free(test);
cudaFree(d_array);
cudaFree(d_src);
return 0;
}
Which simply shows constant memory and global memory usage. On its execution the "kernel2" executes about 4 times faster (in terms of time) than "kernel1"
I understand from the Cuda C programming guide, that this this because accesses to constant memory are getting serialized. Which brings me to the idea that constant memory can be best utilized if a warp accesses a single constant value such as integer, float, double etc. but accessing an array is not beneficial at all. In other terms, I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more? I mean a structure might contain multiple simple types and array for example; when accessing these simple types, are these accesses also serialized or not?
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Anyone can refer me some example code where an efficient constant memory usage is shown.
regards,

I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
Yes this is generally correct and is the principal intent of usage of constant memory/constant cache. The constant cache can serve up one quantity per SM "at a time". The precise wording is as follows:
The constant memory space resides in device memory and is cached in the constant cache.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
An important takeaway from the text above is the desire for uniform access across a warp to achieve best performance. If a warp makes a request to __constant__ memory where different threads in the warp are accessing different locations, those requests will be serialized. Therefore if each thread in a warp is accessing the same value:
int i = array[20];
then you will have the opportunity for good benefit from the constant cache/memory. If each thread in a warp is accessing a unique quantity:
int i = array[threadIdx.x];
then the accesses will be serialized, and the constant data usage will be disappointing, performance-wise.
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more?
You can certainly put structures in constant memory. The same rules apply:
int i = constant_struct_ptr->array[20];
has the opportunity to benefit, but
int i = constant_struct_ptr->array[threadIdx.x];
does not. If you access the same simple type structure element across threads, that is ideal for constant cache usage.
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Yes, if you know that in general your accesses will break the constant memory one 32-bit quantity per cycle rule, then you'll probably be better off leaving the data in ordinary global memory.
There are a variety of cuda sample codes that demonstrate usage of __constant__ data. Here are a few:
graphics volumeRender
imaging bilateralFilter
imaging convolutionTexture
finance MonteCarloGPU
and there are others.
EDIT: responding to a question in the comments, if we have a structure like this in constant memory:
struct Simple { int a, int b, int c} s;
And we access it like this:
int p = s.a + s.b + s.c;
^ ^ ^
| | |
cycle: 1 2 3
We will have good usage of the constant memory/cache. When the C code gets compiled, under the hood it will generate machine code accesses corresponding to 1,2,3 in the diagram above. Let's imagine that access 1 occurs first. Since access 1 is to the same memory location independent of which thread in the warp, during cycle 1, all threads will receive the value in s.a and it will take advantage of the cache for best possible benefit. Likewise for accesses 2 and 3. If on the other hand we had:
struct Simple { int a[32], int b[32], int c[32]} s;
...
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int p = s.a[idx] + s.b[idx] + s.c[idx];
This would not give good usage of constant memory/cache. Instead, if this were typical of our accesses to s, we'd probably have better performance locating s in ordinary global memory.

Related

Is it worthwhile to pass kernel parameters via shared memory?

Suppose that we have an array int * data, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.
Let's create a test kernel:
__global__ void test(int *data, int a, int b, int c){ ... }
I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a b and c.
If I'm not mistaken, even though we do not call directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a b and c will be allocated in the global memory.
Now these variables, are only auxiliary, the threads only read them and nothing else.
My question is, wouldn't it be better to transfer these variables to the shared memory?
I imagine that if we had for example 10 blocks with 1024 threads, we would need 10*3 = 30 reads of 4 bytes in order to store the numbers in the shared memory of each block.
Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720 which is very inefficient.
Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a b and c to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720 and not 10*3 = 30.
On the following website there is this example:
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Here each thread loads different data inside the shared variable s. So each thread, according to its index, loads the specified data inside the shared memory.
In my case, I want to load only variables a b and c to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.
How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3 global memory reads?
The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:
In compute capability 1.0/1.1/1.2/1.3 devices, kernel arguments are passed by the runtime in shared memory.
In compute capability 2.x/3.x/4.x/5.x/6.x devices, kernel arguments are passed by the runtime in a reserved constant memory bank (which has a dedicated cache with broadcast).
So in your hypothetical kernel
__global__ void test(int *data, int a, int b, int c){ ... }
data, a, b, and c are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.

Is it possible for cuda threads to have local/private copy of an argument which can be updated without affecting the other

I have come across a situation whereby I need to provide a number of arrays as input to a global function, I need each thread to be able to perform operations on the array in such a manner that they will not affect how others threads copy of the array, I provide the below code as an example of what I am trying to achieve.
__global__ void testLocalCopy(double *temper){
int threadIDx = threadIdx.x + blockDim.x * blockIdx.x;
// what I need is for each thread to set temper[3] to its id without affecting any other threads copy
// so thread id 0 will have a set its copy of temper[3] to 0 and thread id 3 will set it to 3 etc.
temper[3]=threadIDx;
printf("For thread %d the val in temper[3] is %lf \n",threadIDx,temper[3]);
}
just to restate , Is there a method whereby a given thread can be certain that no other thread is updating its value of temper[3] ?
I initially believed I would be able to solve this problem by using constant memory, but as constant memory is readonly this did not meet my needs,
I am using cuda 4.0 , please see the main function below.
int main(){
double temper[4]={2.0,25.9999,55.3,66.6};
double *dev_temper;
int size=4;
cudaMalloc( (void**)&dev_temper, size * sizeof(double) );
cudaMemcpy( dev_temper, &temper, size * sizeof(double), cudaMemcpyHostToDevice );
testLocalCopy<<<2,2>>>(dev_temper);
cudaDeviceReset();
cudaFree(dev_temper);
}
Thanks in advance,
Connor
Within your kernel function you can allocate memory as
int temper_per_thread[4];
Now each thread will have seperate and unique access to this array within your kernel e.g. the code below will populate temper_per_thread with the current thread index:
temper_per_thread[0]=threadIDx;
temper_per_thread[1]=threadIDx;
temper_per_thread[2]=threadIDx;
temper_per_thread[3]=threadIDx;
Of course if you wish to transfer all these thread specific arrays back to the CPU, you will need a different approach. 1) allocate a larger portion of global memory. 2) The size of this larger portion of global memory will be the number of threads multiplied by the number of elements unique to each thread. 3) Index the array writing such that each thread always writes to unique location within global memory. 4) Do a GPU to CPU memcpy after the kernel finishes.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.
this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage
When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.

GPGPU - CUDA: global store efficiency

I am trying to figure out how well the global memory write accesses of one of my kernels are coalesced, based on the "global store efficiency" value of NVidia's profiler (I am using CUDA 5 toolkit preview release, on a Fermi GPU).
As far as I understood, this value is the ratio of requested memory transactions to actual nb of transcations performed, therefore reflects whether accesses are all perfectly coalesced (100% efficiency) or not.
Now, for a thread block width of 32, and taking float values as input and output, the following test kernel gives 100% efficiency both for global load and for global store, as expected:
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = input[offset];
output[offset] = tmp;
}
What I don't understand is why when I start adding useful code in between the input read and the output write, the global store efficiency begins to drop, whereas I have not changed the memory write pattern or the thread block geometry ? The global load stays at 100%, as I expect, though.
Could someone please shed a light on why this happens ? I thought, since all 32 threads in a given warp execute the output store instruction simultaneously (by definition) and using a "coalescing-friendly" pattern, I should still get 100% whatever I do before, but obviously I must be misunderstanding something on either the meaning of global store efficiency, or on the conditions for global store coalescing.
Thx,
EDIT :
Here is an example: if I use this code (just adding a "round" operation on input), global store efficiency drops from 100% to 95%
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = round(input[offset]);
output[offset] = tmp;
}
Unsure if this is the case, but round probably converts its argument to a double and if there is a register spilling, then each thread would access 8 bytes of memory, which would then be coerced into 4 bytes of tmp. Accessing 8 bytes would reduce the coalescing to half-warp.
However, I believe register spilling shouldn't happen since the number of local variables in your kernel is small. You could check with nvcc --ptxas-options=-v for the spill.
Ok, shame on me, I found the problem: I was profiling this simple test code in Debug mode, which gives completely wild numbers for most metrics. Re-profiling in Release mode gave me the expected result : 100% store efficiency in both cases.

Copying whole global memory buffer many times to shared memory buffer

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.
How does one do that?
I know the size of the buffer only at run time:
__global__ void foo( int *globalMemArray, int N )
{
extern __shared__ int s_array[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if( idx < N )
{
...?
}
}
The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.
The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.
With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:
template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N )
{
__shared__ int s_array[blocksize];
int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) {
if( pos < N ) {
s_array[threadIdx.x] = globalMemArray[pos];
}
__syncthreads();
// Calculations using partial buffer contents
.......
__syncthreads();
}
// write final per thread result to output
globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}
In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).
The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.