Is it worthwhile to pass kernel parameters via shared memory? - cuda

Suppose that we have an array int * data, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.
Let's create a test kernel:
__global__ void test(int *data, int a, int b, int c){ ... }
I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a b and c.
If I'm not mistaken, even though we do not call directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a b and c will be allocated in the global memory.
Now these variables, are only auxiliary, the threads only read them and nothing else.
My question is, wouldn't it be better to transfer these variables to the shared memory?
I imagine that if we had for example 10 blocks with 1024 threads, we would need 10*3 = 30 reads of 4 bytes in order to store the numbers in the shared memory of each block.
Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720 which is very inefficient.
Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a b and c to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720 and not 10*3 = 30.
On the following website there is this example:
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Here each thread loads different data inside the shared variable s. So each thread, according to its index, loads the specified data inside the shared memory.
In my case, I want to load only variables a b and c to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.
How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3 global memory reads?

The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:
In compute capability 1.0/1.1/1.2/1.3 devices, kernel arguments are passed by the runtime in shared memory.
In compute capability 2.x/3.x/4.x/5.x/6.x devices, kernel arguments are passed by the runtime in a reserved constant memory bank (which has a dedicated cache with broadcast).
So in your hypothetical kernel
__global__ void test(int *data, int a, int b, int c){ ... }
data, a, b, and c are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.

Related

cuda racecheck error if using double in kernel [duplicate]

My questions are:
1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each thread. That allows you to store some intermediate result in this variable for every thread. Example: vector c=a+b:
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
int p;
p = a[i] + b[i];
c[i] = p;
}
Here we declare intermediate variable p. But in reality there are N copies of this variable, each one for each thread.
2) Is it true, that if I will declare array, N copies of this array will be created, each for each thread? And as long as everything inside the global kernel happens on gpu memory, you need N times more memory on gpu for any variable declared, where N is the number of your threads.
3) In my current program I have 35*48= 1680 blocks, each block include 32*32=1024 threads. Does it mean, that any variable declared within a global kernel will cost me N=1024*1680=1 720 320 times more than outside the kernel?
4) To use shared memory, I need M times more memory for each variable than usually. Here M is the number of blocks. Is that true?
1) Yes. Each thread has a private copy of non-shared variables declared in the function. These usually go into GPU register memory, though can spill into local memory.
2), 3) and 4) While it's true that you need many copies of that private memory, that doesn't mean your GPU has to have enough private memory for every thread at once. This is because in hardware, not all threads need to execute simultaneously. For example, if you launch N threads it may be that half are active at a given time and the other half won't start until there are free resources to run them.
The more resources your threads use the fewer can be run simultaneously by the hardware, but that doesn't limit how many you can ask to be run, as any threads the GPU doesn't have resource for will be run once some resources free up.
This doesn't mean you should go crazy and declare massive amounts of local resources. A GPU is fast because it is able to run threads in parallel. To run these threads in parallel it needs to fit a lot of threads at any given time. In a very general sense, the more resources you use per thread, the fewer threads will be active at a given moment, and the less parallelism the hardware can exploit.

CUDA shared memory under the hood questions

I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.

Declaring Variables in a CUDA kernel

Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like:
__global__ void kernel(float* delt, float* deltb) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float a;
a = delt[i] + deltb[i];
a += 1;
}
and the kernel call looks something like below, with multiple threads and blocks:
int threads = 200;
uint3 blocks = make_uint3(200,1,1);
kernel<<<blocks,threads>>>(d_delt, d_deltb);
Is "a" stored on the stack?
Is a new "a" created for each thread when they are initialized?
Or will each thread independently access "a" at an unknown time, potentially messing up the algorithm?
Any variable (scalar or array) declared inside a kernel function, without an extern specifier, is local to each thread, that is each thread has its own "copy" of that variable, no data race among threads will occur!
Compiler chooses whether local variables will reside on registers or in local memory (actually global memory), depending on transformations and optimizations performed by the compiler.
Further details on which variables go on local memory can be found in the NVIDIA CUDA user guide, chapter 5.3.2.2
None of the above. The CUDA compiler is smart enough and aggressive enough with optimisations that it can detect that a is unused and the complete code can be optimised away.You can confirm this by compiling the kernel with -Xptxas=-v as an option and look at the resource count, which should be basically no registers and no local memory or heap.
In a less trivial example, a would probably be stored in a per thread register, or in per thread local memory, which is off-die DRAM.

CUDA Constant Memory Best Practices

I present here some code
__constant__ int array[1024];
__global__ void kernel1(int *d_dst) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = array[tId];
}
__global__ void kernel2(int *d_dst, int *d_src) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = d_src[tId];
}
int main(int argc, char **argv) {
int *d_array;
int *d_src;
cudaMalloc((void**)&d_array, sizeof(int) * 1024);
cudaMalloc((void**)&d_src, sizeof(int) * 1024);
int *test = new int[1024];
memset(test, 0, sizeof(int) * 1024);
for (int i = 0; i < 1024; i++) {
test[i] = 100;
}
cudaMemcpyToSymbol(array, test, sizeof(int) * 1024);
kernel1<<< 1, 1024 >>>(d_array);
cudaMemcpy(d_src, test, sizeof(int) * 1024, cudaMemcpyHostToDevice);
kernel2<<<1, 32 >>>(d_array, d_src),
free(test);
cudaFree(d_array);
cudaFree(d_src);
return 0;
}
Which simply shows constant memory and global memory usage. On its execution the "kernel2" executes about 4 times faster (in terms of time) than "kernel1"
I understand from the Cuda C programming guide, that this this because accesses to constant memory are getting serialized. Which brings me to the idea that constant memory can be best utilized if a warp accesses a single constant value such as integer, float, double etc. but accessing an array is not beneficial at all. In other terms, I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more? I mean a structure might contain multiple simple types and array for example; when accessing these simple types, are these accesses also serialized or not?
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Anyone can refer me some example code where an efficient constant memory usage is shown.
regards,
I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
Yes this is generally correct and is the principal intent of usage of constant memory/constant cache. The constant cache can serve up one quantity per SM "at a time". The precise wording is as follows:
The constant memory space resides in device memory and is cached in the constant cache.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
An important takeaway from the text above is the desire for uniform access across a warp to achieve best performance. If a warp makes a request to __constant__ memory where different threads in the warp are accessing different locations, those requests will be serialized. Therefore if each thread in a warp is accessing the same value:
int i = array[20];
then you will have the opportunity for good benefit from the constant cache/memory. If each thread in a warp is accessing a unique quantity:
int i = array[threadIdx.x];
then the accesses will be serialized, and the constant data usage will be disappointing, performance-wise.
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more?
You can certainly put structures in constant memory. The same rules apply:
int i = constant_struct_ptr->array[20];
has the opportunity to benefit, but
int i = constant_struct_ptr->array[threadIdx.x];
does not. If you access the same simple type structure element across threads, that is ideal for constant cache usage.
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Yes, if you know that in general your accesses will break the constant memory one 32-bit quantity per cycle rule, then you'll probably be better off leaving the data in ordinary global memory.
There are a variety of cuda sample codes that demonstrate usage of __constant__ data. Here are a few:
graphics volumeRender
imaging bilateralFilter
imaging convolutionTexture
finance MonteCarloGPU
and there are others.
EDIT: responding to a question in the comments, if we have a structure like this in constant memory:
struct Simple { int a, int b, int c} s;
And we access it like this:
int p = s.a + s.b + s.c;
^ ^ ^
| | |
cycle: 1 2 3
We will have good usage of the constant memory/cache. When the C code gets compiled, under the hood it will generate machine code accesses corresponding to 1,2,3 in the diagram above. Let's imagine that access 1 occurs first. Since access 1 is to the same memory location independent of which thread in the warp, during cycle 1, all threads will receive the value in s.a and it will take advantage of the cache for best possible benefit. Likewise for accesses 2 and 3. If on the other hand we had:
struct Simple { int a[32], int b[32], int c[32]} s;
...
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int p = s.a[idx] + s.b[idx] + s.c[idx];
This would not give good usage of constant memory/cache. Instead, if this were typical of our accesses to s, we'd probably have better performance locating s in ordinary global memory.

Copying whole global memory buffer many times to shared memory buffer

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.
How does one do that?
I know the size of the buffer only at run time:
__global__ void foo( int *globalMemArray, int N )
{
extern __shared__ int s_array[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if( idx < N )
{
...?
}
}
The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.
The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.
With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:
template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N )
{
__shared__ int s_array[blocksize];
int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) {
if( pos < N ) {
s_array[threadIdx.x] = globalMemArray[pos];
}
__syncthreads();
// Calculations using partial buffer contents
.......
__syncthreads();
}
// write final per thread result to output
globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}
In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).
The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.