In cuda which thread is responsible for shared memory allocation requested inside a kernel? - cuda

I understand instructions inside a kernel is executed by all the threads. Let us consider the following case:
_global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Basically this code will be running at different cores as thread.
Now there is a shared memory allocation. Since it(shared memory allocation) will be encountered by all the threads, will it be allocated by all the threads? (Logically not.) But I am sure at least one thread must allocate it. I want to know which thread do it ?
Kindly help me understand where my understanding is wrong.

No threads are responsible for this allocation - the threads do not run any SASS code that is involved in allocating this memory.
The same statement is true if you use dynamic (extern) shared allocation - no threads are responsible - meaning the threads do not run any SASS code that is involved in allocating this memory. There are no function calls or other mechanisms involved.
The memory is already allocated, and a pointer to it is already established, by the time the thread SASS code (i.e. the kernel) begins executing.
There is a wrinkle to be aware of. If the shared memory declaration involves a constructor, then the constructor will be run on all threads. This can be confusing behavior.

Related

CUDA shared memory read/write order within a single thread

The shared memory is not synchronized between threads in a block. But I don't know if the shared memory is synchronized with the writer thread.
For example, in this example:
__global__ void kernel()
{
__shared__ int i, j;
if(threadIdx.x == 0)
{
i = 10;
j = i;
}
// #1
}
Is it guaranteed at #1 that, for thread 0, i=10 and j=10, or do I need some memory fence or introduce a local variable?
I'm going to assume that by
for thread 0
you mean, "the thread that passed the if-test". And for the sake of this discussion, I will assume there is only one of those.
Yes, it's guaranteed. Otherwise basic C++ compliance would be broken in CUDA.
Challenges in CUDA may arise in inter-thread communication or behavior. However you don't have that in view in your question.
As an example, it is certainly not guaranteed that for some other thread, i will be visible as 10, without some sort of fence or barrier.

cuda racecheck error if using double in kernel [duplicate]

My questions are:
1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each thread. That allows you to store some intermediate result in this variable for every thread. Example: vector c=a+b:
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
int p;
p = a[i] + b[i];
c[i] = p;
}
Here we declare intermediate variable p. But in reality there are N copies of this variable, each one for each thread.
2) Is it true, that if I will declare array, N copies of this array will be created, each for each thread? And as long as everything inside the global kernel happens on gpu memory, you need N times more memory on gpu for any variable declared, where N is the number of your threads.
3) In my current program I have 35*48= 1680 blocks, each block include 32*32=1024 threads. Does it mean, that any variable declared within a global kernel will cost me N=1024*1680=1 720 320 times more than outside the kernel?
4) To use shared memory, I need M times more memory for each variable than usually. Here M is the number of blocks. Is that true?
1) Yes. Each thread has a private copy of non-shared variables declared in the function. These usually go into GPU register memory, though can spill into local memory.
2), 3) and 4) While it's true that you need many copies of that private memory, that doesn't mean your GPU has to have enough private memory for every thread at once. This is because in hardware, not all threads need to execute simultaneously. For example, if you launch N threads it may be that half are active at a given time and the other half won't start until there are free resources to run them.
The more resources your threads use the fewer can be run simultaneously by the hardware, but that doesn't limit how many you can ask to be run, as any threads the GPU doesn't have resource for will be run once some resources free up.
This doesn't mean you should go crazy and declare massive amounts of local resources. A GPU is fast because it is able to run threads in parallel. To run these threads in parallel it needs to fit a lot of threads at any given time. In a very general sense, the more resources you use per thread, the fewer threads will be active at a given moment, and the less parallelism the hardware can exploit.

Are needless write operations in multi-thread kernels in CUDA inefficient?

I have a kernel in my CUDA code where I want a bunch of threads to do a bunch of computations on some piece of shared memory (because it's much faster than doing so on global memory), and then write the result to global memory (so I can use it in later kernels). The kernel looks something like this:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
*globalmem = sharedmem;//write shared memory to global memory
}
The fact that every single thread is writing the data out from shared to global memory, when I really only need to write it out once, feels fishy to me. I also get the same feeling from the fact that every thread initializes the shared memory to zero at the start of the code. Is there a faster way to do this than my current implementation?
At the warp level, there's probably not much performance difference between doing a redundant read or write vs. having a single thread do it.
However I would expect a possibly measurable performance difference by having multiple warps in a threadblock do the redundant read or write (vs. a single thread).
It should be sufficient to address these concerns by having a single thread do the read or write, rather than redundantly:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
if (!threadIdx.x) sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
if (!threadIdx.x) *globalmem = sharemem;//write shared memory to global memory
}
Although you didn't ask about it, using atomics within a threadblock on shared memory may possibly be replaceable (for possibly better performance) by a shared memory reduction method.

Declaring Variables in a CUDA kernel

Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like:
__global__ void kernel(float* delt, float* deltb) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float a;
a = delt[i] + deltb[i];
a += 1;
}
and the kernel call looks something like below, with multiple threads and blocks:
int threads = 200;
uint3 blocks = make_uint3(200,1,1);
kernel<<<blocks,threads>>>(d_delt, d_deltb);
Is "a" stored on the stack?
Is a new "a" created for each thread when they are initialized?
Or will each thread independently access "a" at an unknown time, potentially messing up the algorithm?
Any variable (scalar or array) declared inside a kernel function, without an extern specifier, is local to each thread, that is each thread has its own "copy" of that variable, no data race among threads will occur!
Compiler chooses whether local variables will reside on registers or in local memory (actually global memory), depending on transformations and optimizations performed by the compiler.
Further details on which variables go on local memory can be found in the NVIDIA CUDA user guide, chapter 5.3.2.2
None of the above. The CUDA compiler is smart enough and aggressive enough with optimisations that it can detect that a is unused and the complete code can be optimised away.You can confirm this by compiling the kernel with -Xptxas=-v as an option and look at the resource count, which should be basically no registers and no local memory or heap.
In a less trivial example, a would probably be stored in a per thread register, or in per thread local memory, which is off-die DRAM.

Is it worthwhile to pass kernel parameters via shared memory?

Suppose that we have an array int * data, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.
Let's create a test kernel:
__global__ void test(int *data, int a, int b, int c){ ... }
I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a b and c.
If I'm not mistaken, even though we do not call directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a b and c will be allocated in the global memory.
Now these variables, are only auxiliary, the threads only read them and nothing else.
My question is, wouldn't it be better to transfer these variables to the shared memory?
I imagine that if we had for example 10 blocks with 1024 threads, we would need 10*3 = 30 reads of 4 bytes in order to store the numbers in the shared memory of each block.
Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720 which is very inefficient.
Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a b and c to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720 and not 10*3 = 30.
On the following website there is this example:
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Here each thread loads different data inside the shared variable s. So each thread, according to its index, loads the specified data inside the shared memory.
In my case, I want to load only variables a b and c to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.
How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3 global memory reads?
The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:
In compute capability 1.0/1.1/1.2/1.3 devices, kernel arguments are passed by the runtime in shared memory.
In compute capability 2.x/3.x/4.x/5.x/6.x devices, kernel arguments are passed by the runtime in a reserved constant memory bank (which has a dedicated cache with broadcast).
So in your hypothetical kernel
__global__ void test(int *data, int a, int b, int c){ ... }
data, a, b, and c are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.