Getting an unexpected value in global device memory when multiple threads write to it - cuda

Here is problem with cuda threads , memory magament, it returns single threads result "100" but would expect 9 threads result "900".
#indudel <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
__global__
void test(int in1,int*ptr){
int e = 0;
for (int i = 0; i < 100; i++){
e++;
}
*ptr +=e;
}
int main(int argc, char **argv)
{
int devID = 0;
cudaError_t error;
error = cudaGetDevice(&devID);
if (error == cudaSuccess)
{
printf("GPU Device fine\n");
}
else{
printf("GPU Device problem, aborting");
abort();
}
int* d_A;
cudaMalloc(&d_A, sizeof(int));
int res=0;
//cudaMemcpy(d_A, &res, sizeof(int), cudaMemcpyHostToDevice);
test <<<3, 3 >>>(0,d_A);
cudaDeviceSynchronize();
cudaMemcpy(&res, d_A, sizeof(int),cudaMemcpyDeviceToHost);
printf("res is : %i",res);
Sleep(10000);
return 0;
}
It returns:
GPU Device fine\n
res is : 100
Would expect it to return higher number because 3x3(blocks,threads), insted of just one threads result?
What is done wrong and where does the numbers get lost?

You can't write your sum in this way to global memory.
You have to use an atomic function to ensure that the store is atomic.
In general, when having multiple device threads writing into the same values on global memory, you have to use either atomic functions :
float atomicAdd(float* address, float val);
double atomicAdd(double*
address, double val);
reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes (old + val), and stores the result
back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
or thread synchronization :
Throughput for __syncthreads() is 16 operations per clock cycle for
devices of compute capability 2.x, 128 operations per clock cycle for
devices of compute capability 3.x, 32 operations per clock cycle for
devices of compute capability 6.0 and 64 operations per clock cycle
for devices of compute capability 5.x, 6.1 and 6.2.
Note that __syncthreads() can impact performance by forcing the
multiprocessor to idle as detailed in Device Memory Accesses.

(adapting another answer of mine:)
You are experiencing the effects of the increment operator not being atomic. (C++-oriented description of what that means). What's happening, chronologically, is the following sequence of events (not necessarily in the same order of threads though):
...(other work)...
block 0 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 1 issues a LOAD instruction with address ptr into register r
...
block 2 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 0 completes the LOAD, now having 0 in register r
...
block 2 thread 2 completes the LOAD, now having 0 in register r
block 0 thread 0 adds 100 to r
...
block 2 thread 2 adds 100 to r
block 0 thread 0 issues a STORE instruction from register r to address ptr
...
block 2 thread 2 issues a STORE instruction from register r to address ptr
Thus every thread sees the initial value of *ptr, which is 0; adds 100; and stores 0+100=100 back. The order of the stores doesn't matter here as long as all of the threads try to store the same false value.
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use a block-level reduction primitive. This will ensure some partial ordering of the computational activity vis-a-vis shared block memory - using __syncthreads() or other mechanisms. Thus it might first have each thread add its own two elements up; then synchronize block threads; then have less threads add up pairs of pair-sums and so on. Here's an nVIDIA blog post on implementing fast reductions on their more modern GPU architectures.
block-local or warp-local and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.

Related

Where do shared memory of non-resident threadblocks go?

I am trying to understand how shared memory works, when blocks use alot of it.
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory. I force both blocks to communicate before finishing, but since they can't run in parallel, this should be a deadlock. The program however does terminate, whether i run 2 blocks or 1000.
Is this because block 1 is paused once it runs into the deadlock, and switched with block 2? If yes, where does the 48 kb of data from block 1 go while block 2 is active? Is it stored in global memory?
Kernel:
__global__ void testKernel(uint8_t* globalmem_message_buffer, int n) {
const uint32_t size = 48000;
__shared__ uint8_t data[size];
for (int i = 0; i < size; i++)
data[i] = 1;
globalmem_message_buffer[blockIdx.x] = 1;
while (globalmem_message_buffer[(blockIdx.x + 1) % n] == 0) {}
printf("ID: %d\n", blockIdx.x);
}
Host code:
int n = 2; // Still works with n=1000
cudaStream_t astream;
cudaStreamCreate(&astream);
uint8_t* globalmem_message_buffer;
cudaMallocManaged(&globalmem_message_buffer, sizeof(uint8_t) * n);
for (int i = 0; i < n; i++) globalmem_message_buffer[i] = 0;
cudaDeviceSynchronize();
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Edit: Changed "threadIdx" to "blockIdx"
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory.
That wouldn't happen. The general premise here is flawed. The GPU block scheduler only deposits a block on a SM when there are free resources sufficient to support that block.
An SM with 48KB of shared memory, that already has a block resident on it that uses 48KB of shared memory, will not get any new blocks of that type deposited on it, until the existing/resident block "retires" and releases the resources it is using.
Therefore in the normal CUDA scheduling model, the only way a block can be non-resident is if it has never been scheduled yet on a SM. In that case, it uses no resources, while it is waiting in the queue.
The exceptions to this would be in the case of CUDA preemption. This mechanism is not well documented, but would occur for example at the point of a context switch. In such a case, the entire threadblock state is somehow removed from the SM and stored somewhere else. However preemption is not applicable in the case where we are analyzing the behavior of a single kernel launch.
You haven't provided a complete code example, however, for the n=2 case, your claim that these will somehow deposit on the same SM simply isn't true.
For the n=1000 case, your code only requires that a single location in memory be set to 1:
while (globalmem_message_buffer[(threadIdx.x + 1) % n] == 0) {}
threadIdx.x for your code is always 0, since you are launching threadblocks of only 1 thread:
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Therefore the index generated here is always 1 (for n greater than or equal to 2). All threadblocks are checking location 1. Therefore, when the threadblock whose blockIdx.x is 1 executes, all threadblocks in the grid will be "unblocked", because they are all testing the same location. In short, your code may not be doing what you think it is or intended. Even if you had each threadblock check the location of another threadblock, we can imagine a sequence of threadblock deposits that would satisfy this without requiring all n threadblocks to be simultaneously resident, so I don't think that would prove anything either. (There is no specified order for the block deposit sequence.)

Bank conflict in CUDA when reading from the same location

I have a CUDA kernel where there is a point where each thread is reading the same value from the global memory. So something like:
__global__ void my_kernel(const float4 * key_pts)
{
if (key_pts[blockIdx.x] < 0 return;
}
The kernel is configured as follows:
dim3 blocks(16, 16);
dim3 grid(2000);
my_kernel<<<grid, blocks, 0, stream>>>(key_pts);
My question is whether this will lead to some sort bank conflict or sub-optimal access in CUDA. I must confess I do not understand this issue in detail yet.
I was thinking I could do something like the following in case we have sub-optimal access:
__global__ void my_kernel(const float4 * key_pts)
{
__shared__ float x;
if (threadIdx.x == 0 && threadIdx.y == 0)
x = key_pts[blockIdx.x];
__syncthreads();
if (x < 0) return;
}
Doing some timing though, I do not see any difference between the two but so far my tests are with limited data.
bank conflicts apply to shared memory, not global memory.
Since all threads need (ultimately) the same value to make their decision, this won't yield sub-optimal access on global memory because there is a broadcast mechanism so that all threads in the same warp, requesting the same location/value from global memory, will retrieve that without any serialization or overhead. All threads in the warp can be serviced at the same time:
Note that threads can access any words in any order, including the same words.
Furthermore, assuming your GPU has a cache (cc2.0 or newer) the value retrieved from global memory for the first warp encountering this will likely be available in the cache for subsequent warps that hit this point.
I wouldn't expect much performance difference between the two cases.

CUDA context lifetime

In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.
Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.

CUDA shared memory broadcast and __syncthreads behavior

I am experiencing a strange issue, well at least to me it looks strange, and I was hoping someone might be able to shed some light on it. I have a CUDA kernel which relies on shared memory for fast local accesses. To the limits of my knowledge, if all the threads within a half-warp access the same shared memory bank then the value will be broadcast to the threads in the warp. Also, access from multiple warps to the same bank do not cause bank conflicts, they will just be serialized. Keeping this in mind, I have created a small kernel to test this out (after encountering issues in my original kernel). Here's the snippet:
#define NUM_VALUES 16
#define NUM_LOOPS 1024
__global__ void shared_memory_test(float *output)
{
// Create some shared memory
__shared__ int dm_delays[NUM_VALUES];
// Loop over NUM_LOOPS
float accumulator = 0;
for(unsigned c = 0; c < NUM_LOOPS; c++)
{
// Force shared memory update
for(int d = threadIdx.x; d < NUM_VALUES; d++)
dm_delays[d] = c * d;
// __syncthreads();
for(int d = 0; d < NUM_VALUES; d++)
accumulator += dm_delays[d];
}
// Store accumulated value to global memory
for(unsigned d = 0; d < NUM_VALUES; d++)
output[d] = accumulator;
}
I've run this with a block dimension of 16 (half a warp, not terribly efficient but it's just for testing purposes). All the threads should be addressing the same shared memory bank, so there should be no conflicts. However, the opposite seems to be true. I'm using Parallel Nsight on Visual Studio 2010 for this testing.
What is even more mysterious to me is the fact that if I uncomment the __syncthreads call in the outer loop then the number of bank conflicts increases dramatically.
Just some number to give you an idea (this is for a grid containing one block with 16 threads, so a single half-warp, NUM_VALUES = 16, NUM_LOOPS = 1024):
without __syncthreads: 4 bank conflicts
with __syncthreads : 4,096 bank conflicts
I'm running this on a GTX 670, set at compute_capability 3.0
Thank you in advance
UPDATE: It was pointed out that without __syncthreads the NUM_LOOPS reads in the outer loop were being optimised away by the compiler since the values of dm_delays never change. Now I get a constant 4,096 bank conflicts in both cases, which still doesn't play well with the broadcast behavior for shared memory.
Since the value of dm_delays does not change, this may be a case where the compiler optimizes away the 1024 reads to shared memory if the __syncthreads is not present. With the __syncthreads there, it may assume that the value could be changed by another thread, and so it reads the value over and over again.

Thrust: transform_reduce : cudaMalloc in unary_op.operator

In my unary_op.operator, I need to create a temporary array.
I guess cudaMalloc is the way to go.
But, is it performance efficient or is there a better design?
struct my_unary_op
{
__host__ __device__ int operator()(const int& index) const
{
int* array;
cudaMalloc((void**)&array, 10*sizeof(int));
for(int i = 0; i < 10; i++)
array[i] = index;
int sum=0;
for(int i=0; i < 10 ; i++)
sum += array[i];
return sum;
};
};
int main()
{
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+100;
my_unary_op unary_op = my_unary_op();
thrust::plus<int> binary_op;
int init = 0;
int sum = thrust::transform_reduce(first, last, unary_op, init, binary_op);
return 0;
};
You won't be able to compile cudaMalloc() in a __device__ function, because it is a host-only function. You can, however, use plain malloc() or new (on devices of compute capability >= 2.0), but these are not very efficient when running on the device. There are two reasons for this. The first is that concurrently running threads are serialized during the memory allocation call. The second is that the calls allocate global memory in chunks that become arranged in such a way that when the memory load and store instructions are run by the 32 threads in a warp, they are not adjacent, so you don't get properly coalesced memory accesses.
You can address both of these issues by using fixed size C style arrays in your __device__ functions (ie., int array[10];). Small, fixed size arrays can sometimes be optimized by the compiler so that they are stored in the register file, for extremely fast access. If the compiler stores them in global memory, it will use local memory. Local memory is stored in global memory, but it is interleaved in such a way that when the 32 threads in a warp run a load or store instruction, each thread accesses adjacent locations in memory, enabling the transactions to be fully coalesced.
If you don't know at runtime what the size of your C arrays will be, allocate a max size in the array and leave some of it unused.
I think that the total amount of memory that is used by the fixed sized array will depend on the total number of threads that are processed concurrently on the GPU, not on the total number of threads launched by the kernel. In this answer #mharris shows how to calculate the maximum possible number of concurrent threads, which is 24,576 for a GTX580. So, if the fixed size array is 16 32-bit values, the maximum possible amount of memory used by the array would be 1536KiB.
If you need a wide range of array sizes, you can use templates to compile kernels with a number of different sizes. Then, at runtime, select one that is able to accommodate the size that you need. However, chances are that if you simply allocate the maximum of what you might need, the memory usage will not be the limiting factor in the number of threads that you can launch.