CUDA shared memory broadcast and __syncthreads behavior

CUDA shared memory broadcast and __syncthreads behavior - cuda

I am experiencing a strange issue, well at least to me it looks strange, and I was hoping someone might be able to shed some light on it. I have a CUDA kernel which relies on shared memory for fast local accesses. To the limits of my knowledge, if all the threads within a half-warp access the same shared memory bank then the value will be broadcast to the threads in the warp. Also, access from multiple warps to the same bank do not cause bank conflicts, they will just be serialized. Keeping this in mind, I have created a small kernel to test this out (after encountering issues in my original kernel). Here's the snippet:
#define NUM_VALUES 16
#define NUM_LOOPS 1024
__global__ void shared_memory_test(float *output)
{
// Create some shared memory
__shared__ int dm_delays[NUM_VALUES];
// Loop over NUM_LOOPS
float accumulator = 0;
for(unsigned c = 0; c < NUM_LOOPS; c++)
{
// Force shared memory update
for(int d = threadIdx.x; d < NUM_VALUES; d++)
dm_delays[d] = c * d;
// __syncthreads();
for(int d = 0; d < NUM_VALUES; d++)
accumulator += dm_delays[d];
}
// Store accumulated value to global memory
for(unsigned d = 0; d < NUM_VALUES; d++)
output[d] = accumulator;
}
I've run this with a block dimension of 16 (half a warp, not terribly efficient but it's just for testing purposes). All the threads should be addressing the same shared memory bank, so there should be no conflicts. However, the opposite seems to be true. I'm using Parallel Nsight on Visual Studio 2010 for this testing.
What is even more mysterious to me is the fact that if I uncomment the __syncthreads call in the outer loop then the number of bank conflicts increases dramatically.
Just some number to give you an idea (this is for a grid containing one block with 16 threads, so a single half-warp, NUM_VALUES = 16, NUM_LOOPS = 1024):
without __syncthreads: 4 bank conflicts
with __syncthreads : 4,096 bank conflicts
I'm running this on a GTX 670, set at compute_capability 3.0
Thank you in advance
UPDATE: It was pointed out that without __syncthreads the NUM_LOOPS reads in the outer loop were being optimised away by the compiler since the values of dm_delays never change. Now I get a constant 4,096 bank conflicts in both cases, which still doesn't play well with the broadcast behavior for shared memory.

Since the value of dm_delays does not change, this may be a case where the compiler optimizes away the 1024 reads to shared memory if the __syncthreads is not present. With the __syncthreads there, it may assume that the value could be changed by another thread, and so it reads the value over and over again.

Related

Where do shared memory of non-resident threadblocks go?

I am trying to understand how shared memory works, when blocks use alot of it.
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory. I force both blocks to communicate before finishing, but since they can't run in parallel, this should be a deadlock. The program however does terminate, whether i run 2 blocks or 1000.
Is this because block 1 is paused once it runs into the deadlock, and switched with block 2? If yes, where does the 48 kb of data from block 1 go while block 2 is active? Is it stored in global memory?
Kernel:
__global__ void testKernel(uint8_t* globalmem_message_buffer, int n) {
const uint32_t size = 48000;
__shared__ uint8_t data[size];
for (int i = 0; i < size; i++)
data[i] = 1;
globalmem_message_buffer[blockIdx.x] = 1;
while (globalmem_message_buffer[(blockIdx.x + 1) % n] == 0) {}
printf("ID: %d\n", blockIdx.x);
}
Host code:
int n = 2; // Still works with n=1000
cudaStream_t astream;
cudaStreamCreate(&astream);
uint8_t* globalmem_message_buffer;
cudaMallocManaged(&globalmem_message_buffer, sizeof(uint8_t) * n);
for (int i = 0; i < n; i++) globalmem_message_buffer[i] = 0;
cudaDeviceSynchronize();
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Edit: Changed "threadIdx" to "blockIdx"

So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory.
That wouldn't happen. The general premise here is flawed. The GPU block scheduler only deposits a block on a SM when there are free resources sufficient to support that block.
An SM with 48KB of shared memory, that already has a block resident on it that uses 48KB of shared memory, will not get any new blocks of that type deposited on it, until the existing/resident block "retires" and releases the resources it is using.
Therefore in the normal CUDA scheduling model, the only way a block can be non-resident is if it has never been scheduled yet on a SM. In that case, it uses no resources, while it is waiting in the queue.
The exceptions to this would be in the case of CUDA preemption. This mechanism is not well documented, but would occur for example at the point of a context switch. In such a case, the entire threadblock state is somehow removed from the SM and stored somewhere else. However preemption is not applicable in the case where we are analyzing the behavior of a single kernel launch.
You haven't provided a complete code example, however, for the n=2 case, your claim that these will somehow deposit on the same SM simply isn't true.
For the n=1000 case, your code only requires that a single location in memory be set to 1:
while (globalmem_message_buffer[(threadIdx.x + 1) % n] == 0) {}
threadIdx.x for your code is always 0, since you are launching threadblocks of only 1 thread:
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Therefore the index generated here is always 1 (for n greater than or equal to 2). All threadblocks are checking location 1. Therefore, when the threadblock whose blockIdx.x is 1 executes, all threadblocks in the grid will be "unblocked", because they are all testing the same location. In short, your code may not be doing what you think it is or intended. Even if you had each threadblock check the location of another threadblock, we can imagine a sequence of threadblock deposits that would satisfy this without requiring all n threadblocks to be simultaneously resident, so I don't think that would prove anything either. (There is no specified order for the block deposit sequence.)

cuda racecheck error if using double in kernel [duplicate]

My questions are:
1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each thread. That allows you to store some intermediate result in this variable for every thread. Example: vector c=a+b:
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
int p;
p = a[i] + b[i];
c[i] = p;
}
Here we declare intermediate variable p. But in reality there are N copies of this variable, each one for each thread.
2) Is it true, that if I will declare array, N copies of this array will be created, each for each thread? And as long as everything inside the global kernel happens on gpu memory, you need N times more memory on gpu for any variable declared, where N is the number of your threads.
3) In my current program I have 35*48= 1680 blocks, each block include 32*32=1024 threads. Does it mean, that any variable declared within a global kernel will cost me N=1024*1680=1 720 320 times more than outside the kernel?
4) To use shared memory, I need M times more memory for each variable than usually. Here M is the number of blocks. Is that true?

1) Yes. Each thread has a private copy of non-shared variables declared in the function. These usually go into GPU register memory, though can spill into local memory.
2), 3) and 4) While it's true that you need many copies of that private memory, that doesn't mean your GPU has to have enough private memory for every thread at once. This is because in hardware, not all threads need to execute simultaneously. For example, if you launch N threads it may be that half are active at a given time and the other half won't start until there are free resources to run them.
The more resources your threads use the fewer can be run simultaneously by the hardware, but that doesn't limit how many you can ask to be run, as any threads the GPU doesn't have resource for will be run once some resources free up.
This doesn't mean you should go crazy and declare massive amounts of local resources. A GPU is fast because it is able to run threads in parallel. To run these threads in parallel it needs to fit a lot of threads at any given time. In a very general sense, the more resources you use per thread, the fewer threads will be active at a given moment, and the less parallelism the hardware can exploit.

Bank conflict in CUDA when reading from the same location

I have a CUDA kernel where there is a point where each thread is reading the same value from the global memory. So something like:
__global__ void my_kernel(const float4 * key_pts)
{
if (key_pts[blockIdx.x] < 0 return;
}
The kernel is configured as follows:
dim3 blocks(16, 16);
dim3 grid(2000);
my_kernel<<<grid, blocks, 0, stream>>>(key_pts);
My question is whether this will lead to some sort bank conflict or sub-optimal access in CUDA. I must confess I do not understand this issue in detail yet.
I was thinking I could do something like the following in case we have sub-optimal access:
__global__ void my_kernel(const float4 * key_pts)
{
__shared__ float x;
if (threadIdx.x == 0 && threadIdx.y == 0)
x = key_pts[blockIdx.x];
__syncthreads();
if (x < 0) return;
}
Doing some timing though, I do not see any difference between the two but so far my tests are with limited data.

bank conflicts apply to shared memory, not global memory.
Since all threads need (ultimately) the same value to make their decision, this won't yield sub-optimal access on global memory because there is a broadcast mechanism so that all threads in the same warp, requesting the same location/value from global memory, will retrieve that without any serialization or overhead. All threads in the warp can be serviced at the same time:
Note that threads can access any words in any order, including the same words.
Furthermore, assuming your GPU has a cache (cc2.0 or newer) the value retrieved from global memory for the first warp encountering this will likely be available in the cache for subsequent warps that hit this point.
I wouldn't expect much performance difference between the two cases.

Thrust: transform_reduce : cudaMalloc in unary_op.operator

In my unary_op.operator, I need to create a temporary array.
I guess cudaMalloc is the way to go.
But, is it performance efficient or is there a better design?
struct my_unary_op
{
__host__ __device__ int operator()(const int& index) const
{
int* array;
cudaMalloc((void**)&array, 10*sizeof(int));
for(int i = 0; i < 10; i++)
array[i] = index;
int sum=0;
for(int i=0; i < 10 ; i++)
sum += array[i];
return sum;
};
};
int main()
{
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+100;
my_unary_op unary_op = my_unary_op();
thrust::plus<int> binary_op;
int init = 0;
int sum = thrust::transform_reduce(first, last, unary_op, init, binary_op);
return 0;
};

You won't be able to compile cudaMalloc() in a __device__ function, because it is a host-only function. You can, however, use plain malloc() or new (on devices of compute capability >= 2.0), but these are not very efficient when running on the device. There are two reasons for this. The first is that concurrently running threads are serialized during the memory allocation call. The second is that the calls allocate global memory in chunks that become arranged in such a way that when the memory load and store instructions are run by the 32 threads in a warp, they are not adjacent, so you don't get properly coalesced memory accesses.
You can address both of these issues by using fixed size C style arrays in your __device__ functions (ie., int array[10];). Small, fixed size arrays can sometimes be optimized by the compiler so that they are stored in the register file, for extremely fast access. If the compiler stores them in global memory, it will use local memory. Local memory is stored in global memory, but it is interleaved in such a way that when the 32 threads in a warp run a load or store instruction, each thread accesses adjacent locations in memory, enabling the transactions to be fully coalesced.
If you don't know at runtime what the size of your C arrays will be, allocate a max size in the array and leave some of it unused.
I think that the total amount of memory that is used by the fixed sized array will depend on the total number of threads that are processed concurrently on the GPU, not on the total number of threads launched by the kernel. In this answer #mharris shows how to calculate the maximum possible number of concurrent threads, which is 24,576 for a GTX580. So, if the fixed size array is 16 32-bit values, the maximum possible amount of memory used by the array would be 1536KiB.
If you need a wide range of array sizes, you can use templates to compile kernels with a number of different sizes. Then, at runtime, select one that is able to accommodate the size that you need. However, chances are that if you simply allocate the maximum of what you might need, the memory usage will not be the limiting factor in the number of threads that you can launch.

CUDA (JCUDA) shared memory (?) problems / undefined behaviour

Im working on my game project (tower defence) and Im trying to compute the distance between all criters and a tower with JCuda using the shared memory. For each tower I run 1 block with N threds, where N equals the number of critters on the map. Im computing distance between all criters and that tower for given block, and I store the smallest found distance so far in the block's shared memory. My current code looks like that:
extern "C"
__global__ void calcDistance(int** globalInputData, int size, int
critters, int** globalQueryData, int* globalOutputData) {
//shared memory
__shared__ float minimum[2];
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = blockIdx.y;
if (x < critters) {
int distance = 0;
//Calculate the distance between tower and criter
for (int i = 0; i < size; i++) {
int d = globalInputData[x][i] - globalQueryData[y][i];
distance += d * d;
}
if (x == 0) {
minimum[0] = distance;
minimum[1] = x;
}
__syncthreads();
if (distance < minimum[0]) {
minimum[0] = distance;
minimum[1] = x;
}
__syncthreads();
globalOutputData[y * 2] = minimum[0];
globalOutputData[y] = minimum[1];
}
}
The problem is if I rerun the code using the same input multiple times (I free all the memory on both host and device after each run) I get different output each time I the code gets executed for blocks (tower) number > 27... Im fairly sure it has something to do with the shared memory and the way im dealing with it, as rewriting the code to use global memory gives the same result whenever the code gets executed. Any ideas?

There is a memory race problem (so read-after-write correctness) in that kernel here:
if (distance < minimum[0]) {
minimum[0] = distance;
minimum[1] = x;
}
When executed, every thread in the block is going to try and simultaneously read and write the value of minimum. There are no guarantees what will happen when multiple threads in a warp try writing to the same shared memory location, and there are no guarantees what values that other warps in the same block will read when loading from a memory location to which is being written. Memory access is not atomic, and there is no locking or serialization which would ensure that code performed the type of reduction operation you seem to be trying to do.
A milder version of the same problem applies to the write back to global memory at the end of the kernel:
__syncthreads();
globalOutputData[y * 2] = minimum[0];
globalOutputData[y] = minimum[1];
The barrier before the writes ensures that the writes to minimum will be completed prior that a "final" (although inconsistent) value will be stored in minimum, but then every thread in the block will execute the write.
If your intention is to have each thread compute a distance, and then for the minimum of the distance values over the block to get written out to global memory, you will have to either use atomic memory operations (for shared memory this is supported on compute 1.2/1.3 and 2.x devices only), or write an explicit shared memory reduction. After that, only one thread should execute the write back to global memory.
Finally, you also have a potential synchronization correctness problem that could cause the kernel to hang. __syncthreads() (which maps to the PTX bar instruction) demands that every thread in the block arrive and execute the instruction prior to the kernel continuing. Having this sort of control flow:
if (x < critters) {
....
__syncthreads();
....
}
will cause the kernel to hang if some threads in the block can branch around the barrier and exit while others wait at the barrier. There should never be any branch divergence around a __syncthreads() call to ensure execution correctness of a kernel in CUDA.
So, in summary, back to the drawing board on at least three issues in the current code.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008