Related
I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.
EDIT: new minimal working example to illustrate the question and better explanation of nvvp's outcome (following suggestions given in the comments).
So, I have crafted a "minimal" working example, which follows:
#include <cuComplex.h>
#include <iostream>
int const n = 512 * 100;
typedef float real;
template < class T >
struct my_complex {
T x;
T y;
};
__global__ void set( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 1.0f, 0.0f };
}
__global__ void duplicate_whole( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 2.0f * d.x, 2.0f * d.y };
}
__global__ void duplicate_half( real * a )
{
real & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d *= 2.0f;
}
int main()
{
my_complex< real > * a;
cudaMalloc( ( void * * ) & a, sizeof( my_complex< real > ) * n * 1024 );
set<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_whole<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_half<<< 2 * n, 1024 >>>( reinterpret_cast< real * >( a ) );
cudaDeviceSynchronize();
my_complex< real > * a_h = new my_complex< real >[ n * 1024 ];
cudaMemcpy( a_h, a, sizeof( my_complex< real > ) * n * 1024, cudaMemcpyDeviceToHost );
std::cout << "( " << a_h[ 0 ].x << ", " << a_h[ 0 ].y << " )" << '\t' << "( " << a_h[ n * 1024 - 1 ].x << ", " << a_h[ n * 1024 - 1 ].y << " )" << std::endl;
return 0;
}
When I compile and run the above code, kernels duplicate_whole and duplicate_half take just about the same time to run.
However, when I analyze the kernels using nvvp I get different reports for each of the kernels in the following sense. For kernel duplicate_whole, nvvp warns me that at line 23 (d = { 2.0f * d.x, 2.0f * d.y };) the kernel is performing
Global Load L2 Transaction/Access = 8, Ideal Transaction/Access = 4
I agree that I am loading 8 byte words. What I do not understand is why 4 bytes is the ideal word size. In special, there is no performance difference between the kernels.
I suppose that there must be circumstances where this global store access pattern could cause performance degradation. What are these?
And why is that I do not get a performance hit?
I hope that this edit has clarified some unclear points.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I'll start wit some kernel code to exemplify my question, which will follow below
template < class data_t >
__global__ void chirp_factors_multiply( std::complex< data_t > const * chirp_factors,
std::complex< data_t > * data,
int M,
int row_length,
int b,
int i_0
)
{
#ifndef CUGALE_MUL_SHUFFLE
// Output array length:
int plane_area = row_length * M;
// Process element:
int i = blockIdx.x * row_length + threadIdx.x + i_0;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
my_complex< data_t > datum;
my_complex< data_t > datum_new;
for ( int i_b = 0; i_b < b; ++ i_b )
{
my_complex< data_t > & ref_datum = ref_complex( data[ i_b * plane_area + i ] );
datum = ref_datum;
datum_new.x = datum.x * chirp_factor.x - datum.y * chirp_factor.y;
datum_new.y = datum.x * chirp_factor.y + datum.y * chirp_factor.x;
ref_datum = datum_new;
}
#else
// Output array length:
int plane_area = row_length * M;
// Element to process:
int i = blockIdx.x * row_length + ( threadIdx.x + i_0 ) / 2;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
// Real and imaginary part of datum (not respectively for odd threads):
data_t datum_a;
data_t datum_b;
// Even TIDs will read data in regular order, odd TIDs will read data in inverted order:
int parity = ( threadIdx.x % 2 );
int shuffle_dir = 1 - 2 * parity;
int inwarp_tid = threadIdx.x % warpSize;
for ( int i_b = 0; i_b < b; ++ i_b )
{
int data_idx = i_b * plane_area + i;
datum_a = reinterpret_cast< data_t * >( data + data_idx )[ parity ];
datum_b = __shfl_sync( 0xFFFFFFFF, datum_a, inwarp_tid + shuffle_dir, warpSize );
// Even TIDs compute real part, odd TIDs compute imaginary part:
reinterpret_cast< data_t * >( data + data_idx )[ parity ] = datum_a * chirp_factor.x - shuffle_dir * datum_b * chirp_factor.y;
}
#endif // #ifndef CUGALE_MUL_SHUFFLE
}
Let us consider the case where data_t is float, which is memory bandwidth limited. As it can be seen above, there are two versions of the kernel, one which reads/writes 8 bytes (a whole complex number) per thread and another which reads/writes 4 bytes per thread and then shuffles the results so the complex product is computed correctly.
The reason why I have written the version using shuffle is because nvvp insisted that reading 8 bytes per thread was not the best idea because this memory access pattern would be inefficient. This is the case even though in both systems tested (GTX 1050 and GTX Titan Xp) memory bandwidth was very close to theoretical maximum.
Surely enough I knew that no improvement was likely to happen, and this was indeed the case: both kernels take pretty much the same time to run. So, my question is the following:
Why is that nvvp reports that reading 8 bytes would be less efficient than reading 4 bytes per thread? In which circumstances would that be the case?
As a side note, single precision is more important to me, but double is useful in some cases too. Interestingly enough, in the case where data_t is double, there is no execution time difference too between the two kernel versions, even though in this case the kernel is compute bound and the shuffle version performs some more flops than the original version.
Note: the kernels are applied to a row_length * M * b dataset (b images with row_length columns and M lines) and the chirp_factor array is row_length * M. Both kernels run perfecly fine (I can edit the question to show you the calls to both versions if you have doubts about it).
The issue here has to do with how the compiler is processing your code. nvvp is merely dutifully reporting what is happening when you run your code.
If you use the cuobjdump -sass tool on your executable, you will discover that the duplicate_whole routine is doing two 4-byte loads and two 4-byte stores. This is not optimal, partly becuase there is a stride in each load and store (each load and store touches alternate elements in memory).
The reason for this is that the compiler does not know the alignment of your my_complex struct. Your struct would be legal for use in situations that would prevent the compiler from generating a (legal) 8-byte load. As discussed here we can fix this by informing the compiler that we only intend to use the struct in alignment scenarios where a CUDA 8-byte load is legal (i.e. it is "naturally aligned"). The modification to your struct looks like this:
template < class T >
struct __align__(8) my_complex {
T x;
T y;
};
With that change to your code, the compiler generates 8-byte loads for the duplicate_whole kernel, and you should see a different report from the profiler. You should use this sort of decoration only when you understand what it means and are willing to enter into a contract with the compiler that you will ensure this is the case. If you do something unusual, like unusual pointer casting, you can violate your end of the bargain and generate a machine fault.
The reason you don't see much performance difference almost certainly has to do with CUDA load/store behavior and the GPU caches
When you do a strided load, the GPU loads an entire cacheline anyway, even though (in this case) you only need half the elements (the real elements) for that particular load operation. However you need the other half of the elements (the imaginary elements) anyway; they will be loaded on the next instruction, and this instruction most likely hits in the cache, due to the previous load.
On a strided store in this case, writing strided elements in one instruction and the alternate elements in the next instruction will end up using one of the caches as a "coalescing buffer". This isn't coalescing in the typical sense used in CUDA terminology; that sort of coalescing only applies to a single instruction. However the cache "coalescing buffer" behavior allows it to "accumulate" multiple writes to an already-resident line, before that line gets written out or evicted. This is approximately equivalent to "write-back" cache behavior.
Suppose I have a 32 or 64 bit unsigned integer.
What is the fastest way to find the index i of the leftmost bit such that the number of 0s in the leftmost i bits equals the number of 1s in the leftmost i bits?
I was thinking of some bit tricks like the ones mentioned here.
I am interested in recent x86_64 processor. This might be relevant as some processor support instructions as POPCNT (count the number of 1s) or LZCNT (counts the number of leading 0s).
If it helps, it is possible to assume that the first bit has always a certain value.
Example (with 16 bits):
If the integer is
1110010100110110b
^
i
then i=10 and it corresponds to the marked position.
A possible (slow) implementation for 16-bit integers could be:
mask = 1000000000000000b
pos = 0
count=0
do {
if(x & mask)
count++;
else
count--;
pos++;
x<<=1;
} while(count)
return pos;
Edit: fixed bug in code as per #njuffa comment.
I don't have any bit tricks for this, but I do have a SIMD trick.
First a few observations,
Interpreting 0 as -1, this problem becomes "find the first i so that the first i bits sum to 0".
0 is even but all the bits have odd values under this interpretation, which gives the insight that i must be even and this problem can be analyzed by blocks of 2 bits.
01 and 10 don't change the balance.
After spreading the groups of 2 out to bytes (none of the following is tested),
// optionally use AVX2 _mm_srlv_epi32 instead of ugly variable set
__m128i spread = _mm_shuffle_epi8(_mm_setr_epi32(x, x >> 2, x >> 4, x >> 6),
_mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));
spread = _mm_and_si128(spread, _mm_set1_epi8(3));
Replace 00 by -1, 11 by 1, and 01 and 10 by 0:
__m128i r = _mm_shuffle_epi8(_mm_setr_epi8(-1, 0, 0, 1, 0,0,0,0,0,0,0,0,0,0,0,0),
spread);
Calculate the prefix sum:
__m128i pfs = _mm_add_epi8(r, _mm_bsrli_si128(r, 1));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 2));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 4));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 8));
Find the highest 0:
__m128i iszero = _mm_cmpeq_epi8(pfs, _mm_setzero_si128());
return __builtin_clz(_mm_movemask_epi8(iszero) << 15) * 2;
The << 15 and *2 appear because the resulting mask is 16 bits but the clz is 32 bit, it's shifted one less because if the top byte is zero that indicates that 1 group of 2 is taken, not zero.
This is a solution for 32-bit data using classical bit-twiddling techniques. The intermediate computation requires 64-bit arithmetic and logic operations. I have to tried to stick to portable operations as far as it was possible. Required is an implementation of the POSIX function ffsll to find the least-significant 1-bit in a 64-bit long long, and a custom function rev_bit_duos that reverses the bit-duos in a 32-bit integer. The latter could be replaced with a platform-specific bit-reversal intrinsic, such as the __rbit intrinsic on ARM platforms.
The basic observation is that if a bit-group with an equal number of 0-bits and 1-bits can be extracted, it must contain an even number of bits. This means we can examine the operand in 2-bit groups. We can further restrict ourselves to tracking whether each 2-bit increases (0b11), decreases (0b00) or leaves unchanged (0b01, 0b10) a running balance of bits. If we count positive and negative changes with separate counters, 4-bit counters will suffice unless the input is 0 or 0xffffffff, which can be handled separately. Based on comments to the question, these cases shouldn't occur. By subtracting the negative change count from the positive change count for each 2-bit group we can find at which group the balance becomes zero. There may be multiple such bit groups, we need to find the first one.
The processing can be parallelized by expanding each 2-bit group into a nibble that then can serve as a change counter. The prefix sum can be computed via integer multiply with an appropriate constant, which provides the necessary shift & add operations at each nibble position. Efficient ways for parallel nibble-wise subtraction are well-known, likewise there is a well-known technique due to Alan Mycroft for detecting zero-bytes that is trivially changeable to zero-nibble detection. POSIX function ffsll is then applied to find the bit position of that nibble.
Slightly problematic is the requirement for extraction of a left-most bit group, rather than a right-most, since Alan Mycroft's trick only works for finding the first zero-nibble from the right. Also, handling the prefix-sum for left-most bit group require use of a mulhi operation which may not be easily available, and may be less efficient than standard integer multiplication. I have addressed both of these issues by simply bit-reversing the original operand up front.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
/* Reverse bit-duos using classic binary partitioning algorithm */
inline uint32_t rev_bit_duos (uint32_t a)
{
uint32_t m;
a = (a >> 16) | (a << 16); // swap halfwords
m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes
m = (m << 4)^m; a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles
m = (m << 2)^m; a = ((a >> 2) & m) | ((a << 2) & ~m); // swap bit-duos
return a;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int solution (uint32_t x)
{
const uint64_t mask16 = 0x0000ffff0000ffffULL; // alternate half-words
const uint64_t mask8 = 0x00ff00ff00ff00ffULL; // alternate bytes
const uint64_t mask4h = 0x0c0c0c0c0c0c0c0cULL; // alternate nibbles, high bit-duo
const uint64_t mask4l = 0x0303030303030303ULL; // alternate nibbles, low bit-duo
const uint64_t nibble_lsb = 0x1111111111111111ULL;
const uint64_t nibble_msb = 0x8888888888888888ULL;
uint64_t a, b, r, s, t, expx, pc_expx, nc_expx;
int res;
/* common path can't handle all 0s and all 1s due to counter overflow */
if ((x == 0) || (x == ~0)) return 0;
/* make zero-nibble detection work, and simplify prefix sum computation */
x = rev_bit_duos (x); // reverse bit-duos
/* expand each bit-duo into a nibble */
expx = x;
expx = ((expx << 16) | expx) & mask16;
expx = ((expx << 8) | expx) & mask8;
expx = ((expx << 4) | expx);
expx = ((expx & mask4h) * 4) + (expx & mask4l);
/* compute positive and negative change counts for each nibble */
pc_expx = expx & ( expx >> 1) & nibble_lsb;
nc_expx = ~expx & (~expx >> 1) & nibble_lsb;
/* produce prefix sums for positive and negative change counters */
a = pc_expx * nibble_lsb;
b = nc_expx * nibble_lsb;
/* subtract positive and negative prefix sums, nibble-wise */
s = a ^ ~b;
r = a | nibble_msb;
t = b & ~nibble_msb;
s = s & nibble_msb;
r = r - t;
r = r ^ s;
/* find first nibble that is zero using Alan Mycroft's magic */
r = (r - nibble_lsb) & (~r & nibble_msb);
res = ffsll (r) / 2; // account for bit-duo to nibble expansion
return res;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int reference (uint32_t x)
{
int count = 0;
int bits = 0;
uint32_t mask = 0x80000000;
do {
bits++;
if (x & mask) {
count++;
} else {
count--;
}
x = x << 1;
} while ((count) && (bits <= (int)(sizeof(x) * CHAR_BIT)));
return (count) ? 0 : bits;
}
int main (void)
{
uint32_t x = 0;
do {
uint32_t ref = reference (x);
uint32_t res = solution (x);
if (res != ref) {
printf ("x=%08x res=%u ref=%u\n\n", x, res, ref);
}
x++;
} while (x);
return EXIT_SUCCESS;
}
A possible solution (for 32-bit integers). I'm not sure if it can be improved / avoid the use of lookup tables. Here x is the input integer.
//Look-up table of 2^16 elements.
//The y-th is associated with the first 2 bytes y of x.
//If the wanted bit is in y, LUT1[y] is minus the position of the bit
//If the wanted bit is not in y, LUT1[y] is the number of ones in excess in y minus 1 (between 0 and 15)
LUT1 = ....
//Look-up talbe of 16 * 2^16 elements.
//The y-th element is associated to two integers y' and y'' of 4 and 16 bits, respectively.
//y' is the number of excess ones in the first byte of x, minus 1
//y'' is the second byte of x. The table contains the answer to return.
LUT2 = ....
if(LUT1[x>>16] < 0)
return -LUT1[x>>16];
return LUT2[ (LUT1[x>>16]<<16) | (x & 0xFFFF) ]
This requires ~1MB for the lookup tables.
The same idea also works using 4 lookup tables (one per byte of x). The requires more operations but brings down the memory to 12KB.
LUT1 = ... //2^8 elements
LUT2 = ... //8 * 2^8 elements
LUT3 = ... //16 * 2^8 elements
LUT3 = ... //24 * 2^8 elements
y = x>>24
if(LUT1[y] < 0)
return -LUT1[y];
y = (LUT1[y]<<8) | ((x>>16) & 0xFF);
if(LUT2[y] < 0)
return -LUT2[y];
y = (LUT2[y]<<8) | ((x>>8) & 0xFF);
if(LUT3[y] < 0)
return -LUT3[y];
return LUT4[(LUT2[y]<<8) | (x & 0xFF) ];
About two years ago, I wrote a kernel for work on several numerical grids simultaneously. Some very strange behaviour emerged, which resulted in wrong results. When hunting down the bug utilizing printf()-statements inside the kernel, the bug vanished.
Due to deadline constraints, I kept it that way, though recently I figured that this was no appropriate coding style. So I revisited my kernel and boiled it down to what you see below.
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
This kernel computes the derivative of some value at all grid points for a number of numerical 1D-grids.
Things to consider: (see full code base at the bottom)
a grid consists of 1300 grid points
each grid has to be worked upon by two blocks (due to memory/register limitations)
each block successively works on 37 grids (or better: grid halves, the while-loop takes care of that)
each thread is responsible for the same grid point in each grid
for the derivative to be computed, the threads need access to data from the four next grid points
in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
due to the boiling down process, the two threads on each side of the blocks do no computations, in the original they are responsible for writing the corresponing grid values to shared memory
The values of the grids are stored in u_arr, du_arr and r_arr (and their corresponding device arrays d_u, d_du and d_r).
Each grid occupies 1300 consecutive values in each of these arrays.
The while-loop in the kernel iterates over 37 grids for each block.
To evaluate the workings of the kernel, each grid is initialized with the exact same values, so a deterministic program will produce the same result for each grid.
This does not happen with my code.
The weirdness of the Heisenbug:
I compared the computed values of grid 0 with each of the other grids, and there are differences at the overlap (grid points 666-669), though not consistently. Some grids have the right values, some do not. Two consecutive runs will mark different grids as erroneous.
The first thing that came to mind was that two threads at this overlap try to concurrently write to memory, though that does not seem to be the case (I checked.... and re-checked).
Commenting or un-commenting lines or using printf() for debugging purposes will alter
the outcome of the program as well: When "asking" the threads responsible for the grid points in question, they tell me that everything is allright, and they are actually correct. As soon as I force a thread to print out its variables, they will be computed (and more importantly: stored) correctly.
The same goes for debugging with Nsight Eclipse.
Memcheck / Racecheck:
cuda-memcheck (memcheck and racecheck) report no memory/racecondition problems, though even the usage of one of these tools have the ability to impact the correctness of the results.
Valgrind gives some warnings, though I think they have something to do with the CUDA API which I can not influence and which seem unrelated to my problem.
(Update)
As pointed out, cuda-memcheck --tool racecheck only works for shared memory race conditions, whereas the problem at hand has a race condition on d_u, i.e., global memory.
Testing environment:
The original kernel has been tested on different CUDA devices and with different compute capabilities (2.0, 3.0 and 3.5) with the bug showing up in every configuration (in some form or another).
My (main) testsystem is the following:
2 x GTX 460, tested on both the GPU that ran the X-server as well as
the other one
Driver Version: 340.46
Cuda Toolkit 6.5
Linux Kernel 3.11.0-12-generic (Linux Mint 16 - Xfce)
State of solution:
By now I am pretty sure that some memory access is the culprit, maybe some optimization from the compiler or use of uninitialized values, and that I obviously do not understand some fundamental CUDA paradigm.
The fact that printf() statements inside the kernel (which through some dark magic have to utilize device and host memory as well) and memcheck algorithms (cuda-memcheck and valgrind) influence
the bevavior point in the same direction.
I am sorry for this somewhat complicated kernel, but I boiled the original kernel and invocation down as much as I could, and this is as far as I got. By now I have learned to admire this problem, and I am looking forward to learning what is going on here.
Two "solutions", which force the kernel do work as intended, are marked in the code.
(Update) As mentioned in the correct answer below, the problem with my code is a race condition at the border of the thread-blocks. As there are two blocks working on each grid and there is no guarantee as to which block works first, resulting in the behavior outlined below. It also explains the correct results when employing "Solution 1" as mentioned in the code, because the input/output value d_u is not altered when uchange = 1.0.
The simple solution is to split this kernel into two kernels, one computing d_u, the other computing the derivative d_du. It would be more desirable to have just one kernel invocation instead of two, though I do not know how to accomplish this with -arch=sm_20. With -arch=sm_35 one could probably use dynamic parallelism to achieve that, though the overhead for the second kernel invocation is negligible.
heisenbug.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
const float r_sol = 6.955E8f;
__constant__ float d_fourpoint_constant = 0.2f;
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
}
bool gridValuesEqual(float *matarray, uint id0, uint id1, const char *label, int numNodesPerGrid){
bool retval = true;
for(uint i=0; i<numNodesPerGrid; ++i)
if(matarray[id0 * numNodesPerGrid + i] != matarray[id1 * numNodesPerGrid + i])
{
printf("value %s at position %u of grid %u not equal that of grid %u: %E != %E, diff: %E\n",
label, i, id0, id1, matarray[id0 * numNodesPerGrid + i], matarray[id1 * numNodesPerGrid + i],
matarray[id0 * numNodesPerGrid + i] - matarray[id1 * numNodesPerGrid + i]);
retval = false;
}
return retval;
}
int main(int argc, const char* argv[])
{
float *d_u;
float *d_du;
float *d_r;
float *u_arr;
float *du_arr;
float *r_arr;
int numNodesPerGrid = 1300;
int numBlocksPerSM = 2;
int numGridsPerSM = 37;
int numSM = 7;
int TPB = 672;
int radius = 2;
int numGrids = 259;
int memsize_grid = sizeof(float) * numNodesPerGrid;
int numBlocksPerGrid = numNodesPerGrid / (TPB - 2 * radius) + (numNodesPerGrid%(TPB - 2 * radius) == 0 ? 0 : 1);
printf("---------------------------------------------------------------------------\n");
printf("--- Heisenbug Extermination Tracker ---------------------------------------\n");
printf("---------------------------------------------------------------------------\n\n");
cudaSetDevice(0);
cudaDeviceReset();
cudaMalloc((void **) &d_u, memsize_grid * numGrids);
cudaMalloc((void **) &d_du, memsize_grid * numGrids);
cudaMalloc((void **) &d_r, memsize_grid * numGrids);
u_arr = new float[numGrids * numNodesPerGrid];
du_arr = new float[numGrids * numNodesPerGrid];
r_arr = new float[numGrids * numNodesPerGrid];
for(uint k=0; k<numGrids; ++k)
for(uint i=0; i<numNodesPerGrid; ++i)
{
uint index = k * numNodesPerGrid + i;
if (i < 585)
r_arr[index] = i * (6000.0f);
else
{
if (i == 585)
r_arr[index] = r_arr[index - 1] + 8.576E-6f * r_sol;
else
r_arr[index] = r_arr[index - 1] + 1.02102f * ( r_arr[index - 1] - r_arr[index - 2] );
}
u_arr[index] = 1E-10f * (i+1);
du_arr[index] = 0.0f;
}
/*
printf("\n\nbefore kernel start\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
bool equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Input values are not identical for different grids!\n\n");
else
printf("All grids contain the same values at same grid points.!\n\n");
cudaMemcpy(d_u, u_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_du, du_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_r, r_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
printf("Configuration:\n\n");
printf("numNodesPerGrid:\t%i\nnumBlocksPerSM:\t\t%i\nnumGridsPerSM:\t\t%i\n", numNodesPerGrid, numBlocksPerSM, numGridsPerSM);
printf("numSM:\t\t\t\t%i\nTPB:\t\t\t\t%i\nradius:\t\t\t\t%i\nnumGrids:\t\t\t%i\nmemsize_grid:\t\t%i\n", numSM, TPB, radius, numGrids, memsize_grid);
printf("numBlocksPerGrid:\t%i\n\n", numBlocksPerGrid);
printf("Kernel launch parameters:\n\n");
printf("moduleA2_3<<<%i, %i, %i>>>(...)\n\n", numBlocksPerSM * numSM, TPB, 0);
printf("Launching Kernel...\n\n");
heisenkernel<<<numBlocksPerSM * numSM, TPB, 0>>>(d_u, d_r, d_du, radius, numNodesPerGrid, numBlocksPerSM, numGridsPerSM, numGrids);
cudaDeviceSynchronize();
cudaMemcpy(u_arr, d_u, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(du_arr, d_du, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(r_arr, d_r, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
/*
printf("\n\nafter kernel finished\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Results are wrong!!\n");
else
printf("All went well!\n");
cudaFree(d_u);
cudaFree(d_du);
cudaFree(d_r);
delete [] u_arr;
delete [] du_arr;
delete [] r_arr;
return 0;
}
Makefile:
CUDA = 1
DEFINES =
ifeq ($(CUDA), 1)
DEFINES += -DCUDA
CUDAPATH = /usr/local/cuda-6.5
CUDAINCPATH = -I$(CUDAPATH)/include
CUDAARCH = -arch=sm_20
endif
CXX = g++
CXXFLAGS = -pipe -g -std=c++0x -fPIE -O0 $(DEFINES)
VALGRIND = valgrind
VALGRIND_FLAGS = -v --leak-check=yes --log-file=out.memcheck
CUDAMEMCHECK = cuda-memcheck
CUDAMC_FLAGS = --tool memcheck
RACECHECK = $(CUDAMEMCHECK)
RACECHECK_FLAGS = --tool racecheck
INCPATH = -I. $(CUDAINCPATH)
LINK = g++
LFLAGS = -O0
LIBS =
ifeq ($(CUDA), 1)
NVCC = $(CUDAPATH)/bin/nvcc
LIBS += -L$(CUDAPATH)/lib64/
LIBS += -lcuda -lcudart -lcudadevrt
NVCCFLAGS = -g -G -O0 --ptxas-options=-v
NVCCFLAGS += -lcuda -lcudart -lcudadevrt -lineinfo --machine 64 -x cu $(CUDAARCH) $(DEFINES)
endif
all:
$(NVCC) $(NVCCFLAGS) $(INCPATH) -c -o $(DST_DIR)heisenbug.o $(SRC_DIR)heisenbug.cu
$(LINK) $(LFLAGS) -o heisenbug heisenbug.o $(LIBS)
clean:
rm heisenbug.o
rm heisenbug
memrace: all
./heisenbug > out
$(VALGRIND) $(VALGRIND_FLAGS) ./heisenbug > out.memcheck.log
$(CUDAMEMCHECK) $(CUDAMC_FLAGS) ./heisenbug > out.cudamemcheck
$(RACECHECK) $(RACECHECK_FLAGS) ./heisenbug > out.racecheck
Note that in the entirety of your writeup, I do not see a question being explicitly asked, therefore I am responding to:
I am looking forward to learning what is going on here.
You have a race condition on d_u.
by your own statement:
•in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
Furthermore, if you comment out the write to d_u, according to your statement in the code, the problem disappears.
CUDA threadblocks can execute in any order. You have at least 2 different blocks that are reading from grid points 666, 667, 668, 669. The results will be different depending on which case actually occurs:
both blocks read the value before any writes occur.
one block reads the value, then a write occurs, then the other block reads the value.
The blocks are not independent of each other (contrary to your statement) if one block is reading a value that can be written to by another block. The order of block execution will determine the result in this case, and CUDA does not specify the order of block execution.
Note that cuda-memcheck with the -tool racecheck option only captures race conditions related to __shared__ memory usage. Your kernel as posted uses no __shared__ memory, therefore I would not expect cuda-memcheck to report anything.
cuda-memcheck, in order to gather its data, does influence the order of block execution, so it's not surprising that it affects the behavior.
in-kernel printf represents a costly function call, writing to a global memory buffer. So it also affects execution behavior/patterns. And if you are printing out a large amount of data, exceeding the buffer lines of output, the effect is extremely costly (in terms of execution time) in the event of buffer overflow.
As an aside, Linux Mint is not a supported distro for CUDA, as far as I can see. However I don't think this is relevant to your issue; I can reproduce the behavior on a supported config.
I am trying to optimize this FDTD code with CUDA Fortran. I have three 3-D cube matrix with input, output and costant.
attributes (global) subroutine kernel_h(k,num_cells_x,num_cells_y,num_cells_z,Hx,Hy,Hz,Ex,Ey,Ez,Cbdx,Cbdy,Cbdz)
implicit none
integer :: idx,idy
integer,value :: k,num_cells_x,num_cells_y,num_cells_z
real(kind=8), intent(in), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Ex, Ey, Ez
real(kind=8), intent(inout), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Hx, Hy, Hz
real(kind=8), intent(in), constant, dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Cbdx,Cbdy,Cbdz
idx = threadIdx%x + ((blockIdx%x-1) * blockDim%x)
idy = threadIdx%y + ((blockIdx%y-1) * blockDim%y)
do while (idx < num_cells_x)
Hz(idx,idy,k) = Hz(idx,idy,k) + ((Ex(idx,idy+1,k)-Ex(idx,idy,k))*Cbdy(idx,idy,k) + (Ey(idx,idy,k)-Ey(idx+1,idy,k))*Cbdx(idx,idy,k))
Hx(idx,idy,k) = Hx(idx,idy,k) + ((Ey(idx,idy,k+1)-Ey(idx,idy,k))*Cbdz(idx,idy,k) + (Ez(idx,idy,k)-Ez(idx,idy+1,k))*Cbdy(idx,idy,k))
Hy(idx,idy,k) = Hy(idx,idy,k) + ((Ez(idx+1,idy,k)-Ez(idx,idy,k))*Cbdx(idx,idy,k) + (Ex(idx,idy,k)-Ex(idx,idy,k+1))*Cbdz(idx,idy,k))
idx = idx + (blockDim%x * gridDim%x)
idy = idy + (blockDim%y * gridDim%y)
end do
end subroutine kernel_h
and my kernel launch is:
bdim=dim3(16,16,1)
gdim=dim3((num_cells_x+(bdim%x-1))/bdim%x,(num_cells_y+(bdim%y-1))/bdim%y,1)
do k=1,num_cells_z
call kernel_h<<<gdim,bdim>>>(k,num_cells_x,num_cells_y,num_cells_z,Hx_d,Hy_d,Hz_d,Ex_d,Ey_d,Ez_d,Cbdx_d,Cbdy_d,Cbdz_d)
end do
My questions are: why i can't load more than 100x100x100 matrix? If i try i get a kernel error launch failure. And can i improve my code performace? I think it could be written in a better way.
I would guess that you are accessing out of bounds.
Consider a 10x10x10 volume (x,y,z). In that case you will launch a single block of 16x16 threads. These threads will access a 17x17 slice (since stencil radius is 1) which is clearly going to end up out of bounds. You would need to disable those threads that will access out of bounds and also disable those threads that will reach beyond the boundary to apply their stencil.
Consider looking at the FDTD3D sample in the CUDA SDK. Granted it's in C but it illustrates how to handle this problem and it also shows how to use shared memory to have a much more efficient implementation.