CUDA's nvvp reports non-ideal memory access pattern, but bandwidth is almost peaking - cuda

EDIT: new minimal working example to illustrate the question and better explanation of nvvp's outcome (following suggestions given in the comments).
So, I have crafted a "minimal" working example, which follows:
#include <cuComplex.h>
#include <iostream>
int const n = 512 * 100;
typedef float real;
template < class T >
struct my_complex {
T x;
T y;
};
__global__ void set( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 1.0f, 0.0f };
}
__global__ void duplicate_whole( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 2.0f * d.x, 2.0f * d.y };
}
__global__ void duplicate_half( real * a )
{
real & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d *= 2.0f;
}
int main()
{
my_complex< real > * a;
cudaMalloc( ( void * * ) & a, sizeof( my_complex< real > ) * n * 1024 );
set<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_whole<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_half<<< 2 * n, 1024 >>>( reinterpret_cast< real * >( a ) );
cudaDeviceSynchronize();
my_complex< real > * a_h = new my_complex< real >[ n * 1024 ];
cudaMemcpy( a_h, a, sizeof( my_complex< real > ) * n * 1024, cudaMemcpyDeviceToHost );
std::cout << "( " << a_h[ 0 ].x << ", " << a_h[ 0 ].y << " )" << '\t' << "( " << a_h[ n * 1024 - 1 ].x << ", " << a_h[ n * 1024 - 1 ].y << " )" << std::endl;
return 0;
}
When I compile and run the above code, kernels duplicate_whole and duplicate_half take just about the same time to run.
However, when I analyze the kernels using nvvp I get different reports for each of the kernels in the following sense. For kernel duplicate_whole, nvvp warns me that at line 23 (d = { 2.0f * d.x, 2.0f * d.y };) the kernel is performing
Global Load L2 Transaction/Access = 8, Ideal Transaction/Access = 4
I agree that I am loading 8 byte words. What I do not understand is why 4 bytes is the ideal word size. In special, there is no performance difference between the kernels.
I suppose that there must be circumstances where this global store access pattern could cause performance degradation. What are these?
And why is that I do not get a performance hit?
I hope that this edit has clarified some unclear points.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I'll start wit some kernel code to exemplify my question, which will follow below
template < class data_t >
__global__ void chirp_factors_multiply( std::complex< data_t > const * chirp_factors,
std::complex< data_t > * data,
int M,
int row_length,
int b,
int i_0
)
{
#ifndef CUGALE_MUL_SHUFFLE
// Output array length:
int plane_area = row_length * M;
// Process element:
int i = blockIdx.x * row_length + threadIdx.x + i_0;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
my_complex< data_t > datum;
my_complex< data_t > datum_new;
for ( int i_b = 0; i_b < b; ++ i_b )
{
my_complex< data_t > & ref_datum = ref_complex( data[ i_b * plane_area + i ] );
datum = ref_datum;
datum_new.x = datum.x * chirp_factor.x - datum.y * chirp_factor.y;
datum_new.y = datum.x * chirp_factor.y + datum.y * chirp_factor.x;
ref_datum = datum_new;
}
#else
// Output array length:
int plane_area = row_length * M;
// Element to process:
int i = blockIdx.x * row_length + ( threadIdx.x + i_0 ) / 2;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
// Real and imaginary part of datum (not respectively for odd threads):
data_t datum_a;
data_t datum_b;
// Even TIDs will read data in regular order, odd TIDs will read data in inverted order:
int parity = ( threadIdx.x % 2 );
int shuffle_dir = 1 - 2 * parity;
int inwarp_tid = threadIdx.x % warpSize;
for ( int i_b = 0; i_b < b; ++ i_b )
{
int data_idx = i_b * plane_area + i;
datum_a = reinterpret_cast< data_t * >( data + data_idx )[ parity ];
datum_b = __shfl_sync( 0xFFFFFFFF, datum_a, inwarp_tid + shuffle_dir, warpSize );
// Even TIDs compute real part, odd TIDs compute imaginary part:
reinterpret_cast< data_t * >( data + data_idx )[ parity ] = datum_a * chirp_factor.x - shuffle_dir * datum_b * chirp_factor.y;
}
#endif // #ifndef CUGALE_MUL_SHUFFLE
}
Let us consider the case where data_t is float, which is memory bandwidth limited. As it can be seen above, there are two versions of the kernel, one which reads/writes 8 bytes (a whole complex number) per thread and another which reads/writes 4 bytes per thread and then shuffles the results so the complex product is computed correctly.
The reason why I have written the version using shuffle is because nvvp insisted that reading 8 bytes per thread was not the best idea because this memory access pattern would be inefficient. This is the case even though in both systems tested (GTX 1050 and GTX Titan Xp) memory bandwidth was very close to theoretical maximum.
Surely enough I knew that no improvement was likely to happen, and this was indeed the case: both kernels take pretty much the same time to run. So, my question is the following:
Why is that nvvp reports that reading 8 bytes would be less efficient than reading 4 bytes per thread? In which circumstances would that be the case?
As a side note, single precision is more important to me, but double is useful in some cases too. Interestingly enough, in the case where data_t is double, there is no execution time difference too between the two kernel versions, even though in this case the kernel is compute bound and the shuffle version performs some more flops than the original version.
Note: the kernels are applied to a row_length * M * b dataset (b images with row_length columns and M lines) and the chirp_factor array is row_length * M. Both kernels run perfecly fine (I can edit the question to show you the calls to both versions if you have doubts about it).

The issue here has to do with how the compiler is processing your code. nvvp is merely dutifully reporting what is happening when you run your code.
If you use the cuobjdump -sass tool on your executable, you will discover that the duplicate_whole routine is doing two 4-byte loads and two 4-byte stores. This is not optimal, partly becuase there is a stride in each load and store (each load and store touches alternate elements in memory).
The reason for this is that the compiler does not know the alignment of your my_complex struct. Your struct would be legal for use in situations that would prevent the compiler from generating a (legal) 8-byte load. As discussed here we can fix this by informing the compiler that we only intend to use the struct in alignment scenarios where a CUDA 8-byte load is legal (i.e. it is "naturally aligned"). The modification to your struct looks like this:
template < class T >
struct __align__(8) my_complex {
T x;
T y;
};
With that change to your code, the compiler generates 8-byte loads for the duplicate_whole kernel, and you should see a different report from the profiler. You should use this sort of decoration only when you understand what it means and are willing to enter into a contract with the compiler that you will ensure this is the case. If you do something unusual, like unusual pointer casting, you can violate your end of the bargain and generate a machine fault.
The reason you don't see much performance difference almost certainly has to do with CUDA load/store behavior and the GPU caches
When you do a strided load, the GPU loads an entire cacheline anyway, even though (in this case) you only need half the elements (the real elements) for that particular load operation. However you need the other half of the elements (the imaginary elements) anyway; they will be loaded on the next instruction, and this instruction most likely hits in the cache, due to the previous load.
On a strided store in this case, writing strided elements in one instruction and the alternate elements in the next instruction will end up using one of the caches as a "coalescing buffer". This isn't coalescing in the typical sense used in CUDA terminology; that sort of coalescing only applies to a single instruction. However the cache "coalescing buffer" behavior allows it to "accumulate" multiple writes to an already-resident line, before that line gets written out or evicted. This is approximately equivalent to "write-back" cache behavior.

Related

CUDA shared vs global memory, possible speedup

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.

CUDA float precision not matching CPU implementation

I am using CUDA 5.5 compute 3.5 on GTX 1080Ti and want to compute this formula:
y = a * a * b / 64 + c * c
Suppose I have these parameters:
a = 5876
b = 0.4474222958088
c = 664
I am computing this both via GPU and on the CPU and they give me different inexact answers:
h_data[0] = 6.822759375000e+05,
h_ref[0] = 6.822760000000e+05,
difference = -6.250000000000e-02
h_data is the CUDA answer, h_ref is the CPU answer. When I plug these into my calculator the GPU answer is closer to the exact answer, and I suspect this has to do with floating point precision. My question now is, how can I get the CUDA solution to match the precision/roundoff of CPU version? If I offset the a parameter by +/-1 the solutions match, but if I offset say the c parameter I still get a difference of 1/16
Here's the working code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
__global__ void test_func(float a, float b, int c, int nz, float * __restrict__ d_out)
{
float *fdes_out = d_out + blockIdx.x * nz;
float roffout2 = a * a / 64.f;
//float tmp = fma(roffout2,vel,index*index);
for (int tid = threadIdx.x; tid < nz; tid += blockDim.x) {
fdes_out[tid] = roffout2 * b + c * c;
}
}
int main (int argc, char **argv)
{
// parameters
float a = 5876.0f, b = 0.4474222958088f;
int c = 664;
int nz = 1;
float *d_data, *h_data, *h_ref;
h_data = (float*)malloc(nz*sizeof(float));
h_ref = (float*)malloc(nz*sizeof(float));
// CUDA
cudaMalloc((void**)&d_data, sizeof(float)*nz);
dim3 nb(1,1,1); dim3 nt(64,1,1);
test_func <<<nb,nt>>> (a,b,c,nz,d_data);
cudaMemcpy(h_data, d_data, sizeof(float)*nz, cudaMemcpyDeviceToHost);
// Reference
float roffout2 = a * a / 64.f;
h_ref[0] = roffout2*b + c*c;
// Compare
printf("h_data[0] = %1.12e,\nh_ref[0] = %1.12e,\ndifference = %1.12e\n",
h_data[0],h_ref[0],h_data[0]-h_ref[0]);
// Free
free(h_data); free(h_ref);
cudaFree(d_data);
return 0;
}
I'm compiling only with the-O3 flag.
This small numerical difference of one single-precision ulp occurs because the CUDA compiler applies FMA-merging by default, whereas the host compiler does not do that. FMA-merging can be turned off by adding the command line flag -fmad=false to the invocation of the CUDA compiler driver nvcc.
FMA-merging is a compiler optimization in which an FMUL and a dependent FADD are transformed into a single fused multiply-add, or FMA, instruction. An FMA instruction computes a*b+c such that the full unrounded product a*b enters into the addition with c before a final rounding is applied to produce the final result.
Usually, this has performance advantages, since a single FMA instruction is executed instead of two instructions FMUL, FADD, and all the instructions have similar latency. Usually, this also has accuracy advantages as the use of FMA eliminates one rounding step and guards against subtractive cancellation when a*c and c have opposite signs.
In this case, as noted by OP, the GPU result computed with FMA is slightly more accurate than the host result computed without FMA. Using a higher precision reference, I find that the relative error in the GPU result is -4.21e-8, while the relative error in the host result is 4.95e-8.

Best strategy for grid search with CUDA

Recently I started working with CUDA and I read an introductory book on the computing language. To see if I understood it well, I considered the following problem.
Consider a function minimize f(x,y) on the grid [-1,1] X [-1,1]. This provided me with a few practical questions and I would like to have your look on things.
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Thanks!
This was an interesting question for me because it represents a type of problem that I think is rare:
potentially high compute load
little to no data that needs to be communicated host->device
very low volume of results that need to be communicated device->host
In other words, pretty much all compute, with not much dependence on data transfer, or even global memory usage/bandwidth.
Having said that, the question seems to be looking for a brute-force search approach to functional optimization/minimization, which is not an efficient technique for functions that are amenable to other optimization methods. But as a learning exercise, it's interesting (to me, anyway). It may also be useful for functions that are otherwise difficult to handle such as functions with discontinuities or other irregularities.
To answer your questions:
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
I wouldn't bother calculating the grid on the CPU. (I assume by "grid" you mean the functional value of f at each point on the grid.) First of all, this is a fairly computationally intensive task - which GPUs are good at, and secondly, it is potentially a large data set, so transferring it to the GPU (so the GPU can then do the search) will take time. I propose to let the GPU do this (compute the functional value at each grid point.) Since we won't be using global access to data for this, texture memory is not an issue.
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Yes, you could use a 1D array of blocks (list) or a 2D array. I don't think this significantly impacts the problem either way, and I think the 2D grid approach fits the problem better (and I think allows for slightly cleaner code) so I would suggest starting with a 2D array of blocks.
Here's a sample code that might be interesting to play with or crystallize ideas. Each thread has the responsibility to compute its respective value of x and y, and then the functional value f at that point. Then a reduction followed by a block-draining reduction is used to search over all computed values for the minimum value (in this case).
$ cat t811.cu
#include <stdio.h>
#include <math.h>
#include <assert.h>
// grid dimensions and divisions
#define XNR -1.0f
#define XPR 1.0f
#define YNR -1.0f
#define YPR 1.0f
#define DX 0.0001f
#define DY 0.0001f
// threadblock dimensions - product must be a power of 2
#define BLK_X 16
#define BLK_Y 16
// optimization functions - these are currently set for minimization
#define TST(X1,X2) ((X1)>(X2))
#define OPT(X1,X2) (X2)
// error check macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for timing
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
// the function f that will be "optimized"
__host__ __device__ float f(float x, float y){
return (x+0.5)*(x+0.5) + (y+0.5)*(y+0.5) +0.1f;
}
// variable for block-draining reduction block counter
__device__ int blkcnt = 0;
// GPU optimization kernel
__global__ void opt_kernel(float * __restrict__ bf, float * __restrict__ bx, float * __restrict__ by, const float scx, const float scy){
__shared__ float sh_f[BLK_X*BLK_Y];
__shared__ float sh_x[BLK_X*BLK_Y];
__shared__ float sh_y[BLK_X*BLK_Y];
__shared__ int lblock;
// compute x,y coordinates for this thread
float x = ((threadIdx.x+blockDim.x*blockIdx.x) * (XPR-XNR))*scx + XNR;
float y = ((threadIdx.y+blockDim.y*blockIdx.y) * (YPR-YNR))*scy + YNR;
int thid = (threadIdx.y*BLK_X)+threadIdx.x;
lblock = 0;
sh_x[thid] = x;
sh_y[thid] = y;
sh_f[thid] = f(x,y); // compute functional value of f(x,y)
__syncthreads();
// perform block-level shared memory reduction
// assume block size is a power of 2
for (int i = (blockDim.x*blockDim.y)>>1; i > 16; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
volatile float *vf = sh_f;
volatile float *vx = sh_x;
volatile float *vy = sh_y;
for (int i = 16; i > 0; i>>=1)
if (thid < i)
if (TST(vf[thid],vf[thid+i])){
vf[thid] = OPT(vf[thid],vf[thid+i]);
vx[thid] = OPT(vx[thid],vx[thid+i]);
vy[thid] = OPT(vy[thid],vy[thid+i]);}
// save block reduction result, and check if last block
if (!thid){
bf[blockIdx.y*gridDim.x+blockIdx.x] = sh_f[0];
bx[blockIdx.y*gridDim.x+blockIdx.x] = sh_x[0];
by[blockIdx.y*gridDim.x+blockIdx.x] = sh_y[0];
int myblock = atomicAdd(&blkcnt, 1);
if (myblock == (gridDim.x*gridDim.y-1)) lblock = 1;}
__syncthreads();
if (lblock){
// do last-block reduction
float my_x, my_y, my_f;
int myid = thid;
if (myid < gridDim.x * gridDim.y){
my_x = bx[myid];
my_y = by[myid];
my_f = bf[myid];}
else { assert(0);} // does not work correctly if block dims are greater than grid dims
myid += blockDim.x*blockDim.y;
while (myid < gridDim.x*gridDim.y){
if TST(my_f,bf[myid]){
my_x = OPT(my_x,bx[myid]);
my_y = OPT(my_y,by[myid]);
my_f = OPT(my_f,bf[myid]);}
myid += blockDim.x*blockDim.y;}
sh_f[thid] = my_f;
sh_x[thid] = my_x;
sh_y[thid] = my_y;
__syncthreads();
for (int i = (blockDim.x*blockDim.y)>>1; i > 0; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
if (!thid){
bf[0] = sh_f[0];
bx[0] = sh_x[0];
by[0] = sh_y[0];
}
}
}
// cpu (naive,serial) function for comparison
float3 opt_cpu(){
float optx = XNR;
float opty = YNR;
float optf = f(optx,opty);
for (float x = XNR; x < XPR; x += DX)
for (float y = YNR; y < YPR; y += DY){
float test = f(x,y);
if (TST(optf,test)){
optf = OPT(optf,test);
optx = OPT(optx,x);
opty = OPT(opty,y);}}
return make_float3(optf, optx, opty);
}
int main(){
// compute threadblock and grid dimensions
int nx = ceil(XPR-XNR)/DX;
int ny = ceil(YPR-YNR)/DY;
int bx = ceil(nx/(float)BLK_X);
int by = ceil(ny/(float)BLK_Y);
dim3 threads(BLK_X, BLK_Y);
dim3 blocks(bx, by);
float *d_bx, *d_by, *d_bf;
cudaFree(0);
// run GPU test case
unsigned long gtime = dtime_usec(0);
cudaMalloc(&d_bx, bx*by*sizeof(float));
cudaMalloc(&d_by, bx*by*sizeof(float));
cudaMalloc(&d_bf, bx*by*sizeof(float));
opt_kernel<<<blocks, threads>>>(d_bf, d_bx, d_by, 1.0f/(blocks.x*threads.x), 1.0f/(blocks.y*threads.y));
float rf, rx, ry;
cudaMemcpy(&rf, d_bf, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&rx, d_bx, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&ry, d_by, sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
gtime = dtime_usec(gtime);
printf("gpu val: %f, x: %f, y: %f, time: %fs\n", rf, rx, ry, gtime/(float)USECPSEC);
//run CPU test case
unsigned long ctime = dtime_usec(0);
float3 cpu_res = opt_cpu();
ctime = dtime_usec(ctime);
printf("cpu val: %f, x: %f, y: %f, time: %fs\n", cpu_res.x, cpu_res.y, cpu_res.z, ctime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -o t811 t811.cu
$ ./t811
gpu val: 0.100000, x: -0.500000, y: -0.500000, time: 0.193248s
cpu val: 0.100000, x: -0.500017, y: -0.500017, time: 2.810862s
$
Notes:
This problem is set up to find the minimum value of f(x,y) = (x+0.5)^2 + (y+0.5)^2 + 0.1 over the domain: x(-1,1), y(-1,1)
The test was run on Fedora 20, CUDA 7, Quadro5000 GPU (cc2.0) and a Xeon X5560 2.8GHz CPU. Different CPU or GPU will obviously affect the comparison.
The observed speedup here is about 14x. The CPU code is a naive, single threaded code.
It should be possible, for example, via modification of the OPT and TST macros, to perform a different kind of optimization - such as maximum instead of minimum.
The domain (and grid) dimensions and granularity to search over can be modified by the compile time constants such as XNR, XPR, etc.

Heisenbug in CUDA kernel, global memory access

About two years ago, I wrote a kernel for work on several numerical grids simultaneously. Some very strange behaviour emerged, which resulted in wrong results. When hunting down the bug utilizing printf()-statements inside the kernel, the bug vanished.
Due to deadline constraints, I kept it that way, though recently I figured that this was no appropriate coding style. So I revisited my kernel and boiled it down to what you see below.
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
This kernel computes the derivative of some value at all grid points for a number of numerical 1D-grids.
Things to consider: (see full code base at the bottom)
a grid consists of 1300 grid points
each grid has to be worked upon by two blocks (due to memory/register limitations)
each block successively works on 37 grids (or better: grid halves, the while-loop takes care of that)
each thread is responsible for the same grid point in each grid
for the derivative to be computed, the threads need access to data from the four next grid points
in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
due to the boiling down process, the two threads on each side of the blocks do no computations, in the original they are responsible for writing the corresponing grid values to shared memory
The values of the grids are stored in u_arr, du_arr and r_arr (and their corresponding device arrays d_u, d_du and d_r).
Each grid occupies 1300 consecutive values in each of these arrays.
The while-loop in the kernel iterates over 37 grids for each block.
To evaluate the workings of the kernel, each grid is initialized with the exact same values, so a deterministic program will produce the same result for each grid.
This does not happen with my code.
The weirdness of the Heisenbug:
I compared the computed values of grid 0 with each of the other grids, and there are differences at the overlap (grid points 666-669), though not consistently. Some grids have the right values, some do not. Two consecutive runs will mark different grids as erroneous.
The first thing that came to mind was that two threads at this overlap try to concurrently write to memory, though that does not seem to be the case (I checked.... and re-checked).
Commenting or un-commenting lines or using printf() for debugging purposes will alter
the outcome of the program as well: When "asking" the threads responsible for the grid points in question, they tell me that everything is allright, and they are actually correct. As soon as I force a thread to print out its variables, they will be computed (and more importantly: stored) correctly.
The same goes for debugging with Nsight Eclipse.
Memcheck / Racecheck:
cuda-memcheck (memcheck and racecheck) report no memory/racecondition problems, though even the usage of one of these tools have the ability to impact the correctness of the results.
Valgrind gives some warnings, though I think they have something to do with the CUDA API which I can not influence and which seem unrelated to my problem.
(Update)
As pointed out, cuda-memcheck --tool racecheck only works for shared memory race conditions, whereas the problem at hand has a race condition on d_u, i.e., global memory.
Testing environment:
The original kernel has been tested on different CUDA devices and with different compute capabilities (2.0, 3.0 and 3.5) with the bug showing up in every configuration (in some form or another).
My (main) testsystem is the following:
2 x GTX 460, tested on both the GPU that ran the X-server as well as
the other one
Driver Version: 340.46
Cuda Toolkit 6.5
Linux Kernel 3.11.0-12-generic (Linux Mint 16 - Xfce)
State of solution:
By now I am pretty sure that some memory access is the culprit, maybe some optimization from the compiler or use of uninitialized values, and that I obviously do not understand some fundamental CUDA paradigm.
The fact that printf() statements inside the kernel (which through some dark magic have to utilize device and host memory as well) and memcheck algorithms (cuda-memcheck and valgrind) influence
the bevavior point in the same direction.
I am sorry for this somewhat complicated kernel, but I boiled the original kernel and invocation down as much as I could, and this is as far as I got. By now I have learned to admire this problem, and I am looking forward to learning what is going on here.
Two "solutions", which force the kernel do work as intended, are marked in the code.
(Update) As mentioned in the correct answer below, the problem with my code is a race condition at the border of the thread-blocks. As there are two blocks working on each grid and there is no guarantee as to which block works first, resulting in the behavior outlined below. It also explains the correct results when employing "Solution 1" as mentioned in the code, because the input/output value d_u is not altered when uchange = 1.0.
The simple solution is to split this kernel into two kernels, one computing d_u, the other computing the derivative d_du. It would be more desirable to have just one kernel invocation instead of two, though I do not know how to accomplish this with -arch=sm_20. With -arch=sm_35 one could probably use dynamic parallelism to achieve that, though the overhead for the second kernel invocation is negligible.
heisenbug.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
const float r_sol = 6.955E8f;
__constant__ float d_fourpoint_constant = 0.2f;
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
}
bool gridValuesEqual(float *matarray, uint id0, uint id1, const char *label, int numNodesPerGrid){
bool retval = true;
for(uint i=0; i<numNodesPerGrid; ++i)
if(matarray[id0 * numNodesPerGrid + i] != matarray[id1 * numNodesPerGrid + i])
{
printf("value %s at position %u of grid %u not equal that of grid %u: %E != %E, diff: %E\n",
label, i, id0, id1, matarray[id0 * numNodesPerGrid + i], matarray[id1 * numNodesPerGrid + i],
matarray[id0 * numNodesPerGrid + i] - matarray[id1 * numNodesPerGrid + i]);
retval = false;
}
return retval;
}
int main(int argc, const char* argv[])
{
float *d_u;
float *d_du;
float *d_r;
float *u_arr;
float *du_arr;
float *r_arr;
int numNodesPerGrid = 1300;
int numBlocksPerSM = 2;
int numGridsPerSM = 37;
int numSM = 7;
int TPB = 672;
int radius = 2;
int numGrids = 259;
int memsize_grid = sizeof(float) * numNodesPerGrid;
int numBlocksPerGrid = numNodesPerGrid / (TPB - 2 * radius) + (numNodesPerGrid%(TPB - 2 * radius) == 0 ? 0 : 1);
printf("---------------------------------------------------------------------------\n");
printf("--- Heisenbug Extermination Tracker ---------------------------------------\n");
printf("---------------------------------------------------------------------------\n\n");
cudaSetDevice(0);
cudaDeviceReset();
cudaMalloc((void **) &d_u, memsize_grid * numGrids);
cudaMalloc((void **) &d_du, memsize_grid * numGrids);
cudaMalloc((void **) &d_r, memsize_grid * numGrids);
u_arr = new float[numGrids * numNodesPerGrid];
du_arr = new float[numGrids * numNodesPerGrid];
r_arr = new float[numGrids * numNodesPerGrid];
for(uint k=0; k<numGrids; ++k)
for(uint i=0; i<numNodesPerGrid; ++i)
{
uint index = k * numNodesPerGrid + i;
if (i < 585)
r_arr[index] = i * (6000.0f);
else
{
if (i == 585)
r_arr[index] = r_arr[index - 1] + 8.576E-6f * r_sol;
else
r_arr[index] = r_arr[index - 1] + 1.02102f * ( r_arr[index - 1] - r_arr[index - 2] );
}
u_arr[index] = 1E-10f * (i+1);
du_arr[index] = 0.0f;
}
/*
printf("\n\nbefore kernel start\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
bool equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Input values are not identical for different grids!\n\n");
else
printf("All grids contain the same values at same grid points.!\n\n");
cudaMemcpy(d_u, u_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_du, du_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_r, r_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
printf("Configuration:\n\n");
printf("numNodesPerGrid:\t%i\nnumBlocksPerSM:\t\t%i\nnumGridsPerSM:\t\t%i\n", numNodesPerGrid, numBlocksPerSM, numGridsPerSM);
printf("numSM:\t\t\t\t%i\nTPB:\t\t\t\t%i\nradius:\t\t\t\t%i\nnumGrids:\t\t\t%i\nmemsize_grid:\t\t%i\n", numSM, TPB, radius, numGrids, memsize_grid);
printf("numBlocksPerGrid:\t%i\n\n", numBlocksPerGrid);
printf("Kernel launch parameters:\n\n");
printf("moduleA2_3<<<%i, %i, %i>>>(...)\n\n", numBlocksPerSM * numSM, TPB, 0);
printf("Launching Kernel...\n\n");
heisenkernel<<<numBlocksPerSM * numSM, TPB, 0>>>(d_u, d_r, d_du, radius, numNodesPerGrid, numBlocksPerSM, numGridsPerSM, numGrids);
cudaDeviceSynchronize();
cudaMemcpy(u_arr, d_u, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(du_arr, d_du, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(r_arr, d_r, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
/*
printf("\n\nafter kernel finished\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Results are wrong!!\n");
else
printf("All went well!\n");
cudaFree(d_u);
cudaFree(d_du);
cudaFree(d_r);
delete [] u_arr;
delete [] du_arr;
delete [] r_arr;
return 0;
}
Makefile:
CUDA = 1
DEFINES =
ifeq ($(CUDA), 1)
DEFINES += -DCUDA
CUDAPATH = /usr/local/cuda-6.5
CUDAINCPATH = -I$(CUDAPATH)/include
CUDAARCH = -arch=sm_20
endif
CXX = g++
CXXFLAGS = -pipe -g -std=c++0x -fPIE -O0 $(DEFINES)
VALGRIND = valgrind
VALGRIND_FLAGS = -v --leak-check=yes --log-file=out.memcheck
CUDAMEMCHECK = cuda-memcheck
CUDAMC_FLAGS = --tool memcheck
RACECHECK = $(CUDAMEMCHECK)
RACECHECK_FLAGS = --tool racecheck
INCPATH = -I. $(CUDAINCPATH)
LINK = g++
LFLAGS = -O0
LIBS =
ifeq ($(CUDA), 1)
NVCC = $(CUDAPATH)/bin/nvcc
LIBS += -L$(CUDAPATH)/lib64/
LIBS += -lcuda -lcudart -lcudadevrt
NVCCFLAGS = -g -G -O0 --ptxas-options=-v
NVCCFLAGS += -lcuda -lcudart -lcudadevrt -lineinfo --machine 64 -x cu $(CUDAARCH) $(DEFINES)
endif
all:
$(NVCC) $(NVCCFLAGS) $(INCPATH) -c -o $(DST_DIR)heisenbug.o $(SRC_DIR)heisenbug.cu
$(LINK) $(LFLAGS) -o heisenbug heisenbug.o $(LIBS)
clean:
rm heisenbug.o
rm heisenbug
memrace: all
./heisenbug > out
$(VALGRIND) $(VALGRIND_FLAGS) ./heisenbug > out.memcheck.log
$(CUDAMEMCHECK) $(CUDAMC_FLAGS) ./heisenbug > out.cudamemcheck
$(RACECHECK) $(RACECHECK_FLAGS) ./heisenbug > out.racecheck
Note that in the entirety of your writeup, I do not see a question being explicitly asked, therefore I am responding to:
I am looking forward to learning what is going on here.
You have a race condition on d_u.
by your own statement:
•in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
Furthermore, if you comment out the write to d_u, according to your statement in the code, the problem disappears.
CUDA threadblocks can execute in any order. You have at least 2 different blocks that are reading from grid points 666, 667, 668, 669. The results will be different depending on which case actually occurs:
both blocks read the value before any writes occur.
one block reads the value, then a write occurs, then the other block reads the value.
The blocks are not independent of each other (contrary to your statement) if one block is reading a value that can be written to by another block. The order of block execution will determine the result in this case, and CUDA does not specify the order of block execution.
Note that cuda-memcheck with the -tool racecheck option only captures race conditions related to __shared__ memory usage. Your kernel as posted uses no __shared__ memory, therefore I would not expect cuda-memcheck to report anything.
cuda-memcheck, in order to gather its data, does influence the order of block execution, so it's not surprising that it affects the behavior.
in-kernel printf represents a costly function call, writing to a global memory buffer. So it also affects execution behavior/patterns. And if you are printing out a large amount of data, exceeding the buffer lines of output, the effect is extremely costly (in terms of execution time) in the event of buffer overflow.
As an aside, Linux Mint is not a supported distro for CUDA, as far as I can see. However I don't think this is relevant to your issue; I can reproduce the behavior on a supported config.

Texture fetch slower than direct global access, chapter 7 from "Cuda by example" book

I am reading and testing the examples in the book "Cuda By example. An introduction to General Purpose GPU Programming".
When testing the examples in chapter 7, relative to texture memory, I realized that access to global memory via texture cache is much slower than direct access (My NVIDIA GPU is GeForceGTX 260, compute capability 1.3 and I am using NVDIA CUDA 4.2):
Time per frame with texture fetch (1D or 2D) for a 256*256 image: 93 ms
Time per frame not using texture (just direct global access) for 256*256: 8.5 ms
I have double checked the code several times and I have also been reading the "CUDA C Programming guide" and "CUDA C Best practices Guide" which come along with the SDK, and I do not really understand the problem.
As far as I understand, texture memory is just global memory with a specific access mechanism implementation to make it look like a cache (?). I am wondering whether coalesced access to global memory will make texture fetch slower, but I cannot be sure.
Does anybody have a similar problem?
(I found some links in NVIDIA forums for a similar problem, but the link is no longer available.)
The testing code looks this way, only including the relevant parts:
//#define TEXTURE
//#define TEXTURE2
#ifdef TEXTURE
// According to C programming guide, it should be static (3.2.10.1.1)
static texture<float> texConstSrc;
static texture<float> texIn;
static texture<float> texOut;
#endif
__global__ void copy_const_kernel( float *iptr
#ifdef TEXTURE2
){
#else
,const float *cptr ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
#ifdef TEXTURE2
float c = tex1Dfetch(texConstSrc,offset);
#else
float c = cptr[offset];
#endif
if ( c != 0) iptr[offset] = c;
}
__global__ void blend_kernel( float *outSrc,
#ifdef TEXTURE
bool dstOut ) {
#else
const float *inSrc ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
int left = offset - 1;
int right = offset + 1;
if (x == 0) left++;
if (x == SXRES-1) right--;
int top = offset - SYRES;
int bottom = offset + SYRES;
if (y == 0) top += SYRES;
if (y == SYRES-1) bottom -= SYRES;
#ifdef TEXTURE
float t, l, c, r, b;
if (dstOut) {
t = tex1Dfetch(texIn,top);
l = tex1Dfetch(texIn,left);
c = tex1Dfetch(texIn,offset);
r = tex1Dfetch(texIn,right);
b = tex1Dfetch(texIn,bottom);
} else {
t = tex1Dfetch(texOut,top);
l = tex1Dfetch(texOut,left);
c = tex1Dfetch(texOut,offset);
r = tex1Dfetch(texOut,right);
b = tex1Dfetch(texOut,bottom);
}
outSrc[offset] = c + SPEED * (t + b + r + l - 4 * c);
#else
outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +
inSrc[bottom] + inSrc[left] + inSrc[right] -
inSrc[offset]*4);
#endif
}
// globals needed by the update routine
struct DataBlock {
unsigned char *output_bitmap;
float *dev_inSrc;
float *dev_outSrc;
float *dev_constSrc;
cudaEvent_t start, stop;
float totalTime;
float frames;
unsigned size;
unsigned char *output_host;
};
void anim_gpu( DataBlock *d, int ticks ) {
checkCudaErrors( cudaEventRecord( d->start, 0 ) );
dim3 blocks(SXRES/16,SYRES/16);
dim3 threads(16,16);
#ifdef TEXTURE
volatile bool dstOut = true;
#endif
for (int i=0; i<90; i++) {
#ifdef TEXTURE
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
#ifdef TEXTURE2
copy_const_kernel<<<blocks,threads>>>( in );
#else
copy_const_kernel<<<blocks,threads>>>( in,
d->dev_constSrc );
#endif
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
#else
copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
d->dev_constSrc );
blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
d->dev_inSrc );
swap( d->dev_inSrc, d->dev_outSrc );
#endif
}
// Some stuff for the events
// ...
}
I have been testing the results with the nvvp (NVIDIA profiler)
The result are quite curious as they show that there are a lot of texture cache misses (which are probably the cause for the bad performance).
The result from the profiler show also information that is difficult to understand even using the guide "CUPTI_User_GUide):
text_cache_hit: Number of texture cache hits (they are accounted only for one SM according to 1.3 capability).
text_cache_miss: Number of texture cache miss (they are accounted only for one SM according to 1.3 capability).
The following are the results for an example of 256*256 without using texture cache (only relevant info is shown):
Name Duration(ns) Grid_Size Block_Size
"copy_const_kernel(...) 22688 16,16,1 16,16,1
"blend_kernel(...)" 51360 16,16,1 16,16,1
Following are the results using 1D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 147392 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 841728 16,16,1 16,16,1 79 5041
Following are the results using 2D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 150880 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 872832 16,16,1 16,16,1 2971 2149
These result show several interesting info:
There are no cache hits at all for the "copy const" function (although ideally the memory is "spatially located", in the sense that each thread accesses memory which is near to the memory acceded by other near threads). I guess that this is because the threads within this function do not access memory from other threads, which seems to be the way for the texture cache to be usable (being the "spatially located" concept quite confusing)
There are some cache hits in the 1D and a lot more in the 2D case for the function "blend_kernel". I guess that it is due to the fact that within that function, any thread access memory from their neighbours threads. I cannot understand why there are more in 2D than 1d.
The duration time is greater in the texture cases than in the no texture case (nearly about one order of magnitude). Perhaps related with the so many texture cache misses.
For the "copy_const" function there are 1024 total accesses for the SM and 5120 for the "blend kernel". The relation 5:1 is correct due to the fact that there are 5 fetches in "blend" and only 1 in "copy_const". Anyway, I cannot understand where all this 1024 come from: ideally, this event "text cache miss/hot" only accounts for one SM (I have 24 in my GeForceGTX 260) and it only accounts for warps ( 32 thread size). Therefore, I have 256 threads/32=8 warps per SM and 256 blocks/24 = 10 or 11 "iterations" per SM, so I would be expecting something like 80 or 88 fetches (more over, some other event like sm_cta_launched, which is the number of thread blocks per SM, which is supposed to be supported in my 1.3 device, is always 0...)