CUDA float precision not matching CPU implementation - cuda

I am using CUDA 5.5 compute 3.5 on GTX 1080Ti and want to compute this formula:
y = a * a * b / 64 + c * c
Suppose I have these parameters:
a = 5876
b = 0.4474222958088
c = 664
I am computing this both via GPU and on the CPU and they give me different inexact answers:
h_data[0] = 6.822759375000e+05,
h_ref[0] = 6.822760000000e+05,
difference = -6.250000000000e-02
h_data is the CUDA answer, h_ref is the CPU answer. When I plug these into my calculator the GPU answer is closer to the exact answer, and I suspect this has to do with floating point precision. My question now is, how can I get the CUDA solution to match the precision/roundoff of CPU version? If I offset the a parameter by +/-1 the solutions match, but if I offset say the c parameter I still get a difference of 1/16
Here's the working code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
__global__ void test_func(float a, float b, int c, int nz, float * __restrict__ d_out)
{
float *fdes_out = d_out + blockIdx.x * nz;
float roffout2 = a * a / 64.f;
//float tmp = fma(roffout2,vel,index*index);
for (int tid = threadIdx.x; tid < nz; tid += blockDim.x) {
fdes_out[tid] = roffout2 * b + c * c;
}
}
int main (int argc, char **argv)
{
// parameters
float a = 5876.0f, b = 0.4474222958088f;
int c = 664;
int nz = 1;
float *d_data, *h_data, *h_ref;
h_data = (float*)malloc(nz*sizeof(float));
h_ref = (float*)malloc(nz*sizeof(float));
// CUDA
cudaMalloc((void**)&d_data, sizeof(float)*nz);
dim3 nb(1,1,1); dim3 nt(64,1,1);
test_func <<<nb,nt>>> (a,b,c,nz,d_data);
cudaMemcpy(h_data, d_data, sizeof(float)*nz, cudaMemcpyDeviceToHost);
// Reference
float roffout2 = a * a / 64.f;
h_ref[0] = roffout2*b + c*c;
// Compare
printf("h_data[0] = %1.12e,\nh_ref[0] = %1.12e,\ndifference = %1.12e\n",
h_data[0],h_ref[0],h_data[0]-h_ref[0]);
// Free
free(h_data); free(h_ref);
cudaFree(d_data);
return 0;
}
I'm compiling only with the-O3 flag.

This small numerical difference of one single-precision ulp occurs because the CUDA compiler applies FMA-merging by default, whereas the host compiler does not do that. FMA-merging can be turned off by adding the command line flag -fmad=false to the invocation of the CUDA compiler driver nvcc.
FMA-merging is a compiler optimization in which an FMUL and a dependent FADD are transformed into a single fused multiply-add, or FMA, instruction. An FMA instruction computes a*b+c such that the full unrounded product a*b enters into the addition with c before a final rounding is applied to produce the final result.
Usually, this has performance advantages, since a single FMA instruction is executed instead of two instructions FMUL, FADD, and all the instructions have similar latency. Usually, this also has accuracy advantages as the use of FMA eliminates one rounding step and guards against subtractive cancellation when a*c and c have opposite signs.
In this case, as noted by OP, the GPU result computed with FMA is slightly more accurate than the host result computed without FMA. Using a higher precision reference, I find that the relative error in the GPU result is -4.21e-8, while the relative error in the host result is 4.95e-8.

Related

CUDA shared vs global memory, possible speedup

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.

CUDA's nvvp reports non-ideal memory access pattern, but bandwidth is almost peaking

EDIT: new minimal working example to illustrate the question and better explanation of nvvp's outcome (following suggestions given in the comments).
So, I have crafted a "minimal" working example, which follows:
#include <cuComplex.h>
#include <iostream>
int const n = 512 * 100;
typedef float real;
template < class T >
struct my_complex {
T x;
T y;
};
__global__ void set( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 1.0f, 0.0f };
}
__global__ void duplicate_whole( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 2.0f * d.x, 2.0f * d.y };
}
__global__ void duplicate_half( real * a )
{
real & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d *= 2.0f;
}
int main()
{
my_complex< real > * a;
cudaMalloc( ( void * * ) & a, sizeof( my_complex< real > ) * n * 1024 );
set<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_whole<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_half<<< 2 * n, 1024 >>>( reinterpret_cast< real * >( a ) );
cudaDeviceSynchronize();
my_complex< real > * a_h = new my_complex< real >[ n * 1024 ];
cudaMemcpy( a_h, a, sizeof( my_complex< real > ) * n * 1024, cudaMemcpyDeviceToHost );
std::cout << "( " << a_h[ 0 ].x << ", " << a_h[ 0 ].y << " )" << '\t' << "( " << a_h[ n * 1024 - 1 ].x << ", " << a_h[ n * 1024 - 1 ].y << " )" << std::endl;
return 0;
}
When I compile and run the above code, kernels duplicate_whole and duplicate_half take just about the same time to run.
However, when I analyze the kernels using nvvp I get different reports for each of the kernels in the following sense. For kernel duplicate_whole, nvvp warns me that at line 23 (d = { 2.0f * d.x, 2.0f * d.y };) the kernel is performing
Global Load L2 Transaction/Access = 8, Ideal Transaction/Access = 4
I agree that I am loading 8 byte words. What I do not understand is why 4 bytes is the ideal word size. In special, there is no performance difference between the kernels.
I suppose that there must be circumstances where this global store access pattern could cause performance degradation. What are these?
And why is that I do not get a performance hit?
I hope that this edit has clarified some unclear points.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I'll start wit some kernel code to exemplify my question, which will follow below
template < class data_t >
__global__ void chirp_factors_multiply( std::complex< data_t > const * chirp_factors,
std::complex< data_t > * data,
int M,
int row_length,
int b,
int i_0
)
{
#ifndef CUGALE_MUL_SHUFFLE
// Output array length:
int plane_area = row_length * M;
// Process element:
int i = blockIdx.x * row_length + threadIdx.x + i_0;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
my_complex< data_t > datum;
my_complex< data_t > datum_new;
for ( int i_b = 0; i_b < b; ++ i_b )
{
my_complex< data_t > & ref_datum = ref_complex( data[ i_b * plane_area + i ] );
datum = ref_datum;
datum_new.x = datum.x * chirp_factor.x - datum.y * chirp_factor.y;
datum_new.y = datum.x * chirp_factor.y + datum.y * chirp_factor.x;
ref_datum = datum_new;
}
#else
// Output array length:
int plane_area = row_length * M;
// Element to process:
int i = blockIdx.x * row_length + ( threadIdx.x + i_0 ) / 2;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
// Real and imaginary part of datum (not respectively for odd threads):
data_t datum_a;
data_t datum_b;
// Even TIDs will read data in regular order, odd TIDs will read data in inverted order:
int parity = ( threadIdx.x % 2 );
int shuffle_dir = 1 - 2 * parity;
int inwarp_tid = threadIdx.x % warpSize;
for ( int i_b = 0; i_b < b; ++ i_b )
{
int data_idx = i_b * plane_area + i;
datum_a = reinterpret_cast< data_t * >( data + data_idx )[ parity ];
datum_b = __shfl_sync( 0xFFFFFFFF, datum_a, inwarp_tid + shuffle_dir, warpSize );
// Even TIDs compute real part, odd TIDs compute imaginary part:
reinterpret_cast< data_t * >( data + data_idx )[ parity ] = datum_a * chirp_factor.x - shuffle_dir * datum_b * chirp_factor.y;
}
#endif // #ifndef CUGALE_MUL_SHUFFLE
}
Let us consider the case where data_t is float, which is memory bandwidth limited. As it can be seen above, there are two versions of the kernel, one which reads/writes 8 bytes (a whole complex number) per thread and another which reads/writes 4 bytes per thread and then shuffles the results so the complex product is computed correctly.
The reason why I have written the version using shuffle is because nvvp insisted that reading 8 bytes per thread was not the best idea because this memory access pattern would be inefficient. This is the case even though in both systems tested (GTX 1050 and GTX Titan Xp) memory bandwidth was very close to theoretical maximum.
Surely enough I knew that no improvement was likely to happen, and this was indeed the case: both kernels take pretty much the same time to run. So, my question is the following:
Why is that nvvp reports that reading 8 bytes would be less efficient than reading 4 bytes per thread? In which circumstances would that be the case?
As a side note, single precision is more important to me, but double is useful in some cases too. Interestingly enough, in the case where data_t is double, there is no execution time difference too between the two kernel versions, even though in this case the kernel is compute bound and the shuffle version performs some more flops than the original version.
Note: the kernels are applied to a row_length * M * b dataset (b images with row_length columns and M lines) and the chirp_factor array is row_length * M. Both kernels run perfecly fine (I can edit the question to show you the calls to both versions if you have doubts about it).
The issue here has to do with how the compiler is processing your code. nvvp is merely dutifully reporting what is happening when you run your code.
If you use the cuobjdump -sass tool on your executable, you will discover that the duplicate_whole routine is doing two 4-byte loads and two 4-byte stores. This is not optimal, partly becuase there is a stride in each load and store (each load and store touches alternate elements in memory).
The reason for this is that the compiler does not know the alignment of your my_complex struct. Your struct would be legal for use in situations that would prevent the compiler from generating a (legal) 8-byte load. As discussed here we can fix this by informing the compiler that we only intend to use the struct in alignment scenarios where a CUDA 8-byte load is legal (i.e. it is "naturally aligned"). The modification to your struct looks like this:
template < class T >
struct __align__(8) my_complex {
T x;
T y;
};
With that change to your code, the compiler generates 8-byte loads for the duplicate_whole kernel, and you should see a different report from the profiler. You should use this sort of decoration only when you understand what it means and are willing to enter into a contract with the compiler that you will ensure this is the case. If you do something unusual, like unusual pointer casting, you can violate your end of the bargain and generate a machine fault.
The reason you don't see much performance difference almost certainly has to do with CUDA load/store behavior and the GPU caches
When you do a strided load, the GPU loads an entire cacheline anyway, even though (in this case) you only need half the elements (the real elements) for that particular load operation. However you need the other half of the elements (the imaginary elements) anyway; they will be loaded on the next instruction, and this instruction most likely hits in the cache, due to the previous load.
On a strided store in this case, writing strided elements in one instruction and the alternate elements in the next instruction will end up using one of the caches as a "coalescing buffer". This isn't coalescing in the typical sense used in CUDA terminology; that sort of coalescing only applies to a single instruction. However the cache "coalescing buffer" behavior allows it to "accumulate" multiple writes to an already-resident line, before that line gets written out or evicted. This is approximately equivalent to "write-back" cache behavior.

Best strategy for grid search with CUDA

Recently I started working with CUDA and I read an introductory book on the computing language. To see if I understood it well, I considered the following problem.
Consider a function minimize f(x,y) on the grid [-1,1] X [-1,1]. This provided me with a few practical questions and I would like to have your look on things.
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Thanks!
This was an interesting question for me because it represents a type of problem that I think is rare:
potentially high compute load
little to no data that needs to be communicated host->device
very low volume of results that need to be communicated device->host
In other words, pretty much all compute, with not much dependence on data transfer, or even global memory usage/bandwidth.
Having said that, the question seems to be looking for a brute-force search approach to functional optimization/minimization, which is not an efficient technique for functions that are amenable to other optimization methods. But as a learning exercise, it's interesting (to me, anyway). It may also be useful for functions that are otherwise difficult to handle such as functions with discontinuities or other irregularities.
To answer your questions:
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
I wouldn't bother calculating the grid on the CPU. (I assume by "grid" you mean the functional value of f at each point on the grid.) First of all, this is a fairly computationally intensive task - which GPUs are good at, and secondly, it is potentially a large data set, so transferring it to the GPU (so the GPU can then do the search) will take time. I propose to let the GPU do this (compute the functional value at each grid point.) Since we won't be using global access to data for this, texture memory is not an issue.
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Yes, you could use a 1D array of blocks (list) or a 2D array. I don't think this significantly impacts the problem either way, and I think the 2D grid approach fits the problem better (and I think allows for slightly cleaner code) so I would suggest starting with a 2D array of blocks.
Here's a sample code that might be interesting to play with or crystallize ideas. Each thread has the responsibility to compute its respective value of x and y, and then the functional value f at that point. Then a reduction followed by a block-draining reduction is used to search over all computed values for the minimum value (in this case).
$ cat t811.cu
#include <stdio.h>
#include <math.h>
#include <assert.h>
// grid dimensions and divisions
#define XNR -1.0f
#define XPR 1.0f
#define YNR -1.0f
#define YPR 1.0f
#define DX 0.0001f
#define DY 0.0001f
// threadblock dimensions - product must be a power of 2
#define BLK_X 16
#define BLK_Y 16
// optimization functions - these are currently set for minimization
#define TST(X1,X2) ((X1)>(X2))
#define OPT(X1,X2) (X2)
// error check macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for timing
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
// the function f that will be "optimized"
__host__ __device__ float f(float x, float y){
return (x+0.5)*(x+0.5) + (y+0.5)*(y+0.5) +0.1f;
}
// variable for block-draining reduction block counter
__device__ int blkcnt = 0;
// GPU optimization kernel
__global__ void opt_kernel(float * __restrict__ bf, float * __restrict__ bx, float * __restrict__ by, const float scx, const float scy){
__shared__ float sh_f[BLK_X*BLK_Y];
__shared__ float sh_x[BLK_X*BLK_Y];
__shared__ float sh_y[BLK_X*BLK_Y];
__shared__ int lblock;
// compute x,y coordinates for this thread
float x = ((threadIdx.x+blockDim.x*blockIdx.x) * (XPR-XNR))*scx + XNR;
float y = ((threadIdx.y+blockDim.y*blockIdx.y) * (YPR-YNR))*scy + YNR;
int thid = (threadIdx.y*BLK_X)+threadIdx.x;
lblock = 0;
sh_x[thid] = x;
sh_y[thid] = y;
sh_f[thid] = f(x,y); // compute functional value of f(x,y)
__syncthreads();
// perform block-level shared memory reduction
// assume block size is a power of 2
for (int i = (blockDim.x*blockDim.y)>>1; i > 16; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
volatile float *vf = sh_f;
volatile float *vx = sh_x;
volatile float *vy = sh_y;
for (int i = 16; i > 0; i>>=1)
if (thid < i)
if (TST(vf[thid],vf[thid+i])){
vf[thid] = OPT(vf[thid],vf[thid+i]);
vx[thid] = OPT(vx[thid],vx[thid+i]);
vy[thid] = OPT(vy[thid],vy[thid+i]);}
// save block reduction result, and check if last block
if (!thid){
bf[blockIdx.y*gridDim.x+blockIdx.x] = sh_f[0];
bx[blockIdx.y*gridDim.x+blockIdx.x] = sh_x[0];
by[blockIdx.y*gridDim.x+blockIdx.x] = sh_y[0];
int myblock = atomicAdd(&blkcnt, 1);
if (myblock == (gridDim.x*gridDim.y-1)) lblock = 1;}
__syncthreads();
if (lblock){
// do last-block reduction
float my_x, my_y, my_f;
int myid = thid;
if (myid < gridDim.x * gridDim.y){
my_x = bx[myid];
my_y = by[myid];
my_f = bf[myid];}
else { assert(0);} // does not work correctly if block dims are greater than grid dims
myid += blockDim.x*blockDim.y;
while (myid < gridDim.x*gridDim.y){
if TST(my_f,bf[myid]){
my_x = OPT(my_x,bx[myid]);
my_y = OPT(my_y,by[myid]);
my_f = OPT(my_f,bf[myid]);}
myid += blockDim.x*blockDim.y;}
sh_f[thid] = my_f;
sh_x[thid] = my_x;
sh_y[thid] = my_y;
__syncthreads();
for (int i = (blockDim.x*blockDim.y)>>1; i > 0; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
if (!thid){
bf[0] = sh_f[0];
bx[0] = sh_x[0];
by[0] = sh_y[0];
}
}
}
// cpu (naive,serial) function for comparison
float3 opt_cpu(){
float optx = XNR;
float opty = YNR;
float optf = f(optx,opty);
for (float x = XNR; x < XPR; x += DX)
for (float y = YNR; y < YPR; y += DY){
float test = f(x,y);
if (TST(optf,test)){
optf = OPT(optf,test);
optx = OPT(optx,x);
opty = OPT(opty,y);}}
return make_float3(optf, optx, opty);
}
int main(){
// compute threadblock and grid dimensions
int nx = ceil(XPR-XNR)/DX;
int ny = ceil(YPR-YNR)/DY;
int bx = ceil(nx/(float)BLK_X);
int by = ceil(ny/(float)BLK_Y);
dim3 threads(BLK_X, BLK_Y);
dim3 blocks(bx, by);
float *d_bx, *d_by, *d_bf;
cudaFree(0);
// run GPU test case
unsigned long gtime = dtime_usec(0);
cudaMalloc(&d_bx, bx*by*sizeof(float));
cudaMalloc(&d_by, bx*by*sizeof(float));
cudaMalloc(&d_bf, bx*by*sizeof(float));
opt_kernel<<<blocks, threads>>>(d_bf, d_bx, d_by, 1.0f/(blocks.x*threads.x), 1.0f/(blocks.y*threads.y));
float rf, rx, ry;
cudaMemcpy(&rf, d_bf, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&rx, d_bx, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&ry, d_by, sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
gtime = dtime_usec(gtime);
printf("gpu val: %f, x: %f, y: %f, time: %fs\n", rf, rx, ry, gtime/(float)USECPSEC);
//run CPU test case
unsigned long ctime = dtime_usec(0);
float3 cpu_res = opt_cpu();
ctime = dtime_usec(ctime);
printf("cpu val: %f, x: %f, y: %f, time: %fs\n", cpu_res.x, cpu_res.y, cpu_res.z, ctime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -o t811 t811.cu
$ ./t811
gpu val: 0.100000, x: -0.500000, y: -0.500000, time: 0.193248s
cpu val: 0.100000, x: -0.500017, y: -0.500017, time: 2.810862s
$
Notes:
This problem is set up to find the minimum value of f(x,y) = (x+0.5)^2 + (y+0.5)^2 + 0.1 over the domain: x(-1,1), y(-1,1)
The test was run on Fedora 20, CUDA 7, Quadro5000 GPU (cc2.0) and a Xeon X5560 2.8GHz CPU. Different CPU or GPU will obviously affect the comparison.
The observed speedup here is about 14x. The CPU code is a naive, single threaded code.
It should be possible, for example, via modification of the OPT and TST macros, to perform a different kind of optimization - such as maximum instead of minimum.
The domain (and grid) dimensions and granularity to search over can be modified by the compile time constants such as XNR, XPR, etc.

Dot Product in CUDA using atomic operations - getting wrong results

I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:
#include <stdio.h>
#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float
// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c)
{
__shared__ num_t temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads(); //Synchronize!
*c = 0.00;
// Does it need to be tid==0 that
// undertakes this task?
if (0 == threadIdx.x) {
num_t sum = 0.00;
int i;
for (i=0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
//WRONG: *c += sum; This read-write operation must be atomic!
}
}
// Initialize the vectors:
void init_vector(num_t *x)
{
int i;
for (i=0 ; i<N ; i++){
x[i] = 0.001 * i;
}
}
// MAIN
int main(void)
{
num_t *a, *b, *c;
num_t *dev_a, *dev_b, *dev_c;
size_t size = N * sizeof(num_t);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a = (num_t*)malloc(size);
b = (num_t*)malloc(size);
c = (num_t*)malloc(size);
init_vector(a);
init_vector(b);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);
printf("a = [\n");
int i;
for (i=0;i<10;i++){
printf("%g\n",a[i]);
}
printf("...\n");
for (i=N-10;i<N;i++){
printf("%g\n",a[i]);
}
printf("]\n\n");
printf("a*b = %g.\n", *c);
free(a); free(b); free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
}
and I compile it with:
/usr/local/cuda-5.0/bin/nvcc -m64 -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart
Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:
N = 2048 * 8;
a = zeros(N,1);
for i=1:N
a(i) = 0.001*(i-1);
end
dot_product = a'*a;
But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?
Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).
At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example
Comment this line out of your kernel:
// *c = 0.00;
And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):
num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);
And I believe you'll get results that match matlab, more or less.
The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.
By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.
Edit: responding to a question below in the comments:
A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.

Texture fetch slower than direct global access, chapter 7 from "Cuda by example" book

I am reading and testing the examples in the book "Cuda By example. An introduction to General Purpose GPU Programming".
When testing the examples in chapter 7, relative to texture memory, I realized that access to global memory via texture cache is much slower than direct access (My NVIDIA GPU is GeForceGTX 260, compute capability 1.3 and I am using NVDIA CUDA 4.2):
Time per frame with texture fetch (1D or 2D) for a 256*256 image: 93 ms
Time per frame not using texture (just direct global access) for 256*256: 8.5 ms
I have double checked the code several times and I have also been reading the "CUDA C Programming guide" and "CUDA C Best practices Guide" which come along with the SDK, and I do not really understand the problem.
As far as I understand, texture memory is just global memory with a specific access mechanism implementation to make it look like a cache (?). I am wondering whether coalesced access to global memory will make texture fetch slower, but I cannot be sure.
Does anybody have a similar problem?
(I found some links in NVIDIA forums for a similar problem, but the link is no longer available.)
The testing code looks this way, only including the relevant parts:
//#define TEXTURE
//#define TEXTURE2
#ifdef TEXTURE
// According to C programming guide, it should be static (3.2.10.1.1)
static texture<float> texConstSrc;
static texture<float> texIn;
static texture<float> texOut;
#endif
__global__ void copy_const_kernel( float *iptr
#ifdef TEXTURE2
){
#else
,const float *cptr ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
#ifdef TEXTURE2
float c = tex1Dfetch(texConstSrc,offset);
#else
float c = cptr[offset];
#endif
if ( c != 0) iptr[offset] = c;
}
__global__ void blend_kernel( float *outSrc,
#ifdef TEXTURE
bool dstOut ) {
#else
const float *inSrc ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
int left = offset - 1;
int right = offset + 1;
if (x == 0) left++;
if (x == SXRES-1) right--;
int top = offset - SYRES;
int bottom = offset + SYRES;
if (y == 0) top += SYRES;
if (y == SYRES-1) bottom -= SYRES;
#ifdef TEXTURE
float t, l, c, r, b;
if (dstOut) {
t = tex1Dfetch(texIn,top);
l = tex1Dfetch(texIn,left);
c = tex1Dfetch(texIn,offset);
r = tex1Dfetch(texIn,right);
b = tex1Dfetch(texIn,bottom);
} else {
t = tex1Dfetch(texOut,top);
l = tex1Dfetch(texOut,left);
c = tex1Dfetch(texOut,offset);
r = tex1Dfetch(texOut,right);
b = tex1Dfetch(texOut,bottom);
}
outSrc[offset] = c + SPEED * (t + b + r + l - 4 * c);
#else
outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +
inSrc[bottom] + inSrc[left] + inSrc[right] -
inSrc[offset]*4);
#endif
}
// globals needed by the update routine
struct DataBlock {
unsigned char *output_bitmap;
float *dev_inSrc;
float *dev_outSrc;
float *dev_constSrc;
cudaEvent_t start, stop;
float totalTime;
float frames;
unsigned size;
unsigned char *output_host;
};
void anim_gpu( DataBlock *d, int ticks ) {
checkCudaErrors( cudaEventRecord( d->start, 0 ) );
dim3 blocks(SXRES/16,SYRES/16);
dim3 threads(16,16);
#ifdef TEXTURE
volatile bool dstOut = true;
#endif
for (int i=0; i<90; i++) {
#ifdef TEXTURE
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
#ifdef TEXTURE2
copy_const_kernel<<<blocks,threads>>>( in );
#else
copy_const_kernel<<<blocks,threads>>>( in,
d->dev_constSrc );
#endif
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
#else
copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
d->dev_constSrc );
blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
d->dev_inSrc );
swap( d->dev_inSrc, d->dev_outSrc );
#endif
}
// Some stuff for the events
// ...
}
I have been testing the results with the nvvp (NVIDIA profiler)
The result are quite curious as they show that there are a lot of texture cache misses (which are probably the cause for the bad performance).
The result from the profiler show also information that is difficult to understand even using the guide "CUPTI_User_GUide):
text_cache_hit: Number of texture cache hits (they are accounted only for one SM according to 1.3 capability).
text_cache_miss: Number of texture cache miss (they are accounted only for one SM according to 1.3 capability).
The following are the results for an example of 256*256 without using texture cache (only relevant info is shown):
Name Duration(ns) Grid_Size Block_Size
"copy_const_kernel(...) 22688 16,16,1 16,16,1
"blend_kernel(...)" 51360 16,16,1 16,16,1
Following are the results using 1D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 147392 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 841728 16,16,1 16,16,1 79 5041
Following are the results using 2D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 150880 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 872832 16,16,1 16,16,1 2971 2149
These result show several interesting info:
There are no cache hits at all for the "copy const" function (although ideally the memory is "spatially located", in the sense that each thread accesses memory which is near to the memory acceded by other near threads). I guess that this is because the threads within this function do not access memory from other threads, which seems to be the way for the texture cache to be usable (being the "spatially located" concept quite confusing)
There are some cache hits in the 1D and a lot more in the 2D case for the function "blend_kernel". I guess that it is due to the fact that within that function, any thread access memory from their neighbours threads. I cannot understand why there are more in 2D than 1d.
The duration time is greater in the texture cases than in the no texture case (nearly about one order of magnitude). Perhaps related with the so many texture cache misses.
For the "copy_const" function there are 1024 total accesses for the SM and 5120 for the "blend kernel". The relation 5:1 is correct due to the fact that there are 5 fetches in "blend" and only 1 in "copy_const". Anyway, I cannot understand where all this 1024 come from: ideally, this event "text cache miss/hot" only accounts for one SM (I have 24 in my GeForceGTX 260) and it only accounts for warps ( 32 thread size). Therefore, I have 256 threads/32=8 warps per SM and 256 blocks/24 = 10 or 11 "iterations" per SM, so I would be expecting something like 80 or 88 fetches (more over, some other event like sm_cta_launched, which is the number of thread blocks per SM, which is supposed to be supported in my 1.3 device, is always 0...)