Simplest Possible Example to Show GPU Outperform CPU Using CUDA - cuda

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
#define N (1024*1024)
#define M (1000000)
int main()
{
float data[N]; int count = 0;
for(int i = 0; i < N; i++)
{
data[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
{
data[i] = data[i] * data[i] - 0.25f;
}
}
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
Use something like this in CUDA/C:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] = buf[i] * buf[i] - 0.25f;
}
int main()
{
float data[N]; int count = 0;
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)
As I said in my comment response to #Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)

For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024
void serial_add(double *a, double *b, double *c, int n, int m)
{
for(int index=0;index<n;index++)
{
for(int j=0;j<m;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
}
__global__ void vector_add(double *a, double *b, double *c)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
for(int j=0;j<M;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
int main()
{
clock_t start,end;
double *a, *b, *c;
int size = N * sizeof( double );
a = (double *)malloc( size );
b = (double *)malloc( size );
c = (double *)malloc( size );
for( int i = 0; i < N; i++ )
{
a[i] = b[i] = i;
c[i] = 0;
}
start = clock();
serial_add(a, b, c, N, M);
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
end = clock();
float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("Serial: %f seconds\n",time1);
start = clock();
double *d_a, *d_b, *d_c;
cudaMalloc( (void **) &d_a, size );
cudaMalloc( (void **) &d_b, size );
cudaMalloc( (void **) &d_c, size );
cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );
vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
free(a);
free(b);
free(c);
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
end = clock();
float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);
return 0;
}

I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.
For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the
int gpu = 1;
line in the hello.c source file to 0 for CPU, 1 for GPU.
Apple has some more OpenCL example code in their main Mac source code listing.
Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.
I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.

Related

How to make use of swap space on disk when run out of modern gpu memory?

Post-Pascal UM can allocate more memory than the GPU memory, which swap-in swap-out between GPU memory and host memory automatically.
So what if run out of GPU memory and host memory? How can I use the swap space on disk? Virtual memory swap space seems not work in cudaMallocManaged case. Here is how I did the experiment:
create swap space: dd if=/dev/zero of=./swapfile bs=1G count=16, mkswap and swapon
create host memory occupier, burn out 99% of host memory
for (i = 0; i < 8000; i++)
malloc(1<<20);
create GPU memory occupier with cudaMalloc, leaving 1G of GPU memory
offical uvm demo but with a 12GB workload, which will run out of all available GPU memory and host memory
#include <iostream>
#include <math.h>
// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void) {
long long N = 6LL * (1 << 30) / sizeof(float); // <<<<<<<<<<<
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
It just get kill by oom-killer and leave swap space alone.
It there is a pure lib malloc with 100GB workload, you can see that swap space usage is growing.
UM/UVA can use gpu memory + host memory + swap space, just like virtual memory say
There are 3 types of memory allocations that are accessible to GPU device code:
ordinary (e.g. cudaMalloc)
pinned (e.g. cudaHostAlloc)
managed (e.g. cudaMallocManaged)
None of these will make use of or have any bearing on traditional linux swap space (or the equivalent on windows). The first one is limited by available device memory, and the second two are limited by available host memory (or some other lower limit). All host-based allocations accessible to GPU device code must be resident in non-swappable memory using "swappable" here to refer to the ordinary host virtual memory management system that may swap pages out to disk.
The only space that benefits from this form of swapping is host pageable memory allocations, and these are not directly accessible from CUDA device code.

dram_write_bytes result on P100

I used nvprof to profile a simple vecadd example (n=1024) on P100 but observed the dram_write_bytes is only 256 (rather than 1024*4 that I expected). Can someone explain why this number is small? What other metrics I need to add in to count for global memory writes? Thanks. float_count_sp number is correct (1024).
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
__global__ void vecAdd(float* a, float* b, float* c, int n){
int id = blockIdx.x*blockDim.x + threadIdx.x;
if(id < n) c[id] = a[id] + b[id];
}
int main(int argc, char* argv[]){
int n = 1024;
float *h_a, *d_a;
float *h_b, *d_b;
float *h_c, *d_c;
size_t bytes = n*sizeof(float);
h_a = (float*)malloc(bytes);
h_b = (float*)malloc(bytes);
h_c = (float*)malloc(bytes);
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
for(i = 0; i < n; i++){
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i+1);
}
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
vecAdd <<<1, 1024>>> (d_a, d_b, d_c, n);
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
float sum = 0;
for(i = 0; i < n; i++)
sum += h_c[i] - h_a[i] - h_b[i];
printf("final diff: %f\n", sum/n);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(h_a);
free(h_b);
free(h_c);
return 0;
}
Is it related to the sampling of nvprof? One time I get 384 bytes. Sometimes I even got 0 bytes. Weird thing is: if I change n to 1024*1024, I got bytes more than I expected (4688032). 4688032/1024/1024/4 = 1.11.
There are several reasons why your expectations are not being observed and the data is changing:
The GPU memory system is shared by all engines. The primary engine the is the graphics/compute engine but other engines such as copy engines, display, etc. access the device memory and the memory control (FB = framebuffer) counters do not have a method to track the requester.
NVPROF injection does not attempt to evict all context memory from the L2 cache. The cudaMemcpys prior to the launch and the kernel replay code in nvprof will leave the L2 cache in an inconsistent state.
The initial size of 4KB is simply to small to accurately track. The full data set could be in L2 from either the cudaMemcpy or replay. Furthermore, the bytes you see can be from other clients such as the constant caches.
It is highly recommends you scale the buffer size to a reasonable size. On newer GPUs the Nsight Compute profiler has improved L2 level breakdown of various clients to help detect unexpected traffic. In addition Nsight Compute replay logic clears the L2 cache so that each replay has a consistent start state.
If you have a monitor attached it is recommended to move the monitor to a different GPU when looking at DRAM counters. nvprof L2 counters generally filter the count by traffic from the SMs so traffic from copy engines, the display controller, MMU, constant caches, etc. will not show up in the L2 counters.

Best strategy for grid search with CUDA

Recently I started working with CUDA and I read an introductory book on the computing language. To see if I understood it well, I considered the following problem.
Consider a function minimize f(x,y) on the grid [-1,1] X [-1,1]. This provided me with a few practical questions and I would like to have your look on things.
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Thanks!
This was an interesting question for me because it represents a type of problem that I think is rare:
potentially high compute load
little to no data that needs to be communicated host->device
very low volume of results that need to be communicated device->host
In other words, pretty much all compute, with not much dependence on data transfer, or even global memory usage/bandwidth.
Having said that, the question seems to be looking for a brute-force search approach to functional optimization/minimization, which is not an efficient technique for functions that are amenable to other optimization methods. But as a learning exercise, it's interesting (to me, anyway). It may also be useful for functions that are otherwise difficult to handle such as functions with discontinuities or other irregularities.
To answer your questions:
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
I wouldn't bother calculating the grid on the CPU. (I assume by "grid" you mean the functional value of f at each point on the grid.) First of all, this is a fairly computationally intensive task - which GPUs are good at, and secondly, it is potentially a large data set, so transferring it to the GPU (so the GPU can then do the search) will take time. I propose to let the GPU do this (compute the functional value at each grid point.) Since we won't be using global access to data for this, texture memory is not an issue.
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Yes, you could use a 1D array of blocks (list) or a 2D array. I don't think this significantly impacts the problem either way, and I think the 2D grid approach fits the problem better (and I think allows for slightly cleaner code) so I would suggest starting with a 2D array of blocks.
Here's a sample code that might be interesting to play with or crystallize ideas. Each thread has the responsibility to compute its respective value of x and y, and then the functional value f at that point. Then a reduction followed by a block-draining reduction is used to search over all computed values for the minimum value (in this case).
$ cat t811.cu
#include <stdio.h>
#include <math.h>
#include <assert.h>
// grid dimensions and divisions
#define XNR -1.0f
#define XPR 1.0f
#define YNR -1.0f
#define YPR 1.0f
#define DX 0.0001f
#define DY 0.0001f
// threadblock dimensions - product must be a power of 2
#define BLK_X 16
#define BLK_Y 16
// optimization functions - these are currently set for minimization
#define TST(X1,X2) ((X1)>(X2))
#define OPT(X1,X2) (X2)
// error check macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for timing
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
// the function f that will be "optimized"
__host__ __device__ float f(float x, float y){
return (x+0.5)*(x+0.5) + (y+0.5)*(y+0.5) +0.1f;
}
// variable for block-draining reduction block counter
__device__ int blkcnt = 0;
// GPU optimization kernel
__global__ void opt_kernel(float * __restrict__ bf, float * __restrict__ bx, float * __restrict__ by, const float scx, const float scy){
__shared__ float sh_f[BLK_X*BLK_Y];
__shared__ float sh_x[BLK_X*BLK_Y];
__shared__ float sh_y[BLK_X*BLK_Y];
__shared__ int lblock;
// compute x,y coordinates for this thread
float x = ((threadIdx.x+blockDim.x*blockIdx.x) * (XPR-XNR))*scx + XNR;
float y = ((threadIdx.y+blockDim.y*blockIdx.y) * (YPR-YNR))*scy + YNR;
int thid = (threadIdx.y*BLK_X)+threadIdx.x;
lblock = 0;
sh_x[thid] = x;
sh_y[thid] = y;
sh_f[thid] = f(x,y); // compute functional value of f(x,y)
__syncthreads();
// perform block-level shared memory reduction
// assume block size is a power of 2
for (int i = (blockDim.x*blockDim.y)>>1; i > 16; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
volatile float *vf = sh_f;
volatile float *vx = sh_x;
volatile float *vy = sh_y;
for (int i = 16; i > 0; i>>=1)
if (thid < i)
if (TST(vf[thid],vf[thid+i])){
vf[thid] = OPT(vf[thid],vf[thid+i]);
vx[thid] = OPT(vx[thid],vx[thid+i]);
vy[thid] = OPT(vy[thid],vy[thid+i]);}
// save block reduction result, and check if last block
if (!thid){
bf[blockIdx.y*gridDim.x+blockIdx.x] = sh_f[0];
bx[blockIdx.y*gridDim.x+blockIdx.x] = sh_x[0];
by[blockIdx.y*gridDim.x+blockIdx.x] = sh_y[0];
int myblock = atomicAdd(&blkcnt, 1);
if (myblock == (gridDim.x*gridDim.y-1)) lblock = 1;}
__syncthreads();
if (lblock){
// do last-block reduction
float my_x, my_y, my_f;
int myid = thid;
if (myid < gridDim.x * gridDim.y){
my_x = bx[myid];
my_y = by[myid];
my_f = bf[myid];}
else { assert(0);} // does not work correctly if block dims are greater than grid dims
myid += blockDim.x*blockDim.y;
while (myid < gridDim.x*gridDim.y){
if TST(my_f,bf[myid]){
my_x = OPT(my_x,bx[myid]);
my_y = OPT(my_y,by[myid]);
my_f = OPT(my_f,bf[myid]);}
myid += blockDim.x*blockDim.y;}
sh_f[thid] = my_f;
sh_x[thid] = my_x;
sh_y[thid] = my_y;
__syncthreads();
for (int i = (blockDim.x*blockDim.y)>>1; i > 0; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
if (!thid){
bf[0] = sh_f[0];
bx[0] = sh_x[0];
by[0] = sh_y[0];
}
}
}
// cpu (naive,serial) function for comparison
float3 opt_cpu(){
float optx = XNR;
float opty = YNR;
float optf = f(optx,opty);
for (float x = XNR; x < XPR; x += DX)
for (float y = YNR; y < YPR; y += DY){
float test = f(x,y);
if (TST(optf,test)){
optf = OPT(optf,test);
optx = OPT(optx,x);
opty = OPT(opty,y);}}
return make_float3(optf, optx, opty);
}
int main(){
// compute threadblock and grid dimensions
int nx = ceil(XPR-XNR)/DX;
int ny = ceil(YPR-YNR)/DY;
int bx = ceil(nx/(float)BLK_X);
int by = ceil(ny/(float)BLK_Y);
dim3 threads(BLK_X, BLK_Y);
dim3 blocks(bx, by);
float *d_bx, *d_by, *d_bf;
cudaFree(0);
// run GPU test case
unsigned long gtime = dtime_usec(0);
cudaMalloc(&d_bx, bx*by*sizeof(float));
cudaMalloc(&d_by, bx*by*sizeof(float));
cudaMalloc(&d_bf, bx*by*sizeof(float));
opt_kernel<<<blocks, threads>>>(d_bf, d_bx, d_by, 1.0f/(blocks.x*threads.x), 1.0f/(blocks.y*threads.y));
float rf, rx, ry;
cudaMemcpy(&rf, d_bf, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&rx, d_bx, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&ry, d_by, sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
gtime = dtime_usec(gtime);
printf("gpu val: %f, x: %f, y: %f, time: %fs\n", rf, rx, ry, gtime/(float)USECPSEC);
//run CPU test case
unsigned long ctime = dtime_usec(0);
float3 cpu_res = opt_cpu();
ctime = dtime_usec(ctime);
printf("cpu val: %f, x: %f, y: %f, time: %fs\n", cpu_res.x, cpu_res.y, cpu_res.z, ctime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -o t811 t811.cu
$ ./t811
gpu val: 0.100000, x: -0.500000, y: -0.500000, time: 0.193248s
cpu val: 0.100000, x: -0.500017, y: -0.500017, time: 2.810862s
$
Notes:
This problem is set up to find the minimum value of f(x,y) = (x+0.5)^2 + (y+0.5)^2 + 0.1 over the domain: x(-1,1), y(-1,1)
The test was run on Fedora 20, CUDA 7, Quadro5000 GPU (cc2.0) and a Xeon X5560 2.8GHz CPU. Different CPU or GPU will obviously affect the comparison.
The observed speedup here is about 14x. The CPU code is a naive, single threaded code.
It should be possible, for example, via modification of the OPT and TST macros, to perform a different kind of optimization - such as maximum instead of minimum.
The domain (and grid) dimensions and granularity to search over can be modified by the compile time constants such as XNR, XPR, etc.

Shared memory, branching performance and register count

I came across some peculiar performance behaviour when trying out the CUDA shuffle instruction. The test kernel below is based on an image processing algorithm which adds input-dependent values to all neighbouring pixels within a square of side rad. The output for each block is added in shared memory. If only one thread per warp adds its result to shared memory, the performance is poor (Option 1), whereas on the other hand, if all threads add to shared memory (one thread adds the desired value, the rest just add 0), the execution time drops by 2-3 times (Option 2).
#include <iostream>
#include "cuda_runtime.h"
#define warpSz 32
#define tileY 32
#define rad 32
__global__ void test(float *out, int pitch)
{
// Set shared mem to 0
__shared__ float tile[(warpSz + 2*rad) * (tileY + 2*rad)];
for (int i = threadIdx.y*blockDim.x+threadIdx.x; i<(tileY+2*rad)*(warpSz+2*rad); i+=blockDim.x*blockDim.y) {
tile[i] = 0.0f;
}
__syncthreads();
for (int row=threadIdx.y; row<tileY; row += blockDim.y) {
// Loop over pixels in neighbourhood
for (int i=0; i<2*rad+1; ++i) {
float res = 0.0f;
int rowStartIdx = (row+i)*(warpSz+2*rad);
for (int j=0; j<2*rad+1; ++j) {
res += float(threadIdx.x+row); // Substitute for real calculation
// Option 1: one thread writes to shared mem
if (threadIdx.x == 0) {
tile[rowStartIdx + j] += res;
res = 0.0f;
}
//// Option 2: all threads write to shared mem
//float tmp = 0.0f;
//if (threadIdx.x == 0) {
// tmp = res;
// res = 0.0f;
//}
//tile[rowStartIdx + threadIdx.x+j] += tmp;
res = __shfl(res, (threadIdx.x+1) % warpSz);
}
res += float(threadIdx.x+row);
tile[rowStartIdx + threadIdx.x+2*rad] += res;
__syncthreads();
}
}
// Add result back to global mem
for (int row=threadIdx.y; row<tileY+2*rad; row+=blockDim.y) {
for (int col=threadIdx.x; col<warpSz+2*rad; col+=warpSz) {
int idx = (blockIdx.y*tileY + row)*pitch + blockIdx.x*warpSz + col;
atomicAdd(out+idx, tile[row*(warpSz+2*rad) + col]);
}
}
}
int main(void)
{
int2 dim = make_int2(512, 512);
int pitchOut = (((dim.x+2*rad)+warpSz-1) / warpSz) * warpSz;
int sizeOut = pitchOut*(dim.y+2*rad);
dim3 gridDim((dim.x+warpSz-1)/warpSz, (dim.y+tileY-1)/tileY, 1);
float *devOut;
cudaMalloc((void**)&devOut, sizeOut*sizeof(float));
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaFree(0);
cudaEventRecord(start, 0);
test<<<gridDim, dim3(warpSz, 8)>>>(devOut, pitchOut);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaFree(devOut);
cudaDeviceReset();
std::cout << "Elapsed time: " << elapsedTime << " ms.\n";
std::cin.ignore();
}
Is this expected behaviour/can anyone explain why this happens?
One thing I have noted is that Option 1 uses only 15 registers, whereas Option 2 uses 37, which seems a big difference to me.
Another is that the if-statement in the innermost loop is converted to explicit bra instructions in the PTX code for Option 1, whereas for Option 2 it is converted to two selp instructions. Could it be that the explicit branching is behind the 2-3 times slow down similar to what's suspected in this question?
There are two reasons why I am reluctant to go for Option 2. First, when profiling the original application it seems to be limited by share memory bandwidth, which indicates that there is potential to increase the performance by having fewer threads accessing it. Second, unless we use the volatile keyword, writes to shared memory can be optimised to registers. Since we are only interested in the contribution from last the thread to access each memory location (threadIdx.x == 0), and all others add 0, this is not a problem as long as all changes temporarily located in registers are guaranteed to be written back to shared memory in the same order they were issued. Is this the case though? (This far, both options have produced the exact same result.)
Any thoughts or ideas are much appreciated!
PS. I compile for compute capability 3.0. (However, the shuffle instruction is not necessary to demonstrate the behaviour and can be commented out.)

Dot Product in CUDA using atomic operations - getting wrong results

I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:
#include <stdio.h>
#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float
// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c)
{
__shared__ num_t temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads(); //Synchronize!
*c = 0.00;
// Does it need to be tid==0 that
// undertakes this task?
if (0 == threadIdx.x) {
num_t sum = 0.00;
int i;
for (i=0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
//WRONG: *c += sum; This read-write operation must be atomic!
}
}
// Initialize the vectors:
void init_vector(num_t *x)
{
int i;
for (i=0 ; i<N ; i++){
x[i] = 0.001 * i;
}
}
// MAIN
int main(void)
{
num_t *a, *b, *c;
num_t *dev_a, *dev_b, *dev_c;
size_t size = N * sizeof(num_t);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a = (num_t*)malloc(size);
b = (num_t*)malloc(size);
c = (num_t*)malloc(size);
init_vector(a);
init_vector(b);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);
printf("a = [\n");
int i;
for (i=0;i<10;i++){
printf("%g\n",a[i]);
}
printf("...\n");
for (i=N-10;i<N;i++){
printf("%g\n",a[i]);
}
printf("]\n\n");
printf("a*b = %g.\n", *c);
free(a); free(b); free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
}
and I compile it with:
/usr/local/cuda-5.0/bin/nvcc -m64 -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart
Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:
N = 2048 * 8;
a = zeros(N,1);
for i=1:N
a(i) = 0.001*(i-1);
end
dot_product = a'*a;
But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?
Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).
At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example
Comment this line out of your kernel:
// *c = 0.00;
And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):
num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);
And I believe you'll get results that match matlab, more or less.
The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.
By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.
Edit: responding to a question below in the comments:
A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.