Small program that burdens the GPU? - cuda

What is the most efficient way to burden a GPU and increase the energy consuming for testing purpose?
I do want the program to be as small as possible. Is there a specific kernel function that does the job?
Any suggestion on Metal or Cuda will be perfect.

I am sketching a possible solution here. You will need some experiments to maximize the thermal load of your GPU. Generally speaking, data movement is energetically expensive, much more so than computation in modern processors. So shuffling a lot of data will drive up power consumption. At the same time, we want an additive contribution to power consumption from computational units. Multipliers tend to be the largest power hogs; in modern processors we might want to target FMA (fused multiply-add) units.
Various GPUs have low throughput of double-precision math ops, others have low throughput of half-precision math ops. Therefore we would want to focus on single-precision math for the computational portion of our load. We want to be able to change the ratio of computation to memory activity easily. One approach would be to use the unrolled evaluation of a polynomial with the Horner scheme as the basic building block, using POLY_DEPTH steps. This we repeat REPS time in a loop. Prior to the loop we retrieve source data from global memory, and after the loop terminates we store the result to global memory. By varying REPS we can experiment with various settings of compute / memory balances.
One could further experiment with instruction-level parallelism, data patterns (as the power consumption of multipliers often differs based on bit patterns), and adding PCIe activity by using CUDA streams to achieve overlap of kernel execution and PCIe data transfers. Below I just used some random constants as multiplier data.
Obviously we would want to fill the GPU with lots of threads. For this we may use a fairly small THREADS_PER_BLK value giving us fine granularity for filling each SM. We might want to choose the number of blocks to be a multiple of the number of SM to spread load as evenly as possible, or use a MAX_BLOCKS value that is evenly divides common SM counts. How much source and destination memory we should touch will be up to experimentation: We can define arrays of LEN elements as a multiple of the number of blocks. Lastly, we want to execute the kernel thus defined and configured ITER number of times to create a continuous load for some time.
Note that as we apply load, the GPU will heat up and this will in turn further increase its power draw. To achieve maximum thermal load it will be necessary to run the load-generating app for 5 minutes or more. Note further that the GPU power management may dynamically reduce clock frequencies and voltages to reduce power consumption, and the power cap may kick in before you reach the thermal limit. Depending on the GPU you may be able to set a power cap higher than the one used by default with the nvidia-smi utility.
The program below keeps my Quadro P2000 pegged at the power cap, with a GPU load of 98% and a memory controller load of 83%-86%, as reported by TechPowerUp's GPU-Z utility. It will certainly require adjustments for other GPUs.
#include <stdlib.h>
#include <stdio.h>
#define THREADS_PER_BLK (128)
#define MAX_BLOCKS (65520)
#define LEN (MAX_BLOCKS * 1024)
#define POLY_DEPTH (30)
#define REPS (2)
#define ITER (100000)
// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call) \
do { \
cudaError_t err = call; \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaDeviceSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
__global__ void burn (const float * __restrict__ src,
float * __restrict__ dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
float p = src[i] + 1.0;
float q = src[i] + 3.0f;
for (int k = 0; k < REPS; k++) {
#pragma unroll POLY_DEPTH
for (int j = 0; j < POLY_DEPTH; j++) {
p = fmaf (p, 0.68073987f, 0.8947237f);
q = fmaf (q, 0.54639739f, 0.9587058f);
}
}
dst[i] = p + q;
}
}
int main (int argc, char *argv[])
{
float *d_a, *d_b;
/* Allocate memory on device */
CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * LEN));
CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * LEN));
/* Initialize device memory */
CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * LEN)); // zero
CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * LEN)); // NaN
/* Compute execution configuration */
dim3 dimBlock(THREADS_PER_BLK);
int threadBlocks = (LEN + (dimBlock.x - 1)) / dimBlock.x;
if (threadBlocks > MAX_BLOCKS) threadBlocks = MAX_BLOCKS;
dim3 dimGrid(threadBlocks);
printf ("burn: using %d threads per block, %d blocks, %f GB\n",
dimBlock.x, dimGrid.x, 2e-9*LEN*sizeof(d_a[0]));
for (int k = 0; k < ITER; k++) {
burn<<<dimGrid,dimBlock>>>(d_a, d_b, LEN);
CHECK_LAUNCH_ERROR();
}
CUDA_SAFE_CALL (cudaFree(d_a));
CUDA_SAFE_CALL (cudaFree(d_b));
return EXIT_SUCCESS;
}

Related

Why doesn't the console show the entire output? [duplicate]

For the purpose of testing printf() call on device, I wrote a simple program which copies an array of moderate size to device and print the value of device array to screen. Although the array is correctly copied to device, the printf() function does not work correctly, which lost the first several hundred numbers. The array size in the code is 4096. Is this a bug or I'm not using this function properly? Thanks in adavnce.
EDIT: My gpu is GeForce GTX 550i, with compute capability 2.1
My code:
#include<stdio.h>
#include<stdlib.h>
#define N 4096
__global__ void Printcell(float *d_Array , int n){
int k = 0;
printf("\n=========== data of d_Array on device==============\n");
for( k = 0; k < n; k++ ){
printf("%f ", d_Array[k]);
if((k+1)%6 == 0) printf("\n");
}
printf("\n\nTotally %d elements has been printed", k);
}
int main(){
int i =0;
float Array[N] = {0}, rArray[N] = {0};
float *d_Array;
for(i=0;i<N;i++)
Array[i] = i;
cudaMalloc((void**)&d_Array, N*sizeof(float));
cudaMemcpy(d_Array, Array, N*sizeof(float), cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
Printcell<<<1,1>>>(d_Array, N); //Print the device array by a kernel
cudaDeviceSynchronize();
/* Copy the device array back to host to see if it was correctly copied */
cudaMemcpy(rArray, d_Array, N*sizeof(float), cudaMemcpyDeviceToHost);
printf("\n\n");
for(i=0;i<N;i++){
printf("%f ", rArray[i]);
if((i+1)%6 == 0) printf("\n");
}
}
printf from the device has a limited queue. It's intended for small scale debug-style output, not large scale output.
referring to the programmer's guide:
The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten.
Your in-kernel printf output overran the buffer, and so the first printed elements were lost (overwritten) before the buffer was dumped into the standard I/O queue.
The linked documentation indicates that the buffer size can be increased, also.

Best strategy for grid search with CUDA

Recently I started working with CUDA and I read an introductory book on the computing language. To see if I understood it well, I considered the following problem.
Consider a function minimize f(x,y) on the grid [-1,1] X [-1,1]. This provided me with a few practical questions and I would like to have your look on things.
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Thanks!
This was an interesting question for me because it represents a type of problem that I think is rare:
potentially high compute load
little to no data that needs to be communicated host->device
very low volume of results that need to be communicated device->host
In other words, pretty much all compute, with not much dependence on data transfer, or even global memory usage/bandwidth.
Having said that, the question seems to be looking for a brute-force search approach to functional optimization/minimization, which is not an efficient technique for functions that are amenable to other optimization methods. But as a learning exercise, it's interesting (to me, anyway). It may also be useful for functions that are otherwise difficult to handle such as functions with discontinuities or other irregularities.
To answer your questions:
Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?
I wouldn't bother calculating the grid on the CPU. (I assume by "grid" you mean the functional value of f at each point on the grid.) First of all, this is a fairly computationally intensive task - which GPUs are good at, and secondly, it is potentially a large data set, so transferring it to the GPU (so the GPU can then do the search) will take time. I propose to let the GPU do this (compute the functional value at each grid point.) Since we won't be using global access to data for this, texture memory is not an issue.
Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.
Yes, you could use a 1D array of blocks (list) or a 2D array. I don't think this significantly impacts the problem either way, and I think the 2D grid approach fits the problem better (and I think allows for slightly cleaner code) so I would suggest starting with a 2D array of blocks.
Here's a sample code that might be interesting to play with or crystallize ideas. Each thread has the responsibility to compute its respective value of x and y, and then the functional value f at that point. Then a reduction followed by a block-draining reduction is used to search over all computed values for the minimum value (in this case).
$ cat t811.cu
#include <stdio.h>
#include <math.h>
#include <assert.h>
// grid dimensions and divisions
#define XNR -1.0f
#define XPR 1.0f
#define YNR -1.0f
#define YPR 1.0f
#define DX 0.0001f
#define DY 0.0001f
// threadblock dimensions - product must be a power of 2
#define BLK_X 16
#define BLK_Y 16
// optimization functions - these are currently set for minimization
#define TST(X1,X2) ((X1)>(X2))
#define OPT(X1,X2) (X2)
// error check macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for timing
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
// the function f that will be "optimized"
__host__ __device__ float f(float x, float y){
return (x+0.5)*(x+0.5) + (y+0.5)*(y+0.5) +0.1f;
}
// variable for block-draining reduction block counter
__device__ int blkcnt = 0;
// GPU optimization kernel
__global__ void opt_kernel(float * __restrict__ bf, float * __restrict__ bx, float * __restrict__ by, const float scx, const float scy){
__shared__ float sh_f[BLK_X*BLK_Y];
__shared__ float sh_x[BLK_X*BLK_Y];
__shared__ float sh_y[BLK_X*BLK_Y];
__shared__ int lblock;
// compute x,y coordinates for this thread
float x = ((threadIdx.x+blockDim.x*blockIdx.x) * (XPR-XNR))*scx + XNR;
float y = ((threadIdx.y+blockDim.y*blockIdx.y) * (YPR-YNR))*scy + YNR;
int thid = (threadIdx.y*BLK_X)+threadIdx.x;
lblock = 0;
sh_x[thid] = x;
sh_y[thid] = y;
sh_f[thid] = f(x,y); // compute functional value of f(x,y)
__syncthreads();
// perform block-level shared memory reduction
// assume block size is a power of 2
for (int i = (blockDim.x*blockDim.y)>>1; i > 16; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
volatile float *vf = sh_f;
volatile float *vx = sh_x;
volatile float *vy = sh_y;
for (int i = 16; i > 0; i>>=1)
if (thid < i)
if (TST(vf[thid],vf[thid+i])){
vf[thid] = OPT(vf[thid],vf[thid+i]);
vx[thid] = OPT(vx[thid],vx[thid+i]);
vy[thid] = OPT(vy[thid],vy[thid+i]);}
// save block reduction result, and check if last block
if (!thid){
bf[blockIdx.y*gridDim.x+blockIdx.x] = sh_f[0];
bx[blockIdx.y*gridDim.x+blockIdx.x] = sh_x[0];
by[blockIdx.y*gridDim.x+blockIdx.x] = sh_y[0];
int myblock = atomicAdd(&blkcnt, 1);
if (myblock == (gridDim.x*gridDim.y-1)) lblock = 1;}
__syncthreads();
if (lblock){
// do last-block reduction
float my_x, my_y, my_f;
int myid = thid;
if (myid < gridDim.x * gridDim.y){
my_x = bx[myid];
my_y = by[myid];
my_f = bf[myid];}
else { assert(0);} // does not work correctly if block dims are greater than grid dims
myid += blockDim.x*blockDim.y;
while (myid < gridDim.x*gridDim.y){
if TST(my_f,bf[myid]){
my_x = OPT(my_x,bx[myid]);
my_y = OPT(my_y,by[myid]);
my_f = OPT(my_f,bf[myid]);}
myid += blockDim.x*blockDim.y;}
sh_f[thid] = my_f;
sh_x[thid] = my_x;
sh_y[thid] = my_y;
__syncthreads();
for (int i = (blockDim.x*blockDim.y)>>1; i > 0; i>>=1){
if (thid < i)
if (TST(sh_f[thid],sh_f[thid+i])){
sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
__syncthreads();}
if (!thid){
bf[0] = sh_f[0];
bx[0] = sh_x[0];
by[0] = sh_y[0];
}
}
}
// cpu (naive,serial) function for comparison
float3 opt_cpu(){
float optx = XNR;
float opty = YNR;
float optf = f(optx,opty);
for (float x = XNR; x < XPR; x += DX)
for (float y = YNR; y < YPR; y += DY){
float test = f(x,y);
if (TST(optf,test)){
optf = OPT(optf,test);
optx = OPT(optx,x);
opty = OPT(opty,y);}}
return make_float3(optf, optx, opty);
}
int main(){
// compute threadblock and grid dimensions
int nx = ceil(XPR-XNR)/DX;
int ny = ceil(YPR-YNR)/DY;
int bx = ceil(nx/(float)BLK_X);
int by = ceil(ny/(float)BLK_Y);
dim3 threads(BLK_X, BLK_Y);
dim3 blocks(bx, by);
float *d_bx, *d_by, *d_bf;
cudaFree(0);
// run GPU test case
unsigned long gtime = dtime_usec(0);
cudaMalloc(&d_bx, bx*by*sizeof(float));
cudaMalloc(&d_by, bx*by*sizeof(float));
cudaMalloc(&d_bf, bx*by*sizeof(float));
opt_kernel<<<blocks, threads>>>(d_bf, d_bx, d_by, 1.0f/(blocks.x*threads.x), 1.0f/(blocks.y*threads.y));
float rf, rx, ry;
cudaMemcpy(&rf, d_bf, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&rx, d_bx, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(&ry, d_by, sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("some error");
gtime = dtime_usec(gtime);
printf("gpu val: %f, x: %f, y: %f, time: %fs\n", rf, rx, ry, gtime/(float)USECPSEC);
//run CPU test case
unsigned long ctime = dtime_usec(0);
float3 cpu_res = opt_cpu();
ctime = dtime_usec(ctime);
printf("cpu val: %f, x: %f, y: %f, time: %fs\n", cpu_res.x, cpu_res.y, cpu_res.z, ctime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -o t811 t811.cu
$ ./t811
gpu val: 0.100000, x: -0.500000, y: -0.500000, time: 0.193248s
cpu val: 0.100000, x: -0.500017, y: -0.500017, time: 2.810862s
$
Notes:
This problem is set up to find the minimum value of f(x,y) = (x+0.5)^2 + (y+0.5)^2 + 0.1 over the domain: x(-1,1), y(-1,1)
The test was run on Fedora 20, CUDA 7, Quadro5000 GPU (cc2.0) and a Xeon X5560 2.8GHz CPU. Different CPU or GPU will obviously affect the comparison.
The observed speedup here is about 14x. The CPU code is a naive, single threaded code.
It should be possible, for example, via modification of the OPT and TST macros, to perform a different kind of optimization - such as maximum instead of minimum.
The domain (and grid) dimensions and granularity to search over can be modified by the compile time constants such as XNR, XPR, etc.

Unified Memory and Streams in C

I am trying to use streams with CUDA 6 and unified memory in C. My previous stream implementation was looking like this :
for(x=0; x<DSIZE; x+=N*2){
gpuErrchk(cudaMemcpyAsync(array_d0, array_h+x, N*sizeof(char), cudaMemcpyHostToDevice, stream0));
gpuErrchk(cudaMemcpyAsync(array_d1, array_h+x+N, N*sizeof(char), cudaMemcpyHostToDevice, stream1));
gpuErrchk(cudaMemcpyAsync(data_d0, data_h, wrap->size*sizeof(int), cudaMemcpyHostToDevice, stream0));
gpuErrchk(cudaMemcpyAsync(data_d1, data_h, wrap->size*sizeof(int), cudaMemcpyHostToDevice, stream1));
searchGPUModified<<<N/128,128,0,stream0>>>(data_d0, array_d0, out_d0 );
searchGPUModified<<<N/128,128,0,stream1>>>(data_d1, array_d1, out_d1);
gpuErrchk(cudaMemcpyAsync(out_h+x, out_d0 , N * sizeof(int), cudaMemcpyDeviceToHost, stream0));
gpuErrchk(cudaMemcpyAsync(out_h+x+N, out_d1 ,N * sizeof(int), cudaMemcpyDeviceToHost, stream1));
}
but I cannot find an example of streams and unified memory, using the same technique, where chuncks of data are sent to the GPU. I am thus wondering if there is a way to do this ?
You should read section J.2.2 of the programming guide (and preferably all of appendix J).
With Unified Memory, memory allocated using cudaMallocManaged is by default attached to all streams ("global") and we must modify this in order to make effective use of streams, e.g. for compute/copy overlap. We can do this with the cudaStreamAttachMemAsync function as described in section J.2.2.3 By associating each memory "chunk" with a stream in this fashion, the UM subsystem can make intelligent decisions about when to transfer each data item.
The following example demonstrates this:
#include <stdio.h>
#include <time.h>
#define DSIZE 1048576
#define DWAIT 100000ULL
#define nTPB 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
typedef int mytype;
__global__ void mykernel(mytype *data){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < DSIZE) data[idx] = 1;
unsigned long long int tstart = clock64();
while (clock64() < tstart + DWAIT);
}
int main(){
mytype *data1, *data2, *data3;
cudaStream_t stream1, stream2, stream3;
cudaMallocManaged(&data1, DSIZE*sizeof(mytype));
cudaMallocManaged(&data2, DSIZE*sizeof(mytype));
cudaMallocManaged(&data3, DSIZE*sizeof(mytype));
cudaCheckErrors("cudaMallocManaged fail");
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaStreamCreate(&stream3);
cudaCheckErrors("cudaStreamCreate fail");
cudaStreamAttachMemAsync(stream1, data1);
cudaStreamAttachMemAsync(stream2, data2);
cudaStreamAttachMemAsync(stream3, data3);
cudaDeviceSynchronize();
cudaCheckErrors("cudaStreamAttach fail");
memset(data1, 0, DSIZE*sizeof(mytype));
memset(data2, 0, DSIZE*sizeof(mytype));
memset(data3, 0, DSIZE*sizeof(mytype));
mykernel<<<(DSIZE+nTPB-1)/nTPB, nTPB, 0, stream1>>>(data1);
mykernel<<<(DSIZE+nTPB-1)/nTPB, nTPB, 0, stream2>>>(data2);
mykernel<<<(DSIZE+nTPB-1)/nTPB, nTPB, 0, stream3>>>(data3);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
for (int i = 0; i < DSIZE; i++){
if (data1[i] != 1) {printf("data1 mismatch at %d, should be: %d, was: %d\n", i, 1, data1[i]); return 1;}
if (data2[i] != 1) {printf("data2 mismatch at %d, should be: %d, was: %d\n", i, 1, data2[i]); return 1;}
if (data3[i] != 1) {printf("data3 mismatch at %d, should be: %d, was: %d\n", i, 1, data3[i]); return 1;}
}
printf("Success!\n");
return 0;
}
The above program creates a kernel that runs artificially long using clock64(), so as to give us a simulated opportunity for compute/copy overlap (simulating a compute-intensive kernel). We are launching 3 instances of this kernel, each instance operating on a separate "chunk" of data.
When we profile the above program, the following is seen:
First, note that the 3rd kernel launch is highlighted in yellow, and it begins immediately after the second kernel launch highlighted in purple. The actual cudaLaunch runtime API event that launches this 3rd kernel is indicated in the runtime API line by the mouse pointer, also highlighted in yellow (and is preceded by the cudaLaunch events for the first 2 kernels). Since this launch happens during execution of the first kernel, and there is no intervening "empty space" from that point until the start of the 3rd kernel, we can observe that the transfer of the data for the 3rd kernel launch (i.e. data3) occurred while kernels 1 and 2 were executing. Therefore we have effective overlap of copy and compute. (We could make a similar observation about kernel 2).
Although I haven't shown it here, if we omit the cudaStreamAttachMemAsync lines, the program still compiles and runs correctly, but if we profile it, we observe a different relationship between the cudaLaunch events and the kernels. The overall profile looks similar, and the kernels are executing back to back, but the entire cudaLaunch process now begins and ends before the first kernel begins executing, and there are no cudaLaunch events during the kernel execution. This indicates that (since all the cudaMallocManaged memory is global) all of the data transfers are taking place prior to the first kernel launch. The program has no way to associate a "global" allocation with any particular kernel, so all such allocated memory must be transferred before the first kernel launch (even though that kernel is only using data1).

execute CUDA kernel few times

I have problem with execute CUDA kernel few times. Somethings is wrong with environment in my code. First time code works properly, second time during clean up environemnt before third call there are random crashes.
I think for some reason I have memory corruption. Crashes occurs sometimes in CUDA driver, sometimes simple printf crashes or cheap, kernel32.dll. I suppose that I have problem with memory management in my code.
What should be done before again kernel execution ?
This code works when I execute one time.
I'm using CURAND to initialize random generators.
Here is my code:
#define GRID_BLOCK 64
#define GRID_THREAD 8
#define CITIES 100
#define CIPOW2 101
int lenghtPaths = GRID_BLOCK*GRID_THREAD;
int cities = CITIES;
//prepare CURAND
curandState *devStates;
CUDA_CALL(cudaMalloc((void **)&devStates, GRID_BLOCK*GRID_THREAD*sizeof(curandState)));
/* Setup prng states */
setup_kernel<<<GRID_BLOCK ,GRID_THREAD>>>(devStates);
CUDA_CALL(cudaDeviceSynchronize());
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess)
fprintf(stderr, "CURAND preparation failed: %s\n", cudaGetErrorString(cudaStatus));
//copy distance grid to constant memory
cudaMemcpyToSymbol(cdist, dist, sizeof(int) *CIPOW2*CIPOW2);
CUDA_CALL(cudaMalloc((void**)&dev_pathsForThreads, lenghtPaths * cities * sizeof(int)));
CUDA_CALL(cudaMalloc((void**)&d_results, GRID_BLOCK*GRID_THREAD * sizeof(int)));
for (int k = 0; k < 5; k++){
int* pathsForThreads;
pathsForThreads = (int*)malloc(lenghtPaths * cities * sizeof(int));
pathsForThreads = PreaparePaths(Path, lenghtPaths, cities);
CUDA_CALL(cudaMemcpy(dev_pathsForThreads, pathsForThreads, lenghtPaths *cities*sizeof(int), cudaMemcpyHostToDevice));
GPUAnnealing<<<GRID_BLOCK ,GRID_THREAD >>>(dev_pathsForThreads, devStates, iterationLimit,temperature, coolingRate, absoluteTemperature, cities,d_results);
CUDA_CALL(cudaDeviceSynchronize());
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess)
fprintf(stderr, "GPUAnnealing launch failed: %s\n", cudaGetErrorString(cudaStatus));
h_results = (int*) malloc(GRID_BLOCK*GRID_THREAD * sizeof(int));
//Copy lenght of each path to CPU
CUDA_CALL(cudaMemcpy(h_results, d_results, GRID_BLOCK*GRID_THREAD * sizeof(int),cudaMemcpyDeviceToHost));
//Copy paths to CPU
CUDA_CALL(cudaMemcpy(pathsForThreads, dev_pathsForThreads, lenghtPaths *cities*sizeof(int), cudaMemcpyDeviceToHost));
//check the shortest path
shortestPath = FindTheShortestPath(h_results);
fprintf (stdout, "Shortest path on index = %d value = %d \n", shortestPath, h_results[shortestPath]);
for (int i = 0; i < GRID_BLOCK*GRID_BLOCK ; i++)
Path[i] = pathsForThreads[shortestPath*CITIES +i];
free(pathsForThreads);
free(h_results);
}
CUDA_CALL(cudaFree(dev_pathsForThreads));
CUDA_CALL(cudaFree(d_results));
CUDA_CALL(cudaFree(devStates));
CUDA_CALL(cudaDeviceReset());
This is a bad idea:
pathsForThreads = (int*)malloc(lenghtPaths * cities * sizeof(int));
pathsForThreads = PreaparePaths(Path, lenghtPaths, cities);
If the call to PreaparePaths assigns some other value to pathsForThreads than what was assigned to it by the malloc operation, then later when you do this:
free(pathsForThreads);
You're going to get unpredictable results.
You should not reassign a pointer that you're subsequently going to pass to free to some other value. The man page for free indicates:
free() frees the memory space pointed to by ptr, which must have been
returned by a previous call to malloc(), calloc() or realloc().
So reassigning the pointer to something else is not allowed if you intend to pass it to free

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.
First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
#define N (1024*1024)
#define M (1000000)
int main()
{
float data[N]; int count = 0;
for(int i = 0; i < N; i++)
{
data[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
{
data[i] = data[i] * data[i] - 0.25f;
}
}
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
Use something like this in CUDA/C:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] = buf[i] * buf[i] - 0.25f;
}
int main()
{
float data[N]; int count = 0;
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.
A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)
As I said in my comment response to #Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)
For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024
void serial_add(double *a, double *b, double *c, int n, int m)
{
for(int index=0;index<n;index++)
{
for(int j=0;j<m;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
}
__global__ void vector_add(double *a, double *b, double *c)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
for(int j=0;j<M;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
int main()
{
clock_t start,end;
double *a, *b, *c;
int size = N * sizeof( double );
a = (double *)malloc( size );
b = (double *)malloc( size );
c = (double *)malloc( size );
for( int i = 0; i < N; i++ )
{
a[i] = b[i] = i;
c[i] = 0;
}
start = clock();
serial_add(a, b, c, N, M);
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
end = clock();
float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("Serial: %f seconds\n",time1);
start = clock();
double *d_a, *d_b, *d_c;
cudaMalloc( (void **) &d_a, size );
cudaMalloc( (void **) &d_b, size );
cudaMalloc( (void **) &d_c, size );
cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );
vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
free(a);
free(b);
free(c);
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
end = clock();
float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);
return 0;
}
I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.
For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the
int gpu = 1;
line in the hello.c source file to 0 for CPU, 1 for GPU.
Apple has some more OpenCL example code in their main Mac source code listing.
Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.
I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.