How to make use of swap space on disk when run out of modern gpu memory?

How to make use of swap space on disk when run out of modern gpu memory? - cuda

Post-Pascal UM can allocate more memory than the GPU memory, which swap-in swap-out between GPU memory and host memory automatically.
So what if run out of GPU memory and host memory? How can I use the swap space on disk? Virtual memory swap space seems not work in cudaMallocManaged case. Here is how I did the experiment:
create swap space: dd if=/dev/zero of=./swapfile bs=1G count=16, mkswap and swapon
create host memory occupier, burn out 99% of host memory
for (i = 0; i < 8000; i++)
malloc(1<<20);
create GPU memory occupier with cudaMalloc, leaving 1G of GPU memory
offical uvm demo but with a 12GB workload, which will run out of all available GPU memory and host memory
#include <iostream>
#include <math.h>
// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void) {
long long N = 6LL * (1 << 30) / sizeof(float); // <<<<<<<<<<<
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
It just get kill by oom-killer and leave swap space alone.
It there is a pure lib malloc with 100GB workload, you can see that swap space usage is growing.
UM/UVA can use gpu memory + host memory + swap space, just like virtual memory say

There are 3 types of memory allocations that are accessible to GPU device code:
ordinary (e.g. cudaMalloc)
pinned (e.g. cudaHostAlloc)
managed (e.g. cudaMallocManaged)
None of these will make use of or have any bearing on traditional linux swap space (or the equivalent on windows). The first one is limited by available device memory, and the second two are limited by available host memory (or some other lower limit). All host-based allocations accessible to GPU device code must be resident in non-swappable memory using "swappable" here to refer to the ordinary host virtual memory management system that may swap pages out to disk.
The only space that benefits from this form of swapping is host pageable memory allocations, and these are not directly accessible from CUDA device code.

Related

CUDA shared vs global memory, possible speedup

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?

Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.

dram_write_bytes result on P100

I used nvprof to profile a simple vecadd example (n=1024) on P100 but observed the dram_write_bytes is only 256 (rather than 1024*4 that I expected). Can someone explain why this number is small? What other metrics I need to add in to count for global memory writes? Thanks. float_count_sp number is correct (1024).
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
__global__ void vecAdd(float* a, float* b, float* c, int n){
int id = blockIdx.x*blockDim.x + threadIdx.x;
if(id < n) c[id] = a[id] + b[id];
}
int main(int argc, char* argv[]){
int n = 1024;
float *h_a, *d_a;
float *h_b, *d_b;
float *h_c, *d_c;
size_t bytes = n*sizeof(float);
h_a = (float*)malloc(bytes);
h_b = (float*)malloc(bytes);
h_c = (float*)malloc(bytes);
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
for(i = 0; i < n; i++){
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i+1);
}
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
vecAdd <<<1, 1024>>> (d_a, d_b, d_c, n);
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
float sum = 0;
for(i = 0; i < n; i++)
sum += h_c[i] - h_a[i] - h_b[i];
printf("final diff: %f\n", sum/n);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(h_a);
free(h_b);
free(h_c);
return 0;
}
Is it related to the sampling of nvprof? One time I get 384 bytes. Sometimes I even got 0 bytes. Weird thing is: if I change n to 1024*1024, I got bytes more than I expected (4688032). 4688032/1024/1024/4 = 1.11.

There are several reasons why your expectations are not being observed and the data is changing:
The GPU memory system is shared by all engines. The primary engine the is the graphics/compute engine but other engines such as copy engines, display, etc. access the device memory and the memory control (FB = framebuffer) counters do not have a method to track the requester.
NVPROF injection does not attempt to evict all context memory from the L2 cache. The cudaMemcpys prior to the launch and the kernel replay code in nvprof will leave the L2 cache in an inconsistent state.
The initial size of 4KB is simply to small to accurately track. The full data set could be in L2 from either the cudaMemcpy or replay. Furthermore, the bytes you see can be from other clients such as the constant caches.
It is highly recommends you scale the buffer size to a reasonable size. On newer GPUs the Nsight Compute profiler has improved L2 level breakdown of various clients to help detect unexpected traffic. In addition Nsight Compute replay logic clears the L2 cache so that each replay has a consistent start state.
If you have a monitor attached it is recommended to move the monitor to a different GPU when looking at DRAM counters. nvprof L2 counters generally filter the count by traffic from the SMs so traffic from copy engines, the display controller, MMU, constant caches, etc. will not show up in the L2 counters.

Where is the boundary of start and end of CPU launch and GPU launch of Nvidia Profiling NVPROF?

What is the definition of start and end of kernel launch in the CPU and GPU (yellow block)? Where is the boundary between them?
Please notice that the start, end, and duration of those yellow blocks in CPU and GPU are different.Why CPU invocation of vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); takes that long time?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
//printf("id = %d \n", id);
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
int main( int argc, char* argv[] )
{
// Size of vectors
int n = 1000000;
// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;
// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
int blockSize, gridSize;
// Number of threads in each thread block
blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Release host memory
free(h_a);
free(h_b);
free(h_c);
return 0;
}
CPU yellow block:
GPU yellow block:

Note that you mention NVPROF but the pictures you are showing are from nvvp - the visual profiler. nvprof is the command-line profiler
GPU Kernel launches are asynchronous. That means that the CPU thread launches the kernel but does not wait for the kernel to complete. In fact, the CPU activity is actually placing the kernel in a launch queue - the actual execution of the kernel may be delayed if anything else is happening on the GPU.
So there is no defined relationship between the CPU (API) activity, and the GPU activity with respect to time, except that the CPU kernel launch must obviously precede (at least slightly) the GPU kernel execution.
The CPU (API) yellow block represents the duration of time that the CPU thread spends in a library call into the CUDA Runtime library, to launch the kernel (i.e. place it in the launch queue). This library call activity usually has some time overhead associated with it, in the range of 5-50 microseconds. The start of this period is marked by the start of the call into the library. The end of this period is marked by the time at which the library returns control to your code (i.e. your next line of code after the kernel launch).
The GPU yellow block represents the actual time period during which the kernel was executing on the GPU. The start and end of this yellow block are marked by the start and end of kernel activity on the GPU. The duration here is a function of what the code in your kernel is doing, and how long it takes.
I don't think the exact reason why a GPU kernel launch takes ~5-50 microseconds of CPU time is documented or explained anywhere in an authoritative fashion, and it is a closed source library, so you will need to acknowledge that overhead as something you have little control over. If you design kernels that run for a long time and do a lot of work, this overhead can become insignificant.

What is the general way to launch appropriate amount of reduction kernels?

As I have read from NVIDIA's instruction in this link http://www.cuvilib.com/Reduction.pdf, for arrays bigger than blockSize, I should launch multiple reduction kernels to achieve global synchronization. What is the general way to determine how many times I should launch the reduction kernel? I tried as below but I need to Malloc 2 additional pointers, which takes a lot of processing times.
My job is to Reduce the array d_logLuminance into one minimum value min_logLum
void your_histogram_and_prefixsum(const float* const d_logLuminance,
float &min_logLum,
const size_t numRows,
const size_t numCols)
{
const dim3 blockSize(512);
unsigned int pixel = numRows*numCols;
const dim3 gridSize(pixel/blockSize.x+1);
//Reduction kernels to find max and min value
float *d_tempMin, *d_min;
checkCudaErrors(cudaMalloc((void**) &d_tempMin, sizeof(float)*pixel));
checkCudaErrors(cudaMalloc((void**) &d_min, sizeof(float)*pixel));
checkCudaErrors(cudaMemcpy(d_min, d_logLuminance, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
dim3 subGrid = gridSize;
for(int reduceLevel = pixel; reduceLevel > 0; reduceLevel /= blockSize.x) {
checkCudaErrors(cudaMemcpy(d_tempMin, d_min, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
reduceMin<<<subGrid,blockSize,blockSize.x*sizeof(float)>>>(d_tempMin, d_min);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
subGrid.x = subGrid.x / blockSize.x + 1;
}
checkCudaErrors(cudaMemcpy(&min_logLum, d_min, sizeof(float), cudaMemcpyDeviceToHost));
std::cout<< "Min value = " << min_logLum << std::endl;
checkCudaErrors(cudaFree(d_tempMin));
checkCudaErrors(cudaFree(d_min));
}
And if you are curious, here is my reduction kernel:
__global__
void reduceMin(const float* const g_inputRange,
float* g_outputRange)
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
sdata[tid] = g_inputRange[i];
__syncthreads();
for(unsigned int s = blockDim.x/2; s > 0; s >>= 1){
if (tid < s){
sdata[tid] = min(sdata[tid],sdata[tid+s]);
}
__syncthreads();
}
if(tid == 0){
g_outputRange[blockIdx.x] = sdata[0];
}
}

There are many ways to skin the cat, but if you want to minimize kernel launches, it can always be done with at most two kernel launches.
The first kernel launch is composed of up to however many blocks correspond to the number of threads per block that your device supports. Newer devices will support 1024, older devices, 512.
Each of these (at most 512 or 1024) blocks in the first kernel will participate in a grid-looping sum of all the data elements in global memory.
Each of these blocks will then do a partial reduction and write a partial result to global memory. There will be 512 or 1024 of these partial results.
The second kernel launch will be composed of 512 or 1024 threads in a single block. Each thread will pick up one of the partial results from global memory, and then the threads in that single block will cooperatively reduce the partial results to a single final result, and write it back to global memory.
The "grid-looping sum" is described in reduction #7 here as "multiple add/thread". All of the reductions described in this document are available in the NVIDIA reduction sample code

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
#define N (1024*1024)
#define M (1000000)
int main()
{
float data[N]; int count = 0;
for(int i = 0; i < N; i++)
{
data[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
{
data[i] = data[i] * data[i] - 0.25f;
}
}
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
Use something like this in CUDA/C:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] = buf[i] * buf[i] - 0.25f;
}
int main()
{
float data[N]; int count = 0;
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)
As I said in my comment response to #Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)

For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024
void serial_add(double *a, double *b, double *c, int n, int m)
{
for(int index=0;index<n;index++)
{
for(int j=0;j<m;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
}
__global__ void vector_add(double *a, double *b, double *c)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
for(int j=0;j<M;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
int main()
{
clock_t start,end;
double *a, *b, *c;
int size = N * sizeof( double );
a = (double *)malloc( size );
b = (double *)malloc( size );
c = (double *)malloc( size );
for( int i = 0; i < N; i++ )
{
a[i] = b[i] = i;
c[i] = 0;
}
start = clock();
serial_add(a, b, c, N, M);
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
end = clock();
float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("Serial: %f seconds\n",time1);
start = clock();
double *d_a, *d_b, *d_c;
cudaMalloc( (void **) &d_a, size );
cudaMalloc( (void **) &d_b, size );
cudaMalloc( (void **) &d_c, size );
cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );
vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
free(a);
free(b);
free(c);
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
end = clock();
float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);
return 0;
}

I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.
For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the
int gpu = 1;
line in the hello.c source file to 0 for CPU, 1 for GPU.
Apple has some more OpenCL example code in their main Mac source code listing.
Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.
I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008