Cuda Unified memory vs cudaMalloc - cuda

I am trying to do some benchmarking to ensure using CUDA's Unified Memory(UM) approach will not hurt us wrt performance.
I am performing an FFT. One way i use UM, one way i use the cudaMalloc
I compare the results afterwards and they all match up (which is good).
however, the timing i'm getting for the UM approach is ~.5ms vs the cudaMalloc way of ~.04 (after performing the run multiple times an averaging)
I am using Event records to do the timing. I have one right before and after the cufftExecC2C call.
Furthermore, I added two more event records to measure the time before any memory transfer to the device, and after using the data once i get it back from the device.
when doing this, i see the UM approach take ~1.6ms and the cudaMalloc approach taking ~.7.
Below is a snippet of code that does the UM approach:
cufftHandle plan;
cufftPlan1d(&plan, dataSize, CUFFT_C2C, 1);
cudaMallocManaged(&inData, dataSize * sizeof(cufftComplex));
cudaMallocManaged(&outData, dataSize * sizeof(cufftComplex));
cudaEvent_t start_before_memHtoD, start_kernel, stop_kernel,
stop_after_memDtoH;
cudaEventCreate(&start_kernel);
cudaEventCreate(&start_before_memHtoD);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&stop_after_memDtoH);
setupWave(dataSize, inData);
cudaEventRecord(start_before_memHtoD);
cudaMemPrefetchAsync(inData, dataSize * sizeof(cufftComplex), 1);
cudaDeviceSynchronize();
cudaEventRecord(start_kernel);
cufftExecC2C(plan, inData, outData, CUFFT_FORWARD);
cudaEventRecord(stop_kernel);
cudaEventSynchronize(stop_kernel);
float sum = 0;
for (int i = 0; i < dataSize; i++) {
sum += outData[i].x + outData[i].y;
}
cudaEventRecord(stop_after_memDtoH);
cudaEventSynchronize(stop_after_memDtoH);
std::cout << "sum for UM is " << sum << std::endl;
float umTime = 0;
float overallUmTime = 0;
cudaEventElapsedTime(&umTime, start_kernel, stop_kernel);
cudaEventElapsedTime(&overallUmTime, start_before_memHtoD,
stop_after_memDtoH);
resultString_um += std::to_string(dataSize) + " samples took "
+ std::to_string(umTime) + "ms, Overall: "
+ std::to_string(overallUmTime) + "\n";
cudaFree(outData);
cudaFree(inData);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaEventDestroy(start_before_memHtoD);
cudaEventDestroy(stop_after_memDtoH);
cufftDestroy(plan);
The following is for the cudaMalloc approach
cufftComplex *d_inData;
cufftComplex *d_outData;
inData = (cufftComplex*) (malloc(sizeof(cufftComplex) * dataSize));
outData = (cufftComplex*) (malloc(sizeof(cufftComplex) * dataSize));
cudaMalloc((void**) (&d_inData), dataSize * sizeof(cufftComplex));
cudaMalloc((void**) (&d_outData), dataSize * sizeof(cufftComplex));
cufftHandle plan;
cufftPlan1d(&plan, dataSize, CUFFT_C2C, 1);
cudaEvent_t start_before_memHtoD, start_kernel, stop_kernel,
stop_after_memDtoH;
cudaEventCreate(&start_kernel);
cudaEventCreate(&start_before_memHtoD);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&stop_after_memDtoH);
setupWave(dataSize, inData);
cudaEventRecord(start_before_memHtoD);
cudaMemcpy(d_inData, inData, dataSize * sizeof(cufftComplex),
cudaMemcpyHostToDevice);
cudaEventRecord(start_kernel);
cufftExecC2C(plan, d_inData, d_outData, CUFFT_FORWARD);
cudaEventRecord(stop_kernel);
cudaEventSynchronize(stop_kernel);
cudaMemcpy(outData, d_outData, dataSize * sizeof(cufftComplex),
cudaMemcpyDefault);
cudaEventRecord(stop_after_memDtoH);
float sum = 0;
for (int i = 0; i < dataSize; i++) {
sum += outData[i].x + outData[i].y;
}
cudaEventRecord(stop_after_memDtoH);
cudaEventSynchronize(stop_after_memDtoH);
std::cout << "sum for UM is " << sum << std::endl;
float umTime = 0;
float overallUmTime = 0;
cudaEventElapsedTime(&umTime, start_kernel, stop_kernel);
cudaEventElapsedTime(&overallUmTime, start_before_memHtoD,
stop_after_memDtoH);
resultString_um += std::to_string(dataSize) + " samples took "
+ std::to_string(umTime) + "ms, Overall: "
+ std::to_string(overallUmTime) + "\n";
cudaFree(outData);
cudaFree(inData);
cudaFree(d_outData);
cudaFree(d_inData);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaEventDestroy(start_before_memHtoD);
cudaEventDestroy(stop_after_memDtoH);
cufftDestroy(plan);
Is there something else I could be doing when using the unified memory approach to speed it up? I expected UM to be slower, but not by this much.
We are using the P100 on redhat 7.3 with Cuda 9

One problem with your posted code is that you are not doing a cudaMemPrefetchAsync on the output data from the FFT. According to my testing, this makes a significant difference. There were a few other problems with your code, for example we do not call cudaFree on a pointer allocated with malloc.
Here's a complete code built around what you have shown. When I run this on CentOS7.4, CUDA 9.1, Tesla P100, I get comparable times for the FFT performed in the managed memory case (3.52ms) vs. the FFT performed in the non-managed memory case (3.45ms):
$ cat t43.cu
#include <cufft.h>
#include <iostream>
#include <string>
//using namespace std;
const int dataSize = 1048576*32;
void setupWave(const int ds, cufftComplex *d){
for (int i = 0; i < ds; i++){
d[i].x = 1.0f;
d[i].y = 0.0f;}
}
int main(){
cufftComplex *inData, *outData;
cufftHandle plan;
cufftPlan1d(&plan, dataSize, CUFFT_C2C, 1);
cudaMallocManaged(&inData, dataSize * sizeof(cufftComplex));
cudaMallocManaged(&outData, dataSize * sizeof(cufftComplex));
cudaEvent_t start_before_memHtoD, start_kernel, stop_kernel,
stop_after_memDtoH;
cudaEventCreate(&start_kernel);
cudaEventCreate(&start_before_memHtoD);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&stop_after_memDtoH);
setupWave(dataSize, inData);
cudaEventRecord(start_before_memHtoD);
cudaMemPrefetchAsync(inData, dataSize * sizeof(cufftComplex), 0);
cudaMemPrefetchAsync(outData, dataSize * sizeof(cufftComplex), 0);
cudaDeviceSynchronize();
cudaEventRecord(start_kernel);
cufftExecC2C(plan, inData, outData, CUFFT_FORWARD);
cudaEventRecord(stop_kernel);
cudaEventSynchronize(stop_kernel);
float sum = 0;
for (int i = 0; i < dataSize; i++) {
sum += outData[i].x + outData[i].y;
}
cudaEventRecord(stop_after_memDtoH);
cudaEventSynchronize(stop_after_memDtoH);
std::cout << "sum for UM is " << sum << std::endl;
float umTime = 0;
float overallUmTime = 0;
cudaEventElapsedTime(&umTime, start_kernel, stop_kernel);
cudaEventElapsedTime(&overallUmTime, start_before_memHtoD,
stop_after_memDtoH);
std::string resultString_um = std::to_string(dataSize) + " samples took " + std::to_string(umTime) + "ms, Overall: " + std::to_string(overallUmTime) + "\n";
std::cout << resultString_um;
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaFree(inData);
cudaFree(outData);
cudaEventDestroy(start_before_memHtoD);
cudaEventDestroy(stop_after_memDtoH);
cufftDestroy(plan);
cufftComplex *d_inData;
cufftComplex *d_outData;
inData = (cufftComplex*) (malloc(sizeof(cufftComplex) * dataSize));
outData = (cufftComplex*) (malloc(sizeof(cufftComplex) * dataSize));
cudaMalloc((void**) (&d_inData), dataSize * sizeof(cufftComplex));
cudaMalloc((void**) (&d_outData), dataSize * sizeof(cufftComplex));
//cufftHandle plan;
cufftPlan1d(&plan, dataSize, CUFFT_C2C, 1);
//cudaEvent_t start_before_memHtoD, start_kernel, stop_kernel,
// stop_after_memDtoH;
cudaEventCreate(&start_kernel);
cudaEventCreate(&start_before_memHtoD);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&stop_after_memDtoH);
setupWave(dataSize, inData);
cudaEventRecord(start_before_memHtoD);
cudaMemcpy(d_inData, inData, dataSize * sizeof(cufftComplex),
cudaMemcpyHostToDevice);
cudaEventRecord(start_kernel);
cufftExecC2C(plan, d_inData, d_outData, CUFFT_FORWARD);
cudaEventRecord(stop_kernel);
cudaEventSynchronize(stop_kernel);
cudaMemcpy(outData, d_outData, dataSize * sizeof(cufftComplex),
cudaMemcpyDefault);
sum = 0;
for (int i = 0; i < dataSize; i++) {
sum += outData[i].x + outData[i].y;
}
cudaEventRecord(stop_after_memDtoH);
cudaEventSynchronize(stop_after_memDtoH);
std::cout << "sum for non-UM is " << sum << std::endl;
//float umTime = 0;
//float overallUmTime = 0;
cudaEventElapsedTime(&umTime, start_kernel, stop_kernel);
cudaEventElapsedTime(&overallUmTime, start_before_memHtoD,
stop_after_memDtoH);
resultString_um = std::to_string(dataSize) + " samples took "
+ std::to_string(umTime) + "ms, Overall: "
+ std::to_string(overallUmTime) + "\n";
std::cout << resultString_um;
free(outData);
free(inData);
cudaFree(d_outData);
cudaFree(d_inData);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaEventDestroy(start_before_memHtoD);
cudaEventDestroy(stop_after_memDtoH);
cufftDestroy(plan);
}
$ nvcc -std=c++11 -arch=sm_60 -o t43 t43.cu -lcufft
$ ./t43
sum for UM is 3.35544e+07
33554432 samples took 3.520640ms, Overall: 221.909988
sum for non-UM is 3.35544e+07
33554432 samples took 3.456160ms, Overall: 278.099426
$

Related

CUDA Graph Problem: Results not computed for the first iteration

I am trying to utilize CUDA Graphs for the computation of Fast Fourier Transform (FFT) using CUDA's cuFFT APIs.
I modified the sample FFT code present on Github into the following FFT code using CUDA Graphs:
#include <cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "device_functions.h"
#include <iostream>
#include <cufft.h>
// Complex data type
typedef float2 Complex;
static __device__ inline Complex ComplexScale(Complex, float);
static __device__ inline Complex ComplexMul(Complex, Complex);
static __global__ void ComplexPointwiseMulAndScale(Complex*, const Complex*, int, float);
#define CUDA_CALL( call ) \
{ \
cudaError_t result = call; \
if ( cudaSuccess != result ) \
std::cerr << "CUDA error " << result << " in " << __FILE__ << ":" << __LINE__ << ": " << cudaGetErrorString( result ) << " (" << #call << ")" << std::endl; \
}
#define CUDA_FFT_CALL( call ) \
{ \
cufftResult result = call; \
if ( CUFFT_SUCCESS != result ) \
std::cerr << "FFT error " << result << " in " << __FILE__ << ":" << __LINE__ << ": " << result << std::endl; \
}
// The filter size is assumed to be a number smaller than the signal size
#define SIGNAL_SIZE 10
#define FILTER_KERNEL_SIZE 4
static __device__ inline Complex ComplexScale(Complex a, float s)
{
Complex c;
c.x = s * a.x;
c.y = s * a.y;
return c;
}
// Complex multiplication
static __device__ inline Complex ComplexMul(Complex a, Complex b)
{
Complex c;
c.x = a.x * b.x - a.y * b.y;
c.y = a.x * b.y + a.y * b.x;
return c;
}
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex* a, const Complex* b, int size, float scale)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < size; i += numThreads)
{
a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale);
}
}
int main()
{
printf("[simpleCUFFT] is starting...\n");
int minRadius = FILTER_KERNEL_SIZE / 2;
int maxRadius = FILTER_KERNEL_SIZE - minRadius;
int padded_data_size = SIGNAL_SIZE + maxRadius;
// Allocate HOST Memories
Complex* h_signal = (Complex*)malloc(sizeof(Complex) * SIGNAL_SIZE); //host signal
Complex* h_filter_kernel = (Complex*)malloc(sizeof(Complex) * FILTER_KERNEL_SIZE); //host filter
Complex* h_padded_signal= (Complex*)malloc(sizeof(Complex) * padded_data_size); // host Padded signal
Complex* h_padded_filter_kernel = (Complex*)malloc(sizeof(Complex) * padded_data_size); // host Padded filter kernel
Complex* h_convolved_signal = (Complex*)malloc(sizeof(Complex) * padded_data_size); // to store convolution RESULTS
memset(h_convolved_signal, 0, padded_data_size * sizeof(Complex));
//Allocate DEVICE Memories
Complex* d_signal; //device signal
cudaMalloc((void**)&d_signal, sizeof(Complex) * padded_data_size);
Complex* d_filter_kernel;
cudaMalloc((void**)&d_filter_kernel, sizeof(Complex) * padded_data_size); //device kernel
//CUDA GRAPH
bool graphCreated = false;
cudaGraph_t graph;
cudaGraphExec_t instance;
cudaStream_t stream;
cudaStreamCreate(&stream);
// CUFFT plan
cufftHandle plan;
CUDA_FFT_CALL(cufftPlan1d(&plan, padded_data_size, CUFFT_C2C, 1));
cufftSetStream(plan, stream); // bind plan to the stream
// Initalize the memory for the signal
for (unsigned int i = 0; i < SIGNAL_SIZE; ++i)
{
h_signal[i].x = rand() / (float)RAND_MAX;
h_signal[i].y = 0;
}
// Initalize the memory for the filter
for (unsigned int i = 0; i < FILTER_KERNEL_SIZE; ++i)
{
h_filter_kernel[i].x = rand() / (float)RAND_MAX;
h_filter_kernel[i].y = 0;
}
//REPEAT 3 times
int nRepeatationsNeeded = 3;
for (int repeatations = 0; repeatations < nRepeatationsNeeded; repeatations++)
{
std::cout << "\n\n" << "Repeatation ------ " << repeatations << std::endl;
if (!graphCreated)
{
//Start Graph Recording --------------!!!!!!!!
CUDA_CALL(cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal));
//Pad Data
CUDA_CALL(cudaMemcpyAsync(h_padded_signal + 0, h_signal, SIGNAL_SIZE * sizeof(Complex), cudaMemcpyHostToHost, stream));
memset(h_padded_signal + SIGNAL_SIZE, 0, (padded_data_size - SIGNAL_SIZE) * sizeof(Complex));
//CUDA_CALL(cudaMemsetAsync(h_padded_signal + SIGNAL_SIZE, 0, (padded_data_size - SIGNAL_SIZE) * sizeof(Complex), stream));
CUDA_CALL(cudaMemcpyAsync(h_padded_filter_kernel + 0, h_filter_kernel + minRadius, maxRadius * sizeof(Complex), cudaMemcpyHostToHost, stream));
/*CUDA_CALL(cudaMemsetAsync(h_padded_filter_kernel + maxRadius, 0, (padded_data_size - FILTER_KERNEL_SIZE) * sizeof(Complex), stream));*/
memset(h_padded_filter_kernel + maxRadius, 0, (padded_data_size - FILTER_KERNEL_SIZE) * sizeof(Complex));
CUDA_CALL(cudaMemcpyAsync(h_padded_filter_kernel + padded_data_size - minRadius, h_filter_kernel, minRadius * sizeof(Complex), cudaMemcpyHostToHost, stream));
// MemCpy H to D
CUDA_CALL(cudaMemcpyAsync(d_signal, h_padded_signal, sizeof(Complex) * padded_data_size, cudaMemcpyHostToDevice, stream)); //Signal
CUDA_CALL(cudaMemcpyAsync(d_filter_kernel, h_padded_filter_kernel, sizeof(Complex) * padded_data_size, cudaMemcpyHostToDevice, stream)); //Kernel
//COMPUTE FFT
CUDA_FFT_CALL(cufftExecC2C(plan, (cufftComplex*)d_signal, (cufftComplex*)d_signal, CUFFT_FORWARD)); // Transform signal
CUDA_FFT_CALL(cufftExecC2C(plan, (cufftComplex*)d_filter_kernel, (cufftComplex*)d_filter_kernel, CUFFT_FORWARD)); // Transform kernel
ComplexPointwiseMulAndScale << <64, 1, 0, stream >> > (d_signal, d_filter_kernel, padded_data_size, 1.0f / padded_data_size); // Multiply and normalize
CUDA_CALL(cudaGetLastError());
CUDA_FFT_CALL(cufftExecC2C(plan, (cufftComplex*)d_signal, (cufftComplex*)d_signal, CUFFT_INVERSE)); // Transform signal back
// Copy device memory to host
CUDA_CALL(cudaMemcpyAsync(h_convolved_signal, d_signal, sizeof(Complex) * padded_data_size, cudaMemcpyDeviceToHost, stream));
//END Graph Recording
CUDA_CALL(cudaStreamEndCapture(stream, &graph));
CUDA_CALL(cudaGraphInstantiate(&instance, graph, NULL, NULL, 0));
graphCreated = true;
}
else
{
CUDA_CALL(cudaGraphLaunch(instance, stream));
CUDA_CALL(cudaStreamSynchronize(stream));
}
//verify results
for (int i = 0; i < SIGNAL_SIZE; i++)
std::cout << "index: " << i << ", fft: " << h_convolved_signal[i].x << std::endl;
}
//Destroy CUFFT context
cufftDestroy(plan);
// cleanup memory
cudaStreamDestroy(stream);
free(h_signal);
free(h_filter_kernel);
free(h_padded_signal);
free(h_padded_filter_kernel);
cudaFree(d_signal);
cudaFree(d_filter_kernel);
return 0;
}
PROBLEM: The Output of the above program is below, in which it can be seen that the values of the result are also ZEROS for the first iteration. How can I resolve this?
The results are zero for the first iteration, because for the first iteration, the work is all issued in capture mode.
In capture mode, no CUDA work actually gets done. From here:
When a stream is being captured, work launched into the stream is not enqueued for execution.
I pointed you to this same area of the documentation in a comment to your last question. You might wish to read the entire programming guide section on graphs, and there are also blogs available.

False dependency issue for the Fermi architecture

I am trying to achieve "3-way overlapping" using 3 streams as in the examples in CUDA streams and concurrency webinar. But I couldn't achieve it.
I have Geforce GT 550M (Fermi Architecture with one copy engine) and I am using Windows 7 (64 bit).
Here is the code that I have written.
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// includes, project
#include "helper_cuda.h"
#include "helper_functions.h" // helper utility functions
#include <stdio.h>
using namespace std;
#define DATA_SIZE 6000000
#define NUM_THREADS 32
#define NUM_BLOCKS 16
#define NUM_STREAMS 3
__global__ void kernel(const int *in, int *out, int dataSize)
{
int start = blockIdx.x * blockDim.x + threadIdx.x;
int end = dataSize;
for (int i = start; i < end; i += blockDim.x * gridDim.x)
{
out[i] = in[i] * in[i];
}
}
int main()
{
const int dataSize = DATA_SIZE;
int *h_in = new int[dataSize];
int *h_out = new int[dataSize];
int *h_groundTruth = new int[dataSize];
// Input population
for(int i = 0; i < dataSize; i++)
h_in[i] = 5;
for(int i = 0; i < dataSize; i++)
h_out[i] = 0;
// CPU calculation for ground truth
for(int i = 0; i < dataSize; i++)
h_groundTruth[i] = h_in[i] * h_in[i];
// Choose which GPU to run on, change this on a multi-GPU system.
checkCudaErrors( cudaSetDevice(0) );
int *d_in = 0;
int *d_out = 0;
int streamSize = dataSize / NUM_STREAMS;
size_t memSize = dataSize * sizeof(int);
size_t streamMemSize = memSize / NUM_STREAMS;
checkCudaErrors( cudaMalloc( (void **)&d_in, memSize) );
checkCudaErrors( cudaMalloc( (void **)&d_out, memSize) );
// registers host memory as page-locked (required for asynch cudaMemcpyAsync)
checkCudaErrors(cudaHostRegister(h_in, memSize, cudaHostRegisterPortable));
checkCudaErrors(cudaHostRegister(h_out, memSize, cudaHostRegisterPortable));
// set kernel launch config
dim3 nThreads = dim3(NUM_THREADS,1,1);
dim3 nBlocks = dim3(NUM_BLOCKS,1,1);
cout << "GPU Kernel Configuration : " << endl;
cout << "Number of Streams :\t" << NUM_STREAMS << " with size: \t" << streamSize << endl;
cout << "Number of Threads :\t" << nThreads.x << "\t" << nThreads.y << "\t" << nThreads.z << endl;
cout << "Number of Blocks :\t" << nBlocks.x << "\t" << nBlocks.y << "\t" << nBlocks.z << endl;
// create cuda stream
cudaStream_t streams[NUM_STREAMS];
for(int i = 0; i < NUM_STREAMS; i++)
checkCudaErrors(cudaStreamCreate(&streams[i]));
// create cuda event handles
cudaEvent_t start, stop;
checkCudaErrors(cudaEventCreate(&start));
checkCudaErrors(cudaEventCreate(&stop));
cudaEventRecord(start, 0);
// overlapped execution using version 2
for(int i = 0; i < NUM_STREAMS; i++)
{
int offset = i * streamSize;
cudaMemcpyAsync(&d_in[offset], &h_in[offset], streamMemSize, cudaMemcpyHostToDevice, streams[i]);
}
//cudaMemcpy(d_in, h_in, memSize, cudaMemcpyHostToDevice);
for(int i = 0; i < NUM_STREAMS; i++)
{
int offset = i * streamSize;
dim3 subKernelBlock = dim3((int)ceil((float)nBlocks.x / 2));
//kernel<<<nBlocks, nThreads, 0, streams[i]>>>(&d_in[offset], &d_out[offset], streamSize);
kernel<<<subKernelBlock, nThreads, 0, streams[i]>>>(&d_in[offset], &d_out[offset], streamSize/2);
kernel<<<subKernelBlock, nThreads, 0, streams[i]>>>(&d_in[offset + streamSize/2], &d_out[offset + streamSize/2], streamSize/2);
}
for(int i = 0; i < NUM_STREAMS; i++)
{
int offset = i * streamSize;
cudaMemcpyAsync(&h_out[offset], &d_out[offset], streamMemSize, cudaMemcpyDeviceToHost, streams[i]);
}
for(int i = 0; i < NUM_STREAMS; i++)
checkCudaErrors(cudaStreamSynchronize(streams[i]));
cudaEventRecord(stop, 0);
checkCudaErrors(cudaStreamSynchronize(0));
checkCudaErrors(cudaDeviceSynchronize());
float gpu_time = 0;
checkCudaErrors(cudaEventElapsedTime(&gpu_time, start, stop));
// release resources
checkCudaErrors(cudaEventDestroy(start));
checkCudaErrors(cudaEventDestroy(stop));
checkCudaErrors(cudaHostUnregister(h_in));
checkCudaErrors(cudaHostUnregister(h_out));
checkCudaErrors(cudaFree(d_in));
checkCudaErrors(cudaFree(d_out));
for(int i = 0; i < NUM_STREAMS; i++)
checkCudaErrors(cudaStreamDestroy(streams[i]));
cudaDeviceReset();
cout << "Execution Time of GPU: " << gpu_time << "ms" << endl;
// GPU output check
int sum = 0;
for(int i = 0; i < dataSize; i++)
sum += h_groundTruth[i] - h_out[i];
cout << "Error between CPU and GPU: " << sum << endl;
delete[] h_in;
delete[] h_out;
delete[] h_groundTruth;
return 0;
}
Using Nsight for profiling, I have this result:
It may seem correct, but why does the D2H transfer in stream #1 only start when the last kernel launch of stream #2 and not before?
I tried also to use 8 streams (just by changing NUM_STREAM to 8) to achieve such a "3-way overlap" and here is the result:
The interesting thing is that when I use 8 streams, the overlappings between computation and memory transfers seem to be much better.
What is the reason for this problem? Is it due to WDDM driver or is there something wrong with my program?
From the comments above, it seems that the OP's problem is a false dependency issue, suffered by the Fermi architecture and solved by the Hyper-Q feature of the Kepler architecture.
To summarize, the OP is highlighting the fact that the first D2H transfer (stream #1) does not start immediately after the last H2D (stream #3) finishes, while in principle it could. The time gap is highlighted by the red circle in the following figure (henceforth, but for the differently specified, all the tests refer to a GeForce GT540M belonging to the Fermi family):
The OP's approach is a breadth-first approach, which operates according to the following scheme:
for(int i = 0; i < NUM_STREAMS; i++)
cudaMemcpyAsync(..., cudaMemcpyHostToDevice, streams[i]);
for(int i = 0; i < NUM_STREAMS; i++)
{
kernel_launch_1<<<..., 0, streams[i]>>>(...);
kernel_launch_2<<<..., 0, streams[i]>>>(...);
}
for(int i = 0; i < NUM_STREAMS; i++)
cudaMemcpyAsync(..., cudaMemcpyDeviceToHost, streams[i]);
Using a depth-first approach, operating according to the following scheme
for(int i = 0; i < NUM_STREAMS; i++)
{
cudaMemcpyAsync(...., cudaMemcpyHostToDevice, streams[i]);
kernel_launch_1<<<...., 0, streams[i]>>>(....);
kernel_launch_2<<<...., 0, streams[i]>>>(....);
cudaMemcpyAsync(...., cudaMemcpyDeviceToHost, streams[i]);
}
does not seem to improve the situation, according to the following timeline (the depth-first code is reported at the bottom of the answer), but it seems to show a worse overlapping:
Under the breadth-first approach, and commenting the second kernel launch, the first D2H copy starts immediately as it can, as reported by the following timeline:
Finally, running the code on a Kepler K20c, the problem does not show up, as illustrated by the following figure:
Here is the code for the depth-first approach:
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// includes, project
#include "helper_cuda.h"
#include "helper_functions.h" // helper utility functions
#include <stdio.h>
using namespace std;
#define DATA_SIZE 6000000
#define NUM_THREADS 32
#define NUM_BLOCKS 16
#define NUM_STREAMS 3
__global__ void kernel(const int *in, int *out, int dataSize)
{
int start = blockIdx.x * blockDim.x + threadIdx.x;
int end = dataSize;
for (int i = start; i < end; i += blockDim.x * gridDim.x)
{
out[i] = in[i] * in[i];
}
}
int main()
{
const int dataSize = DATA_SIZE;
int *h_in = new int[dataSize];
int *h_out = new int[dataSize];
int *h_groundTruth = new int[dataSize];
// Input population
for(int i = 0; i < dataSize; i++)
h_in[i] = 5;
for(int i = 0; i < dataSize; i++)
h_out[i] = 0;
// CPU calculation for ground truth
for(int i = 0; i < dataSize; i++)
h_groundTruth[i] = h_in[i] * h_in[i];
// Choose which GPU to run on, change this on a multi-GPU system.
checkCudaErrors( cudaSetDevice(0) );
int *d_in = 0;
int *d_out = 0;
int streamSize = dataSize / NUM_STREAMS;
size_t memSize = dataSize * sizeof(int);
size_t streamMemSize = memSize / NUM_STREAMS;
checkCudaErrors( cudaMalloc( (void **)&d_in, memSize) );
checkCudaErrors( cudaMalloc( (void **)&d_out, memSize) );
// registers host memory as page-locked (required for asynch cudaMemcpyAsync)
checkCudaErrors(cudaHostRegister(h_in, memSize, cudaHostRegisterPortable));
checkCudaErrors(cudaHostRegister(h_out, memSize, cudaHostRegisterPortable));
// set kernel launch config
dim3 nThreads = dim3(NUM_THREADS,1,1);
dim3 nBlocks = dim3(NUM_BLOCKS,1,1);
cout << "GPU Kernel Configuration : " << endl;
cout << "Number of Streams :\t" << NUM_STREAMS << " with size: \t" << streamSize << endl;
cout << "Number of Threads :\t" << nThreads.x << "\t" << nThreads.y << "\t" << nThreads.z << endl;
cout << "Number of Blocks :\t" << nBlocks.x << "\t" << nBlocks.y << "\t" << nBlocks.z << endl;
// create cuda stream
cudaStream_t streams[NUM_STREAMS];
for(int i = 0; i < NUM_STREAMS; i++)
checkCudaErrors(cudaStreamCreate(&streams[i]));
// create cuda event handles
cudaEvent_t start, stop;
checkCudaErrors(cudaEventCreate(&start));
checkCudaErrors(cudaEventCreate(&stop));
cudaEventRecord(start, 0);
for(int i = 0; i < NUM_STREAMS; i++)
{
int offset = i * streamSize;
cudaMemcpyAsync(&d_in[offset], &h_in[offset], streamMemSize, cudaMemcpyHostToDevice, streams[i]);
dim3 subKernelBlock = dim3((int)ceil((float)nBlocks.x / 2));
kernel<<<subKernelBlock, nThreads, 0, streams[i]>>>(&d_in[offset], &d_out[offset], streamSize/2);
kernel<<<subKernelBlock, nThreads, 0, streams[i]>>>(&d_in[offset + streamSize/2], &d_out[offset + streamSize/2], streamSize/2);
cudaMemcpyAsync(&h_out[offset], &d_out[offset], streamMemSize, cudaMemcpyDeviceToHost, streams[i]);
}
for(int i = 0; i < NUM_STREAMS; i++)
checkCudaErrors(cudaStreamSynchronize(streams[i]));
cudaEventRecord(stop, 0);
checkCudaErrors(cudaStreamSynchronize(0));
checkCudaErrors(cudaDeviceSynchronize());
float gpu_time = 0;
checkCudaErrors(cudaEventElapsedTime(&gpu_time, start, stop));
// release resources
checkCudaErrors(cudaEventDestroy(start));
checkCudaErrors(cudaEventDestroy(stop));
checkCudaErrors(cudaHostUnregister(h_in));
checkCudaErrors(cudaHostUnregister(h_out));
checkCudaErrors(cudaFree(d_in));
checkCudaErrors(cudaFree(d_out));
for(int i = 0; i < NUM_STREAMS; i++)
checkCudaErrors(cudaStreamDestroy(streams[i]));
cudaDeviceReset();
cout << "Execution Time of GPU: " << gpu_time << "ms" << endl;
// GPU output check
int sum = 0;
for(int i = 0; i < dataSize; i++)
sum += h_groundTruth[i] - h_out[i];
cout << "Error between CPU and GPU: " << sum << endl;
delete[] h_in;
delete[] h_out;
delete[] h_groundTruth;
return 0;
}

Dense-to-sparse and sparse-to-dense conversions using cuSPARSE

The following program tests the dense to sparse conversion using cuSPARSE. It produces garbage in the first several lines of output. But if I move the lines marked with (2) to the place after the lines marked with (1), the program works fine. Can someone tell me what could be the reason?
EDIT:
To make the presentation clearer, I rewrote the program with thrust, the same issue persists.
EDIT:
As suggested by Robert, I changed it back to the version without thrust and added api level error check code.
#include <iostream>
#include <cusparse_v2.h>
using std::cerr;
using std::cout;
using std::endl;
#define WRAP(x) do {x} while (0)
#define CHKcusparse(x) WRAP( \
cusparseStatus_t err = (x); \
if (err != CUSPARSE_STATUS_SUCCESS) { \
cerr << "Cusparse Error #" << int(err) << "\"TODO\" at Line " \
<< __LINE__ << " of " << __FILE__ << ": " << #x << endl; \
exit(1); \
} \
)
#define CHKcuda(x) WRAP( \
cudaError_t err = (x); \
if (err != cudaSuccess) { \
cerr << "Cuda Error #" << int(err) << ", \"" \
<< cudaGetErrorString(err) << "\" at Line " << __LINE__ \
<< " of " << __FILE__ << ": " << #x << endl; \
exit(1); \
} \
)
#define ALLOC(X, T, N) do { \
h##X = (T*) malloc(sizeof(T) * (N)); \
CHKcuda(cudaMalloc((void**)&d##X, sizeof(T) * (N))); \
} while(0)
int main() {
srand(100);
cusparseHandle_t g_cusparse_handle;
CHKcusparse(cusparseCreate(&g_cusparse_handle));
const int n = 100, in_degree = 10;
int nnz = n * in_degree, nn = n * n;
int *dnnz, *dridx, *dcols;
int *hnnz, *hridx, *hcols;
float *dvals, *dmat;
float *hvals, *hmat;
// (1) The number of non-zeros in each column.
ALLOC(nnz, int, n);
// The dense matrix.
ALLOC(mat, float, nn);
// The values in sparse matrix.
ALLOC(vals, float, nnz);
// (2) The row indices of the sparse matrix.
ALLOC(ridx, int, nnz);
// The column offsets of the sparse matrix.
ALLOC(cols, int, n+1);
// Fill and copy dense matrix and number of non-zeros.
for (int i = 0; i < nn; i++) {hmat[i] = rand();}
for (int i = 0; i < n; i++) {hnnz[i] = in_degree;}
CHKcuda(cudaMemcpyAsync(dnnz, hnnz, sizeof(int) * n, cudaMemcpyHostToDevice));
CHKcuda(cudaMemcpyAsync(dmat, hmat, sizeof(float) * nn, cudaMemcpyHostToDevice));
CHKcuda(cudaDeviceSynchronize());
// Perform dense to CSC format
cusparseMatDescr_t cspMatDesc;
CHKcusparse(cusparseCreateMatDescr(&cspMatDesc));
CHKcusparse(cusparseSdense2csc(
g_cusparse_handle, n, n, cspMatDesc, dmat, n,
dnnz, dvals, dridx, dcols
));
// Copy row indices back.
CHKcuda(cudaMemcpyAsync(hridx, dridx, sizeof(int) * nnz, cudaMemcpyDeviceToHost));
CHKcuda(cudaDeviceSynchronize());
CHKcusparse(cusparseDestroyMatDescr(cspMatDesc));
// Display row indices.
for (int i = 0; i < n; i++) {
for (int j = 0; j < in_degree; j++) {
std::cout << hridx[i * in_degree + j] << ", ";
}
std::cout << std::endl;
}
CHKcuda(cudaFree(dnnz));
CHKcuda(cudaFree(dvals));
CHKcuda(cudaFree(dridx));
CHKcuda(cudaFree(dcols));
CHKcuda(cudaFree(dmat));
free(hnnz);
free(hmat);
free(hvals);
free(hridx);
free(hcols);
return 0;
}
The basic problem is that you are passing internally inconsistent data to the dense-to-sparse routine. You are passing a dense matrix which has 100 non-zero elements per column, but you are telling cusparse that there are only 10 non-zero elements per column.
If you run your code with cuda-memcheck, you will see that there are errors coming out of cusparse.
For this code, you can fix the issue by changing your in_degree variable to 100.
For the general case, cusparse provides a convenient routine to populate the number of non-zero elements per column correctly.
As already underlined by Robert Crovella, passing from dense to sparse can be effectively performed using cuSPARSE by the cusparse<t>nnz() and cusparse<t>dense2csr() routines. The vice versa can be done by the cusparse<t>csr2dense() routine. Below, there is a fully worked out example showing how passing from dense to sparse and vice versa using cuSPARSE in CSR format.
cuSparseUtilities.cuh
#ifndef CUSPARSEUTILITIES_CUH
#define CUSPARSEUTILITIES_CUH
#include "cusparse_v2.h"
void setUpDescriptor(cusparseMatDescr_t &, cusparseMatrixType_t, cusparseIndexBase_t);
void dense2SparseD(const double * __restrict__ d_A_dense, int **d_nnzPerVector, double **d_A,
int **d_A_RowIndices, int **d_A_ColIndices, int &nnz, cusparseMatDescr_t descrA,
const cusparseHandle_t handle, const int Nrows, const int Ncols);
#endif
cuSparseUtilities.cu
#include "cuSparseUtilities.cuh"
#include "Utilities.cuh"
/*****************************/
/* SETUP DESCRIPTOR FUNCTION */
/*****************************/
void setUpDescriptor(cusparseMatDescr_t &descrA, cusparseMatrixType_t matrixType, cusparseIndexBase_t indexBase) {
cusparseSafeCall(cusparseCreateMatDescr(&descrA));
cusparseSafeCall(cusparseSetMatType(descrA, matrixType));
cusparseSafeCall(cusparseSetMatIndexBase(descrA, indexBase));
}
/********************************************************/
/* DENSE TO SPARSE CONVERSION FOR REAL DOUBLE PRECISION */
/********************************************************/
void dense2SparseD(const double * __restrict__ d_A_dense, int **d_nnzPerVector, double **d_A,
int **d_A_RowIndices, int **d_A_ColIndices, int &nnz, cusparseMatDescr_t descrA,
const cusparseHandle_t handle, const int Nrows, const int Ncols) {
const int lda = Nrows; // --- Leading dimension of dense matrix
gpuErrchk(cudaMalloc(&d_nnzPerVector[0], Nrows * sizeof(int)));
// --- Compute the number of nonzero elements per row and the total number of nonzero elements in the dense d_A_dense
cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector[0], &nnz));
// --- Device side sparse matrix
gpuErrchk(cudaMalloc(&d_A[0], nnz * sizeof(double)));
gpuErrchk(cudaMalloc(&d_A_RowIndices[0], (Nrows + 1) * sizeof(int)));
gpuErrchk(cudaMalloc(&d_A_ColIndices[0], nnz * sizeof(int)));
cusparseSafeCall(cusparseDdense2csr(handle, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector[0], d_A[0], d_A_RowIndices[0], d_A_ColIndices[0]));
}
kernel.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <cusparse_v2.h>
#include "cuSparseUtilities.cuh"
#include "Utilities.cuh"
/********/
/* MAIN */
/********/
int main() {
cusparseHandle_t handle;
// --- Initialize cuSPARSE
cusparseSafeCall(cusparseCreate(&handle));
cusparseMatDescr_t descrA = 0;
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
const int Nrows = 5; // --- Number of rows
const int Ncols = 4; // --- Number of columns
const int N = Nrows;
// --- Host side dense matrix
double *h_A_dense = (double*)malloc(Nrows * Ncols * sizeof(*h_A_dense));
// --- Column-major storage
h_A_dense[ 0] = 0.4612f; h_A_dense[ 5] = -0.0006f; h_A_dense[10] = 1.3f; h_A_dense[15] = 0.0f;
h_A_dense[ 1] = 0.0f; h_A_dense[ 6] = 1.443f; h_A_dense[11] = 0.0f; h_A_dense[16] = 0.0f;
h_A_dense[ 2] = -0.0006f; h_A_dense[ 7] = 0.4640f; h_A_dense[12] = 0.0723f; h_A_dense[17] = 0.0f;
h_A_dense[ 3] = 0.3566f; h_A_dense[ 8] = 0.0723f; h_A_dense[13] = 0.7543f; h_A_dense[18] = 0.0f;
h_A_dense[ 4] = 0.f; h_A_dense[ 9] = 0.0f; h_A_dense[14] = 0.0f; h_A_dense[19] = 0.1f;
// --- Create device array and copy host array to it
double *d_A_dense; gpuErrchk(cudaMalloc(&d_A_dense, Nrows * Ncols * sizeof(double)));
gpuErrchk(cudaMemcpy(d_A_dense, h_A_dense, Nrows * Ncols * sizeof(*d_A_dense), cudaMemcpyHostToDevice));
/*******************************/
/* FROM DENSE TO SPARSE MATRIX */
/*******************************/
// --- Descriptor for sparse matrix A
setUpDescriptor(descrA, CUSPARSE_MATRIX_TYPE_GENERAL, CUSPARSE_INDEX_BASE_ONE);
int nnz = 0; // --- Number of nonzero elements in dense matrix
int *d_nnzPerVector; // --- Device side number of nonzero elements per row
double *d_A; // --- Sparse matrix values - array of size nnz
int *d_A_RowIndices; // --- "Row indices"
int *d_A_ColIndices; // --- "Column indices"
dense2SparseD(d_A_dense, &d_nnzPerVector, &d_A, &d_A_RowIndices, &d_A_ColIndices, nnz, descrA, handle, Nrows, Ncols);
/*******************************************************/
/* CHECKING THE RESULTS FOR DENSE TO SPARSE CONVERSION */
/*******************************************************/
// --- Host side number of nonzero elements per row
int *h_nnzPerVector = (int *)malloc(Nrows * sizeof(int));
gpuErrchk(cudaMemcpy(h_nnzPerVector, d_nnzPerVector, Nrows * sizeof(int), cudaMemcpyDeviceToHost));
printf("Number of nonzero elements in dense matrix = %i\n\n", nnz);
for (int i = 0; i < Nrows; ++i) printf("Number of nonzero elements in row %i = %i \n", i, h_nnzPerVector[i]);
printf("\n");
// --- Host side sparse matrix
double *h_A = (double *)malloc(nnz * sizeof(double));
int *h_A_RowIndices = (int *)malloc((Nrows + 1) * sizeof(int));
int *h_A_ColIndices = (int *)malloc(nnz * sizeof(int));
gpuErrchk(cudaMemcpy(h_A, d_A, nnz * sizeof(double), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_RowIndices, d_A_RowIndices, (Nrows + 1) * sizeof(int), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_ColIndices, d_A_ColIndices, nnz * sizeof(int), cudaMemcpyDeviceToHost));
printf("\nOriginal matrix in CSR format\n\n");
for (int i = 0; i < nnz; ++i) printf("A[%i] = %f\n", i, h_A[i]); printf("\n");
printf("\n");
for (int i = 0; i < (Nrows + 1); ++i) printf("h_A_RowIndices[%i] = %i \n", i, h_A_RowIndices[i]); printf("\n");
for (int i = 0; i < nnz; ++i) printf("h_A_ColIndices[%i] = %i \n", i, h_A_ColIndices[i]);
/*******************************/
/* FROM SPARSE TO DENSE MATRIX */
/*******************************/
double *d_A_denseReconstructed; gpuErrchk(cudaMalloc(&d_A_denseReconstructed, Nrows * Ncols * sizeof(double)));
cusparseSafeCall(cusparseDcsr2dense(handle, Nrows, Ncols, descrA, d_A, d_A_RowIndices, d_A_ColIndices,
d_A_denseReconstructed, Nrows));
/*******************************************************/
/* CHECKING THE RESULTS FOR SPARSE TO DENSE CONVERSION */
/*******************************************************/
double *h_A_denseReconstructed = (double *)malloc(Nrows * Ncols * sizeof(double));
gpuErrchk(cudaMemcpy(h_A_denseReconstructed, d_A_denseReconstructed, Nrows * Ncols * sizeof(double), cudaMemcpyDeviceToHost));
printf("\nReconstructed dense matrix \n");
for (int m = 0; m < Nrows; m++) {
for (int n = 0; n < Ncols; n++)
printf("%f\t", h_A_denseReconstructed[n * Nrows + m]);
printf("\n");
}
return 0;
}

cuda addvectors memory intuitive explanation

I have the following code and
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <ctime>
#include <vector>
#include <numeric>
float random_float(void)
{
return static_cast<float>(rand()) / RAND_MAX;
}
std::vector<float> add(float alpha, std::vector<float>& v1, std::vector<float>& v2 )
{ /*Do quick size check on vectors before proceeding*/
std::vector<float> result(v1.size());
for (unsigned int i = 0; i < result.size(); ++i)
{
result[i]=alpha*v1[i]+v2[i];
}
return result;
}
__global__ void Addloop( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for( i = i0; i < N; i += blockDim.x*gridDim.x )
y[i] = alpha*x[i] + y[i];
/*
if ( i0 < N )
y[i0] = alpha*x[i0] + y[i0];
*/
}
int main( int argc, char** argv ) {
float alpha = 0.3;
// create array of 256k elements
int num_elements = 10;//1<<18;
// generate random input on the host
std::vector<float> h1_input(num_elements);
std::vector<float> h2_input(num_elements);
for(int i = 0; i < num_elements; ++i)
{
h1_input[i] = random_float();
h2_input[i] = random_float();
}
for (std::vector<float>::iterator it = h1_input.begin() ; it != h1_input.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
for (std::vector<float>::iterator it = h2_input.begin() ; it != h2_input.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
std::vector<float> host_result;//(std::vector<float> h1_input, std::vector<float> h2_input );
host_result = add( alpha, h1_input, h2_input );
for (std::vector<float>::iterator it = host_result.begin() ; it != host_result.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
// move input to device memory
float *d1_input = 0;
cudaMalloc((void**)&d1_input, sizeof(float) * num_elements);
cudaMemcpy(d1_input, &h1_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice);
float *d2_input = 0;
cudaMalloc((void**)&d2_input, sizeof(float) * num_elements);
cudaMemcpy(d2_input, &h2_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice);
Addloop<<<1,3>>>( num_elements, alpha, d1_input, d2_input );
// copy the result back to the host
std::vector<float> device_result(num_elements);
cudaMemcpy(&device_result[0], d2_input, sizeof(float) * num_elements, cudaMemcpyDeviceToHost);
for (std::vector<float>::iterator it = device_result.begin() ; it != device_result.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
cudaFree(d1_input);
cudaFree(d2_input);
h1_input.clear();
h2_input.clear();
device_result.clear();
std::cout << "DONE! \n";
getchar();
return 0;
}
I am trying to understand the gpu memory access. The kernel, for reasons of simplicity, is launched as Addloop<<<1,3>>>. I am trying to understand how this code is working by imagining the for loops working on the gpu as instances. More specifically, I imagine the following instances but they do not help.
Instance 1:
for( i = 0; i < N; i += 3*1 ) // ( i += 0*1 --> i += 3*1 after Eric's comment)
y[i] = alpha*x[i] + y[i];
Instance 2:
for( i = 1; i < N; i += 3*1 )
y[i] = alpha*x[i] + y[i];
Instance 3:
for( i = 3; i < N; i += 3*1 )
y[i] = alpha*x[i] + y[i];
Looking inside of every loop it does not make any sense in the logic of adding two vectors. Can some one help?
The reason I am adopting this logic of instances is because it is working well in the case of the code inside the kernel which is in comments.
If these thoughts are correct what would be the instances in case we have multiple blocks inside the grid? In other words what would be the i values and the update rates (+=updaterate) in some examples?
PS: The kernel code borrowed from here.
UPDATE:
After Eric's answer I think the execution for N = 15, e.i the number of elements, goes like this (correct me if I am wrong):
For the instance 1 above i = 0 , 3, 6, 9, 12 which computes the corresponding y[i] values.
For the instance 2 above i = 1 , 4, 7, 10, 13 which computes the corresponding remaining y[i] values.
For the instance 3 above i = 2 , 5, 8, 11, 14 which computes the rest y[i] values.
Your blockDim.x is 3 and gridDim.x is 1 according to your setup <<<1,3>>>. So in each thread (you call it instance), it should be i+=3*1
update
With the for loop you can compute 15 element using only 3 threads. Generally you can use limited number of threads to do "infinit" work. And more work per threads can improve the performance by reducing the launch overhead and hiding the instruction stalls.
Another advantage is you could use fixed number of threads/blocks to do work of various sizes, thus requires less tuning.

Cuda Reduction in 2d Array

I want to calculate the average of the values over the whole image in Cuda. To test how reduction in 2D array work, I write this kernel below. The final output o should be the sum of all the image values. The input g is a 2D array with value 1 in every pixel. But the result of this program is 0 as the sum. A bit weird to me.
I imitate the reduction in 1D array in this tutorial http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf I write this 2D form. I am new to Cuda. And suggestions to potential bugs and improvement are welcomed!
Just add one comment. I know it makes sense just to calculate the average in 1D array. But I want to exploit more and test more complicated reduction behaviours. It might not be right. But just a test. Hope anyone can give me suggestions more about reduction common practices.
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
cudaEvent_t start, stop;
float elapsedTime;
__global__ void
reduce(float *g, float *o, const int dimx, const int dimy)
{
extern __shared__ float sdata[];
unsigned int tid_x = threadIdx.x;
unsigned int tid_y = threadIdx.y;
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i >= dimx || j >= dimy)
return;
sdata[tid_x*blockDim.y + tid_y] = g[i*dimy + j];
__syncthreads();
for(unsigned int s_y = blockDim.y/2; s_y > 0; s_y >>= 1)
{
if (tid_y < s_y)
{
sdata[tid_x * dimy + tid_y] += sdata[tid_x * dimy + tid_y + s_y];
}
__syncthreads();
}
for(unsigned int s_x = blockDim.x/2; s_x > 0; s_x >>= 1 )
{
if(tid_x < s_x)
{
sdata[tid_x * dimy] += sdata[(tid_x + s_x) * dimy];
}
__syncthreads();
}
float sum;
if( tid_x == 0 && tid_y == 0)
{
sum = sdata[0];
atomicAdd (o, sum); // The result should be the sum of all pixel values. But the program produce 0
}
//if(tid_x==0 && tid__y == 0 )
//o[blockIdx.x] = sdata[0];
}
int
main()
{
int dimx = 320;
int dimy = 160;
int num_bytes = dimx*dimy*sizeof(float);
float *d_a, *h_a, // device and host pointers
*d_o=0, *h_o=0;
h_a = (float*)malloc(num_bytes);
h_o = (float*)malloc(sizeof(float));
srand(time(NULL));
for (int i=0; i < dimx; i++)
{
for (int j=0; j < dimy; j++)
{
h_a[i*dimy + j] = 1;
}
}
cudaMalloc( (void**)&d_a, num_bytes );
cudaMalloc( (void**)&d_o, sizeof(int) );
cudaMemcpy( d_a, h_a, num_bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_o, h_o, sizeof(int), cudaMemcpyHostToDevice);
dim3 grid, block;
block.x = 4;
block.y = 4;
grid.x = dimx / block.x;
grid.y = dimy / block.y;
cudaEventCreate(&start);
cudaEventRecord(start, 0);
int sizeofSharedMemory = dimx*dimy*sizeof(float);
reduce<<<grid, block, sizeofSharedMemory>>> (d_a, d_o, block.x, block.y);
cudaEventCreate(&stop);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "This kernel runs: " << elapsedTime << "ms" << std::endl;
std::cout << block.x << " " << block.y << std::endl;
std::cout << grid.x << " " << grid.y << std::endl;
std::cout << dimx << " " << dimy << " " << dimx*dimy << std::endl;
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
cudaMemcpy( h_o, d_o, sizeof(int), cudaMemcpyDeviceToHost );
std::cout << "The sum is:" << *h_o << std::endl;
free(h_a);
free(h_o);
cudaFree(d_a);
cudaFree(d_o);
}
If you do basic cuda error checking you will discover that your reduce kernel is not even running. The reason is as follows:
int dimx = 320;
int dimy = 160;
...
int sizeofSharedMemory = dimx*dimy*sizeof(float); // = 204800
reduce<<<grid, block, sizeofSharedMemory>>> (d_a, d_o, block.x, block.y);
^
|
204800 is illegal here
You cannot request 204800 bytes of shared memory dynamically (or any other way). The maximum is slightly less than 48K bytes.
If you had done proper cuda error checking, you would discover your kernel is not running and would have gotten an instructive error message which suggests the launch configuration (the numbers between the <<< ... >>> ) is invalid. Shared memory is requested on a per-block basis, and it's probably not sensible that you need to request enough shared memory to cover your entire 2D data set, when each block only consists of a 4x4 thread array. You probably just need enough data for what will be accessed by each 4x4 thread array.
After you have properly instrumented your code with cuda error checking, and detected and corrected all the errors, then run your code with cuda-memcheck. This will do an additional level of error checking to point out any kernel access errors. You may also use cuda-memcheck if you are getting an unspecified launch failure, and it may help pinpoint the issue.
After you have done these basic trouble shooting steps, then it might make sense to ask others for help. But use the power of the tools you have been given first.
I also want to point out one other error before you come back and post this code again, asking for help.
This will not be useful:
std::cout << "The sum is:" << *h_o << std::endl;
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
cudaMemcpy( h_o, d_o, sizeof(int), cudaMemcpyDeviceToHost );
You are printing out the sum before you have copied the sum from the device to the host.
Reverse the order of these steps:
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
cudaMemcpy( h_o, d_o, sizeof(int), cudaMemcpyDeviceToHost );
std::cout << "The sum is:" << *h_o << std::endl;