Is DeviceToDevice copy in a CUDA program needed? - cuda

I am doing the following two operations:
Addition of two array => a + b = AddResult
Multiply of two arrays => AddResult * a = MultiplyResult
In the above logic, the AddResult is an intermediate result and is used in the next mupltiplication operation as input.
#define N 4096 // size of array
__global__ void add(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] + b[tid];
}
}
__global__ void multiply(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] * b[tid];
}
}
int main()
{
int T = 1024, B = 4; // threads per block and blocks per grid
int a[N], b[N], c[N], d[N], e[N];
int* dev_a, * dev_b, * dev_AddResult, * dev_Temp, * dev_MultiplyResult;
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_AddResult, N * sizeof(int));
cudaMalloc((void**)&dev_Temp, N * sizeof(int));
cudaMalloc((void**)&dev_MultiplyResult, N * sizeof(int));
for (int i = 0; i < N; i++)
{
// load arrays with some numbers
a[i] = i;
b[i] = i * 1;
}
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_AddResult, c, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_Temp, d, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_MultiplyResult, e, N * sizeof(int), cudaMemcpyHostToDevice);
//ADD
add << <B, T >> > (dev_a, dev_b, dev_AddResult);
cudaDeviceSynchronize();
//Multiply
cudaMemcpy(dev_Temp, dev_AddResult, N * sizeof(int), cudaMemcpyDeviceToDevice); //<---------DO I REALLY NEED THIS?
multiply << <B, T >> > (dev_a, dev_Temp, dev_MultiplyResult);
//multiply << <B, T >> > (dev_a, dev_AddResult, dev_MultiplyResult);
//Copy Final Results D to H
cudaMemcpy(e, dev_MultiplyResult, N * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
{
printf("(%d+%d)*%d=%d\n", a[i], b[i], a[i], e[i]);
}
// clean up
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_AddResult);
cudaFree(dev_Temp);
cudaFree(dev_MultiplyResult);
return 0;
}
In the above sample code, I am transferring the Addition results (i.e. dev_AddResult) to another device array (i.e. dev_Temp) to perform the multiplication operation.
QUESTION: Since the Addition results array (i.e. dev_AddResult) is already on the GPU device, do I really need to transfer it to another array? I have already tried to execute the next kernel by directly providing dev_AddResult as input and it produced the same results. Is there any risk involved in directly passing the output of one kernel as the input of the next kernel? Any best practices to follow?

Yes, for the case you have shown, you can use the "output" of one kernel as the "input" to the next, without any copying. You've already done that and confirmed it works, so I will dispense with any example. The changes are trivial anyway - eliminate the intervening cudaMemcpy operation, and use the same dev_AddResult pointer in place of the dev_Temp pointer on your multiply kernel invocation.
Regarding "risks" I'm not aware of any for the example you have given. Moving away from that example to possibly more general usage, you would want to make sure that the add output calculations are finished before being used somewhere else.
Your example already does this, redundantly, using at least 2 mechanisms:
intervening cudaDeviceSynchronize() - this forces the previously issued work to complete
stream semantics - one rule of stream semantics is that work issued into a particular stream will execute in issue order. Item B issued into stream X, will not begin until the previously issued item A into stream X has completed.
So you don't really need the cudaDeviceSynchronize() in this case. It isn't "hurting" anything from a functionality perspective, but it is probably adding a few microseconds to the overall execution time.
More generally, if you had issued your add and multiply kernel into separate streams, then CUDA provides no guarantees of execution order, even though you "issued" the multiply kernel after the add kernel.
In that case (not the one you have here) if you needed the multiply operation to use the previously computed add results, you would need to enforce that somehow (enforce the completion of the add kernel before the multiply kernel). You have already shown one method to do that here, using a synchronize call.

Related

What is wrong with my understanding of "__shared__" variables in cuda?

After reading the manual of NVIDIA, I wrotea parrell reduction code as follows:
__global__ void kernel(int *devData)
{
__shared__ int sum;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (threadIdx.x == 0)
sum = 0;
__syncthreads();
sum += devData[i];
__syncthreads();
if (threadIdx.x == 0)
printf("sum of block %d is %d\n", blockIdx.x, sum);
}
int main(void)
{
// init device
int devIdx = 0;
cudaError_t err = cudaSuccess;
gpuDeviceInit(devIdx);
int i;
int data[100];
int *devData;
for (i = 0; i < 100; i++)
data[i] = 1;
err = cudaMalloc(&devData, 100 * sizeof(int));
checkCudaErrors(err);
// copy data to device
err = cudaMemcpy(devData, data, 100 * sizeof(int), cudaMemcpyHostToDevice);
checkCudaErrors(err);
int blocksPerGrid = 10;
int threadsPerBlock = 10;
// call kernel function
kernel <<<blocksPerGrid, threadsPerBlock>>> (devData);
checkCudaErrors(cudaGetLastError());
cudaDeviceReset();
return 0;
}
I'm trying to sum integers for each block and then print this sum.
But I found the result was as follows:
sum of block 0 is 1
sum of block 6 is 1
sum of block 2 is 1
sum of block 8 is 1
sum of block 1 is 1
sum of block 7 is 1
sum of block 4 is 1
sum of block 3 is 1
sum of block 9 is 1
sum of block 5 is 1
The result I expected was 10.Is the __shared__ variable "sum" shared by every thread in a block? What's wrong with my understanding of "__shared__" variables in cuda?
you have multiple threads trying to access (read-modify-write) sum at the same time, here:
sum += devData[i];
This doesn't work for either global or shared data in CUDA (i.e. CUDA won't sort that out for you, automatically). To sort this out, the usual approaches are either to use atomics or else to use a canonical parallel reduction
There are numerous questions on both of these topics here on the cuda SO tag, and you can get some focused training on parallel reduction methods in unit 5 of this online training series.
For example, in your code, a trivial change to "fix" would be to replace the above line of code with an atomic add:
atomicAdd(&sum,devData[i]);
atomics force serialization, so a preferred approach is a canonical parallel reduction.

Cuda: Copy 1D Array From CPU to GPU

I am newbie to Cuda, trying to copy array from Host to Device via cudaMemcpy(...)
However,the data passed to GPU seems to be totally different (for cost: totally wrong, for G: wrong after index of 5)
My data is a malloc array (written in C) of size 25 for example,
I tried to copy through the following way (MAX = 5):
Declaration:
int *cost, int* G
int *dev_cost, *dev_G;
Allocation:
cost = (int*)malloc(MAX* MAX * sizeof(int));
G = (int*)malloc(MAX* MAX* sizeof(int));
cudaMalloc((void**)&dev_cost, MAX*MAX);
cudaMalloc((void**)&dev_G, MAX*MAX);
Data transfer:
cudaMemcpy(dev_cost, cost, MAX*MAX, cudaMemcpyHostToDevice);
cudaMemcpy(dev_G, G, MAX*MAX, cudaMemcpyHostToDevice);
Kernel Trigger:
assignCost<<<1,MAX*MAX>>>(dev_G,dev_cost);
Kernel Function:
__global__ void assignCost(int *G, int *cost)
{
int tid = threadIdx.x + blockDim.x*blockIdx.x;
printf("cost[%d]: %d G[%d] = %d\n", tid, cost[tid], tid, G[tid]);
if(tid<MAX*MAX)
{
if (G[tid] == 0)
cost[tid] = INT_MAX;
else
cost[tid] = G[tid];
}
}
Is there anything wrong with my approach? If then, how should i modify?
cudaMemcpy(dev_cost, cost, MAX*MAX*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_G, G, MAX*MAX*sizeof(int), cudaMemcpyHostToDevice);
The 3rd argument for cudaMemcpy is the count in bytes. As you have MAX*MAX integers and each integer has a size sizeof(int) bytes, replace MAX*MAX as MAX*MAX*sizeof(int)

What is the general way to launch appropriate amount of reduction kernels?

As I have read from NVIDIA's instruction in this link http://www.cuvilib.com/Reduction.pdf, for arrays bigger than blockSize, I should launch multiple reduction kernels to achieve global synchronization. What is the general way to determine how many times I should launch the reduction kernel? I tried as below but I need to Malloc 2 additional pointers, which takes a lot of processing times.
My job is to Reduce the array d_logLuminance into one minimum value min_logLum
void your_histogram_and_prefixsum(const float* const d_logLuminance,
float &min_logLum,
const size_t numRows,
const size_t numCols)
{
const dim3 blockSize(512);
unsigned int pixel = numRows*numCols;
const dim3 gridSize(pixel/blockSize.x+1);
//Reduction kernels to find max and min value
float *d_tempMin, *d_min;
checkCudaErrors(cudaMalloc((void**) &d_tempMin, sizeof(float)*pixel));
checkCudaErrors(cudaMalloc((void**) &d_min, sizeof(float)*pixel));
checkCudaErrors(cudaMemcpy(d_min, d_logLuminance, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
dim3 subGrid = gridSize;
for(int reduceLevel = pixel; reduceLevel > 0; reduceLevel /= blockSize.x) {
checkCudaErrors(cudaMemcpy(d_tempMin, d_min, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
reduceMin<<<subGrid,blockSize,blockSize.x*sizeof(float)>>>(d_tempMin, d_min);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
subGrid.x = subGrid.x / blockSize.x + 1;
}
checkCudaErrors(cudaMemcpy(&min_logLum, d_min, sizeof(float), cudaMemcpyDeviceToHost));
std::cout<< "Min value = " << min_logLum << std::endl;
checkCudaErrors(cudaFree(d_tempMin));
checkCudaErrors(cudaFree(d_min));
}
And if you are curious, here is my reduction kernel:
__global__
void reduceMin(const float* const g_inputRange,
float* g_outputRange)
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
sdata[tid] = g_inputRange[i];
__syncthreads();
for(unsigned int s = blockDim.x/2; s > 0; s >>= 1){
if (tid < s){
sdata[tid] = min(sdata[tid],sdata[tid+s]);
}
__syncthreads();
}
if(tid == 0){
g_outputRange[blockIdx.x] = sdata[0];
}
}
There are many ways to skin the cat, but if you want to minimize kernel launches, it can always be done with at most two kernel launches.
The first kernel launch is composed of up to however many blocks correspond to the number of threads per block that your device supports. Newer devices will support 1024, older devices, 512.
Each of these (at most 512 or 1024) blocks in the first kernel will participate in a grid-looping sum of all the data elements in global memory.
Each of these blocks will then do a partial reduction and write a partial result to global memory. There will be 512 or 1024 of these partial results.
The second kernel launch will be composed of 512 or 1024 threads in a single block. Each thread will pick up one of the partial results from global memory, and then the threads in that single block will cooperatively reduce the partial results to a single final result, and write it back to global memory.
The "grid-looping sum" is described in reduction #7 here as "multiple add/thread". All of the reductions described in this document are available in the NVIDIA reduction sample code

Understanding Cublas: vector addition (asum) [duplicate]

This question already has answers here:
Finding maximum and minimum with CUBLAS
(2 answers)
Closed 7 years ago.
According to CUBLAS reference, asum function (for getting the sum of the elements of a vector) is:
cublasStatus_t cublasSasum(cublasHandle_t handle, int n, const float *x, int incx, float *result)
You can see in the link to the reference the parameters explanation, roughly we have a vector x of n elements with incx distance between elements.
My code is (quite simplified, but I also tested this one and there is still the error):
int arraySize = 10;
float* a = (float*) malloc (sizeof(float) * arraySize);
float* d_a;
cudaMalloc((void**) &d_a, sizeof(float) * arraySize);
for (int i=0; i<arraySize; i++)
a[i]=0.8f;
cudaMemcpy(d_a, a, sizeof(float) * arraySize, cudaMemcpyHostToDevice);
cublasStatus_t ret;
cublasHandle_t handle;
ret = cublasCreate(&handle);
float* cb_result = (float*) malloc (sizeof(float));
ret = cublasSasum(handle, arraySize, d_a, sizeof(float), cb_result);
printf("\n\nCUBLAS: %.3f", *cb_result);
cublasDestroy(handle);
I have removed error checking for simplifying the code (there are no errors, CUBLAS functions return CUDA_STATUS_SUCCESS) and free and cudaFree.
It compiles, it runs, it doesn't throw any error, but result printed is 0, and, debugging, it is actually 1.QNAN.
What did i miss?
One of the arguments to cublasSasum is incorrect. The call should look like this:
ret = cublasSasum(handle, arraySize, d_a, 1, cb_result);
Note that the second last argument, incx, should be in words, not bytes.

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.
First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
#define N (1024*1024)
#define M (1000000)
int main()
{
float data[N]; int count = 0;
for(int i = 0; i < N; i++)
{
data[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
{
data[i] = data[i] * data[i] - 0.25f;
}
}
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
Use something like this in CUDA/C:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] = buf[i] * buf[i] - 0.25f;
}
int main()
{
float data[N]; int count = 0;
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.
A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)
As I said in my comment response to #Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)
For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024
void serial_add(double *a, double *b, double *c, int n, int m)
{
for(int index=0;index<n;index++)
{
for(int j=0;j<m;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
}
__global__ void vector_add(double *a, double *b, double *c)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
for(int j=0;j<M;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
int main()
{
clock_t start,end;
double *a, *b, *c;
int size = N * sizeof( double );
a = (double *)malloc( size );
b = (double *)malloc( size );
c = (double *)malloc( size );
for( int i = 0; i < N; i++ )
{
a[i] = b[i] = i;
c[i] = 0;
}
start = clock();
serial_add(a, b, c, N, M);
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
end = clock();
float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("Serial: %f seconds\n",time1);
start = clock();
double *d_a, *d_b, *d_c;
cudaMalloc( (void **) &d_a, size );
cudaMalloc( (void **) &d_b, size );
cudaMalloc( (void **) &d_c, size );
cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );
vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
free(a);
free(b);
free(c);
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
end = clock();
float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);
return 0;
}
I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.
For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the
int gpu = 1;
line in the hello.c source file to 0 for CPU, 1 for GPU.
Apple has some more OpenCL example code in their main Mac source code listing.
Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.
I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.