multi-GPU basic usage - cuda

How can I use two devices in order to improve for example
the performance of the following code (sum of vectors)?
Is it possible to use more devices "at the same time"?
If yes, how can I manage the allocations of the vectors on the global memory of the different devices?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <cuda.h>
#define NB 32
#define NT 500
#define N NB*NT
__global__ void add( double *a, double *b, double *c);
//===========================================
__global__ void add( double *a, double *b, double *c){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N){
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
//============================================
//BEGIN
//===========================================
int main( void ) {
double *a, *b, *c;
double *dev_a, *dev_b, *dev_c;
// allocate the memory on the CPU
a=(double *)malloc(N*sizeof(double));
b=(double *)malloc(N*sizeof(double));
c=(double *)malloc(N*sizeof(double));
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(double) );
cudaMalloc( (void**)&dev_b, N * sizeof(double) );
cudaMalloc( (void**)&dev_c, N * sizeof(double) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = (double)i;
b[i] = (double)i*2;
}
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);
for(int i=0;i<10000;++i)
add<<<NB,NT>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);
// display the results
// for (int i=0; i<N; i++) {
// printf( "%g + %g = %g\n", a[i], b[i], c[i] );
// }
printf("\nGPU done\n");
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
// free the memory allocated on the CPU
free( a );
free( b );
free( c );
return 0;
}
Thank you in advance.
Michele

Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.
Now it is possible to do something like this for the memory allocation part of your host code:
double *dev_a[2], *dev_b[2], *dev_c[2];
const int Ns[2] = {N/2, N-(N/2)};
// allocate the memory on the GPUs
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) );
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].
The transfer of data from the host to device could be as simple as:
// copy the arrays 'a' and 'b' to the GPUs
for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {
cudaSetDevice(dev);
cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The kernel launching section of your code could then look something like:
for(int i=0;i<10000;++i) {
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );
}
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required.
But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.
You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

Related

Compare Thrust fill with kernel launch speed [duplicate]

I want to add 128-bit vectors with carry. My 128-bit version (addKernel128 in the code below) is twice slower than the basic 32-bit version (addKernel32 below).
Do I have memory coalescing problems ? How can I get better performance ?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#define UADDO(c, a, b) asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define UADDC(c, a, b) asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
__global__ void addKernel32(unsigned int *c, const unsigned int *a, const unsigned int *b, const int size)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
while (tid < size)
{
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
__global__ void addKernel128(unsigned *c, const unsigned *a, const unsigned *b, const int size)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
while (tid < size / 4)
{
uint4 a4 = ((const uint4 *)a)[tid],
b4 = ((const uint4 *)b)[tid],
c4;
UADDO(c4.x, a4.x, b4.x)
UADDC(c4.y, a4.y, b4.y) // add with carry
UADDC(c4.z, a4.z, b4.z) // add with carry
UADDC(c4.w, a4.w, b4.w) // add with carry (no overflow checking for clarity)
((uint4 *)c)[tid] = c4;
tid += blockDim.x * gridDim.x;
}
}
int main()
{
const int size = 10000000; // 10 million
unsigned int *d_a, *d_b, *d_c;
cudaMalloc((void**)&d_a, size * sizeof(int));
cudaMalloc((void**)&d_b, size * sizeof(int));
cudaMalloc((void**)&d_c, size * sizeof(int));
cudaMemset(d_a, 1, size * sizeof(int)); // dummy init just for the example
cudaMemset(d_b, 2, size * sizeof(int)); // dummy init just for the example
cudaMemset(d_c, 0, size * sizeof(int));
int nbThreads = 512;
int nbBlocks = 1024; // for example
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
addKernel128<<<nbBlocks, nbThreads>>>(d_c, d_a, d_b, size);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float m = 0;
cudaEventElapsedTime(&m, start, stop);
cudaFree(d_c);
cudaFree(d_b);
cudaFree(d_a);
cudaDeviceReset();
printf("Elapsed = %g\n", m);
return 0;
}
Timing CUDA code on a WDDM GPU can be quite difficult for a variety of reasons. Most of these revolve around the fact that the GPU is being managed as a display device by Windows, and this can introduce a variety of artifacts into the timing. One example is that the windows driver and WDDM will batch work for the GPU, and may interleave display work in the middle of CUDA GPU work.
if possible, time your cuda code on linux, or else on a windows GPU
in TCC mode.
for performance, always build without the -G switch. In visual studio, this usually corresponds to building the release, not the debug version of the project.
To get a good performance comparison, it's usually advisable to do some "warm up runs" before actually measuring the timing results. These will eliminate "start-up" and other one-time measurement issues, are you are more likely to get sensible results. You may also wish to run your code a number of times and average the results.
It's also usually advisable to compile with an arch flag that corresponds to your GPU, so for example -arch=sm_20 for a cc2.0 GPU.

Experiment to find out affect of block size on cuda program speed

I want to find out how the number of threads in a block affects the performance and speed of a cuda program. I wrote a simple vector addition code, here is my code:
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void gpuVecAdd(float *a, float *b, float *c, int n) {
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < n) {
c[id] = a[id] + b[id];
}
}
int main() {
int n = 1000000;
float *h_a, *h_b, *h_c, *t;
srand(time(NULL));
size_t bytes = n* sizeof(float);
h_a = (float*) malloc(bytes);
h_b = (float*) malloc(bytes);
h_c = (float*) malloc(bytes);
for (int i=0; i<n; i++)
{
h_a[i] =rand()%10;
h_b[i] =rand()%10;
}
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
gpuErrchk( cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
gpuErrchk( cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice));
clock_t t1,t2;
t1 = clock();
int block_size = 1024;
gpuVecAdd<<<ceil(float(n/block_size)),block_size>>>(d_a, d_b, d_c, n);
gpuErrchk( cudaPeekAtLastError() );
t2 = clock();
cout<<(float)(t2-t1)/CLOCKS_PER_SEC<<" seconds";
gpuErrchk(cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost));
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(h_a);
free(h_b);
free(h_c);
}
I read this post and Based on the talonmies' answer "The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware."
I checked the code with a different number of threads per block, for example, 2 and 1024 (which is the multiply of 32 and also the maximum number of thread per block). The average running time for both sizes is almost equal and I don't see a huge difference between them. Why is that? Is my benchmarking incorrect?
GPU kernel launches in CUDA are asynchronous. This means that control will be returned to the CPU thread before the kernel has finished executing.
If we want the CPU thread to time the duration of the kernel, we must cause the CPU thread to wait until the kernel has finished. We can do this by putting a call to cudaDeviceSynchronize() in the timing region. Then the measured time will include the full duration of kernel execution.

Using cluster of GPU's [duplicate]

How can I use two devices in order to improve for example
the performance of the following code (sum of vectors)?
Is it possible to use more devices "at the same time"?
If yes, how can I manage the allocations of the vectors on the global memory of the different devices?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <cuda.h>
#define NB 32
#define NT 500
#define N NB*NT
__global__ void add( double *a, double *b, double *c);
//===========================================
__global__ void add( double *a, double *b, double *c){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N){
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
//============================================
//BEGIN
//===========================================
int main( void ) {
double *a, *b, *c;
double *dev_a, *dev_b, *dev_c;
// allocate the memory on the CPU
a=(double *)malloc(N*sizeof(double));
b=(double *)malloc(N*sizeof(double));
c=(double *)malloc(N*sizeof(double));
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(double) );
cudaMalloc( (void**)&dev_b, N * sizeof(double) );
cudaMalloc( (void**)&dev_c, N * sizeof(double) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = (double)i;
b[i] = (double)i*2;
}
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);
for(int i=0;i<10000;++i)
add<<<NB,NT>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);
// display the results
// for (int i=0; i<N; i++) {
// printf( "%g + %g = %g\n", a[i], b[i], c[i] );
// }
printf("\nGPU done\n");
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
// free the memory allocated on the CPU
free( a );
free( b );
free( c );
return 0;
}
Thank you in advance.
Michele
Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.
Now it is possible to do something like this for the memory allocation part of your host code:
double *dev_a[2], *dev_b[2], *dev_c[2];
const int Ns[2] = {N/2, N-(N/2)};
// allocate the memory on the GPUs
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) );
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].
The transfer of data from the host to device could be as simple as:
// copy the arrays 'a' and 'b' to the GPUs
for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {
cudaSetDevice(dev);
cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The kernel launching section of your code could then look something like:
for(int i=0;i<10000;++i) {
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );
}
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required.
But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.
You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

Kernel function and cudaMemcpy

I don't know why my kernel function doesn't work. Theoretically my program should display a = 14 but it displays a = 5.
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
using namespace std;
__global__ void AddIntCUDA(int* a, int* b)
{
a[0] += b[0];
}
int main()
{
int a = 5;
int b = 9;
int *d_a ;
int *d_b ;
cudaMalloc(&d_a, sizeof(int));
cudaMalloc(&d_b, sizeof(int));
cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);
AddIntCUDA<<<1, 1>>>(d_a, d_b);
cudaMemcpy(&a, d_a, sizeof(int), cudaMemcpyDeviceToHost);
cout<<"The answer is a = "<<a<<endl;
cudaFree(d_a);
cudaFree(d_b);
return 0;
}
Also I don't understand why if I have:
cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice); //d_b = 9 on device
cudaMemcpy(&a, d_b, sizeof(int), cudaMemcpyDeviceToHost); //a = 9 on host
a is still 5?
Whenever you are having trouble with a CUDA program, the first step should be to use proper cuda error checking on all cuda API calls and kernel calls. With error checking, this error (driver issue) would have been immediately obvious.
Additional suggestions can be found on the cuda tag info tab.
Maybe you need to put cudaDeviceSynchronize(); after AddIntCUDA<<<1, 1>>>(d_a, d_b);
When you executed AddIntCUDA<<<1, 1>>>(d_a, d_b); The host doesn't wait to the CUDA kernel if you don't put cudaDeviceSynchronize();

Multi-GPU cuda programming: launching one kernel from multiple devices [duplicate]

How can I use two devices in order to improve for example
the performance of the following code (sum of vectors)?
Is it possible to use more devices "at the same time"?
If yes, how can I manage the allocations of the vectors on the global memory of the different devices?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <cuda.h>
#define NB 32
#define NT 500
#define N NB*NT
__global__ void add( double *a, double *b, double *c);
//===========================================
__global__ void add( double *a, double *b, double *c){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N){
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
//============================================
//BEGIN
//===========================================
int main( void ) {
double *a, *b, *c;
double *dev_a, *dev_b, *dev_c;
// allocate the memory on the CPU
a=(double *)malloc(N*sizeof(double));
b=(double *)malloc(N*sizeof(double));
c=(double *)malloc(N*sizeof(double));
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(double) );
cudaMalloc( (void**)&dev_b, N * sizeof(double) );
cudaMalloc( (void**)&dev_c, N * sizeof(double) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = (double)i;
b[i] = (double)i*2;
}
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);
for(int i=0;i<10000;++i)
add<<<NB,NT>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);
// display the results
// for (int i=0; i<N; i++) {
// printf( "%g + %g = %g\n", a[i], b[i], c[i] );
// }
printf("\nGPU done\n");
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
// free the memory allocated on the CPU
free( a );
free( b );
free( c );
return 0;
}
Thank you in advance.
Michele
Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.
Now it is possible to do something like this for the memory allocation part of your host code:
double *dev_a[2], *dev_b[2], *dev_c[2];
const int Ns[2] = {N/2, N-(N/2)};
// allocate the memory on the GPUs
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) );
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].
The transfer of data from the host to device could be as simple as:
// copy the arrays 'a' and 'b' to the GPUs
for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {
cudaSetDevice(dev);
cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The kernel launching section of your code could then look something like:
for(int i=0;i<10000;++i) {
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );
}
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required.
But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.
You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.