Issue with using Multi GPU NVIDIA - cuda

I'm learning how to use multi GPU for my CUDA application. I tried out a simple program which successfully ran on a system having two Tesla C2070. But when I tried to run the same program on a different system having a Tesla K40c and a Tesla C2070, it shows a segmentation fault. What might be the problem? I'm sure that there is no problem with the code. Is there any settings to be done in the environment? I have attached my code here for your reference.
#include <stdio.h>
#include "device_launch_parameters.h"
#include "cuda_runtime_api.h"
__global__ void testA(int *a)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
a[i] = a[i] * 2;
}
int main()
{
int *ai, *bi, *ao, *bo;
int iter;
cudaStream_t streamA, streamB;
cudaSetDevice(0);
cudaStreamCreate(&streamA);
cudaMalloc((void**)&ao, 10 * sizeof(int));
cudaHostAlloc((void**)&ai, 10 * sizeof(int), cudaHostAllocMapped);
for(iter=0; iter<10; iter++)
{
ai[iter] = iter+1;
}
cudaSetDevice(1);
cudaStreamCreate(&streamB);
cudaMalloc((void**)&bo, 10 * sizeof(int));
cudaHostAlloc((void**)&bi, 10 * sizeof(int), cudaHostAllocMapped);
for(iter=0; iter<10; iter++)
{
bi[iter] = iter+11;
}
cudaSetDevice(0);
cudaMemcpyAsync(ao, ai, 10 * sizeof(int), cudaMemcpyHostToDevice, streamA);
testA<<<1, 10, 0, streamA>>>(ao);
cudaMemcpyAsync(ai, ao, 10 * sizeof(int), cudaMemcpyDeviceToHost, streamA);
cudaSetDevice(1);
cudaMemcpyAsync(bo, bi, 10 * sizeof(int), cudaMemcpyHostToDevice, streamB);
testA<<<1, 10, 0, streamB>>>(bo);
cudaMemcpyAsync(bi, bo, 10 * sizeof(int), cudaMemcpyDeviceToHost, streamB);
cudaSetDevice(0);
cudaStreamSynchronize(streamA);
cudaSetDevice(1);
cudaStreamSynchronize(streamB);
printf("%d %d %d %d %d\n",ai[0],ai[1],ai[2],ai[3],ai[4]);
printf("%d %d %d %d %d\n",bi[0],bi[1],bi[2],bi[3],bi[4]);
return 0;
}
The segmentation fault occurs when bi array is initialized inside the for loop, which means the memory is not allocated for bi.

With the new information you've provided based on the error checking, the problem you were having was due to the ECC error.
When a GPU has a double-bit ECC error detected in the current session, it is no longer usable for compute activities until either:
the GPU is reset (e.g. via system reboot, or via driver unload/reload, or manually via nvidia-smi, etc.),
(or)
ECC is disabled (which usually also may require a system reboot or gpu reset)
You can review ECC status of your GPUs with the nvidia-smi command. You probably already know which GPU was reporting the ECC error, since you disabled ECC, but in case not, based on your initial report it would be the one that was associated with the cudaSetDevice(1); command, which probably should have been the Tesla C2070 (i.e. not the K40).

Related

dram_write_bytes result on P100

I used nvprof to profile a simple vecadd example (n=1024) on P100 but observed the dram_write_bytes is only 256 (rather than 1024*4 that I expected). Can someone explain why this number is small? What other metrics I need to add in to count for global memory writes? Thanks. float_count_sp number is correct (1024).
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
__global__ void vecAdd(float* a, float* b, float* c, int n){
int id = blockIdx.x*blockDim.x + threadIdx.x;
if(id < n) c[id] = a[id] + b[id];
}
int main(int argc, char* argv[]){
int n = 1024;
float *h_a, *d_a;
float *h_b, *d_b;
float *h_c, *d_c;
size_t bytes = n*sizeof(float);
h_a = (float*)malloc(bytes);
h_b = (float*)malloc(bytes);
h_c = (float*)malloc(bytes);
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
for(i = 0; i < n; i++){
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i+1);
}
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
vecAdd <<<1, 1024>>> (d_a, d_b, d_c, n);
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
float sum = 0;
for(i = 0; i < n; i++)
sum += h_c[i] - h_a[i] - h_b[i];
printf("final diff: %f\n", sum/n);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(h_a);
free(h_b);
free(h_c);
return 0;
}
Is it related to the sampling of nvprof? One time I get 384 bytes. Sometimes I even got 0 bytes. Weird thing is: if I change n to 1024*1024, I got bytes more than I expected (4688032). 4688032/1024/1024/4 = 1.11.
There are several reasons why your expectations are not being observed and the data is changing:
The GPU memory system is shared by all engines. The primary engine the is the graphics/compute engine but other engines such as copy engines, display, etc. access the device memory and the memory control (FB = framebuffer) counters do not have a method to track the requester.
NVPROF injection does not attempt to evict all context memory from the L2 cache. The cudaMemcpys prior to the launch and the kernel replay code in nvprof will leave the L2 cache in an inconsistent state.
The initial size of 4KB is simply to small to accurately track. The full data set could be in L2 from either the cudaMemcpy or replay. Furthermore, the bytes you see can be from other clients such as the constant caches.
It is highly recommends you scale the buffer size to a reasonable size. On newer GPUs the Nsight Compute profiler has improved L2 level breakdown of various clients to help detect unexpected traffic. In addition Nsight Compute replay logic clears the L2 cache so that each replay has a consistent start state.
If you have a monitor attached it is recommended to move the monitor to a different GPU when looking at DRAM counters. nvprof L2 counters generally filter the count by traffic from the SMs so traffic from copy engines, the display controller, MMU, constant caches, etc. will not show up in the L2 counters.

cudaEventElapsedTime not expected behaviour

I'm trying to compute the total time taken in GPU to compute something. I'm using the cudaEventRecord and cudaEventElapsedTime to determine this, but I'm having a unexpected behavior, or at least, unexpected for me :) I wrote this example to understand what's happening and I'm still confused.
In the example below I was expecting to report the same time for the three iterations but the result is:
2.80342
1003
2005.6
Which means that the total time in considering the CPU sleep time.
Am I doing something wrong? If not, is it possible do what I want?
#include <iostream>
#include <thread>
#include <chrono>
#include <cuda.h>
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
__global__ void kernel_test(int *a, int N) {
for(int i=threadIdx.x;i<N;i+=N) {
if(i<N)
a[i] = 1;
}
}
int main(int argc, char ** argv) {
cudaEvent_t start[3], stop[3];
for(int i=0;i<3;i++) {
cudaEventCreate(&start[i]);
cudaEventCreate(&stop[i]);
}
cudaStream_t stream;
cudaStreamCreate(&stream);
const int N = 1024 * 1024;
int *h_a = (int*)malloc(N * sizeof(int));
int *a = 0;
cudaMalloc((void**)&a, N * sizeof(int));
for(int i=0;i<3;i++) {
cudaEventRecord(start[i], stream);
cudaMemcpyAsync(a, h_a, N * sizeof(int), cudaMemcpyHostToDevice, stream);
kernel_test<<<1, 1024, 0, stream>>>(a, N);
cudaMemcpyAsync(h_a, a, N*sizeof(int), cudaMemcpyDeviceToHost, stream);
cudaEventRecord(stop[i], stream);
std::this_thread::sleep_for (std::chrono::seconds(i));
cudaEventSynchronize(stop[i]);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start[i], stop[i]);
std::cout<<milliseconds<<std::endl;
}
return 0;
}
I attach the nsight result to verify the behaviour of my example.
Windows 8.1
Geforce GTX 780 Ti
Nvidia drivers: 358.50
EDIT:
Added code to be complete
Attached nsight result
Added SO and drivers info
, ,
If you're running the program on Windows using the WDDM (in contrast to TCC with Tesla cards or Linux) this may be the issue:
With the WDDM kernels are not executed immediately after invocation but instead enqueued to a command buffer. Once the buffer is full it gets flushed and the enqueued commands are actually executed. Another option to force the command buffer to be explicitly flushed is to synchronize.
Now what happens is that you wait before the command buffer is acutally flushed...
Edit
Also see https://devtalk.nvidia.com/default/topic/548639/is-wddm-causing-this-/ for the problem and how cudaEventQuery(0) may help

cudaMemcpy is too slow on Tesla C2075

I'm currently working on a server with 2 cuda capable GPU's: Quadro 400 and Tesla C2075. I made a simple vector addition test program. My problem is that while Tesla C2075 GPU is supposed to be more powerful than Quadro 400, it takes it more time to do the job. I found that cudaMemcpy takes up most of the execution time and it works slower on a more powerful gpu.
Here's the source:
void get_matrix(float* arr1,float* arr2,int N1,int N2)
{
int Nx,Ny;
int n_blocks,n_threads;
int dev=0; // 1
float time;
size_t size;
clock_t start,end;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
start = clock();
cudaGetDeviceProperties(&deviceProp, dev);
Nx=N1;
Ny=N2;
n_threads=256;
n_blocks=(Nx*Ny+n_threads-1)/n_threads;
size=Nx*Ny*sizeof(float);
cudaMalloc((void**)&d_A,size);
cudaMalloc((void**)&d_B,size);
cudaMemcpy(d_A, arr1, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, arr2, size, cudaMemcpyHostToDevice);
vector_add<<<n_blocks,n_threads>>>(d_A,d_B,size);
cudaMemcpy(arr1, d_A, size, cudaMemcpyDeviceToHost);
printf("Running device %s \n",deviceProp.name);
end = clock();
time=float(end-start)/float(CLOCKS_PER_SEC);
printf("time = %e\n",time);
}
int main()
{
int const nx = 20000,ny = nx;
static float a[nx*ny],b[nx*ny];
for(int i=0;i<nx;i++)
{
for(int j=0;j<ny;j++)
{
a[j+ny*i]=j+10*i;
b[j+ny*i]=-(j+10*i);
}
}
get_matrix(a,b,nx,ny);
return 0;
}
The output is:
Running device Quadro 400
time = 1.100000e-01
Running device Tesla C2075
time = 1.050000e+00
And my questions are:
Should I modify the code depending on what GPU I am going to use?
Is there any connection between the number of blocks, threads per block specified in the code and the number of multiprocessors, cores per multiprocessor available on a GPU?
I'm running Linux Open Suse 11.2. The source code is compiled using the nvcc compiler (version 4.2).
Thanks for your help!
Try to invoke get_matrix(a,b,nx,ny) twice and take the second timing result. First time calling to CUDA API will create the cuda context. It often takes a long time.
Please refer to this section in CUDA C Best Practice Guide for how to determine the block size and grid size.

CUDA memory transfer issue

I am trying to execute a code which first transfers data from CPU to GPU memory and vice-versa. In spite of increasing the volume of data, the data transfer time remains the same
as if no data transfer is actually taking place. I am posting the code.
#include <stdio.h> /* Core input/output operations */
#include <stdlib.h> /* Conversions, random numbers, memory allocation, etc. */
#include <math.h> /* Common mathematical functions */
#include <time.h> /* Converting between various date/time formats */
#include <cuda.h> /* CUDA related stuff */
#include <sys/time.h>
__global__ void device_volume(float *x_d,float *y_d)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
}
int main(void)
{
float *x_h,*y_h,*x_d,*y_d,*z_h,*z_d;
long long size=9999999;
long long nbytes=size*sizeof(float);
timeval t1,t2;
double et;
x_h=(float*)malloc(nbytes);
y_h=(float*)malloc(nbytes);
z_h=(float*)malloc(nbytes);
cudaMalloc((void **)&x_d,size*sizeof(float));
cudaMalloc((void **)&y_d,size*sizeof(float));
cudaMalloc((void **)&z_d,size*sizeof(float));
gettimeofday(&t1,NULL);
cudaMemcpy(x_d, x_h, nbytes, cudaMemcpyHostToDevice);
cudaMemcpy(y_d, y_h, nbytes, cudaMemcpyHostToDevice);
cudaMemcpy(z_d, z_h, nbytes, cudaMemcpyHostToDevice);
gettimeofday(&t2,NULL);
et = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms
et += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms
printf("\n %ld\t\t%f\t\t",nbytes,et);
et=0.0;
//printf("%f %d\n",seconds,CLOCKS_PER_SEC);
// launch a kernel with a single thread to greet from the device
//device_volume<<<1,1>>>(x_d,y_d);
gettimeofday(&t1,NULL);
cudaMemcpy(x_h, x_d, nbytes, cudaMemcpyDeviceToHost);
cudaMemcpy(y_h, y_d, nbytes, cudaMemcpyDeviceToHost);
cudaMemcpy(z_h, z_d, nbytes, cudaMemcpyDeviceToHost);
gettimeofday(&t2,NULL);
et = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms
et += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms
printf("%f\n",et);
cudaFree(x_d);
cudaFree(y_d);
cudaFree(z_d);
return 0;
}
Can anybody help me with this issue?
Thanks
Try cudaEvent for capturing the time for GPU code.
Try use the Visual profiler to see how much time is spend on the memcpy. The profiler will show all execution time spend on the GPU for every cuda related operations.
It stays the same because it takes the same time. In your code you don't add up the transfer times.

Simple adding of two int's in Cuda, result always the same

I'm starting my journary to learn Cuda. I am playing with some hello world type cuda code but its not working, and I'm not sure why.
The code is very simple, take two ints and add them on the GPU and return the result, but no matter what I change the numbers to I get the same result(If math worked that way I would have done alot better in the subject than I actually did).
Here's the sample code:
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
*c = a + b;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaMalloc( (void**)&dev_c, sizeof(int) );
add<<<1,1>>>( 1, 4, dev_c );
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
The output seems a bit off: 1 + 4 = -1065287167
I'm working on setting up my environment and just wanted to know if there was a problem with the code otherwise its probably my environment.
Update: I tried to add some code to show the error but I don't get an output but the number changes(is it outputing error codes instead of answers? Even if I don't do any work in the kernal other than assign a variable I still get simlair results).
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
//*c = a + b;
*c = 5;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
Code appears to be fine, maybe its related to my setup. Its been a nightmare to get Cuda installed on OSX lion but I thought it worked as the examples in the SDK seemed to be fine. The steps I took so far are go to the Nvida website and download the latest mac releases for the driver, toolkit and SDK. I then added export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH and 'PATH=/usr/local/cuda/bin:$PATH` I did a deviceQuery and it passed with the following info about my system:
[deviceQuery] starting...
/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED
UPDATE: what's really weird is even if I remove all the work in the kernel I stil get a result for c? I have reinstalled cuda and used make on the examples and all of them pass.
Basically there are two problems here:
You are not compiling the kernel for the correct architecture (gleaned from comments)
Your code contains imperfect error checking which is missing the point when the runtime error is occurring, leading to mysterious and unexplained symptoms.
In the runtime API, most context related actions are performed "lazily". When you launch a kernel for the first time, the runtime API will invoke code to intelligently find a suitable CUBIN image from inside the fat binary image emitted by the toolchain for the target hardware and load it into the context. This can also include JIT recompilation of PTX for a backwards compatible architecture, but not the other way around. So if you had a kernel compiled for a compute capability 1.2 device and you run it on a compute capability 2.0 device, the driver can JIT compile the PTX 1.x code it contains for the newer architecture. But the reverse doesn't work. So in your example, the runtime API will generate an error because it cannot find a usable binary image in the CUDA fatbinary image embedded in the executable. The error message is pretty cryptic, but you will get an error (see this question for a bit more information).
If your code contained error checking like this:
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
the extra error checking after the kernel launch should catch the runtime API error generated by the kernel load/launch failure.
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void Addition(int *a,int *b,int *c)
{
*c = *a + *b;
}
int main()
{
int a,b,c;
int *dev_a,*dev_b,*dev_c;
int size = sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a=5,b=6;
cudaMemcpy(dev_a, &a,sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, &b,sizeof(int), cudaMemcpyHostToDevice);
Addition<<< 1,1 >>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c, dev_c,size, cudaMemcpyDeviceToHost);
cudaFree(&dev_a);
cudaFree(&dev_b);
cudaFree(&dev_c);
printf("%d\n", c);
getch();
return 0;
}