CUDA: Fermi (Tesla M2090) generating CUDA_EXCEPTION_10 without reason - cuda

I have a small piece of code which runs perfectly on Nvidia old architecture (Tesla T10 processor) but not on Fermi (Tesla M2090)
I learned that Fermi behaves slightly differently. Due to which unsafe code might work correctly on old architectures, while on Fermi it catches the bug.
But I don't know how to resolve it.
Here is my code:
__global__ void exec (int *arr_ptr, int size, int *result) {
int tx = threadIdx.x;
int ty = threadIdx.y;
*result = arr_ptr[-2];
}
void run(int *arr_dev, int size, int *result) {
cudaStream_t stream = 0;
int *arr_ptr = arr_dev + 5;
dim3 threads(1,1,1);
dim3 grid (1,1);
exec<<<grid, threads, 0, stream>>>(arr_ptr, size, result);
}
since I am accessing arr_ptr[-2], the fermi throws CUDA_EXCEPTION_10, Device Illegal Address. But it is not. The address is legal.
Can anyone help me on this.
My driver code is
int main(){
int *arr;
int *arr_dev = NULL;
int result = 1;
arr = (int*)malloc(10*sizeof(int));
for(int i = 0; i < 10; i++)
arr[i] = i;
if(arr_dev == NULL)
{
cudaMalloc((void**)&arr_dev, 10);
cudaMemcpy(arr_dev, arr, 10*sizeof(int), cudaMemcpyHostToDevice);
}
run(arr_dev, 10, &result);
printf("%d \n", result);
return 0;
}

Fermi cards have much better memory protection on the device and will detect out of bounds conditions which will appear to "work" on older cards. Use cuda-memchk (or the cuda-memchk mode in cuda-gdb) to get a better handle on what is going wrong.
EDIT:
This is the culprit:
cudaMalloc((void**)&arr_dev, 10);
which should be
cudaMalloc((void**)&arr_dev, 10*sizeof(int));
This will result in this code
int *arr_ptr = arr_dev + 5;
passing a pointer to the device which is out of bounds.

Related

Cuda printf() overlapping when using multiple devices

I have a printf in my __global__ code. It works as intended most of the time. However when using a multi GPU system (typically happens when ran on an 4-8 GPU system), once in a while, the prints will merge. By once in a while Its about 100-500 lines out of 167000 lines.
I was wondering how this situation can be remedied without adding too much overhead of transferring the data back to host (if possible). I was thinking to try a mutex lock for printing but I dont think that sort of thing exists for use in the kernel. Any other solutions I could try?
Note: The actual kernel is a long running kernel usually around 20-50 minutes to complete depending on the GPU.
Note2: I barely know what I'm doing with C/C++.
Example of merged Output
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869fxLuZJu3ZkX
qVOuKlQ0ZcMrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
What it should look like
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869f
MrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a
qVOuKlQ0Zc,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
xLuZJu3ZkX,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
My Example Code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
pthread_mutex_t solutionLock;
__global__ void kernel(unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Print to console - randomStr,Hash */
printf("%s,%s\n", random, data);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
while (1) {
hostRandomGen(&rngSeed);
kernel<<<BLOCKS, THREADS>>>(rngSeed);
cudaDeviceSynchronize();
}
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
pthread_mutex_lock(&solutionLock);
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I spent 4 days trying different things to no avail. I really don't understand memory management enough in C/C++ to get past the endless segmentation fault errors.
What I ended up doing was using Unified Memory as it seemed the easiest way to handle the memory for both device and host and it doesn't seem to add too much overhead to the whole process. Then each cpu thread (gpu) can write to its own file. I ran a couple of nvprof and it seemed that after the initial setup for the memory cudaMallocManaged the rest of the overhead seemed to be measured in the microseconds. Since each loop takes 20 minutes these are really barely noticeable.
I created two __device__ functions to copy the data over to the host accessible arrays, because I wanted to utilize the #pragma unroll feature. Not really sure if that helps or what it even does, but I decided to do things this way.
If anyone has further suggestions on ways to improve I am open to trying more things out.
Here is my new example code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
__device__ void mycpydigest(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 64
for (BYTE i = 0; i < 64; i++) {
dst[i] = src[i];
}
dst[64] = '\0';
}
__device__ void mycpyrandom(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 10
for (BYTE i = 0; i < 10; i++) {
dst[i] = src[i];
}
dst[10] = '\0';
}
__global__ void kernel(BYTE **d_random, BYTE **d_hashes, unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Send to host - randomStr & Hash */
mycpydigest(d_hashes[idx], data);
mycpyrandom(d_random[idx], random);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
int threadBlocks = hi->BLOCKS * hi->THREADS;
BYTE **randoms;
BYTE **hashes;
cudaMallocManaged(&randoms, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
cudaMallocManaged(&hashes, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
for (int i = 0; i < threadBlocks; i++) {
cudaMallocManaged(&randoms[i], sizeof(BYTE) * (RANDOM_LEN), cudaMemAttachGlobal);
cudaMallocManaged(&hashes[i], sizeof(BYTE) * (DIGEST_LEN), cudaMemAttachGlobal);
}
while (1) {
hostRandomGen(&rngSeed);
kernel<<<hi->BLOCKS, hi->THREADS>>>(randoms, hashes, rngSeed);
cudaDeviceSynchronize();
print2File(randoms, hashes, threadBlocks, hi->device)
}
cudaFree(hashes);
cudaFree(randoms);
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I want to thank #paleonix for the help in the comments. I was working on this issue for a week before I posted and your comments helped guide me down a different path.

Overlap compute and transfer on Windows

i'm facing some issues when trying to overlap computation and transferts on Windows (using VS2015 and CUDA 10.1). The code doesn't overlap at all. But the exact same code on Linux as the expected behaviour.
Here is the views from NVVP :
Windows 10 NVVP screenshot :
Linux NVVP screenshot :
Please note the following points :
my host memory is PageLocked
i'm using two different streams
i'm using cudaMemcpyAsync method to transfert between host and device
if i run my code on Linux, everything is fine
i don't see anything in the documentation describing a different behaviour between there two systems.
So the question is the following :
Am i missing something ?
Does it exists a way to achieve overlapping on this configuration (Windows 10 + 1080Ti)?
you can find some code here to reproduce the issue :
#include "cuda_runtime.h"
constexpr int NB_ELEMS = 64*1024*1024;
constexpr int BUF_SIZE = NB_ELEMS * sizeof(float);
constexpr int BLK_SIZE=1024;
using namespace std;
__global__
void dummy_operation(float* ptr1, float* ptr2)
{
const int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx<NB_ELEMS)
{
float value = ptr1[idx];
for(int i=0; i<100; ++i)
{
value += 1.0f;
}
ptr2[idx] = value;
}
}
int main()
{
float *h_data1 = nullptr, *h_data2 = nullptr,
*h_data3 = nullptr, *h_data4 = nullptr;
cudaMallocHost(&h_data1, BUF_SIZE);
cudaMallocHost(&h_data2, BUF_SIZE);
cudaMallocHost(&h_data3, BUF_SIZE);
cudaMallocHost(&h_data4, BUF_SIZE);
float *d_data1 = nullptr, *d_data2 = nullptr,
*d_data3 = nullptr, *d_data4 = nullptr;
cudaMalloc(&d_data1, BUF_SIZE);
cudaMalloc(&d_data2, BUF_SIZE);
cudaMalloc(&d_data3, BUF_SIZE);
cudaMalloc(&d_data4, BUF_SIZE);
cudaStream_t st1, st2;
cudaStreamCreate(&st1);
cudaStreamCreate(&st2);
const dim3 threads(BLK_SIZE);
const dim3 blocks(NB_ELEMS / BLK_SIZE + 1);
for(int i=0; i<10; ++i)
{
float* tmp_dev_ptr = (i%2)==0? d_data1 : d_data3;
float* tmp_host_ptr = (i%2)==0? h_data1 : h_data3;
cudaStream_t tmp_st = (i%2)==0? st1 : st2;
cudaMemcpyAsync(tmp_dev_ptr, tmp_host_ptr, BUF_SIZE, cudaMemcpyDeviceToHost, tmp_st);
dummy_operation<<<blocks, threads, 0, tmp_st>>>(tmp_dev_ptr, d_data2);
//cudaMempcyAsync(d_data2, h_data2);
}
cudaStreamSynchronize(st1);
cudaStreamSynchronize(st2);
return 0;
}
As pointed by #talonmies, to overlap compute and transfers you need a graphic card in Tesla Compute Cluster mode.
I've checked this behaviour using an old Quadro P620.
[Edit] Overlapping between kernels and copy seems to be working since i applied the Windows 10 update 1909.
I'm not sure if the windows update has included an graphic driver update or not. But it's fine :)

Has cudaMalloc changed to be asynchronous?

I've read in other places that cudaMalloc will synchronize across kernels.
(e.g. will cudaMalloc synchronize host and device?)
However, I just tested this code out and based on what I'm seeing in the visual profiler, it seems like cudaMalloc is not synchronizing. if you add cudaFree into the loop, that does synchronize. I'm using CUDA 7.5. Does anyone know if cudaMalloc changed its behavior? Or am I missing some subtlety? Thanks very much!
__global__ void slowKernel()
{
float input = 5;
for( int i = 0; i < 1000000; i++ ){
input = input * .9999999;
}
}
__global__ void fastKernel()
{
float input = 5;
for( int i = 0; i < 100000; i++ ){
input = input * .9999999;
}
}
void mallocSynchronize(){
cudaStream_t stream1, stream2;
cudaStreamCreate( &stream1 );
cudaStreamCreate( &stream2 );
slowKernel <<<1, 1, 0, stream1 >>>();
int *dev_a = 0;
for( int i = 0; i < 10; i++ ){
cudaMalloc( &dev_a, 4 * 1024 * 1024 );
fastKernel <<<1, 1, 0, stream2 >>>();
// cudaFree( dev_a ); // If you uncomment this, the second fastKernel launch will wait until slowKernel completes
}
}
Your methodology is flawed, but you conclusion looks correct to me (if you look at your profile data you should see that both long and short kernels are taking the same amount of time and run very quickly, because aggressive compiler optimisation is eliminating all the code in both cases).
I turned your example into something more reasonable
#include <time.h>
__global__ void slowKernel(float *output, bool write=false)
{
float input = 5;
#pragma unroll
for( int i = 0; i < 10000000; i++ ){
input = input * .9999999;
}
if (write) *output -= input;
}
__global__ void fastKernel(float *output, bool write=false)
{
float input = 5;
#pragma unroll
for( int i = 0; i < 100000; i++ ){
input = input * .9999999;
}
if (write) *output -= input;
}
void burntime(long val) {
struct timespec tv[] = {{0, val}};
nanosleep(tv, 0);
}
void mallocSynchronize(){
cudaStream_t stream1, stream2;
cudaStreamCreate( &stream1 );
cudaStreamCreate( &stream2 );
const size_t sz = 1 << 21;
slowKernel <<<1, 1, 0, stream1 >>>((float *)(0));
burntime(500000000L); // 500ms wait - slowKernel around 1300ms
int *dev_a = 0;
for( int i = 0; i < 10; i++ ){
cudaMalloc( &dev_a, sz );
fastKernel <<<1, 1, 0, stream2 >>>((float *)(0));
burntime(1000000L); // 1ms wait - fastKernel around 15ms
}
}
int main()
{
mallocSynchronize();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
[note requires POSIX time functions so this won't run on Windows]
On a fairly fast Maxwell device (GTX970), I see that cudaMalloc calls in the loop overlap with the still executing slowKernel call in the profile trace, and then with running fastKernel calls in the other stream. I was willing to accept the initial conclusion that minor timing variations could be cause the effect you saw in your broken example. However, in this code, 0.5 seconds time shift in synchronisation between the host and device traces seems very improbable. You might need to vary the duration of the burntime calls to get the same effect, depending on how fast your GPU is.
So this is a very long way of saying, yes it looks like it is a non-synchronising call on Linux with CUDA 7.5 and a Maxwell device. I don't believe that has always been the case, but then again the documentation has never, as best as I can tell, said whether is should block/synchronize or not. I don't have access to older CUDA versions and supported hardware to see what this example would do with an older driver and a Fermi or Kepler device.

CUDA atomic function usage with volatile shared memory

I have a CUDA kernel that needs to use an atomic function on volatile shared integer memory. However, when I try to declare the shared memory as volatile and use it in an atomic function, I get an error message.
Below is some minimalist code that reproduces the error. Please note that the following kernel does nothing and horribly abuses why you would ever want to declare shared memory as volatile (or even use shared memory at all). But it does reproduce the error.
The code uses atomic functions on shared memory, so, to run it, you probably need to compile with "arch12" or higher (in Visual Studio 2010, right click on your project and go to "Properties -> Configuration Properties -> CUDA C/C++ -> Device" and enter "compute_12,sm_12" in the "Code Generation" line). The code should otherwise compile as is.
#include <cstdlib>
#include <cuda_runtime.h>
static int const X_THRDS_PER_BLK = 32;
static int const Y_THRDS_PER_BLK = 8;
__global__ void KernelWithSharedMemoryAndAtomicFunction(int * d_array, int numTotX, int numTotY)
{
__shared__ int s_blk[Y_THRDS_PER_BLK][X_THRDS_PER_BLK]; // compiles
//volatile __shared__ int s_blk[Y_THRDS_PER_BLK][X_THRDS_PER_BLK]; // will not compile
int tx = threadIdx.x;
int ty = threadIdx.y;
int mx = blockIdx.x*blockDim.x + threadIdx.x;
int my = blockIdx.y*blockDim.y + threadIdx.y;
int mi = my*numTotX + mx;
if (mx < numTotX && my < numTotY)
{
s_blk[ty][tx] = d_array[mi];
__syncthreads();
atomicMin(&s_blk[ty][tx], 4); // will compile with volatile shared memory only if this line is commented out
__syncthreads();
d_array[mi] = s_blk[ty][tx];
}
}
int main(void)
{
// Declare and initialize some array on host
int const NUM_TOT_X = 4*X_THRDS_PER_BLK;
int const NUM_TOT_Y = 6*Y_THRDS_PER_BLK;
int * h_array = (int *)malloc(NUM_TOT_X*NUM_TOT_Y*sizeof(int));
for (int i = 0; i < NUM_TOT_X*NUM_TOT_Y; ++i) h_array[i] = i;
// Copy array to device
int * d_array;
cudaMalloc((void **)&d_array, NUM_TOT_X*NUM_TOT_Y*sizeof(int));
cudaMemcpy(d_array, h_array, NUM_TOT_X*NUM_TOT_Y*sizeof(int), cudaMemcpyHostToDevice);
// Declare block and thread variables
dim3 thdsPerBlk;
dim3 blks;
thdsPerBlk.x = X_THRDS_PER_BLK;
thdsPerBlk.y = Y_THRDS_PER_BLK;
thdsPerBlk.z = 1;
blks.x = (NUM_TOT_X + X_THRDS_PER_BLK - 1)/X_THRDS_PER_BLK;
blks.y = (NUM_TOT_Y + Y_THRDS_PER_BLK - 1)/Y_THRDS_PER_BLK;
blks.z = 1;
// Run kernel
KernelWithSharedMemoryAndAtomicFunction<<<blks, thdsPerBlk>>>(d_array, NUM_TOT_X, NUM_TOT_Y);
// Cleanup
free (h_array);
cudaFree(d_array);
return 0;
}
Anyway, if you comment out the "s_blk" declaration towards the top of the kernel and uncomment the commented-out declaration immediately following it, then you should get the following error:
error : no instance of overloaded function "atomicMin" matches the argument list
I do not understand why declaring the shared memory as volatile would affect its type, as (I think) this error message is indicating, nor why it cannot be used with atomic operations.
Can anyone please provide any insight?
Thanks,
Aaron
Just replace
atomicMin(&s_blk[ty][tx], 4);
by
atomicMin((int *)&s_blk[ty][tx], 4);.
It typecasts &s_blk[ty][tx] so it matches the argument of atomicMin(..).

Cuda kernel producing the resultant vector as zero

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
And here is how I decide the configuration of grid and blocks
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I am using GTX 480 with CC 2.0.
Now the problem that I am facing is that whenever q increases beyond 4096 the values in Cjc array are all produced as 0.
I know maximum number of blocks that I can use in X direction is 65535 and each block can have at most (1024,1024,64) threads. Then why does this kernel calculate the wrong output for Cjc array?
I seems like there are a couple of things wrong with the code you posted:
I guess findkernel is kernel in the CUDA code above?
kernel has 8 parameters, but you only use 7 parameters to call findkernel. This doesn't look right!
In kernel, you test if(tid < cols) - I guess this should be if(tid < count)??
Why does kernel expect count to be a pointer? I think you don't pass in an int pointer but a regular integer value to findkernel.
Why does __device__ bool mult get count/int *val if it is not used?
I guess #3 or #4 could be the source of your problem, but you should look at the other things as well.
OK so I finally figured out using cudaError_t that when I tried to cudaMemcpy the d_Cjc array from device to host, it throws following error.
CUDA error: the launch timed out and was terminated
It turns out that some of the calculations in findkernel are taking reasonably large amount of time which causes the display driver to terminate the program because of OS 'watchdog' time limit.
I believe I will have to shut down X server or ssh my gpu machine (from another machine) by removing its display.This will buy me some time to do the calculations that will not exceed the 'watchdog' limit of OS.