cudaMallocManaged sometimes returns null pointer? - cuda

Why does CUDA's cudaMallocManaged function sometimes return a null pointer? I understand that on older architecture it isn't supported and returns a null pointer in that case, but I have a strange situation where it works sometimes but not others:
int main()
{
const int n = pow(10, 5);
int blockSize = 256;
int N_blocks = (n + blockSize - 1)/blockSize;
double *val;
cudaMallocManaged(&val, n*sizeof(double));
for (int i = 0; i < n; ++i)
{
val[i] = 0;
}
r_line<<<N_blocks, blockSize>>>(n, val);
cudaDeviceSynchronize();
works fine, and r_line returns the expected value. If I then add this to the main function
double *rng_test_array;
const int n_rng = 10;
cudaMallocManaged(&rng_test_array, n_rng*sizeof(double));
if (rng_test_array == nullptr)
{
printf("Null pointer \n");
return 1;
}
cudaFree(val);
cudaFree(rng_test_array);
}
and running it shows that the second call to cudaMallocManaged returns a null pointer. Removing the if statement, and attempting to perform any operations on the rng_test_array causes a segfault, as you'd expect. Why does this happen?
Thanks.

Robert Crovella's comment was correct; using cuda-memcheck identified the issue in the r_line kernel, which was causing issues further down the line.

Related

Cuda printf() overlapping when using multiple devices

I have a printf in my __global__ code. It works as intended most of the time. However when using a multi GPU system (typically happens when ran on an 4-8 GPU system), once in a while, the prints will merge. By once in a while Its about 100-500 lines out of 167000 lines.
I was wondering how this situation can be remedied without adding too much overhead of transferring the data back to host (if possible). I was thinking to try a mutex lock for printing but I dont think that sort of thing exists for use in the kernel. Any other solutions I could try?
Note: The actual kernel is a long running kernel usually around 20-50 minutes to complete depending on the GPU.
Note2: I barely know what I'm doing with C/C++.
Example of merged Output
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869fxLuZJu3ZkX
qVOuKlQ0ZcMrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
What it should look like
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869f
MrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a
qVOuKlQ0Zc,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
xLuZJu3ZkX,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
My Example Code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
pthread_mutex_t solutionLock;
__global__ void kernel(unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Print to console - randomStr,Hash */
printf("%s,%s\n", random, data);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
while (1) {
hostRandomGen(&rngSeed);
kernel<<<BLOCKS, THREADS>>>(rngSeed);
cudaDeviceSynchronize();
}
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
pthread_mutex_lock(&solutionLock);
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I spent 4 days trying different things to no avail. I really don't understand memory management enough in C/C++ to get past the endless segmentation fault errors.
What I ended up doing was using Unified Memory as it seemed the easiest way to handle the memory for both device and host and it doesn't seem to add too much overhead to the whole process. Then each cpu thread (gpu) can write to its own file. I ran a couple of nvprof and it seemed that after the initial setup for the memory cudaMallocManaged the rest of the overhead seemed to be measured in the microseconds. Since each loop takes 20 minutes these are really barely noticeable.
I created two __device__ functions to copy the data over to the host accessible arrays, because I wanted to utilize the #pragma unroll feature. Not really sure if that helps or what it even does, but I decided to do things this way.
If anyone has further suggestions on ways to improve I am open to trying more things out.
Here is my new example code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
__device__ void mycpydigest(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 64
for (BYTE i = 0; i < 64; i++) {
dst[i] = src[i];
}
dst[64] = '\0';
}
__device__ void mycpyrandom(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 10
for (BYTE i = 0; i < 10; i++) {
dst[i] = src[i];
}
dst[10] = '\0';
}
__global__ void kernel(BYTE **d_random, BYTE **d_hashes, unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Send to host - randomStr & Hash */
mycpydigest(d_hashes[idx], data);
mycpyrandom(d_random[idx], random);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
int threadBlocks = hi->BLOCKS * hi->THREADS;
BYTE **randoms;
BYTE **hashes;
cudaMallocManaged(&randoms, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
cudaMallocManaged(&hashes, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
for (int i = 0; i < threadBlocks; i++) {
cudaMallocManaged(&randoms[i], sizeof(BYTE) * (RANDOM_LEN), cudaMemAttachGlobal);
cudaMallocManaged(&hashes[i], sizeof(BYTE) * (DIGEST_LEN), cudaMemAttachGlobal);
}
while (1) {
hostRandomGen(&rngSeed);
kernel<<<hi->BLOCKS, hi->THREADS>>>(randoms, hashes, rngSeed);
cudaDeviceSynchronize();
print2File(randoms, hashes, threadBlocks, hi->device)
}
cudaFree(hashes);
cudaFree(randoms);
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I want to thank #paleonix for the help in the comments. I was working on this issue for a week before I posted and your comments helped guide me down a different path.

Has cudaMalloc changed to be asynchronous?

I've read in other places that cudaMalloc will synchronize across kernels.
(e.g. will cudaMalloc synchronize host and device?)
However, I just tested this code out and based on what I'm seeing in the visual profiler, it seems like cudaMalloc is not synchronizing. if you add cudaFree into the loop, that does synchronize. I'm using CUDA 7.5. Does anyone know if cudaMalloc changed its behavior? Or am I missing some subtlety? Thanks very much!
__global__ void slowKernel()
{
float input = 5;
for( int i = 0; i < 1000000; i++ ){
input = input * .9999999;
}
}
__global__ void fastKernel()
{
float input = 5;
for( int i = 0; i < 100000; i++ ){
input = input * .9999999;
}
}
void mallocSynchronize(){
cudaStream_t stream1, stream2;
cudaStreamCreate( &stream1 );
cudaStreamCreate( &stream2 );
slowKernel <<<1, 1, 0, stream1 >>>();
int *dev_a = 0;
for( int i = 0; i < 10; i++ ){
cudaMalloc( &dev_a, 4 * 1024 * 1024 );
fastKernel <<<1, 1, 0, stream2 >>>();
// cudaFree( dev_a ); // If you uncomment this, the second fastKernel launch will wait until slowKernel completes
}
}
Your methodology is flawed, but you conclusion looks correct to me (if you look at your profile data you should see that both long and short kernels are taking the same amount of time and run very quickly, because aggressive compiler optimisation is eliminating all the code in both cases).
I turned your example into something more reasonable
#include <time.h>
__global__ void slowKernel(float *output, bool write=false)
{
float input = 5;
#pragma unroll
for( int i = 0; i < 10000000; i++ ){
input = input * .9999999;
}
if (write) *output -= input;
}
__global__ void fastKernel(float *output, bool write=false)
{
float input = 5;
#pragma unroll
for( int i = 0; i < 100000; i++ ){
input = input * .9999999;
}
if (write) *output -= input;
}
void burntime(long val) {
struct timespec tv[] = {{0, val}};
nanosleep(tv, 0);
}
void mallocSynchronize(){
cudaStream_t stream1, stream2;
cudaStreamCreate( &stream1 );
cudaStreamCreate( &stream2 );
const size_t sz = 1 << 21;
slowKernel <<<1, 1, 0, stream1 >>>((float *)(0));
burntime(500000000L); // 500ms wait - slowKernel around 1300ms
int *dev_a = 0;
for( int i = 0; i < 10; i++ ){
cudaMalloc( &dev_a, sz );
fastKernel <<<1, 1, 0, stream2 >>>((float *)(0));
burntime(1000000L); // 1ms wait - fastKernel around 15ms
}
}
int main()
{
mallocSynchronize();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
[note requires POSIX time functions so this won't run on Windows]
On a fairly fast Maxwell device (GTX970), I see that cudaMalloc calls in the loop overlap with the still executing slowKernel call in the profile trace, and then with running fastKernel calls in the other stream. I was willing to accept the initial conclusion that minor timing variations could be cause the effect you saw in your broken example. However, in this code, 0.5 seconds time shift in synchronisation between the host and device traces seems very improbable. You might need to vary the duration of the burntime calls to get the same effect, depending on how fast your GPU is.
So this is a very long way of saying, yes it looks like it is a non-synchronising call on Linux with CUDA 7.5 and a Maxwell device. I don't believe that has always been the case, but then again the documentation has never, as best as I can tell, said whether is should block/synchronize or not. I don't have access to older CUDA versions and supported hardware to see what this example would do with an older driver and a Fermi or Kepler device.

Cuda kernel - possible optimizations

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
My launch configurations and kernel call are as follows.
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I have to admit, this kernel is running pretty slow.Once I get the array back to host side, I use thrust::inclusive_scan to find my resultant array.
My question is, is there any room for improvement / optimization for my kernel? I tried using shared memory but its producing either wrong answers or throwing runtime exceptions.
Also, how does the dynamically allocated shared memory ( which is allocated by third parameter in kernel launch ) is distributed among the blocks?
Any help/hints/insinuations will be appreciated.
Thanks in advance.
As for the shared memory allocated using kernel<<<blocks,threads,mem>>> mem is the amount of memory allocated each block. So each block gets mem amount of memory.
For your code, I don't understand why are there 2 for loops in the mult function. Just want to point out that each thread will be executing these 2 for loops. Moreover, as you also have a for loop in the kernel function, it means that each thread will be executing the 2 for loops in the mult function several times. THis is slow. Moreover, doing
int beg = Bptr[tid];
int end = Bptr[tid+1];
is not exactly coalesced access. Non coalesced access is slow.

Cuda kernel producing the resultant vector as zero

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
And here is how I decide the configuration of grid and blocks
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I am using GTX 480 with CC 2.0.
Now the problem that I am facing is that whenever q increases beyond 4096 the values in Cjc array are all produced as 0.
I know maximum number of blocks that I can use in X direction is 65535 and each block can have at most (1024,1024,64) threads. Then why does this kernel calculate the wrong output for Cjc array?
I seems like there are a couple of things wrong with the code you posted:
I guess findkernel is kernel in the CUDA code above?
kernel has 8 parameters, but you only use 7 parameters to call findkernel. This doesn't look right!
In kernel, you test if(tid < cols) - I guess this should be if(tid < count)??
Why does kernel expect count to be a pointer? I think you don't pass in an int pointer but a regular integer value to findkernel.
Why does __device__ bool mult get count/int *val if it is not used?
I guess #3 or #4 could be the source of your problem, but you should look at the other things as well.
OK so I finally figured out using cudaError_t that when I tried to cudaMemcpy the d_Cjc array from device to host, it throws following error.
CUDA error: the launch timed out and was terminated
It turns out that some of the calculations in findkernel are taking reasonably large amount of time which causes the display driver to terminate the program because of OS 'watchdog' time limit.
I believe I will have to shut down X server or ssh my gpu machine (from another machine) by removing its display.This will buy me some time to do the calculations that will not exceed the 'watchdog' limit of OS.

CUDA: Fermi (Tesla M2090) generating CUDA_EXCEPTION_10 without reason

I have a small piece of code which runs perfectly on Nvidia old architecture (Tesla T10 processor) but not on Fermi (Tesla M2090)
I learned that Fermi behaves slightly differently. Due to which unsafe code might work correctly on old architectures, while on Fermi it catches the bug.
But I don't know how to resolve it.
Here is my code:
__global__ void exec (int *arr_ptr, int size, int *result) {
int tx = threadIdx.x;
int ty = threadIdx.y;
*result = arr_ptr[-2];
}
void run(int *arr_dev, int size, int *result) {
cudaStream_t stream = 0;
int *arr_ptr = arr_dev + 5;
dim3 threads(1,1,1);
dim3 grid (1,1);
exec<<<grid, threads, 0, stream>>>(arr_ptr, size, result);
}
since I am accessing arr_ptr[-2], the fermi throws CUDA_EXCEPTION_10, Device Illegal Address. But it is not. The address is legal.
Can anyone help me on this.
My driver code is
int main(){
int *arr;
int *arr_dev = NULL;
int result = 1;
arr = (int*)malloc(10*sizeof(int));
for(int i = 0; i < 10; i++)
arr[i] = i;
if(arr_dev == NULL)
{
cudaMalloc((void**)&arr_dev, 10);
cudaMemcpy(arr_dev, arr, 10*sizeof(int), cudaMemcpyHostToDevice);
}
run(arr_dev, 10, &result);
printf("%d \n", result);
return 0;
}
Fermi cards have much better memory protection on the device and will detect out of bounds conditions which will appear to "work" on older cards. Use cuda-memchk (or the cuda-memchk mode in cuda-gdb) to get a better handle on what is going wrong.
EDIT:
This is the culprit:
cudaMalloc((void**)&arr_dev, 10);
which should be
cudaMalloc((void**)&arr_dev, 10*sizeof(int));
This will result in this code
int *arr_ptr = arr_dev + 5;
passing a pointer to the device which is out of bounds.