cudaMalloc always gives out of memory - cuda

I'm facing a simple problem, where all my calls to cudaMalloc fail, giving me an out of memory error, even if its just a single byte I'm allocating.
The cuda device is available and there is also a lot of memory available (bot checked with the corresponding calls).
Any idea what the problem could be?

Please try to call cudaSetDevice(), then cudaDeviceSynchronize() and then cudaThreadSynchronize() at the beginning of the code itself.
cudaSetDevice(0) if there is only one device. by default the CUDA run time will initialize the device 0.
cudaSetDevice(0);
cudaDeviceSynchronize();
cudaThreadSynchronize();
Please reply back your observation. If still it is getting failed, please specify the OS, architecture, CUDA SDK version, CUDA driver version. if possible please provide the code/code snippet which is being failed.

Thank you everybody for your help.
The problem was not really with the cudaMalloc itself but it shadowed the real problem which was due to the initialisation of cuda which seemed to fail.
Because the first call to cuda was in a separate Thread I did'nt have a GLContext available, leading to failures. I needed to make sure that I initialised cuda by a dummy malloc in the main thread after the initialisation of the context.

Related

Returning custom cudaError or force copy to host from device

I have a cuda kernel which is called many times, which adds some values to an array of allocated size N. I keep track of the inserted elements with a device variable in which I apply atomicAdd.
When the number of added values approach N, I would like to be able to know it so I can call cudaMalloc again and reallocate the array. The most obvious solution is to do a cudaMemcpy of that device variable every time the kernel is called, and therefore keep track of the size of the array in the host. What I would like to know is if there is a way to be able of ONLY doing the cudaMemcpy to the host when the values are approaching N.
One possible solution I had thought of is if I could set cudaError_t return value to 30 (cudaErrorUnknown), or some custom error, which I could later check. But I havent found how to do it and I guess that its not possible. Is there any way to do what I want and do the cudaMemcpy only when the device finds that its running out of memory?
But I haven't found how to do it and I guess that it's not possible
Error numbers from the runtime are set by the host driver. They are not available to the programmer and they cannot be set in kernels either. So your guess is correct. There are assertions available in device for debugging, and there are ways to cause a kernel to abnormally abort, but the latter will cause context destruction and a loss of the contents of device memory, which I suspect won't help you.
About the best you can do is use a mapped host or managed allocation as a way for the host to keep track of the consumption of allocated memory on the device. Then you don't need to explicitly memcpy and the latency will be minimized. But you will need some sort of synchronization on the running kernel in that case.

Would CPU continue executing the next line code when the GPU kernel is running? Would it cause an error?

I am reading the book "CUDA by example" by Sanders, where the author mentioned that p.441: For example, when we launched the kernel in our ray tracer, the GPU begins executing our code, but the CPU continues executing the next line of our program before the GPU finishes. -- Highlighted mar 4, 2014
I am wondering if this statement is correct. For example, what if the next instruction CPU continues executing depends on the variables that the GPU kernel outputs? Would it cause an error? From my experience, it does not cause an error. So what does the author really mean?
Many thanks!
Yes, the author is correct. Suppose my kernel launch looks like this:
int *h_in_data, *d_in_data, *h_out_data, *d_out_data;
// code to allocate host and device pointers, and initialize host data
...
// copy host data to device
cudaMemcpy(d_in_data, h_in_data, size_of_data, cudaMemcpyHostToDevice);
mykernel<<<grid, block>>>(d_in_data, d_out_data);
// some other host code happens here
// at this point, h_out_data does not point to valid data
...
cudaMemcpy(h_out_data, d_out_data, size_of_data, cudaMemcpyDeviceToHost);
//h_out_data now points to valid data
Immediately after the kernel launch, the CPU continues executing host code. But the data generated by the device (either d_out_data or h_out_data) is not ready yet. If the host code attempts to use whatever is pointed to by h_out_data, it will just be garbage data. This data only becomes valid after the 2nd cudaMemcpy operation.
Note that using the data (h_out_data) before the 2nd cudaMemcpy will not generate an error, if by that you mean a segmentation fault or some other run time error. But any results generated will not be correct.
Kernel launches in CUDA are by default asynchronous, i.e., the control will return to CPU after the launch. Now if the next instruction of the CPU is another kernel launch, then you don't need to worry, this launch will be done only after the previously launched kernel has finished its execution.
However, if the next instruction is some CPU instruction only which is accessing the results of the kernel, there can be a problem of accessing garbage value. Therefore, excessive care has to be taken and device synchronization should be done as and when needed.

How to debug: CUDA kernel fails when there are many threads?

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.
"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.

HELP! CUDA kernel will no longer launch after using too much memory

I'm writing a program that requires the following kernel launch:
dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);
I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation. Now whenever I run the program, the kernel doesn't even launch. If I look at the list of function calls recorded by the profiler, the kernel is not there. I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again. I have restarted my computer, but that did not help. If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are well within the allowed maximum for my card.
Max threads per block: 1024
Max grid dimensions: 65535,65535,65535
Any suggestions appreciated, thanks in advance!
Try launching with lesser number of threads. If that works, it means that each of your threads is doing a lot of work or using a lot of memory. Thus the maximum possible number of threads cannot possibly be practically launched by CUDA on your hardware.
You may have to make your CUDA code more efficient to be able to launch more threads. You could try slicing your kernel into smaller pieces if it has complex logic inside it. Or get more powerful hardware.
If you compile your code like this:
nvcc -Xptxas="-v" [other compiler options]
the assembler will report the number of local heap memory that the code requires. This can be a useful diagnostic to see what the memory footprint of the kernel is. There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.
Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.
I got it! nVidia NSight 2.0 - which supposedly supports CUDA 4 - changed my CUDA_INC_PATH to use CUDA 3.2. No wonder it wouldn't let me allocate 1024 threads per block. All relief and jubilation aside, that is a really stupid and annoying bug considering I already had CUDA 4.0 RC2 installed.

CUDA host to device (or device to host) memcpy operations with application rendering graphics using OpenGL on the same graphics card

I have posted my problem in the CUDA forums, but not sure if it's appropriate to post a link here for more ideas in case there are significant number of different audiences between the two forums. The link is here. I apologize for any inconvenience and appreciate any comments on this question, as I haven't heard back yet on some specifics of a particular CUDA memory access and management problems. Thanks in advance.
I'm not sure if this is relevant without seeing more of your code but where is CudaObj's destructor being called from?
you said:
However, if I do it this way, I run into errors exiting the application in the line of CudaObj's destructor where cudaFree() is called. This causes the memory cleanup code after CUDA context's cleanup code not being executed due to the error, leaving behind a mess of memory leaks.
After your description of how you changed the cuda setup stuff to be in the beginning of thread2's main function. If you're calling the destructor on CudaObj from a different thread then doing the cudaFree cleanup will be in error for the same reason that you had to move the cuda initialization into thread 2. It sounds like you know this already, but the cuda context is specific to a single thread in your process. Cleaning up in a different thread is not supported according to the documentation, though I've never tried it myself.
Hope this helps