cuSolverRf sample status alloc failed - cuda

Running the CuSolverRf sample with the sample .mtx files lap2D_5pt_n100.mtx and lap3D_7pt_n20.mtx allows the program to run smoothly. However, when I insert in my own .mtx file, I get an error after step 8:
"CUDA error at cuSolverRF.ccp:649 code=2..."
I've narrowed down the problem to here:
checkCudaErrors(cusolverRfSetupHost(
rowsA, nnzA,
h_csrRowPtrA, h_csrColIndA, h_csrValA,
nnzL,
h_csrRowPtrL, h_csrColIndL, h_csrValL,
nnzU,
h_csrRowPtrU, h_csrColIndU, h_csrValU,
h_P,
h_Q,
cusolverRfH));
Which would jump to
void check(T result, char const *const func, const char *const file, int const line)
{
if (result)
{
fprintf(stderr, "CUDA error at %s:%d code=%d(%s) \"%s\" \n",
file, line, static_cast<unsigned int>(result), _cudaGetErrorEnum(result), func);
DEVICE_RESET
// Make sure we call CUDA Device Reset before exiting
exit(EXIT_FAILURE);
}
}
My question is how does the "result" derived? and what I can do to overcome the problem or what am I doing wrong?
Additional info: my matrix is 196530 by 196530 with 2530798 nnz.

The error code 2 corresponds to CUSOLVER_STATUS_ALLOC_FAILED:
quoting the cuSOLVER documentation:
Resource allocation failed inside the cuSolver library. This is
usually caused by a cudaMalloc() failure.
To correct: prior to the function call, deallocate previously
allocated memory as much as possible.
This means memory for your matrix could not be allocated, probably since your GPU's memory is exceeded. Try deallocating memory (as stated in the documentation), use a smaller input matrix, or use a GPU with more memory.

Related

cudaMemGetInfo runs asynchronously with cudaMalloc

I have been trying to see whether we can cudaMalloc the amount of free memory returned by cudaMemGetInfo. But I encounter a strange problem: the cudaMalloc seems to run before the cudaMemGetInfo, as a result of which the latter returns available memory as zero. How do I enforce no reordering of the calls?
Here is the code:
#include <stdio.h>
#include <cuda.h>
#define cudaMallocError(s) error = cudaGetLastError();\
if (error != cudaSuccess)\
{\
printf("CUDA Error: %s\n", cudaGetErrorString(error));\
printf("Failed to cudaMalloc %s\n", s);\
exit(1);\
}
int main()
{
size_t f, t;
int * x;
cudaError_t error;
cudaMemGetInfo(&f, &t);
error = cudaGetLastError();
if (error != cudaSuccess)
{
printf("cudaMemGetInfo went wrong!\n");
printf("Error: %s\n", cudaGetErrorString(error));
}
printf("Available memory = %ld\n", f);
cudaDeviceSynchronize();
cudaMalloc(&x, f);
cudaMallocError("x");
cudaFree(x);
printf("Success\n");
return 0;
}
It is triggering both the error-handling codes. This is the output:
cudaMemGetInfo went wrong!
Error: out of memory
Available memory = 0
CUDA Error: out of memory
Failed to cudaMalloc x
But if I altogether remove the call to cudaMalloc, then it shows available memory as some non-zero value, clearly indicating that it is calling cudaMalloc before cudaMemGetInfo, even though the latter appears before the former in program order. Why is this so?
There is no reordering. cudaMalloc is executed after cudaMemGetInfo.
You are probably just observing physical memory allocation granularity. The requested bytes are rounded up to physical memory page size. However, if this results in more physical memory requested than available, the allocation fails.
On my machine, it seems to be sufficient to round down the free bytes to the next smallest multiple of 2 megabytes.

CUDA: Forgetting kernel launch configuration does not result in NVCC compiler warning or error

When I try to call a CUDA kernel (a __global__ function) using a function pointer, everything appears to work just fine. However, if I forget to provide launch configuration when calling the kernel, NVCC will not result in an error or warning, but the program will compile and then crash if I attempt to run it.
__global__ void bar(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr func)
{
func<<<1, 1>>>(1.0);
}
invoker(bar);
cudaDeviceSynchronize();
Compile and run the above. Everything will work just fine. Then, remove the kernel's launch configuration (i.e., <<<1, 1>>>). The code will compile just fine but it will crash when you try to run it.
Any idea what is going on? Is this a bug, or I am not supposed to pass around pointers of __global__ functions?
CUDA version: 8.0
OS version: Debian (Testing repo)
GPU: NVIDIA GeForce 750M
If we take a slightly more complex version of your repro, and look at the code emitted by the CUDA toolchain front-end, it becomes possible to see what is happening:
#include <cstdio>
__global__ void bar_func(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr passed_func)
{
#ifdef NVCC_FAILS_HERE
bar_func(1.0);
#endif
bar_func<<<1,1>>>(1.0);
passed_func(1.0);
passed_func<<<1,1>>>(2.0);
}
So let's compile it a couple of ways:
$ nvcc -arch=sm_52 -c -DNVCC_FAILS_HERE invoker.cu
invoker.cu(10): error: a __global__ function call must be configured
i.e. the front-end can detect that bar_func is a global function and requires launch parameters. Another attempt:
$ nvcc -arch=sm_52 -c -keep invoker.cu
As you note, this produces no compile error. Let's look at what happened:
void bar_func(float x) ;
# 5 "invoker.cu"
typedef void (*FuncPtr)(float);
# 7 "invoker.cu"
void invoker(FuncPtr passed_func)
# 8 "invoker.cu"
{
# 12 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : (bar_func)((1.0));
# 13 "invoker.cu"
passed_func((2.0));
# 14 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : passed_func((3.0));
# 15 "invoker.cu"
}
The standard kernel invocation syntax <<<>>> gets expanded into an inline call to cudaConfigureCall, and then a host wrapper function is called. The host wrapper has the API internals required to launch the kernel:
void bar_func( float __cuda_0)
# 3 "invoker.cu"
{__device_stub__Z8bar_funcf( __cuda_0); }
void __device_stub__Z8bar_funcf(float __par0)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
{ volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(float))bar_func));
(void)cudaLaunch(((char *)((void ( *)(float))bar_func)));
};
}
So the stub only handles arguments and launches the kernel via cudaLaunch. It doesn't handle launch configuration
The underlying reason for the crash (actually an undetected runtime API error) is that the kernel launch happens without a prior configuration. Obviously this happens because the CUDA front end (and C++ for that matter) can't do pointer introspection at compile time and detect that your function pointer is a stub function for calling a kernel.
I think the only way to describe this is a "limitation" of the runtime API and compiler. I wouldn't say what you are doing is wrong, but I would probably be using the driver API and explicitly managing the kernel launch myself in such a situation.

Dynamic parallelism cudaDeviceSynchronize() crashes

I have a kernel which calls another empty kernel. However when the calling kernel calls cudaDeviceSynchronize(), the kernel crashes and the execution goes straight to the host. Memory checker does not report of any memory access issues.
Does anyone know what could be the reason for such uncivilized behavior?
The crash seems to happen only if I run the code from the debugger (Visual Studio -> Nsight -> Start CUDA Debugging).
The crash does not happen every time I run the code - sometimes it crashes, and sometimes it finishes ok.
Here is the complete code to reproduce the problem:
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include "device_launch_parameters.h"
#include <stdio.h>
#define CUDA_RUN(x_, err_) {cudaStatus = x_; if (cudaStatus != cudaSuccess) {fprintf(stderr, err_ " %d - %s\n", cudaStatus, cudaGetErrorString(cudaStatus)); int k; scanf("%d", &k); goto Error;}}
struct computationalStorage {
float rotMat;
};
__global__ void drawThetaFromDistribution() {}
__global__ void chainKernel() {
computationalStorage* c = (computationalStorage*)malloc(sizeof(computationalStorage));
if (!c) printf("malloc error\n");
c->rotMat = 1.0f;
int n = 1;
while (n < 1000) {
cudaError_t err;
drawThetaFromDistribution<<<1, 1>>>();
if ((err = cudaGetLastError()) != cudaSuccess)
printf("drawThetaFromDistribution Sync kernel error: %s\n", cudaGetErrorString(err));
printf("0");
if ((err = cudaDeviceSynchronize()) != cudaSuccess)
printf("drawThetaFromDistribution Async kernel error: %s\n", cudaGetErrorString(err));
printf("1\n");
++n;
}
free(c);
}
int main() {
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
CUDA_RUN(cudaSetDevice(0), "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
// Set to use on chip memory 16KB for shared, 48KB for L1
CUDA_RUN(cudaDeviceSetCacheConfig ( cudaFuncCachePreferL1 ), "Can't set CUDA to use on chip memory for L1");
// Set a large heap
CUDA_RUN(cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 10 * 192), "Can't set the Heap size");
chainKernel<<<10, 192>>>();
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
printf("Something was wrong! Error code: %d", cudaStatus);
}
CUDA_RUN(cudaDeviceReset(), "cudaDeviceReset failed!");
Error:
int k;
scanf("%d",&k);
return 0;
}
If all goes well I expect to see:
00000000000000000000000....0000000000000001
1
1
1
1
....
This is what I get when everything works ok. When it crashes however:
000000000000....0000000000000Something was wrong! Error code: 30
As you can see the statement err = cudaDeviceSynchronize(); does not finish, and the execution goes straight to the host, where its cudaDeviceSynchronize(); fails with unknown error code (30 = cudaErrorUnknown).
System: CUDA 5.5, NVidia-Titan(Headless), Windows 7x64, Win32 application.
UPDATE: additional Nvidia card driving the display, Nsight 3.2.0.13289.
That last fact may have been the critical one. You don't mention which version of nsight VSE you are using nor your exact machine config (e.g. are there other GPUs in the machine, if so, which is driving the display?), but at least up till recently it was not possible to debug a dynamic parallelism application in single-GPU mode with nsight VSE.
The current feature matrix also suggests that single-GPU CDP debugging is not yet supported.
Probably one possible workaround in your case would be to add another GPU to drive the display, and make the Titan card headless (i.e. don't attach any monitors and don't extend the windows desktop onto that GPU).
I ran your application with and without cuda-memcheck and it does not appear to me that there are any problems with it.

Execution time issue in CUDA benchmarks

I am trying to profile some CUDA Rodinia benchmarks, in terms of their SM and memory utilization, power consumption etc. For that, I simultaneously execute the benchmark and the profiler which essentially spawns a pthread to profile the GPU execution using NVML library.
The issue is that the execution time of a benchmark, is much higher( about 3 times) in case I do not invoke the profiler along with it, than the case when the benchmark is executing with the profiler. The frequency scaling governor for the CPU is userspace so I do not think that frequency of the CPU is changing. Is it due to the flickering in GPU frequency?
Below is the code for the profiler.
#include <pthread.h>
#include <stdio.h>
#include "nvml.h"
#include "unistd.h"
#define NUM_THREADS 1
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
// printf("Hello World! It's me, thread #%ld!\n", tid);
nvmlReturn_t result;
nvmlDevice_t device;
nvmlUtilization_t utilization;
nvmlClockType_t jok;
unsigned int device_count, i,powergpu,clo;
char version[80];
result = nvmlInit();
result = nvmlSystemGetDriverVersion(version,80);
printf("\n Driver version: %s \n\n", version);
result = nvmlDeviceGetCount(&device_count);
printf("Found %d device%s\n\n", device_count,
device_count != 1 ? "s" : "");
printf("Listing devices:\n");
result = nvmlDeviceGetHandleByIndex(0, &device);
while(1)
{
result = nvmlDeviceGetPowerUsage(device,&powergpu );
result = nvmlDeviceGetUtilizationRates(device, &utilization);
printf("\n%d\n",powergpu);
if (result == NVML_SUCCESS)
{
printf("%d\n", utilization.gpu);
printf("%d\n", utilization.memory);
}
result=nvmlDeviceGetClockInfo(device,NVML_CLOCK_SM,&clo);
if(result==NVML_SUCCESS)
{
printf("%d\n",clo);
}
usleep(500000);
}
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("In main: creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Last thing that main() should do */
pthread_exit(NULL);
}
With your profiler running, the GPU(s) are being pulled out of their sleep state (due to the access to the nvml API, which is querying data from the GPUs). This makes them respond much more quickly to a CUDA application, and so the application appears to run "faster" if you time the entire application execution (e.g. using the linux time command).
One solution is to place the GPUs in "persistence mode" with the nvidia-smi command (use nvidia-smi --help to get command line help).
Another solution would be to do the timing from within the application, and exclude the CUDA start-up time from the timing measurement, perhaps by executing a cuda command such as cudaFree(0); prior to the start of timing.

CUDA: cudaEventElapsedTime returns device not ready error

I tried to measure the elapsed time on Tesla (T10 processors) and cudaEventElapsedTime returns device not ready error. But when I tested it on Fermi (Tesla M2090), it gave me the result.
Can anyone tell me what is happening...
Here is my code
cudaError_t err;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
err = cudaEventRecord(start, 0);
f(err != cudaSuccess) {
printf ("\n\n 1. Error: %s\n\n", cudaGetErrorString(err));
exit(1);
}
// actual code
cudaThreadSynchronize();
err = cudaEventRecord(stop, 0);
if(err != cudaSuccess) {
printf ("\n\n2. Error: %s\n\n", cudaGetErrorString(err));
exit(1);
}
err = cudaEventElapsedTime(&elapsed_time, start, stop);
f(err != cudaSuccess) {
printf ("\n\n 3. Error: %s\n\n", cudaGetErrorString(err));
exit(1);
}
It's because cudaEventRecord is asynchronous. It finishes its execution immediately, regardless of the status. Asynchronous functions simply put an order on a "CUDA execution queue". When GPU finishes its current assignment, it pops the next order and executes it. It is all done in a separate thread, handled by the CUDA driver, separate of your program host thread.
cudaEventRecord is an order which says more-or-less something like this: "When you are done all previous work, flag me in this variable".
If your host thread then asks for cudaEventElapsedTime, but the GPU didn't finish its work yet, it gets confused and reports "not ready yet!". cudaEventSynchronize() stalls the current host thread until the GPU reaches the cudaEventRecord order that you placed earlier. After that you are guaranteed that cudaEventElapsedTime will have a meaningful answer for you.
cudaThreadSynchronize() is just a stronger tool: it stalls current thread until GPU finishes all assigned tasks, and not just those until the event.
Even I faced this issue and so based on the answer by #CygnusX1 I keep all my execution code in one cell and the cudaEventElapsedTime in another cell. This solved the issue because Colab (or jupyter notebook) goes to the next cell only if the process in the current cell is done.
Thus,
with torch.no_grad():
model.eval() # warm up
model(x)
start.record()
model(x)
model(x)
model(x)
end.record()
print('execution time in MILLISECONDS: {}'.format(start.elapsed_time(end)/3.0))
raised the error reported in the question i.e. device not ready error and was solved by
with torch.no_grad():
model.eval()
model(x) # warm up
start.record()
model(x)
model(x)
model(x)
end.record()
# Shift the print command to next code CELL !!!
print('execution time in MILLISECONDS: {}'.format(start.elapsed_time(end)/3.0))