cudaMemGetInfo runs asynchronously with cudaMalloc - cuda

I have been trying to see whether we can cudaMalloc the amount of free memory returned by cudaMemGetInfo. But I encounter a strange problem: the cudaMalloc seems to run before the cudaMemGetInfo, as a result of which the latter returns available memory as zero. How do I enforce no reordering of the calls?
Here is the code:
#include <stdio.h>
#include <cuda.h>
#define cudaMallocError(s) error = cudaGetLastError();\
if (error != cudaSuccess)\
{\
printf("CUDA Error: %s\n", cudaGetErrorString(error));\
printf("Failed to cudaMalloc %s\n", s);\
exit(1);\
}
int main()
{
size_t f, t;
int * x;
cudaError_t error;
cudaMemGetInfo(&f, &t);
error = cudaGetLastError();
if (error != cudaSuccess)
{
printf("cudaMemGetInfo went wrong!\n");
printf("Error: %s\n", cudaGetErrorString(error));
}
printf("Available memory = %ld\n", f);
cudaDeviceSynchronize();
cudaMalloc(&x, f);
cudaMallocError("x");
cudaFree(x);
printf("Success\n");
return 0;
}
It is triggering both the error-handling codes. This is the output:
cudaMemGetInfo went wrong!
Error: out of memory
Available memory = 0
CUDA Error: out of memory
Failed to cudaMalloc x
But if I altogether remove the call to cudaMalloc, then it shows available memory as some non-zero value, clearly indicating that it is calling cudaMalloc before cudaMemGetInfo, even though the latter appears before the former in program order. Why is this so?

There is no reordering. cudaMalloc is executed after cudaMemGetInfo.
You are probably just observing physical memory allocation granularity. The requested bytes are rounded up to physical memory page size. However, if this results in more physical memory requested than available, the allocation fails.
On my machine, it seems to be sufficient to round down the free bytes to the next smallest multiple of 2 megabytes.

Related

cudaGetLastError. Which kernel execution raised it?

I have implemented a pipeline where many kernels are launched in a specific stream. The kernels are enqueued into the stream and executed when the scheduler decides it’s best.
In my code, after every kernel enqueue, I check if there’s any error by calling cudaGetLastError which, according to the documentation, "it returns the last error from a runtime call. This call, may also return error codes from previous asynchronous launches". Thus, if the kernel has only been enqueued, not executed, I understand that the error returned refers only if the kernel was enqueued correctly (parameters checking, grid and block size, shared memory, etc...).
My problem is: I enqueue many different kernels without waiting for finalization of the execution of each kernel. Imagine now, I have a bug in one of my kernels (let's call it Kernel1) which causes a illegal memory access (for instance). If I check the cudaGetLastError right after enqueuing it, the return value is success because it was correctly enqueued. So my CPU thread moves on and keep enqueuing kernels to the stream. At some point Kernel1 is executed and raised the illegal memory access. Thus, next time I check for cudaGetLastError I will get the cuda error but, by that time, the CPU thread is another point forward in the code. Consequently, I know there's been an error, but I have no idea which kernel raised it.
An option is to synchronize (block the CPU thread) until the execution of every kernel have finished and then check the error code, but this is not an option for performance reasons.
The question is, is there any way we can query which kernel raised a given error code returned by cudaGetLastError? If not, which is in your opinion the best way to handle this?
There is an environment variable CUDA_​LAUNCH_​BLOCKING which you can use to serialize kernel execution of an otherwise asynchronous sequence of kernel launches. This should allow you to isolate the kernel instance which is causing an error, either via internal error checking in your host code, or via an external tool like cuda-memcheck.
I have tested 3 different options:
Set CUDA_​LAUNCH_​BLOCKING environment variable to 1. This forces to block the CPU thread until the kernel execution has finished. We can check after each execution if there's been an error catching the exact point of failure. Although, this has an obvious performance impact but this may help to bound the bug in a production environment without having to perform any change at the client side.
Distribute the production code compiled with the flag -lineinfo and run the code again with cuda-memncheck. This has no performance impact and we do not need to perform any change in the client either. Although, we have to execute the binary in a slightly different environment and in some cases, like a service running GPU tasks, can be difficult to achieve.
Insert a callback after each kernel call. In the userData parameter, include a unique id for the kernel-call, and possibly some information on the parameters used. This can be directly distributed in a production environment and always gives us the exact point of failure and we don't need to perform any change at the client side. Although, the performance impact of this approach is huge. Apparently, the callback functions, are processed by a driver thread and cause for the performance impact. I wrote a code to test it
#include <cuda_runtime.h>
#include <vector>
#include <chrono>
#include <iostream>
#define BLOC_SIZE 1024
#define NUM_ELEMENTS BLOC_SIZE * 32
#define NUM_ITERATIONS 500
__global__ void KernelCopy(const unsigned int *input, unsigned int *result) {
unsigned int pos = blockIdx.x * BLOC_SIZE + threadIdx.x;
result[pos] = input[pos];
}
void CUDART_CB myStreamCallback(cudaStream_t stream, cudaError_t status, void *data) {
if (status) {
std::cout << "Error: " << cudaGetErrorString(status) << "-->";
}
}
#define CUDA_CHECK_LAST_ERROR cudaStreamAddCallback(stream, myStreamCallback, nullptr, 0)
int main() {
cudaError_t c_ret;
c_ret = cudaSetDevice(0);
if (c_ret != cudaSuccess) {
return -1;
}
unsigned int *input;
c_ret = cudaMalloc((void **)&input, NUM_ELEMENTS * sizeof(unsigned int));
if (c_ret != cudaSuccess) {
return -1;
}
std::vector<unsigned int> h_input(NUM_ELEMENTS);
for (unsigned int i = 0; i < NUM_ELEMENTS; i++) {
h_input[i] = i;
}
c_ret = cudaMemcpy(input, h_input.data(), NUM_ELEMENTS * sizeof(unsigned int), cudaMemcpyKind::cudaMemcpyHostToDevice);
if (c_ret != cudaSuccess) {
return -1;
}
unsigned int *result;
c_ret = cudaMalloc((void **)&result, NUM_ELEMENTS * sizeof(unsigned int));
if (c_ret != cudaSuccess) {
return -1;
}
cudaStream_t stream;
c_ret = cudaStreamCreate(&stream);
if (c_ret != cudaSuccess) {
return -1;
}
std::chrono::steady_clock::time_point start;
std::chrono::steady_clock::time_point end;
start = std::chrono::steady_clock::now();
for (unsigned int i = 0; i < 500; i++) {
dim3 grid(NUM_ELEMENTS / BLOC_SIZE);
KernelCopy <<< grid, BLOC_SIZE, 0, stream >>> (input, result);
CUDA_CHECK_LAST_ERROR;
}
cudaStreamSynchronize(stream);
end = std::chrono::steady_clock::now();
std::cout << "With callback took (ms): " << std::chrono::duration<float, std::milli>(end - start).count() << '\n';
start = std::chrono::steady_clock::now();
for (unsigned int i = 0; i < 500; i++) {
dim3 grid(NUM_ELEMENTS / BLOC_SIZE);
KernelCopy <<< grid, BLOC_SIZE, 0, stream >>> (input, result);
c_ret = cudaGetLastError();
if (c_ret) {
std::cout << "Error: " << cudaGetErrorString(c_ret) << "-->";
}
}
cudaStreamSynchronize(stream);
end = std::chrono::steady_clock::now();
std::cout << "Without callback took (ms): " << std::chrono::duration<float, std::milli>(end - start).count() << '\n';
c_ret = cudaStreamDestroy(stream);
if (c_ret != cudaSuccess) {
return -1;
}
c_ret = cudaFree(result);
if (c_ret != cudaSuccess) {
return -1;
}
c_ret = cudaFree(input);
if (c_ret != cudaSuccess) {
return -1;
}
return 0;
}
Ouput:
With callback took (ms): 47.8729
Without callback took (ms): 1.9317
(CUDA 9.2, Windows 10, Visual Studio 2015, Nvidia Tesla P4)
To me, in a production environment, the only valid approach is number 2.

cuSolverRf sample status alloc failed

Running the CuSolverRf sample with the sample .mtx files lap2D_5pt_n100.mtx and lap3D_7pt_n20.mtx allows the program to run smoothly. However, when I insert in my own .mtx file, I get an error after step 8:
"CUDA error at cuSolverRF.ccp:649 code=2..."
I've narrowed down the problem to here:
checkCudaErrors(cusolverRfSetupHost(
rowsA, nnzA,
h_csrRowPtrA, h_csrColIndA, h_csrValA,
nnzL,
h_csrRowPtrL, h_csrColIndL, h_csrValL,
nnzU,
h_csrRowPtrU, h_csrColIndU, h_csrValU,
h_P,
h_Q,
cusolverRfH));
Which would jump to
void check(T result, char const *const func, const char *const file, int const line)
{
if (result)
{
fprintf(stderr, "CUDA error at %s:%d code=%d(%s) \"%s\" \n",
file, line, static_cast<unsigned int>(result), _cudaGetErrorEnum(result), func);
DEVICE_RESET
// Make sure we call CUDA Device Reset before exiting
exit(EXIT_FAILURE);
}
}
My question is how does the "result" derived? and what I can do to overcome the problem or what am I doing wrong?
Additional info: my matrix is 196530 by 196530 with 2530798 nnz.
The error code 2 corresponds to CUSOLVER_STATUS_ALLOC_FAILED:
quoting the cuSOLVER documentation:
Resource allocation failed inside the cuSolver library. This is
usually caused by a cudaMalloc() failure.
To correct: prior to the function call, deallocate previously
allocated memory as much as possible.
This means memory for your matrix could not be allocated, probably since your GPU's memory is exceeded. Try deallocating memory (as stated in the documentation), use a smaller input matrix, or use a GPU with more memory.

Unified Memory and Streams in C

I am trying to use streams with CUDA 6 and unified memory in C. My previous stream implementation was looking like this :
for(x=0; x<DSIZE; x+=N*2){
gpuErrchk(cudaMemcpyAsync(array_d0, array_h+x, N*sizeof(char), cudaMemcpyHostToDevice, stream0));
gpuErrchk(cudaMemcpyAsync(array_d1, array_h+x+N, N*sizeof(char), cudaMemcpyHostToDevice, stream1));
gpuErrchk(cudaMemcpyAsync(data_d0, data_h, wrap->size*sizeof(int), cudaMemcpyHostToDevice, stream0));
gpuErrchk(cudaMemcpyAsync(data_d1, data_h, wrap->size*sizeof(int), cudaMemcpyHostToDevice, stream1));
searchGPUModified<<<N/128,128,0,stream0>>>(data_d0, array_d0, out_d0 );
searchGPUModified<<<N/128,128,0,stream1>>>(data_d1, array_d1, out_d1);
gpuErrchk(cudaMemcpyAsync(out_h+x, out_d0 , N * sizeof(int), cudaMemcpyDeviceToHost, stream0));
gpuErrchk(cudaMemcpyAsync(out_h+x+N, out_d1 ,N * sizeof(int), cudaMemcpyDeviceToHost, stream1));
}
but I cannot find an example of streams and unified memory, using the same technique, where chuncks of data are sent to the GPU. I am thus wondering if there is a way to do this ?
You should read section J.2.2 of the programming guide (and preferably all of appendix J).
With Unified Memory, memory allocated using cudaMallocManaged is by default attached to all streams ("global") and we must modify this in order to make effective use of streams, e.g. for compute/copy overlap. We can do this with the cudaStreamAttachMemAsync function as described in section J.2.2.3 By associating each memory "chunk" with a stream in this fashion, the UM subsystem can make intelligent decisions about when to transfer each data item.
The following example demonstrates this:
#include <stdio.h>
#include <time.h>
#define DSIZE 1048576
#define DWAIT 100000ULL
#define nTPB 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
typedef int mytype;
__global__ void mykernel(mytype *data){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < DSIZE) data[idx] = 1;
unsigned long long int tstart = clock64();
while (clock64() < tstart + DWAIT);
}
int main(){
mytype *data1, *data2, *data3;
cudaStream_t stream1, stream2, stream3;
cudaMallocManaged(&data1, DSIZE*sizeof(mytype));
cudaMallocManaged(&data2, DSIZE*sizeof(mytype));
cudaMallocManaged(&data3, DSIZE*sizeof(mytype));
cudaCheckErrors("cudaMallocManaged fail");
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaStreamCreate(&stream3);
cudaCheckErrors("cudaStreamCreate fail");
cudaStreamAttachMemAsync(stream1, data1);
cudaStreamAttachMemAsync(stream2, data2);
cudaStreamAttachMemAsync(stream3, data3);
cudaDeviceSynchronize();
cudaCheckErrors("cudaStreamAttach fail");
memset(data1, 0, DSIZE*sizeof(mytype));
memset(data2, 0, DSIZE*sizeof(mytype));
memset(data3, 0, DSIZE*sizeof(mytype));
mykernel<<<(DSIZE+nTPB-1)/nTPB, nTPB, 0, stream1>>>(data1);
mykernel<<<(DSIZE+nTPB-1)/nTPB, nTPB, 0, stream2>>>(data2);
mykernel<<<(DSIZE+nTPB-1)/nTPB, nTPB, 0, stream3>>>(data3);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
for (int i = 0; i < DSIZE; i++){
if (data1[i] != 1) {printf("data1 mismatch at %d, should be: %d, was: %d\n", i, 1, data1[i]); return 1;}
if (data2[i] != 1) {printf("data2 mismatch at %d, should be: %d, was: %d\n", i, 1, data2[i]); return 1;}
if (data3[i] != 1) {printf("data3 mismatch at %d, should be: %d, was: %d\n", i, 1, data3[i]); return 1;}
}
printf("Success!\n");
return 0;
}
The above program creates a kernel that runs artificially long using clock64(), so as to give us a simulated opportunity for compute/copy overlap (simulating a compute-intensive kernel). We are launching 3 instances of this kernel, each instance operating on a separate "chunk" of data.
When we profile the above program, the following is seen:
First, note that the 3rd kernel launch is highlighted in yellow, and it begins immediately after the second kernel launch highlighted in purple. The actual cudaLaunch runtime API event that launches this 3rd kernel is indicated in the runtime API line by the mouse pointer, also highlighted in yellow (and is preceded by the cudaLaunch events for the first 2 kernels). Since this launch happens during execution of the first kernel, and there is no intervening "empty space" from that point until the start of the 3rd kernel, we can observe that the transfer of the data for the 3rd kernel launch (i.e. data3) occurred while kernels 1 and 2 were executing. Therefore we have effective overlap of copy and compute. (We could make a similar observation about kernel 2).
Although I haven't shown it here, if we omit the cudaStreamAttachMemAsync lines, the program still compiles and runs correctly, but if we profile it, we observe a different relationship between the cudaLaunch events and the kernels. The overall profile looks similar, and the kernels are executing back to back, but the entire cudaLaunch process now begins and ends before the first kernel begins executing, and there are no cudaLaunch events during the kernel execution. This indicates that (since all the cudaMallocManaged memory is global) all of the data transfers are taking place prior to the first kernel launch. The program has no way to associate a "global" allocation with any particular kernel, so all such allocated memory must be transferred before the first kernel launch (even though that kernel is only using data1).

Dynamic parallelism cudaDeviceSynchronize() crashes

I have a kernel which calls another empty kernel. However when the calling kernel calls cudaDeviceSynchronize(), the kernel crashes and the execution goes straight to the host. Memory checker does not report of any memory access issues.
Does anyone know what could be the reason for such uncivilized behavior?
The crash seems to happen only if I run the code from the debugger (Visual Studio -> Nsight -> Start CUDA Debugging).
The crash does not happen every time I run the code - sometimes it crashes, and sometimes it finishes ok.
Here is the complete code to reproduce the problem:
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include "device_launch_parameters.h"
#include <stdio.h>
#define CUDA_RUN(x_, err_) {cudaStatus = x_; if (cudaStatus != cudaSuccess) {fprintf(stderr, err_ " %d - %s\n", cudaStatus, cudaGetErrorString(cudaStatus)); int k; scanf("%d", &k); goto Error;}}
struct computationalStorage {
float rotMat;
};
__global__ void drawThetaFromDistribution() {}
__global__ void chainKernel() {
computationalStorage* c = (computationalStorage*)malloc(sizeof(computationalStorage));
if (!c) printf("malloc error\n");
c->rotMat = 1.0f;
int n = 1;
while (n < 1000) {
cudaError_t err;
drawThetaFromDistribution<<<1, 1>>>();
if ((err = cudaGetLastError()) != cudaSuccess)
printf("drawThetaFromDistribution Sync kernel error: %s\n", cudaGetErrorString(err));
printf("0");
if ((err = cudaDeviceSynchronize()) != cudaSuccess)
printf("drawThetaFromDistribution Async kernel error: %s\n", cudaGetErrorString(err));
printf("1\n");
++n;
}
free(c);
}
int main() {
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
CUDA_RUN(cudaSetDevice(0), "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
// Set to use on chip memory 16KB for shared, 48KB for L1
CUDA_RUN(cudaDeviceSetCacheConfig ( cudaFuncCachePreferL1 ), "Can't set CUDA to use on chip memory for L1");
// Set a large heap
CUDA_RUN(cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 10 * 192), "Can't set the Heap size");
chainKernel<<<10, 192>>>();
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
printf("Something was wrong! Error code: %d", cudaStatus);
}
CUDA_RUN(cudaDeviceReset(), "cudaDeviceReset failed!");
Error:
int k;
scanf("%d",&k);
return 0;
}
If all goes well I expect to see:
00000000000000000000000....0000000000000001
1
1
1
1
....
This is what I get when everything works ok. When it crashes however:
000000000000....0000000000000Something was wrong! Error code: 30
As you can see the statement err = cudaDeviceSynchronize(); does not finish, and the execution goes straight to the host, where its cudaDeviceSynchronize(); fails with unknown error code (30 = cudaErrorUnknown).
System: CUDA 5.5, NVidia-Titan(Headless), Windows 7x64, Win32 application.
UPDATE: additional Nvidia card driving the display, Nsight 3.2.0.13289.
That last fact may have been the critical one. You don't mention which version of nsight VSE you are using nor your exact machine config (e.g. are there other GPUs in the machine, if so, which is driving the display?), but at least up till recently it was not possible to debug a dynamic parallelism application in single-GPU mode with nsight VSE.
The current feature matrix also suggests that single-GPU CDP debugging is not yet supported.
Probably one possible workaround in your case would be to add another GPU to drive the display, and make the Titan card headless (i.e. don't attach any monitors and don't extend the windows desktop onto that GPU).
I ran your application with and without cuda-memcheck and it does not appear to me that there are any problems with it.

cudaHostRegister fails with 'invalid argument' error even with page-aligned memory

I have allocated page-aligned memory on host using posix_memalign. The call to posix_memalign does not return any error. However, using this pointer as argument to cudaHostRegister gives me an 'invalid argument' error. What could be the issue?
CUDA API version: 4.0
gcc version: 4.4.5
GPU compute capability: 2.0
The memory allocation is done in the application code, and a pointer is passed to a library routine.
Application code snippet:
if(posix_memalign((void **)&h_A, getpagesize(), n * n * sizeof(float))) {
printf("Error allocating aligned memory for A\n");
return 1;
}
Shared library code snippet:
if((ret = cudaSetDeviceFlags(cudaDeviceMapHost)) != cudaSuccess) {
fprintf(stderr, "Error setting device flag: %s\n",
cudaGetErrorString(ret));
return NULL;
}
if((ret = cudaHostRegister(h_A, n2 * sizeof(float),
cudaHostRegisterMapped)) != cudaSuccess) {
fprintf(stderr, "Error registering page-locked memory for A: %s\n",
cudaGetErrorString(ret));
return NULL;
}
I cannot reproduce this. If I take the code snippets you supplied and make them into a minimal executable:
#include <unistd.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdio.h>
int main(void)
{
const int n2 = 100 * 100;
float *h_A;
cudaError_t ret;
if(posix_memalign((void **)&h_A, getpagesize(), n2 * sizeof(float))) {
printf("Error allocating aligned memory for A\n");
return -1;
}
if((ret = cudaSetDeviceFlags(cudaDeviceMapHost)) != cudaSuccess) {
fprintf(stderr, "Error setting device flag: %s\n",
cudaGetErrorString(ret));
return -1;
}
if((ret = cudaHostRegister(h_A, n2 * sizeof(float),
cudaHostRegisterMapped)) != cudaSuccess) {
fprintf(stderr, "Error registering page-locked memory for A: %s\n",
cudaGetErrorString(ret));
return -1;
}
return 0;
}
it compiles and runs without error under both CUDA 4.2 and CUDA 5.0 on a 64 bit linux host with the 304.54 driver. I would, therefore, conclude that either you have a broken CUDA installation or your code has a problem somewhere you haven't shown us.
Perhaps you can compile and run this code exactly as I posted and see what happens. If it works, it might help narrow down what it is that might be going wrong here.