Any CUDA API to clear the already pinned memory? - cuda

I have a program which would use cudaMallocHost() to allocate the pinned memory, but I forget to use cudaFreeHost() to free the pinned memory...
I run this program once and then exit, but next time I want to run the same program, it would throw segmentation fault when I called cudaMallocHost.
I suspected that is due to the memory is pinned when I first run the program, and when I try to run the program one more time, the OS couldn't find any more memory that could be pinned...
My question is, is there any CUDA API that I can call to clear the already pinned memory in host without knowing the host memory address?
I lookup the CUDA document but didn't find any, and rebooting didn't help either.
Edit
I run htop and found that there is a 17GB of memory which seems like nobody is using it.
I wonder if this is the memory that I pinned?
htop

i made some tests using htop and a small application :
here is the code i used :
#include <cuda_runtime.h>
#include <unistd.h>
#include <vector>
int main(void)
{
std::vector<void*> arPtrs;
for(int i=0;i<5;++i)
{
void* ptr = nullptr;
cudaMallocHost(&ptr, 1 * 1024 * 1024*1024);
arPtrs.push_back(ptr);
sleep(2);
}
return 0;
}
as you can see i don't call the cudaFreeHost on my pointers.
in parallel i monitor the memory with htop. The memory appears released for htop when the application leaves.
May your memory being use by another user so you can't see it ?

Related

Persistent buffers in CUDA

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

Why do I need to declare CUDA variables on the Host before allocating them on the Device

I've just started trying to learn CUDA again and came across some code I don't fully understand.
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
When the GPU memory pointers are declared, they allocate memory on the host. The cudaMalloc calls throw away the information that d_in and d_out are pointers to floats.
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored. It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
So, what is the purpose of the original variable declarations on the host?
======================================================================
I would've thought something like this would make more sense:
// declare GPU memory pointers
cudaFloat * d_in;
cudaFloat * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
This way, everything GPU related takes place on the GPU. If d_in or d_out are accidentally used in host code, an error can be thrown at compile time, since those variables wouldn't be defined on the host.
I guess what I also find confusing is that by storing device memory addresses on the host, it feels like the device isn't in fully in charge of managing its own memory. It feels like there's a risk of host code accidentally overwriting the value of either d_in or d_out either through accidentally assigning to them in host code or another more subtle error, which could cause the GPU to lose access to its own memory. Also, it seems strange that the addresses assigned to d_in & d_out are chosen by the host, instead of the device. Why should the host know anything about which addresses are/are not available on the device?
What am I failing to understand here?
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored
That is just the C pass by reference idiom.
It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
Ok, so let's design the API your way. Here is a typical sequence of operations on the host -- allocate some memory on the device, copy some data to that memory, launch a kernel to do something to that memory. You can think for yourself how it would be possible to do this without having the pointers to the allocated memory stored in a host variable:
cudaMalloc(somebytes);
cudaMemcpy(?????, hostdata, somebytes, cudaMemcpyHOstToDevice);
kernel<<<1,1>>>(?????);
If you can explain what should be done with ????? if we don't have the address of the memory allocation on the device stored in a host variable, then you are really onto something. If you can't, then you have deduced the basic reason why we store the return address of memory allocated on the GPU in host variables.
Further, because of the use of typed host pointers to store the addresses of device allocations, CUDA runtime API can do type checking. So this:
__global__ void kernel(double *data, int N);
// .....
int N = 1 << 20;
float * d_data;
cudaMalloc((void **)&d_data, N * sizeof(float));
kernel<<<1,1>>>(d_data, N);
can report type mismatches at compile time, which is very useful.
Your fundamental conceptual failure is mixing up host-side code and device-side code. If you call cudaMalloc() from code execution on the CPU, then, well, it's on the CPU: It's you who want to have the arguments in CPU memory, and the result in CPU memory. You asked for it. cudaMalloc has told the GPU/device how much of its (the device's) memory to allocate, but if the CPU/host wants to access that memory, it needs a way to refer to it that the device will understand. The memory location on the device is a way to do this.
Alternatively, you can call it from device-side code; then everything takes place on the GPU. (Although, frankly, I've never done it myself and it's not such a great idea except in special cases).

Ensure that thrust doesnt memcpy from host to device

I have used the following method, expecting to avoid memcpy from host to device. Does thrust library ensure that there wont be a memcpy from host to device in the process?
void EScanThrust(float * d_in, float * d_out)
{
thrust::device_ptr<float> dev_ptr(d_in);
thrust::device_ptr<float> dev_out_ptr(d_out);
thrust::exclusive_scan(dev_ptr, dev_ptr + size, dev_out_ptr);
}
Here d_in and d_out are prepared using cudaMalloc and d_in is filled with data using cudaMemcpy before calling this function
Does thrust library ensure that there wont be a memcpy from host to device in the process?
The code you've shown shouldn't involve any host->device copying. (How could it? There are no references anywhere to any host data in the code you have shown.)
For actual codes, it's easy enough to verify the underlying CUDA activity using a profiler, for example:
nvprof --print-gpu-trace ./my_exe
If you keep your profiled code sequences short, it's pretty easy to line up the underlying CUDA activity with the thrust code that generated that activity. If you want to profile just a short segment of a longer sequence, then you can turn profiling on and off or else use NVTX markers to identify the desired range in the profiler output.

CUDA kernel doesn't launch

My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.

CUDA host to device transfer faster than device to host transfer

I was working on a simple cuda program in which I figured out that 90% of the time was coming from a single statement which was a cudamemcpy from device to host. The program was transfering some 2MB was data from host to device in 600-700microseconds and was copying back 4MB of data from device to host in 10ms. The total time taken by my program was 13ms. My question is that why there is an asymmetry in the two copying from device to host and host to device. Is it because cuda devlopers thought that copying back would be usually smaller in bytes. My second question is that is there any way to circumvent it.
I am using a Fermi GTX560 graphics card with 343 cores and 1GB memory.
Timing of CUDA functions is a bit different than CPU. First of all be sure that you do not take the initialization cost of CUDA into account by calling a CUDA function at the start of your application, otherwise it might be initialized while you started your timing.
int main (int argc, char **argv) {
cudaFree(0);
....//cuda is initialized..
}
Use a Cutil timer like this
unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);
//your code, to assess elapsed time..
cutStopTimer(timer);
printf("Elapsed: %.3f\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);
Now, after these preliminary steps lets look at the problem. When a kernel is called, the CPU part will be stalled only till the call is delivered to GPU. The GPU will continue execution while the CPU continues too. If you call cudaThreadSynchronize(..), then the CPU will stall till the GPU finishes current call. cudaMemCopy operation also requires GPU to finish its execution, because the values that should be filled by the kernel is requested.
kernel<<<numBlocks, threadPerBlock>>>(...);
cudaError_t err = cudaThreadSynchronize();
if (cudaSuccess != err) {
fprintf(stderr, "cudaCheckError() failed at %s:%i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err ) );
exit(1);
}
//now the kernel is complete..
cutStopTimer(timer);
So place a synchronization before calling the stop timer function. If you place a memory copy after the kernel call, then the elapsed time of memory copy will include some part of the kernel execution. So memCopy operation may be placed after the timing operations.
There are also some profiler counters that may be used to assess some sections of the kernels.
How to profile the number of global memory transactions for cuda kernels?
How Do You Profile & Optimize CUDA Kernels?