Hi I want to allocate pinned memory but not using cudaMallocHost, I've read this post and tried to use fixed mmap to emulate 'cudaMallocHost' :
data_mapped_ = (void *)mmap(NULL, sb.st_size, PROT_READ, MAP_SHARED, fd_, 0);
if(munmap(data_mapped_, sb.st_size) == -1) {
cerr << "munmap failed" << endl;
exit(-1);
}
data_mapped_ = (void *)mmap(data_mapped_, sb.st_size, PROT_READ, MAP_SHARED|MAP_FIXED, fd_, 0);
But this is still not as fast as cudaMallocHost. So what's the correct c implementation of pinned memory?
CUDA pinned memory (e.g. those pointers returned by cudaMallocHost, cudaHostAlloc, or cudaHostRegister) has several characteristics. One characteristic is that it is non-pageable and this characteristic is largely provided by underlying system/OS calls.
Another characteristic is that it is registered with the CUDA driver. This registration means the driver keeps track of the starting address and size of the pinned allocation. It uses that information to decide exactly how it will process future API calls that touch that region, such as cudaMemcpy or cudaMemcpyAsync.
You could conceivably provide the non-pageable aspect by performing your own system calls. The only way to perform the CUDA driver registration function is to actually call one of the aforementioned CUDA API calls.
Therefore there is no sequence of purely C library or system library calls that can completely mimic the behavior of one of the aforementioned CUDA API calls that provide "pinned" memory.
Related
I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.
I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.
I want to copy data between two GPUs in different processes using the old API for GPUs that don't support peer-to-peer (they are not on the same PCI root hub). However, I'm having trouble with synchronization. The basic steps as I understand them are:
(Process 0, Device 0):
void * d_X;
cudaMalloc(&d_X, size);
// Put something into d_X;
cudaIpcMemHandle_t data;
cudaIpcGetMemHandle(&data, (void *)d_X);
-> Send address and size to Process 1 via MPI_Send/MPI_Recv
(Process 1, Device 1):
cudaSetDevice(1);
void * d_Y;
cudaMalloc(&d_Y, size);
cudaSetDevice(0); // Need to be on device 0 to copy from device 0
void * d_X;
cudaIpcOpenMemHandle(&d_X, data, cudaIpcMemLazyEnablePeerAccess);
cudaMemcpyPeer(d_Y, 1, d_X, 0, size);
cudaIpcCloseMemHandle(d_X);
Is this basically correct? Once I'm sure this is the right approach I need to work out how to synchronize correctly because it's clear I have sync problems (stale memory being copied across, basically).
My GPUs do support UVA but cudaDeviceCanAccessPeer returns 0. I'm actually trying to write some code that works for both P2P and this, but this is the bit I'm having trouble with.
I don't think that what you're asking for is possible.
If you read the documentation for cudaIPCOpenMemHandle (which would be necessary in any event to convert a memory handle from another process into a device pointer usable in the local process), the only possible flag is cudaIpcMemLazyEnablePeerAccess. If you run this call with this flag on devices that are not peer capable, it will return an error (according to my testing, and it should be pretty evident anyway).
Therefore there is no way, in process A, to get a usable device pointer for a device allocation in process B, unless the devices are peer capable (or unless it is on the same device as the one being used by process A - which is demonstrated in the cuda simpleIPC sample code).
The "fallback" option would be to copy data from device to host in process B, and use ordinary Linux IPC mechanisms (e.g. mapped memory, as demonstrated in the simpleIPC sample code) to make that host data available in process A. From there you could copy it to the device in process A if you wish.
Although this seems tedious, it is more-or-less exactly what cudaMemcpyPeer does, for two devices in the same process, when P2P is not possible between those two devices. The fallback mode is to copy the data through a host staging buffer.
I have used the following method, expecting to avoid memcpy from host to device. Does thrust library ensure that there wont be a memcpy from host to device in the process?
void EScanThrust(float * d_in, float * d_out)
{
thrust::device_ptr<float> dev_ptr(d_in);
thrust::device_ptr<float> dev_out_ptr(d_out);
thrust::exclusive_scan(dev_ptr, dev_ptr + size, dev_out_ptr);
}
Here d_in and d_out are prepared using cudaMalloc and d_in is filled with data using cudaMemcpy before calling this function
Does thrust library ensure that there wont be a memcpy from host to device in the process?
The code you've shown shouldn't involve any host->device copying. (How could it? There are no references anywhere to any host data in the code you have shown.)
For actual codes, it's easy enough to verify the underlying CUDA activity using a profiler, for example:
nvprof --print-gpu-trace ./my_exe
If you keep your profiled code sequences short, it's pretty easy to line up the underlying CUDA activity with the thrust code that generated that activity. If you want to profile just a short segment of a longer sequence, then you can turn profiling on and off or else use NVTX markers to identify the desired range in the profiler output.
After reading CUDA's "overlap of data transfer and kernel execution" section in "CUDA C Programming Guide", I have a question: what exactly does data transfer refers to? Does it include cudaMemsetAsync, cudaMemcpyAsync, cudaMemset, cudaMemcpy. Of course, the memory allocated for memcpy is pinned.
In the implicit synchronization (streams) section, the book says "a device memory set" may serialize the streams. So, does it refer to cudaMemsetAsync, cudaMemcpyAsync, cudaMemcpy, cudaMemcpy? I am not sure.
Any function call with an Async at the end has a stream parameter. Additionally, some of the libraries provided by the CUDA toolkit also have the option of setting a stream. By using this, you can have multiple streams running concurrently.
This means, unless you specifically create and set a stream, you will be using the defualt stream. For example, there are no default data transfer and kernel execution streams. You will have to create two streams (or more), and allocate them a task of choice.
A common use case is to have the two streams as mentioned in the programming guide. Keep in mind, this is only useful if you have multiple kernel launches. You can get the data needed for the next (independent) kernel or the next iteration of the current kernel while computing the results for the current kernel. This can maximize both compute and bandwidth capabilities.
For the function calls you mention, cudaMemcpy and cudaMemcpyAsync are the only functions performing data transfers. I don't think cudaMemset and cudaMemsetAsync can be termed as data transfers.
Both cudaMempyAsync and cudaMemsetAsync can be used with streams, while cudaMemset and cudaMemcpy are blocking calls that do not make use of streams.