reading CUDA buffer in directx12 - cuda

after creating a GPU buffer in CUDA then exporting it using cuMemExportToShareableHandle() and getting a HANDLE then using the CUDA HANDLE as input in ID3D12Device::OpenSharedHandle() then fails with Access violation writing locationcan
(A)OpenSharedHandle() open CUDA handles or
(B)OpenSharedHandle() only opens handles created by CreateSharedHandle()
if it's (A) then there is something wrong with my handle
if it's (B) then is there another way to import the CUDA handle into directx12?

I asked on the official directx discord
OpenSharedHandle() can only open directx handles
but a handle created by directx12 CreateSharedHandle() can be open by CUDA cuMemImportFromShareableHandle()

Related

Synchronization between CUDA applications

Is there a way to synchronize two different CUDA applications on the same GPU?
I have two different part of processes: original process & post processing. Original process is using GPU. And now we're going to migrate post processing to GPU also. In our architecture there is a requirement, that this two processes should be organized as two separate applications.
And now I'm thinking of synchronization problem:
if I synchronize them on CPU level, I have to know outside when GPU of 1 app is over.
ideal way as I see is to synchronize them somehow on GPU level.
Is there some flag for that purpose? Or some workaround?
Is there a way to synchronize two different CUDA applications on the same GPU?
In a word, no. You would have to do this via some sort of inter-process communication mechanism on the host side.
If you are on Linux or on Windows with a GPU in TCC mode, host IPC will still be required, but you can "interlock" CUDA activity in one process to CUDA activity in another process using the CUDA IPC mechanism. In particular, it is possible to communicate an event handle to another process, using cudaIpcGetEventHandle, and cudaIpcOpenEventHandle. This would provide an event that you could use for a cudaStreamWaitEvent call. Of course, this is really only half of the solution. You would also need to have CUDA IPC memory handles. The CUDA simpleIPC sample code has most of the plumbing you need.
You should also keep in mind that CUDA cannot be used in a child process if CUDA has been initialized in a parent process. This concept is also already provided for in the sample code.
So you would do something like this:
Process A:
create (cudaMalloc) allocation for buffer to hold results to send to post-process
create event for synchronization
get cuda IPC memory and event handles
using host-based IPC, communicate these handles to process B
launch processing work (i.e. GPU kernel) on the data, results should be put in the buffer created above
into the same stream as the GPU kernel, record the event
signal to process B via host based IPC, to launch work
Process B:
receive memory and event handles from process A using host IPC
extract memory pointer and create IPC event from the handles
create a stream for work issue
wait for signal from process A (indicates event has been recorded)
perform cudaStreamWaitEvent using the local event and created stream
in that same stream, launch the post-processing kernel
This should allow the post-processing kernel to begin only when kernel from process A is complete, using the event interlock. Another caveat with this is that you cannot allow process A to terminate at any time during this. Since it is the owner of the memory and event, it must continue to run as long as that memory or that event is required, even if required in another process. If that is a concern, it might make sense to make process B the "owner" and communicate the handles to process A.

Kernel mode GPGPU usage

Is it possible to run CUDA or OpenCL applications from a Linux kernel module?
I have found a project which is providing this functionality, but it needs a userspace helper in order to run CUDA programs. (https://code.google.com/p/kgpu/)
While this project already avoids redundant memory copying between user and kernel space I am wondering if it is possible to avoid the userspace completely?
EDIT:
Let me expand my question. I am aware that kernel components can only call the API provided by the kernel and other kernel components. So I am not looking to call OpenCL or CUDA API directly.
CUDA or OpenCL API in the end has to call into the graphics driver in order to make its magic happen. Most probably this interface is completely non-standard, changing with every release and so on....
But suppose that you have a compiled OpenCL or CUDA kernel that you would want to run. Do the OpenCL/CUDA userspace libraries do some heavy lifting before actually running the kernel or are they just lightweight wrappers around the driver interface?
I am also aware that the userspace helper is probably the best bet for doing this since calling the driver directly would most likely get broken with a new driver release...
The short answer is, no you can't do this.
There is no way to call any code which relies on glibc from kernel space. That implies that there is no way of making CUDA or OpenCL API calls from kernel space, because those libraries rely on glibc and a host of other user space helper libraries and user space system APIs which are unavailable in kernel space. CUDA and OpenCL aren't unique in this respect -- it is why the whole of X11 runs in userspace, for example.
A userspace helper application working via a simple kernel module interface is the best you can do.
[EDIT]
The runtime components of OpenCL are not lightweight wrappers around a few syscalls to push a code payload onto the device. Amongst other things, they include a full just in time compilation toolchain (in fact that is all that OpenCL has supported until very recently), internal ELF code and object management and a bunch of other stuff. There is very little likelihood that you could emulate the interface and functionality from within a driver at runtime.

cuda runtime api and dynamic kernel definition

Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.
[1] CUDA Driver API vs. CUDA runtime
To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:
The CUDA Toolkit 3.0 Beta is now available.
Highlights for this release include:
CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.
There is documentation here which succinctly describes how the driver and runtime API interact.
To concretely answer your main question:
If one wants dynamic kernel definition as in cuModuleLoad and cublas
at the same time, what are the options?
The basic approach goes something like this:
Use the driver API to establish a context on the device as you would normally do.
Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment

Understanding 'Native Memory profiling' in Chrome developer tools

I am building an application with a simple search panel with few search attributes and a result panel. In the result panel, I am rendering the data in a tabular form using Slickgrid.
After few searches (AJAX call to server), the page gets so much loaded and it eventually crashes after sometime. I have checked the DOM count and the JavaScript heap usage for possible memory leaks. I couldn't find anything wrong there. However, when I ran the experimental native memory profiler, I see that the "JavaScript external resource" section uses 600+ MB memory. On running the garbage collector, it is getting down to few MBs. I have couple of questions here:
What contributes to the "JavaScript external resource" section? I thought it corresponds to the JSON data / JavaScript sources which gets transferred from the server. FYI, the gzipped JSON response from the server is ~1MB.
Why is Chrome not releasing the memory pro-actively instead of crashing the page? Again, when I manually run the garbage collector, it is releasing the memory used by "JavaScript external resources".
How do I fix the original problem?
JS Heap Profiler makes a snapshot of the objects in the javascript but javascript code may use native memory with help of "Int8Array", "Uint8Array", "Uint8ClampedArray", "Int16Array", "Uint16Array", "Int32Array", "Uint32Array", "Float32Array" and "Float64Array".
So when you take a snapshot it will have only small wrappers that point to native memory blocks.
Unfortunately heap snapshot does not provide data about the native memory that was used for these kind of objects.
Native heap snapshot is able to count that memory and now we know that the page uses native memory via an array or via an external string.
I'd like to know how did you check that the page has no memory leaks? Did you use three snapshot technique or just checked particular objects?
Tool to track down JavaScript memory leak

Read already allocated memory / vector in Thrust

I am loading a simple variable to the GPU memory using Mathematica:
mem = CUDAMemoryLoad[{1, 2, 3}]
And get the following result:
CUDAMemory["<135826556>", "Integer32"]
Now, with this data in the GPU memory I want to access it from a separate .cu program (outside of Mathematica), using Thrust.
Is there any way to do this? If so, can someone please explain how?
No, there isn't a way to do this. CUDA contexts are private, and there is no way in the standard APIs for a process to access memory which is allocated in another processes context.
During the CUDA 4 release cycle, a new API called cudaIpc was released. This allows two processes with CUDA contexts running on the same host to export and exchange handles to GPU memory allocations. The API is only supported on Linux hosts running with unified addressing support. To the best of my knowledge Mathematica doesn't currently support this.