Read already allocated memory / vector in Thrust - cuda

I am loading a simple variable to the GPU memory using Mathematica:
mem = CUDAMemoryLoad[{1, 2, 3}]
And get the following result:
CUDAMemory["<135826556>", "Integer32"]
Now, with this data in the GPU memory I want to access it from a separate .cu program (outside of Mathematica), using Thrust.
Is there any way to do this? If so, can someone please explain how?

No, there isn't a way to do this. CUDA contexts are private, and there is no way in the standard APIs for a process to access memory which is allocated in another processes context.
During the CUDA 4 release cycle, a new API called cudaIpc was released. This allows two processes with CUDA contexts running on the same host to export and exchange handles to GPU memory allocations. The API is only supported on Linux hosts running with unified addressing support. To the best of my knowledge Mathematica doesn't currently support this.

Related

I don't know how to export a matrix from CUDA [duplicate]

I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code.
Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf?
Sorry, if the answer has already been given in a previous topic, I can't seem to find it...
Thanks!
The short answer is, no there is not.
cuPrintf and the built-in printf support in Fermi and Kepler runtime is implemented using device to host copies. The mechanism is no different to using cudaMemcpy to transfer a buffer to the host yourself.
Just about all CUDA compatible GPUs support so-called zero-copy (AKA "pinned, mapped") memory, which allows the GPU to map a host buffer into its address space and execute DMA transfers into that mapped host memory. Note, however, that setup and initialisation of mapped memory has considerably higher overhead than conventional memory allocation (so you really need a lot of transactions to amortise that overhead throughout the life of your application), and that the CUDA driver can't use zero-copy with any other than addresses backed by physical memory. So you can't mmap a file and use zero-copy on it, ie. you will still need explicit host side file IO code to get from a zero-copy buffer to disk.

configure local (shared) memory for OpenCL using Nvidia platforms

I want to optimize my local memory access pattern within my OpenCL kernel. I read at somewhere about configurable local memory. E.g. we should be able to configure which amount is used for local mem and which amount is used for automatic caching.
Also i read that the bank size can be chosen for the latest (Kepler) Nvidia hardware here:
http://www.acceleware.com/blog/maximizing-shared-memory-bandwidth-nvidia-kepler-gpus. This point seems to be very crucial for double precision value storing in local memory.
Does Nvidia provide the functionality of setting up the local memory exclusively for CUDA users? I can't find similar methods for OpenCL. So is this maybe called in a different way or does it really not exist?
Unfortunately there is no way to control the L1 cache/local memory configuration when using OpenCL. This functionality is only provided by the CUDA runtime (via cudaDeviceSetCacheConfig or cudaFuncSetCacheConfig).

Writing output files from CUDA devices

I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code.
Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf?
Sorry, if the answer has already been given in a previous topic, I can't seem to find it...
Thanks!
The short answer is, no there is not.
cuPrintf and the built-in printf support in Fermi and Kepler runtime is implemented using device to host copies. The mechanism is no different to using cudaMemcpy to transfer a buffer to the host yourself.
Just about all CUDA compatible GPUs support so-called zero-copy (AKA "pinned, mapped") memory, which allows the GPU to map a host buffer into its address space and execute DMA transfers into that mapped host memory. Note, however, that setup and initialisation of mapped memory has considerably higher overhead than conventional memory allocation (so you really need a lot of transactions to amortise that overhead throughout the life of your application), and that the CUDA driver can't use zero-copy with any other than addresses backed by physical memory. So you can't mmap a file and use zero-copy on it, ie. you will still need explicit host side file IO code to get from a zero-copy buffer to disk.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?
The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).
If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."
You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.
Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.