What are Cuda shared memory capabilities - cuda

I am a newbie in cuda. According to my knowledge I must use global memory to make blocks communicate with each other, but my understanding of the stream concept and memory capabilities stuck somewhere. After searching I figured that streams queue multiple kernels in sequence and can be used to apply different kernels on different blocks.
Now I NEED to exchange arrays between 2 blocks or more. Can kernel be used to swap or exchange data within shared memory between blocks without involving global/device memory.

if I allocated block for each sub population to calculate fitness using some kernel and shared memory, can I transfer data between blocks
No. Shared memory has block scope. It is not portable between blocks. Global memory or heap memory is portable and could potentially be used to hold data to be accessed by multiple blocks.
However, the standard execution model in CUDA doesn't support grid level synchronization. Since CUDA 9, and with the newest generations of hardware, there is support for a grid level synchronization mechanism if you use cooperative groups, however neither PyCUDA nor Numba expose that facility as far as I am aware.

Related

Can Unified Memory Migration use NVLink?

I want to check whether unified memory migration (as previously discussed in this page ) across different GPUs can now leveraging the NVLink for the later version of CUDA and GPU architectures.
Yes, unified memory migration can use NVLink.
For a unified memory allocation, this would typically occur when one GPU accesses that allocation, then another GPU accesses that allocation. If those 2 GPUs are in a direct NVLink relationship, the migration of pages from the first to the second will flow over NVLink.
In addition, although you didn't ask about it, NVLink also provides a path for peer-mapped pages, where they do not migrate, but instead a mapping is provided from one GPU to another. The pages may stay on the first GPU, and the 2nd GPU will access them using memory read or write cycles which take place over NVLink.

Any guarantees that Torch won't mess up with an already allocated CUDA array?

Assume we allocated some array on our GPU through other means than PyTorch, for example by creating a GPU array using numba.cuda.device_array. Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array? In general, since PyTorch and Numba use the same CUDA runtime and thus I assume the same mechanism for memory management, are they automatically aware of memory regions used by other CUDA programs or does each one of them see the entire GPU memory as his own? If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
EDIT: figured this would be an important assumption: assume that all allocations are done by the same process.
Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array?
No.
are they automatically aware of memory regions used by other CUDA programs ...
They are not "aware", but each process gets its own separate context ...
... or does each one of them see the entire GPU memory as his own?
.... and contexts have their own address spaces and isolation. So neither, but there is no risk of memory corruption.
If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
If by "aware" you mean "safe", then that happens automatically. If by "aware" you imply some sort of interoperability, then that is possible on some platforms, but it is not automatic.
... assume that all allocations are done by the same process.
That is a different situation. In general, the same process implies a shared context, and shared contexts share a memory space, but all the normal address space protection rules and facilities apply, so there is not a risk of loss of safety.

Splitting an array on a multi-GPU system and transferring the data across the different GPUs

I'm using CUDA on a double GPU system using NVIDIA GTX 590 cards and I have an array partitioned according to the figure below.
If I'm going to use CudaSetDevice() to split the sub-arrays across the GPUs, will they share the same global memory? Could the first device access the updated data on the second device and, if so, how?
Thank you.
Each device memory is separate, so if you call cudaSetDevice(A) and then cudaMalloc() then you are allocating memory on device A. If you subsequently access that memory from device B then you will have a higher access latency since the access has to go through the external PCIe link.
An alternative strategy would be to partition the result across the GPUs and store all the input data needed on each GPU. This means you have some duplication of data but this is common practice in GPU (and indeed any parallel method such as MPI) programming - you'll often hear the term "halo" applied to the data regions that need to be transferred between updates.
Note that you can check whether one device can access another's memory using cudaDeviceCanAccessPeer(), in cases where you have a dual GPU card this is always true.

Question on CUDA programming model

Hi I am new to CUDA programming and I had 2 questions on the CUDA programming model.
In brief, the model says there is a memory hierarchy in terms of thread, blocks and then grids. Threads within a block have shared memory and are able to communicate with each other easily, but cannot communicate if they are in different blocks. There is also a global memory on the GPU device.
My questions are:
(1)Why do we need to have such a memory hierarchy consisting of threads and then blocks?
That way any two threads can communicate with each other if needed and hence probably simplify programming effort.
(2) Why is there a restriction of setting up threads only upto 3D configuations and not beyond?
Thank you.
1) This allows you to have a generalized programming model that supports hardware with different numbers of processors. It is also a reflection of the underlying GPU hardware which treats thread within a block differently from threads in different blocks WRT to memory access and synchronization.
Threads can communicate via global memory, or shared memory depending on their block affinity. You can also use synchronization primatives, like __syncthreads.
2) This is part of the programming model. I suspect is largely due to user demand to allow data decomposition for 3 dimensional problems and there was little demand for further dimension support.
The Cuda Programming Guide covers a lot of this sort of stuff. There are also a couple of books available. There's a good discussion in Programming Massively Parallel Processors: A Hands-on Approach that goes into why GPU hardware is the way it is and how that has been reflected in the programming model.
(1) Local memory is used to store local values that doesn't fit into registers. Shared memory is used to store common data, that is shared by threads. Local memory + registers compose execution context of thread, and shared memory is storage for data to be processed.
(2) You can easily use 1D to represent any D. For example if you have 1D index you can convert it to 2D space by using: x = i % width, y = i / width and inverse is i = y*width + x. 2D and 3D were added for your convenience. It is pretty same as N-D arrays are implemented in C++.

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.