Any guarantees that Torch won't mess up with an already allocated CUDA array? - cuda

Assume we allocated some array on our GPU through other means than PyTorch, for example by creating a GPU array using numba.cuda.device_array. Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array? In general, since PyTorch and Numba use the same CUDA runtime and thus I assume the same mechanism for memory management, are they automatically aware of memory regions used by other CUDA programs or does each one of them see the entire GPU memory as his own? If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
EDIT: figured this would be an important assumption: assume that all allocations are done by the same process.

Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array?
No.
are they automatically aware of memory regions used by other CUDA programs ...
They are not "aware", but each process gets its own separate context ...
... or does each one of them see the entire GPU memory as his own?
.... and contexts have their own address spaces and isolation. So neither, but there is no risk of memory corruption.
If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
If by "aware" you mean "safe", then that happens automatically. If by "aware" you imply some sort of interoperability, then that is possible on some platforms, but it is not automatic.
... assume that all allocations are done by the same process.
That is a different situation. In general, the same process implies a shared context, and shared contexts share a memory space, but all the normal address space protection rules and facilities apply, so there is not a risk of loss of safety.

Related

Can Unified Memory Migration use NVLink?

I want to check whether unified memory migration (as previously discussed in this page ) across different GPUs can now leveraging the NVLink for the later version of CUDA and GPU architectures.
Yes, unified memory migration can use NVLink.
For a unified memory allocation, this would typically occur when one GPU accesses that allocation, then another GPU accesses that allocation. If those 2 GPUs are in a direct NVLink relationship, the migration of pages from the first to the second will flow over NVLink.
In addition, although you didn't ask about it, NVLink also provides a path for peer-mapped pages, where they do not migrate, but instead a mapping is provided from one GPU to another. The pages may stay on the first GPU, and the 2nd GPU will access them using memory read or write cycles which take place over NVLink.

What are Cuda shared memory capabilities

I am a newbie in cuda. According to my knowledge I must use global memory to make blocks communicate with each other, but my understanding of the stream concept and memory capabilities stuck somewhere. After searching I figured that streams queue multiple kernels in sequence and can be used to apply different kernels on different blocks.
Now I NEED to exchange arrays between 2 blocks or more. Can kernel be used to swap or exchange data within shared memory between blocks without involving global/device memory.
if I allocated block for each sub population to calculate fitness using some kernel and shared memory, can I transfer data between blocks
No. Shared memory has block scope. It is not portable between blocks. Global memory or heap memory is portable and could potentially be used to hold data to be accessed by multiple blocks.
However, the standard execution model in CUDA doesn't support grid level synchronization. Since CUDA 9, and with the newest generations of hardware, there is support for a grid level synchronization mechanism if you use cooperative groups, however neither PyCUDA nor Numba expose that facility as far as I am aware.

What are the lifetimes for CUDA constant memory?

I'm having trouble wrapping my head around the restrictions on CUDA constant memory.
Why can't we allocate __constant__ memory at runtime? Why do I need to compile in a fixed size variable with near-global scope?
When is constant memory actually loaded, or unloaded? I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory? Related, is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
Where does constant memory reside on the chip?
I'm primarily interested in answers as they relate to Pascal and Volta.
It is probably easiest to answer these six questions in reverse order:
Where does constant memory reside on the chip?
It doesn't. Constant memory is stored in statically reserved physical memory off-chip and accessed via a per-SM cache. When the compiler can identify that a variable is stored in the logical constant memory space, it will emit specific PTX instructions which allow access to that static memory via the constant cache. Note also that there are specific reserved constant memory banks for storing kernel arguments on all currently supported architectures.
Is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
No. But there also isn't "binding" or "unbinding" because reservations are performed statically. The only runtime costs are host to device memory transfers and the cost of loading the symbols into the context as part of context establishment.
I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory?
No. There is only one "allocation" for the entire GPU (although as noted above, there is specific constant memory banks for kernel arguments, so in some sense you could say that there is a per-kernel component of constant memory).
When is constant memory actually loaded, or unloaded?
It depends what you mean by "loaded" and "unloaded". Loading is really a two phase process -- firstly retrieve the symbol and load it into the context (if you use the runtime API this is done automagically) and secondly any user runtime operations to alter the contents of the constant memory via cudaMemcpytoSymbol.
Why do I need to compile in a fixed size variable with near-global scope?
As already noted, constant memory is basically a logical address space in the PTX memory hierarchy which is reflected by a finite size reserved area of the GPU DRAM map and which requires the compiler to emit specific instructions to access uniformly via a dedicated on chip cache or caches. Given its static, compiler analysis driven nature, it makes sense that its implementation in the language would also be primarily static.
Why can't we allocate __constant__ memory at runtime?
Primarily because NVIDIA have chosen not to expose it. But given all the constraints outlined above, I don't think it is an outrageously poor choice. Some of this might well be historic, as constant memory has been part of the CUDA design since the beginning. Almost all of the original features and functionality in the CUDA design map to hardware features which existed for the hardware's first purpose, which was the graphics APIs the GPUs were designed to support. So some of what you are asking about might well be tied to historical features or limitations of either OpenGL or Direct 3D, but I am not familiar enough with either to say for sure.

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

Coalescence vs Bank conflicts (Cuda)

What is the difference between coalescence and bank conflicts when programming with cuda?
Is it only that coalescence happens in global memory while bank conflicts in shared memory?
Should I worry about coalescence, if I have a >1.2 supported GPU? Does it handle coalescence by itself?
Yes, coalesced reads/writes are applicable to global reads and bank conflicts are applicable to shared memory reads/writes.
Different compute capability devices have different behaviours here, but a 1.2 GPU still needs care to ensure that you're coalescing reads and writes - it's just that there are some optimisations to make stuff easier for you
You should read the CUDA Best Practices guide. This goes into lots of detail about both these issues.
Yes: coalesced accesses are relevant to global memory only, bank conflicts are relevant to shared memory only.
Check out also the Advanced CUDA C training session, the first section goes into some detail to explain how the hardware in >1.2 GPUs help you and what optimisations you still need to consider. It also explains shared memory bank conflicts too. Check out this recording for example.
The scan and reduction samples in the SDK also explain shared memory bank conflicts really well with progressive improvements to a kernel.
A >1.2 GPU will try to do the best it can wrt coalescing, in that it is able to group memory accesses of the same size that fit within the same memory atom of 256 bytes and write them out as 1 memory write. The GPU will take care of reordering accesses and aligning them to the right memory boundary. (In earlier GPUs, memory transactions within a warp had to be aligned to the memory atom and had to be in the right order.)
However, for optimal performance, you still need to make sure that those coalescing opportunities are available. If all threads within a warp have memory transactions to completely different memory atoms, there is nothing the coalescer can do, so it still pays to be aware about the memory locality behavior of your kernel.