Is non-aligned pinned memory allowed with CUDA? - cuda

Is passing a pointer to cudaHostRegister that's not page aligned allowed/portable? I'm asking because the simpleStream example does manual page-aligment, but I can't find this requirement in the documentation. Maybe it's a portability problem (similar to mlock() supporting non-aligned on linux, but POSIX does not in general)?
I changed to bandwidth test and using non-aligned, but registered memory performs the same as that returned by cudaHostAlloc. Since I use these pinned buffers for overlapping copies and computation, I'm also interested in whether non-alignment prevents that (so far I could not detect a performance loss).
All my tests were on x86-64 linux.

Maybe it's a portability problem (similar to mlock() supporting non-aligned on linux, but POSIX does not in general)?
Both Linux's mlock and Windows' VirtualLock will lock all pages containing a byte or more of the address range you want to lock, manual alignment is not needed. But as you noted, POSIX allows for an implementation to require the argument of mlock to be page-aligned. This is notably the case on OS X's mlock which will round up a page-unaligned address to the next page boundary, therefore not locking the entirety of the address range.
The documentation of cudaHostRegister makes no mention of any alignment constraint on its arguments. As such, a consumer of this API would be in right to expect that any concern of alignment on the underlying platform is the responsibility of cudaHostRegister, not the user. But without seeing the source of cudaHostRegister, it's impossible to tell if this is actually the case. As the sample is deliberately manually taking care of alignment, it is possible that cudaHostRegister doesn't have such transparent alignment-fixing functionality.
Therefore, yes, it is likely the sample was written to ensure its portability across OSes supported by CUDA (Windows, Linux, Mac OS X).

I just found the following lines in the old 4.0 NVIDIA Library... Maybe it can be helpful for future questions:
The CUDA context must have been created with the cudaMapHost flag in order for the cudaHostRegisterMapped flag to have any effect.
The cudaHostRegisterMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostRegisterPortable flag.
and finally
The pointer ptr and size size must be aligned to the host page size (4 KB).
so it is about the host page size.

Related

Does cuMemcpy "care" about the current context?

Suppose I have a GPU and driver version supporting unified addressing; two GPUs, G0 and G1; a buffer allocated in G1 device memory; and that the current context C0 is a context for G0.
Under these circumstances, is it legitimate to cuMemcpy() from my buffer to host memory, despite it having been allocated in a different context for a different device?
So far, I've been working under the assumption that the answer is "yes". But I've recently experienced some behavior which seems to contradict this assumption.
Calling cuMemcpy from another context is legal, regardless of which device the context was created on. Depending on which case you are in, I recommend the following:
If this is a multi-threaded application, double-check your program and make sure you are not releasing your device memory before the copy is completed
If you are using the cuMallocAsync/cuFreeAsync API to allocate and/or release memory, please make sure that operations are correctly stream-ordered
Run compute-sanitizer on your program
If you keep experiencing issues after these steps, you can file a bug with NVIDIA here.

Is Pinned memory non-atomic read/write safe on Xavier devices?

Double posted here, since I did not get a response I will post here as well.
Cuda Version 10.2 (can upgrade if needed)
Device: Jetson Xavier NX/AGX
I have been trying to find the answer to this across this forum, stack overflow, etc.
So far what I have seen is that there is no need for a atomicRead in cuda because:
“A properly aligned load of a 64-bit type cannot be “torn” or partially modified by an “intervening” write. I think this whole question is silly. All memory transactions are performed with respect to the L2 cache. The L2 cache serves up 32-byte cachelines only. There is no other transaction possible. A properly aligned 64-bit type will always fall into a single L2 cacheline, and the servicing of that cacheline cannot consist of some data prior to an extraneous write (that would have been modified by the extraneous write), and some data after the same extraneous write.” - Robert Crovella
However I have not found anything about cache flushing/loading for the iGPU on a tegra device. Is this also on “32-byte cachelines”?
My use case is to have one kernel writing to various parts of a chunk of memory (not atomically i.e. not using atomic* functions), but also have a second kernel only reading those same bytes in a non-tearing manner. I am okay with slightly stale data in my read (given the writing kernel flushes/updates the memory such that proceeding read kernels/processes get the update within a few milliseconds). The write kernel launches and completes after 4-8 ms or so.
At what point in the life cycle of the kernel does the iGPU update the DRAM with the cached values (given we are NOT using atomic writes)? Is it simply always at the end of the kernel execution, or at some other point?
Can/should pinned memory be used for this use case, or would unified be more appropriate such that I can take advantage of the cache safety within the iGPU?
According to the Memory Management section here we see that the iGPU access to pinned memory is Uncached. Does this mean we cannot trust the iGPU to still have safe access like Robert said above?
If using pinned, and a non-atomic write and read occur at the same time, what is the outcome? Is this undefined/segfault territory?
Additionally if using pinned and an atomic write and read occur at the same time, what is the outcome?
My goal is to remove the use of cpu side mutexing around the memory being used by my various kernels since this is causing a coupling/slow-down of two parts of my system.
Any advice is much appreciated. TIA.

Native Client inner/outer sandbox

I am dealing with the Chrome Native Client and have some difficulties in the following points:
As I understood so far, the first 64 KB of the 256MB Nacl segment are dedicated to the inner-sandbox. This inner sandbox contains the trampoline and the springboard which communicate from the trusted code to the untrusted and vice versa. When I am in this first 64 KB, can I jump to the middle of 32 byte instructions? for example, if I have a 32 byte instruction in the trampoline, can I jump from this instr to the middle (not 32 bytes aligned) of another 32 byte intruction in the trampoline? Do all the instructions in the trampiline and the springboard are also 32 byte aligned?
Can I combine several x86 instructions into one 32 bytes aligned Nacl instruction (for example, putting AND 0xffffffe0 %eax and JMP EAX in one 32 byte aligned Nacl instruction).
I understood that the service runtime is dealing with process creating, memory management etc and that it is accessed through the trampoline, how exactly the trampoline instruction accesses the service runtime? where the service runtime is located in the memory platform? when the service runtime finishes, can it access not 32-byte aligned instruction in the springboard?
What the actual duty of the outer sandbox? how does it monitor and filter the system calls? if there is a bug in the validator of the inner sandbox, in what cases it can catch illegal/malicious instruction?
Thank you all
I'm not 100% sure of the top of my head, but I would guess from looking just at the directory layout of the source that they are both part of the trusted service runtime code (they are in the src/trusted/service_runtime directory), and are therefore built with the system compiler and not subject to validation.
Yes, there is no limit on the number of instructions in a 32-byte bundle. The restriction is just that no instruction (or multi-instruction sandboxing sequence such as the one you mentioned for indirect jumps) may cross the bundle boundary. So in your example, both of those instructions would be required to be in the same bundle.
Again I'm a bit fuzzy on the details of how the trampolines work but when control transfers from the trampoline, it ends up in the service runtime, which is just ordinary machine code built according to the native ABIs for the OS. So the service runtime can use any system calls (at least any allowed by the outer sandbox) and can read or execute any part of the untrusted code.
The outer sandbox is, strictly speaking, a defense in depth (i.e. the inner sandbox is in theory sufficient to contain the untrusted code). It filters system calls in different ways on different OSes. In Chrome's embedding of NaCl, the outer sandbox is the same implementation as the Chrome sandbox used for the renderer and GPU processes.

cudaMemcpy D2D flag - semantics w.r.t. multiple devices, and is it necessary?

I've not had the need before to memcpy data between 2 GPUs. Now, I'm guessing I'm going to do it with cudaMemcpy() and the cudaMemcpyDeviceToDevice flag, but:
is the cudaMemcpyDeviceToDevice flag used both for copying data within a single device's memory space and between the memory spaces of all devices?
If it is,
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
And if that's the case, then
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
For devices with UVA in effect, you can use the mechanism you describe. This doc section may be of interest (both the one describing device-to-device transfers as well as the subsequent section on UVA implications). Otherwise there is a cudaMemcpyPeer() API available, which has somewhat different semantics.
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
Yes, see the previously referenced doc sections.
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
cudaMemcpyDefault is the transfer flag that was added when UVA first appeared, to enable the use of generically-flagged transfers, where the direction is inferred by the runtime upon inspection of the supplied pointers.
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
I'm assuming the generically-flagged method described above would meet whatever needs you have (or perhaps I'm not understanding this question).
Such discussions could engender the question "Why would I ever use anything but cudaMemcpyDefault"?
One possible reason I can think of to use an explicit flag would be that the runtime API will do explicit error checking if you supply an explicit flag. If you're sure that a given invocation of cudaMemcpy would always be in a H2D transfer direction, for example, then explicitly using cudaMemcpyHostToDevice will cause the runtime API to throw an error if the supplied pointers do not match the indicated direction. Whether you attach any value to such a concept is probably a matter of opinion.
As a matter of lesser importance (IMO) code that uses the explicit flags does not depend on UVA being available, but such execution scenarios are "disappearing" with newer environments

A single program appear on two GPU card

I have multiple GPU cards(NO.0, NO.1 ...), and every time I run a caffe process on NO.1 or 2 ... (except 0) card, it will use up 73MiB on the NO.0 card.
For example, in the fig below, process 11899 will use 73MiB on NO.0 card but it actually run on NO.1 card.
Why? Can I disable this feature?
The CUDA driver is like an operating system. It will reserve memory for various purposes when it is active. Certain features, such as managed memory, may cause substantial side-effect allocations to occur (although I don't think this is the case with Caffe). And its even possible that the application itself may be doing some explicit allocations on those devices, for some reason.
If you want to prevent this, one option is to use the CUDA_VISIBLE_DEVICES environment variable when you launch your process.
For example, if you want to prevent CUDA from doing anything with card "0", you could do something like this (on linux):
CUDA_VISIBLE_DEVICES="1,2" ./my_application ...
Note that the enumeration used above (the CUDA enumeration) is the same enumeration that would be reported by the deviceQuery sample app, but not necessarily the same enumeration reported by nvidia-smi (the NVML enumeration). You may need to experiment or else run deviceQuery to determine which GPUs you want to use, and which you want to exclude.
Also note that using this option actually affects the devices that are visible to an application, and will cause a re-ordering of device enumeration (the device that was previously "1" will appear to be enumerated as device "0", for example). So if your application is multi-GPU aware, and you are selecting specific devices for use, you may need to change the specific devices you (or the application) are selecting, when you use this environment variable.