Is Pinned memory non-atomic read/write safe on Xavier devices? - cuda

Double posted here, since I did not get a response I will post here as well.
Cuda Version 10.2 (can upgrade if needed)
Device: Jetson Xavier NX/AGX
I have been trying to find the answer to this across this forum, stack overflow, etc.
So far what I have seen is that there is no need for a atomicRead in cuda because:
“A properly aligned load of a 64-bit type cannot be “torn” or partially modified by an “intervening” write. I think this whole question is silly. All memory transactions are performed with respect to the L2 cache. The L2 cache serves up 32-byte cachelines only. There is no other transaction possible. A properly aligned 64-bit type will always fall into a single L2 cacheline, and the servicing of that cacheline cannot consist of some data prior to an extraneous write (that would have been modified by the extraneous write), and some data after the same extraneous write.” - Robert Crovella
However I have not found anything about cache flushing/loading for the iGPU on a tegra device. Is this also on “32-byte cachelines”?
My use case is to have one kernel writing to various parts of a chunk of memory (not atomically i.e. not using atomic* functions), but also have a second kernel only reading those same bytes in a non-tearing manner. I am okay with slightly stale data in my read (given the writing kernel flushes/updates the memory such that proceeding read kernels/processes get the update within a few milliseconds). The write kernel launches and completes after 4-8 ms or so.
At what point in the life cycle of the kernel does the iGPU update the DRAM with the cached values (given we are NOT using atomic writes)? Is it simply always at the end of the kernel execution, or at some other point?
Can/should pinned memory be used for this use case, or would unified be more appropriate such that I can take advantage of the cache safety within the iGPU?
According to the Memory Management section here we see that the iGPU access to pinned memory is Uncached. Does this mean we cannot trust the iGPU to still have safe access like Robert said above?
If using pinned, and a non-atomic write and read occur at the same time, what is the outcome? Is this undefined/segfault territory?
Additionally if using pinned and an atomic write and read occur at the same time, what is the outcome?
My goal is to remove the use of cpu side mutexing around the memory being used by my various kernels since this is causing a coupling/slow-down of two parts of my system.
Any advice is much appreciated. TIA.

Related

is it possible to force cudaMallocManaged allocate on a specific gpu id (e.g. via cudaSetDevice)

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

Memory coalescing and transaction

After reading about the topic, I have 2 questions related to Global Memory coalescing access:
1- I read that one requirement for Memory coalescing is that words accessed by the threads must be 4, 8, or 16 byte but apparently this is valid only for device with compute capability less than 1.3. Is that right? for the latter device (>=1.3), a thread can even access one or 2 bytes and have a successful coalesced memory access
2- Will it matter (time mainly) if a (half) warp Global Memory access generates a 128-byte instead of 64-byte memory transaction because of the words misalignment and what about the extra data transferred, will it be discarded by the system?
Thank you
1) You can access the data any way you want on later devices, but the performance will still be poor if you request a data segment that is narrow, i.e. you will not achieve the full memory bandwidth of your GPU.
2) This again depends on the overall scheme of you code. Generally, the improvement in later version of CUDA was that non-aligned reads/writes did not result in abysmal performance, but resulted in e.g. 2 write commands being issues instead of one.
Think of it like putting people on a bus. If you can fill your whole crowd into a single bus with one destination, you get better efficiency than using two buses that are only half filled.
So yes, it will matter, but depending on whether you are memory or compute bound, it will matter differently.
Arranging your read/write patterns to utilize the full bandwidth have given me the last 20-30% performance in many applications.
/Henrik

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

CUDA thread synchronization

I am a bit confused about synchronization.
Using __syncthreads you can synchronize threads in a block.This,
(the use of __syncthreads) must be done only with shared memory? Or
using shared memory with __syncthreads has best performance?
Generally, threads may only safely communicate with each other if
and only if they exist within the same thread block, right? So, why
don't we always use shared memory? Because it's not big enough?
And, if we don't use shared memory how can we ensure that results
are right?
I have a program that sometimes runs ok (I get the results) and
sometimes i get 'nan' results without altering anything. Can that be
a problem of synchronization?
The use of __syncthreads does not involve shared memory, it only ensures synchronization within a block. But you need to synchronize threads when you want them to share data through shared memory.
We don't always use shared memory because it is quite small, and because it can slow down your application when badly used. This is due to potential bank conflicts when badly addressing shared memory. Moreover, recent architectures (from 2.0) implement shared memory in the same hardware area than cache. Thus, some seasoned CUDA developers recommend not to use shared memory and rely on the cache mechanisms only.
Can be. If you want to know whether it is a deadlock, try to increase the number of blocks you're using. If it is a deadlock, your GPU should freeze. If it is not, post your code, it will be easier for us to answer ;)
__syncthreads() and shared memory are independent ideas, you don't need one to use the other. The only requirement for using __syncthreads() that comes to my mind is that all the threads must eventually arrive at the point in the code, otherwise your program will simply hang.
As for shared memory, yes it's probably a matter of size that you don't see it being used all the time. From my understanding shared memory is split amongst all blocks. For example, to launch a kernel using a shared memory of 1kb with a 100 blocks will require 100kb which exceeds what is available on the SM.
Although shared memory and __syncthreads() are independent concepts, but they often go hand in hand. Otherwise if threads operate independently, there is no need to use __syncthreads().
Two aspects restrict the use of shared memory: 1). the size of shared memory is limited 2). to achieve best performance, you need to avoid bank conflict when using shared memory.
It could be due to the lack of __syncthreads(). Sometimes, using shared memory without __syncthreads() could lead to unpredictable results.