What does "object affine" mean? - terminology

What does "object affine" mean? For instance, there are object affine thread pools. While I understand both thread pools and affine transformations in math, I can't think of an association between them.

Are you thinking about Affinity? If so I suggest that what it means with respect to threads, is that the certain threads will be linked to a cetain set of resources like for exmample a cpu or a core perhaps, and not be swtched across to another set of resources. This can allow for certain low level optimisations such as maximising L1 cache hits

Related

is it possible to force cudaMallocManaged allocate on a specific gpu id (e.g. via cudaSetDevice)

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

Cuda dynamic parallelism: depth of children threads one can create

I am reading the CUDA programming guide which I find dense. I came to the section where they explain that a parent grid can create a child grid, and the parent grid is considered completed only when all its spawned child threads have completed.
My question is: how "deep" is the parent-child tree allowed to grow in Cuda: are these only constrained by the compute capabilities of the hardware in question, e.g one can for example spawn as many parent/children blocks of threads as he/she wants, provided we don't exceed the max number of threads that can run on the hardware at once, or are there further constraints? I am asking this because absent this capability I don't see how recursion can be implemented on GPUs.
thanks,
Amine
My question is: how "deep" is the parent-child tree allowed to grow in Cuda
The documentation indicates a maximum nesting depth of 24.
As indicated in the documentation, there typically will be other limits that you may hit first, before actually reaching a nesting depth of 24. One of these would be general limits on device kernel launches, including memory requirements as well as launch pending limits. Another possible limit is the synchronization limit. This has to do with whether a parent kernel is explicitly waiting on a child kernel to complete (e.g. via device-side cudaDeviceSynchronize(), and to what depth this synchronization is extended.
provided we don't exceed the max number of threads that can run on the hardware at once
None of this depends explicitly on how many threads are in the parent kernel, or child kernel(s). CUDA kernels don't have any basic limitation on the number of threads the hardware can run at once, and neither does CUDA Dynamic Parallelism (CDP).
As a practical matter then, large depth CDP launches may run into a variety of limits. Further, such design patterns may not be the best from a performance perspective. A CDP launch has time and resource overheads associated with it, and for any pattern that would subdivide the work this way, it's generally desirable in a CUDA kernel to do more work in the kernel, not less.

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

the latency of acessing shared memory

which latency is longer between two situation below,
The data be filled into the shared memory from global memory, and all the thread access the shared memory concurrently.the data maybe the same for multiple threads accessing
All the threads access the global memory,but the data are neighbors.
If you plan on accessing each value only once, then you won't gain anything from using shared memory.
Values in shared memory are only valid within a block, so one or more threads in each block will have to load the values from global memory. So you're not able to avoid the global memory accesses.
If you have a device of compute capability >= 2.0 (Fermi), values read from global memory are automatically cached in the L1 and L2 caches. L1 has the same latency as shared memory.
Latency is a fixed value that depends on which memory you're accessing. It doesn't change. Latency is always much lower for shared memory than for global memory.
I think what you might really be asking is what type of access would give you the best memory throughput. If you will be using each value only once, case (2) will give the best throughput. If you will be reusing values and have CC >= 2.0, letting L1 handle the caching is likely to give the best throughput. If you're reusing values on CC < 2.0, using shared memory will give the best throughput.
Case (1) may or may not cause bank conflicts but will give better throughput regardless, for values that are already stored in shared memory.
Case (2) describes the optimal access pattern for global memory.
Perhaps I don't understand the difference between the two case. But if I do:
The second is faster if your hardware architecture allows it. For example, on a multicore machine with parallel registers. Notice also that in the second case, even from a pure software viewpoint, the data does not need to be made thread-safe for such fears as race-conditions due to interleaving.
Think of it like this:
CASE 2:
you have a large table with five dinners, and you have five kids to eat them: no synchronization needed.
CASE 1:
You have, say, three tables with three dinners; so that two kids may have to eat from the same plate and thus may need to synchronize their movements so they don't hit each other. Synchronization means delay.

Concurrency and memory models

I'm watching this video by Herb Sutter on GPGPU and the new C++ AMP library. He is talking about memory models and mentions Weak Memory Models and then Strong Memory Models and I think he's referring to read/write ordering etc, but I am however not sure.
Google turns up some interesting results (mostly science papers) on memory models, but can someone explain what is a Weak Memory Model and what is a Strong Memory Model and their relation to concurrency?
In terms of concurrency, a memory model specifies the constraints on data accesses, and the conditions under which data written by one thread/core/processor becomes visible to another.
The terms weak and strong are somewhat ambiguous, but the basic premise is that a strong memory model places a lot of constraints on the hardware to ensure that writes by one thread/core/processor are visible to other threads/cores/processors in clearly-defined orders, whilst allowing the programmer maximum freedom of data access.
On the other hand, a weak model places very little constraints on the hardware, but instead places the responsibility of ensuring visibility in the hands of the programmer.
The strongest memory model is Sequential Consistency: all operations to all data by all processors form a single total order agreed on by all processors, which is consistent with the order of operations on each processor individually. This is essentially an interleaving of the operations of each processor.
The weakest memory model will not impose any restrictions on the order that processors see each other's writes. Different processors in the same system may see writes in different orders, and some processors may use "stale" data from their own cache for a long time after a write to the same memory address by another processor. Sometimes, whole cache lines are treated as a single unit, so a write to one variable on a cache line will cause writes from other processors to other variables on that cache line that are not yet visible to the first processor to be effectively discarded, as the stale values are written over the top when it eventually writes the cache line to memory. Under such a scheme, extreme care must be taken to ensure that data is transferred to other processors in the correct order, using explicit synchronization instructions.
For example, the Intel x86 memory model is generally considered to be on the stronger end, as there are strict rules about the order in which writes become visible to other processors, whereas the DEC Alpha and ARM processors are generally considered to have weak memory models, as writes from one processor are only required to be visible to other processors in a particular order if you explicitly put ordering instructions (memory fences or barriers) in your code.
Some systems have memory that is only accessible by particular processors. Transferring data between these processors therefore requires explicit data transfer instructions. This is the case with the Cell processors, and is often the case with GPUs as well. This can be viewed as an extreme of a weak memory model --- data is only visible to other processors if you explicitly invoke the data transfer.
Programming languages usually impose their own memory models on top of whatever is provided by the underlying processors. For example, C++0x specifies a complete set of ordering constraints ranging from completely relaxed to full sequential consistency, so you can specify in code what you require. On the other hand, Java has a very specific set of ordering constraints that must be adhered to and cannot be varied. In both cases the compiler must translate the desired constraints into the relevant instructions for the underlying processor, which may be quite involved if you request sequential consistency on a weakly ordered machine.
The two terms aren't clearly defined, and it's not a black/white thing.
Memory models can be extremely weak, extremely strong, or anywhere in between.
It basically refers to the guarantees offered about concurrent memory accesses.
Naively, you would expect a write made on one thread, to be immediately visible to all other threads. And you would expect events to appear in the same order on all threads as well.
But in a weaker memory model, neither of those may hold.
Sequential consistency is the term for a memory model which guarantees that events are seen in the same order across all threads. So a memory model which ensures sequential consistency is pretty strong.
A weaker guarantee is causal consistency: the guarantee that events are observed after the events they depend on.
In other words, if you first write a value x to some address A, and then write a second value y to the same address, then no thread will ever read the value y after reading the x value. Because the two writes are to the same address, it would violate causal consistency if not all threads observed the same order.
But this says nothing about what should happen to unrelated events. The result of writing a third value to a different memory address could be observed at absolutely any time by other threads (so different threads may observe events in a different order, unlike under sequential consistency)
There are plenty other such levels of "consistency", some stronger, some weaker, and offering all sorts of subtle guarantees about what you can rely on.
Fundamentally, a stronger memory model is going to offer more guarantees about the order in which events are observed, and will normally guarantee behavior closer to what you'd intuitively expect.
But a weaker model allows more room for optimization, and especially, it scales better with more cores (because less synchronization is required)
Sequential consistency is basically free on a single-core CPU, is doable on a quad-core, but would be prohibitively expensive on a 32-core system, or a system with 4 physical CPUs. Or a shared-memory system between multiple physical machines.
The more cores you have, and the further apart they are, the harder it is to ensure that they all observe events in the same order. So compromises are made, and you settle for a weaker memory model which makes looser guarantees.
Yes, you are right - the difference between Weak and Strong memory models is a difference in what optimizations are available (order of reads/write and related fences).
You can specify a memory model by starting with a sequentially consistent model (the most restrictive, or strongest model), and then specify how reads and writes from a single thread can be introduced, removed, or moved with respect to one another
In this model (sequentially consistent) the memory is independent of any of the processors (threads) that use it. The memory is connected to each of the threads by a controller that feeds read and write requests from each thread. The reads and writes from a single thread reach memory in exactly the order specified by the thread, but they might be interleaved with reads and writes from other threads in an unspecified way
Understand the Impact of Low-Lock Techniques in Multithreaded Apps
However there's no exact bound between strong and weak memory models, unless you consider sequentilly consistent model vs others. Some of them are just stronger/weaker and therefore more open to optimizations by reordering than others. For example, memory model in .NET 2.0 for x86 allows a bit more optimizations that the verison in .NET 1.1 so it can be considered as a weaker model.
Google turns up some interesting results (mostly science papers) on memory models, but can someone explain what is a Weak Memory Model and what is a Strong Memory Model and their relation to concurrency?
A strong memory model is one where, from the point of view of other cores, reads and writes appear to happen as they appear in the program and, in particular, in the order in which they appear in the program. This is known as sequential consistency.
A weak memory model is one where memory executions may be changed by the CPU, e.g. reordered. All practical CPU architectures allow instructions to be reordered.
Note that Herb Sutter uses "strong memory model" to mean one where atomic intrinsics are not reordered. This is not the commonly accepted definition.