I understand what priority inheritance is. I also understand, from the Mars Pathfinder's system reset issue, that most of the time, depending upon the criticality of the operation, it is good to enable/implement priority inheritance.
However, are there any cases where priority inheritance is not desirable and may actually cause problems if enabled/implemented? If so, could you please provide an example while preferably describing a problem?
OP gives an example where priority inheritance is desirable. There we have a short mutex-protected transaction between threads of highest and lowest priorities and a long-running thread with medium priority which could block the transaction if priority inheritance is not used. Also in this example high-priority thread is more important for the application than medium-priority thread.
To get priority inheritance not desirable, we could make an application where all these assumptions are opposite to the above example. Let's make transaction between threads of highest and lowest priorities longer than time available for medium-priority thread to perform its task. And let's assume that the results of this medium-priority thread are more important than the results of high-priority thread.
Imagine an application which should serve 100 interrupts per second with its high-priority thread. And 10 interrupts per second with its medium-priority thread (where each interrupt needs 30 ms to be processed). Low-priority thread could lock some resource (used together with high-priority thread). And sometimes (very rarely) it could lock it for a long time (1 sec). With priority inheritance first access to this resource by high-priority thread boosts priority of background thread for up to 1 sec. So that medium-priority thread would miss 10 interrupts. At the same time high-priority thread could miss 100 interrupts. Without priority inheritance medium-priority thread serves all interrupts but high-priority thread misses up to 130 interrupts. If interrupts served by medium-priority thread are more valuable, we should prefer to disable priority inheritance.
I never seen a real-life application like this. So I invented one to illustrate this case. Let it be a video capture application with some computer vision task in background and (as additional bonus) sound recording. Video frames are captured by medium-priority thread. Sound is captured by high-priority thread (because otherwise video capturing process could block significant portion of sound interrupts). Video is captured to pre-allocated buffer. Audio is captured with silence suppression (no memory is needed to capture silence). So audio thread needs dynamically allocated memory blocks (this is our shared resource). Computer vision task also sometimes allocates memory blocks. When we are out-of-memory, garbage collector thread blocks memory allocation and does some work. Here without priority inheritance video capture works flawlessly. But with priority inheritance audio thread sometimes blocks video capture (which is considered bad for this application).
Also there are applications where priority inheritance does not give any benefits:
When there are (1) several background threads and (2) several short-living priority threads. Each group of threads shares resources only inside its own group.
When one resource is shared by threads with priorities 1 & 2, other one - by threads with priorities 3 & 4, etc.
When processor has more cores than there are time-critical threads in application.
In many other cases.
If priority inheritance is enabled in such cases, we could only get some performance (or power efficiency) degradation because priority inheritance implementation in kernel most likely needs some additional resources.
Related
I have recently started coding in verilog. I have completed my first project, prototyping a MIPS 32 processor using 5 stage pipelining. Now my next task is to implement a single level cache hiearchy on the instruction set memory.
I have sucessfully implemented a 2-way set associative cache.
Previously I had declared the instruction set memory as a array of registers, so whenever I need to access the next instruction in IF stage, the data(instruction) gets instantaneously allotted to the register for further decoding (since blocking/non_blocking assignment is instantaneous from any memory location).
But now since I have a single level cache added on top of it, it takes a few more cycles for the cache FSM to work (like data searching, and replacement policies in case of cache miss). Max. delay is about 5 cycles when there is a cache miss.
Since my pipelined stage proceeds to the next stage within just a single cycle, hence whenever there is a cache miss, the cache fails to deliver the instruction before the pipeline stage moves to the next stage. So desired output is always wrong.
To counteract this , I have increased the clock of the cache by 5 times as compared the processor pipelined clock. This does do the work, since the cache clock is much faster, it need not to worry about the processor clock.
But is this workaround legit?? I mean i haven't heard of multiple clocks in a processor system. How does the processors in real world overcome this issue.
Yes ofc, there is an another way of using stall cycles in pipeline until the data is readily made available in cache (hit). But just wondering is making memory system more faster by increasing clock is justified??
P.S. I am newbie to computer architecture and verilog. I dont know about VLSI much. This is my first question ever, because whatever questions strikes, i get it readily available in webpages, but i cant find much details about this problem, so i am here.
I also asked my professor, she replied me to research more in this topic, bcs none of my colleague/ senior worked much on pipelined processors.
But is this workaround legit??
No, it isn't :P You're not only increasing the cache clock, but also apparently the memory clock. And if you can run your cache 5x faster and still make the timing constraints, that means you should clock your whole CPU 5x faster if you're aiming for max performance.
A classic 5-stage RISC pipeline assumes and is designed around single-cycle latency for cache hits (and simultaneous data and instruction cache access), but stalls on cache misses. (Data load/store address calculation happens in EX, and cache access in MEM, which is why that stage exists)
A stall is logically equivalent to inserting a NOP, so you can do that on cache miss. The program counter needs to not increment, but otherwise it should be a pretty local change.
If you had hardware performance counters, you'd maybe want to distinguish between real instructions vs. fake stall NOPs so you could count real instructions executed.
You'll need to implement pipeline interlocks for other stages that stall to wait for their inputs to be ready, e.g. a cache-miss load followed by an add that uses the result.
MIPS I had load-delay slots (you can't use the result of a load in the following instruction, because the MEM stage is after EX). So that ISA rule hides the 1 cycle latency of a cache hit without requiring the HW to detect the dependency and stall for it.
But a cache miss still had to be detected. Probably it stalled the whole pipeline whether there was a dependency or not. (Again, like inserting a NOP for the rest of the pipeline while holding on to the incoming instruction. Except this isn't the first stage, so it has to signal to the previous stage that it's stalling.)
Later versions of MIPS removed the load delay slot to avoid bloating code with NOPs when compilers couldn't fill the slot. Simple HW then had to detect the dependency and stall if needed, but smarter hardware probably tracked loads anyway so they could do hit under miss and so on. Not stalling the pipeline until an instruction actually tried to read a load result that wasn't ready.
MIPS = "Microprocessor without Interlocked Pipeline Stages" (i.e. no data-hazard detection). But it still had to stall for cache misses.
An alternate expansion for the acronym (which still fits MIPS II where the load delay slot as removed, requiring HW interlocks to detect that data hazard) would be "Minimally Interlocked Pipeline Stages" but apparently I made that up in my head, thanks #PaulClayton for catching that.
While reading through the CUDA programming guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
I came across the following paragraph:
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
However, at the start of the same section, it says:
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
Which appears to be contradict the other paragraph, because it mentions that threads have their own program counter, while the first paragraph claims they do not.
How is this active mask handled when a program has nested branches (such as if statements)?
How does a thread know when the divergent part which it did not need to execute is done, if it supposedly does not have its own program counter?
This answer is highly speculative, but based on the available information and some educated guessing, I believe the way it used to work before Volta is that each warp would basically have a stack of "return addresses" as well as the active mask or probably actually the inverse of the active mask, i.e., the mask for running the other part of the branch once you return. With this design, each warp can only have a single active branch at any point in time. A consequence of this is that the warp scheduler could only ever schedule the one active branch of a warp. This makes fair, starvation-free scheduling impossible and gives rise to all the limitations there used to be, e.g., concerning locks.
I believe what they basically did with Volta is that there is now a separate such stack and program counter for each branch (or maybe even for each thread; it should be functionally indistinguishable whether each thread has its own physical program counter or whether there is one shared program counter per branch; if you really want to find out about this implementation detail you maybe could design some experiment based on checking at which point you run out of stack space). This change gives all current branches an explicit representation and allows the warp scheduler to at any time pick threads from any branch to run. As a result, the warp scheduling can be made starvation-free, which gets rid of many of the restrictions that earlier architectures had…
I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.
I use Observables in couchbase.
What is the difference between Schedulers.io() and Schedulers.computation()?
Brief introduction of RxJava schedulers.
Schedulers.io() – This is used to perform non-CPU-intensive operations like making network calls, reading disc/files, database operations, etc., This maintains a pool of threads.
Schedulers.newThread() – Using this, a new thread will be created each time a task is scheduled. It’s usually suggested not to use scheduler unless there is a very long-running operation. The threads created via newThread() won’t be reused.
Schedulers.computation() – This schedular can be used to perform CPU-intensive operations like processing huge data, bitmap processing etc., The number of threads created using this scheduler completely depends on number CPU cores available.
Schedulers.single() – This scheduler will execute all the tasks in sequential order they are added. This can be used when there is a necessity of sequential execution is required.
Schedulers.immediate() – This scheduler executes the task immediately in a synchronous way by blocking the main thread.
Schedulers.trampoline() – It executes the tasks in First In – First Out manner. All the scheduled tasks will be executed one by one by limiting the number of background threads to one.
Schedulers.from() – This allows us to create a scheduler from an executor by limiting the number of threads to be created. When the thread pool is occupied, tasks will be queued.
From the documentation of rx:
Schedulers.computation( ) - meant for computational work such as event-loops and callback processing; do not use this scheduler for I/O (use Schedulers.io( ) instead); the number of threads, by default, is equal to the number of processors
Schedulers.io( ) - meant for I/O-bound work such as asynchronous performance of blocking I/O, this scheduler is backed by a thread-pool that will grow as needed; for ordinary computational work, switch to Schedulers.computation( ); Schedulers.io( ) by default is a CachedThreadScheduler, which is something like a new thread scheduler with thread caching
This question is related to: Does Nvidia Cuda warp Scheduler yield?
However, my question is about forcing a thread block to yield by doing some controlled memory operation (which is heavy enough to make the thread block yield). The idea is to allow another ready-state thread block to execute on the now vacant multiprocessor.
The PTX manual v2.3 mentions (section 6.6):
...Much of the delay to memory can be hidden in a number of ways. The first is to have multiple threads of execution so that the hardware can issue a memory operation and then switch to other execution. Another way to hide latency is to issue the load instructions as early as possible, as execution is not blocked until the desired result is used in a subsequent (in time) instruction...
So it sounds like this can be achieved (despite being an ugly hack). Has anyone tried something similar? Perhaps with block_size = warp_size kind of setting?
EDIT: I've raised this question without clearly understanding the difference between resident and non-resident (but assigned to the same SM) thread blocks. So, the question should be about switching between two resident (warp-sized) thread blocks. Apologies!
In the CUDA programming model as it stands today, once a thread block starts running on a multiprocessor, it runs to completion, occupying resources until it completes. There is no way for a thread block to yield its resources other than returning from the global function that it is executing.
Multiprocessors will switch among warps of all resident thread blocks automatically, so thread blocks can "yield" to other resident thread blocks. But a thread block can't yield to a non-resident thread block without exiting -- which means it can't resume.
Starting from compute capability 7 (Volta), you have the __nanosleep() instruction, which will put a thread to sleep for a given nanosecond duration.
Another option (available since compute capability 3.5) is to start a grid with lower priority using the cudaStreamCreateWithPriority() call. This allows you to run one stream at lower priority. Do note that (some) GPU's only have 2 priorities, meaning that you may have to run your main code at high priority in order to be able to dodge the default priority.
Here's a code snippet:
// get the range of stream priorities for this device
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
// create streams with highest and lowest available priorities
cudaStream_t st_high, st_low;
cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);