Is a "single cycle cpu" possible if asynchronous components are used?

Is a "single cycle cpu" possible if asynchronous components are used? - mips

I have heard the term "Single Cycle Cpu" and was trying to understand what single cycle cpu actually meant. Is there a clear agreed definition and consensus and what is means?
Some home brew "single cycle cpu's" I've come across seem to use both the rising and the falling edges of the clock to complete a single instruction. Typically, the rising edge acts as fetch/decode and the falling edge as execute.
However, in my reading I came across the reasonable point made here ...
https://zipcpu.com/blog/2017/08/21/rules-for-newbies.html
"Do not transition on any negative (falling) edges.
Falling edge clocks should be considered a violation of the one clock principle,
as they act like separate clocks.".
This rings true to me.
Needing both the rising and falling edges (or high and low phases) is effectively the same as needing the rising edge of two cycles of a single clock that's running twice as fast; and this would be a "two cycle" CPU wouldn't it.
So is it honest to state that a design is a "single cycle CPU" when both the rising and falling edges are actively used for state change?
It would seem that a true single cycle cpu must perform all state changing operations on a single clock edge of a single clock cycle.
I can imagine such a thing is possible providing the data strorage is all synchronous. If we have a synchronous system that has settled then on the next clock edge we can clock the results into a synchronous data store and simultaneously clock the program counter on to the next address.
But if the target data store is for example async RAM then the surely control lines would be changing whilst that data is being stored leading to unintended behaviours.
Am I wrong, are there any examples of a "single cycle cpu" that include async storage in the mix?
It would seem that using async RAM in ones design means one must use at least two logical clock cycles to achive the state change.
Of course, with some more complexity one could perhaps add anhave a cpu that uses a single edge where instructions use solely synchronout components, but relies on an extra cycle when storing to async data; but then that still wouldn't be a single cycle cpu, but rather a a mostly single cycle cpu.
So no CPU that writes to async RAM (or other async component) can honestly be considered a single cycle CPU because the entire instruction cannot be carried out on a single clock edge. The RAM write needs two edges (ie falling and rising) and this breaks the single clock principal.
So is there a commonly accepted single cycle CPU and are we applying the term consistently?
What's the story?
(Also posted in my hackday log https://hackaday.io/project/166922-spam-1-8-bit-cpu/log/181036-single-cycle-cpu-confusion and also on a private group in hackaday)
=====
Update: Looking at simple MIP's it seems the models use synchronous memory and so can probably operate off a single edge ad maybe it does - therefore warrant the category "single cycle".
And perhaps FPGA memory is always synchronous - I don't know about that.
But is the term using inconsistently elsewhere - ie like most Homebrew TTL Computers out there??
Or am I just plain wrong?
====
Update :
Some may have misunderstood my point.
Numerous home brew TTL cpu's claim "single cycle CPU" status (not interested for the purposes of this discussion in more complex beasts that do pipelining or whatever).
By single cycle these CPU's they typically mean that they do something like advancing the PC on one edge of the clock and then the use the opposing edge of the clock to update flipflops with the result. OR they will use the other phase of the clock to update async components like latches and sram.
However, the ZipCPU reference I provided suggests that using the opposing clock edge is akin to using a second clock cycle or even a second clock. BTW Ben Eater in his vids even compares the inverted clock that he uses to update his SRAM to being a second clock.
My objection to the use of "single cycle CPU" with such CPU's (basically most/all home bred TTL CPU's I've seen as they all work that way) is that I agree with ZipCPU that using the opposing edge (or phase) of the clock for the commit is effectively the same as using a second clock and this makes a mockery of the "single cycle" claim.
If the use of oposing edge is effectively the same a using a single edge but of dual clock cycles then I think that makes use of the term questionable. So I take ZipCPU's point to heart and tighten the term to mean use of a single edge.
On the other hand is seems perfectly possible to build a CPU that uses only sync components (ie edge triggered flip flops) and which uses only a single edge, where on each edge, we clock whatever is on the bus into whatever device is selected for write and at the same moment advance the PC.
Between one edge and the next same direction edge, settling occurs.
In this manner we end up with CPI=1 and use of only a single edge - which is very distinctly different to the common TTL CPU pattern of using both edges of the clock.
BTW my impression of FPGA's (which I'm not referring to here) is that the storage elements in FPGA are all synchronous flip flops. I don't know, but that's what my reading suggests. Anyway, if this is true then a trivial FPGA based CPU probabnly has a CPI=1 and uses only say the +ve edge and so these might well meet my narrow definition of "single cycle cpu". Also, my reading suggests that various MIP's impls (educational efforts probably) are probably meeting my definition ok.

This seems mostly a question of definitions and terminology, moreso than how to actually build simple CPUs.
If you insist on that strict definition of "single cycle CPU" meaning to truly use only clock edge to set everything in motion for that instruction, then yes, that would exclude real-world toy/hobby CPUs that use a 2nd clock edge to give a consistent interval for memory access.
But they certainly still fulfil the spirit of a single-cycle CPU, which is that every instruction runs in 1 clock cycle, with no pipelining and no multi-cycle microcode.
A whole clock cycle does have 2 clock edges, and it's normal for real-world (non-single-cycle) CPUs to use the "other" edge for internal timing in some cases, but we still talk about their frequency in whole cycles, not the edge frequency, except in cases like DDR memory where we do talk about the transfer rate = twice the memory clock frequency. What sets that apart is always using both edges, and for approximately equal things, not just some extra timing / synchronization within a clock cycle.
Now could you build a CPU that keeps a store value on a memory bus for some minimum time, without using a clock edge? Maybe.
Perhaps make sure the critical path leading to store-data it is short enough that the data is always ready. And possibly propagate some "data-ready" signal along with your computations (or just from the longest critical path of any instruction), and after a couple gate delays after the data is on the bus, flip the memory clock. (And on the next CPU clock edge, flip it back). If your memory doesn't mind its clock not having a uniform duty cycle, this might be fine as long as each half of the memory clock is long enough.
For loading from memory, you can maybe do something similar by initiating a memory load cycle some gate-delays after the CPU clock edge that starts this "cycle" of your single-cycle CPU. This might even involve building a long chain of gate delays intentionally with inverters dedicated to that purpose. Or perhaps even an analog RC time delay, but either way that's obviously worse than just using the other edge of the main clock, and you'd only ever do this as an exercise in single-cycle dogmatic purity. (It can also be flaky because gate-delay isn't constant, depending on voltage and temperature, so one side of the chip running hotter than the other could change relative timing.)

The definition says that a single cycle CPU takes just one instruction per one cycle. So it's possible to make a conclusion in theory that there are other CPU's that takes more or less instruction per cycle. You can check it out that there are some concepts like multi-cycle processor and pipelined processor (Instruction pipelining). "Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps." acorkcording to Wiki. I don't know how exactly it works, but maybe it just uses available registers (maybe instead of using for example EAX, ECX is used like EAX or maybe it works some other way, but one for sure is 100% true that number of registers in increasing, so maybe it's one of main purposes. Source: https://en.wikipedia.org/wiki/Instruction_pipelining
I think the answer for the question: "is-a-single-cycle-cpu-possible-if-asynchronous-components-are-used" depends on CPU controller that controls both CPU and RAM with opcodes. I found interesing information about this on site: http://people.cs.pitt.edu/~cho/cs1541/current/handouts/lect-pipe_4up.pdf
https://ibb.co/tKy6sR2
CONCLUSION: In my opinion, if we consider the term "single cycle CPU" it should be the simplest possible construction. The term "asynchronous" implements a conclusion, that is more complex than "synchronous". So both terms are not equivalent. It's something like "Can a basic data type be considered as a structure?". In my opinion the word "single" means the simplest possible and "asynchronous" means some modification, so more complex, so just think it's not possible, but maybe the term "are used" can be bypassed by "are used at the time" - if some switch, some controller can turn off asynchronous mode and make this all the simplest possible. But generally just think it's not possible

Related

Cache Implementation in Pipelined Processor

I have recently started coding in verilog. I have completed my first project, prototyping a MIPS 32 processor using 5 stage pipelining. Now my next task is to implement a single level cache hiearchy on the instruction set memory.
I have sucessfully implemented a 2-way set associative cache.
Previously I had declared the instruction set memory as a array of registers, so whenever I need to access the next instruction in IF stage, the data(instruction) gets instantaneously allotted to the register for further decoding (since blocking/non_blocking assignment is instantaneous from any memory location).
But now since I have a single level cache added on top of it, it takes a few more cycles for the cache FSM to work (like data searching, and replacement policies in case of cache miss). Max. delay is about 5 cycles when there is a cache miss.
Since my pipelined stage proceeds to the next stage within just a single cycle, hence whenever there is a cache miss, the cache fails to deliver the instruction before the pipeline stage moves to the next stage. So desired output is always wrong.
To counteract this , I have increased the clock of the cache by 5 times as compared the processor pipelined clock. This does do the work, since the cache clock is much faster, it need not to worry about the processor clock.
But is this workaround legit?? I mean i haven't heard of multiple clocks in a processor system. How does the processors in real world overcome this issue.
Yes ofc, there is an another way of using stall cycles in pipeline until the data is readily made available in cache (hit). But just wondering is making memory system more faster by increasing clock is justified??
P.S. I am newbie to computer architecture and verilog. I dont know about VLSI much. This is my first question ever, because whatever questions strikes, i get it readily available in webpages, but i cant find much details about this problem, so i am here.
I also asked my professor, she replied me to research more in this topic, bcs none of my colleague/ senior worked much on pipelined processors.

But is this workaround legit??
No, it isn't :P You're not only increasing the cache clock, but also apparently the memory clock. And if you can run your cache 5x faster and still make the timing constraints, that means you should clock your whole CPU 5x faster if you're aiming for max performance.
A classic 5-stage RISC pipeline assumes and is designed around single-cycle latency for cache hits (and simultaneous data and instruction cache access), but stalls on cache misses. (Data load/store address calculation happens in EX, and cache access in MEM, which is why that stage exists)
A stall is logically equivalent to inserting a NOP, so you can do that on cache miss. The program counter needs to not increment, but otherwise it should be a pretty local change.
If you had hardware performance counters, you'd maybe want to distinguish between real instructions vs. fake stall NOPs so you could count real instructions executed.
You'll need to implement pipeline interlocks for other stages that stall to wait for their inputs to be ready, e.g. a cache-miss load followed by an add that uses the result.
MIPS I had load-delay slots (you can't use the result of a load in the following instruction, because the MEM stage is after EX). So that ISA rule hides the 1 cycle latency of a cache hit without requiring the HW to detect the dependency and stall for it.
But a cache miss still had to be detected. Probably it stalled the whole pipeline whether there was a dependency or not. (Again, like inserting a NOP for the rest of the pipeline while holding on to the incoming instruction. Except this isn't the first stage, so it has to signal to the previous stage that it's stalling.)
Later versions of MIPS removed the load delay slot to avoid bloating code with NOPs when compilers couldn't fill the slot. Simple HW then had to detect the dependency and stall if needed, but smarter hardware probably tracked loads anyway so they could do hit under miss and so on. Not stalling the pipeline until an instruction actually tried to read a load result that wasn't ready.
MIPS = "Microprocessor without Interlocked Pipeline Stages" (i.e. no data-hazard detection). But it still had to stall for cache misses.
An alternate expansion for the acronym (which still fits MIPS II where the load delay slot as removed, requiring HW interlocks to detect that data hazard) would be "Minimally Interlocked Pipeline Stages" but apparently I made that up in my head, thanks #PaulClayton for catching that.

Is that true if we can always fill the delay slot there is no need for branch prediction?

I'm looking at the five stages MIPS pipeline (ID,IF,EXE,MEM,WB) in H&P 3rd ed. and it seems to me that the branch decision is resolved at the stage of ID so that while the branch instruction reaches its EXE stage, the second instruction after the branch can be executed correctly (can be fetched). But this leaves us the problem of possibly still wasting the 1st instruction soon after the branch instruction.
I also encountered the concept of branch delay slot, which means you want to fill the 1st instruction soon after the branch with something useful as well as "harmless" that whether the branch is taken or not the instruction is executed as desired and the 1st instruction after the branch is not wasted.
My question is, first of all, is my above understanding correct? If it's correct, then the problem comes from the concept of branch prediction, which seems to be trying to fill the first instruction with instruction from the predicted place that the program is going to. But if we can always find some instruction to fill the branch delay slot, we would not need the feature of branch prediction, right?

For the classic MIPS (R2000) pipeline, the branch delay slot makes branch prediction useless as you perceive. (Technically, a design could combine a predictor/indicator of whether the delay slot instruction is a nop with a branch predictor. This would allow the nop to be skipped, modestly improving performance on a correct branch prediction.)
However, processor pipelines are often long and wide enough (and branch condition evaluation sufficiently delayed) that a single delay slot is not sufficient to fill the delay between when the post-branch instruction address is needed and the branch direction and target are known.
For example, a follow-on processor, the MIPS R4000, significantly lengthened the pipeline and as a result could not determine the location of the post-branch instruction early enough. The designers chose to use a simple static predict not-taken strategy.
If one did not care about binary compatibility, one could add more delay slots. However, finding useful instructions to fill such slots increases in difficulty as the number of slots increases. For certain loop-rich code, regularly filling two delay slots might be practical, and I think at least one DSP had two delay slots.
Branch prediction can also be used to decouple fetch from execution so that even if the condition cannot be evaluated (e.g., depending on the result of a high latency operation such as a data cache miss or a division), fetch can continue. Such decoupling could be used to generate instruction cache misses early (hiding some of their latency) and to reduce the impact of variable throughput at different stages (so an earlier stage can continue operating with maximum throughput when a later stage stalls or has reduced throughput and the buffered instructions can then hide later stalls or reduced throughput in the earlier stage).

The fact is that complier may not always find a instruction to fill the delay slot.
What is more, instruction is highly predictable.
Before IF stage, u even not know whether it is branch instruction.( u have to fetch it from instruction memory)

within a mips core like that with zero wait state randomly accessed ram sure. but depending on how the fetching is implemented and caching behind that, you may still want/need the concept of branch prediction to start those fetches earlier. the pipeline is just a small part of a bigger system. system busses are usually not single cycle here is my address I want my data by the end of this cycle, there are address busses and data busses and tags that cross them so you can have multiple transactions in flight at the same time, like a pipeline trying to optimize the bandwidth of the data bus knowing the peripherals and memory on the far side are too slow for that bus.
prediction "could" be used to assist these other features in getting instructions into the pipe faster or more efficiently.
from an academic sense though, the idea of the slot is to give the pipe a cycle to switch gears along another execution path. It only actually saves you if the incoming end of the pipe can be fed any random thing it wants every clock cycle. which isnt real world.
another academic solution is the arm one of conditional execution on every instruction, you can construction execution sequences to keep the pipe full and not have to flush or stall... again so long as what feeds the pipe can keep up...arm dumped the conditional instruction idea in the new 64 bit instruction set. some/newer mips you can disable the branch shadow/delay slot.

How does a zero register improve performance?

In the MIPS ISA, there's a zero register ($r0) which always gives a value of zero. This allows the processor to:
Any instruction which produces result that is to be discarded can direct its target to this register
To be a source of 0
It is said in this source that this improved the speed of the CPU. How does it improve performance? And what are the reasons why not all ISA adopt this zero register?
$r0 is not general purpose. It is hardwired to 0. No matter what you
do to this register, it always has a value of 0. You might wonder why
such a register is needed in MIPS.
The designers of MIPS used benchmarks (programs used to determine the
performance of a CPU), which convinced them that having a register
hardwired to 0 would improve the performance (speed) of the CPU as
opposed to not having it. Not everyone agrees a register hardwired to
0 is essential, so not all ISAs have a zero register.

There's a few potential ways that this can improve performance; it's not clear which ones apply to that particular processor, but I've listed them roughly in order from most to least likely.
It avoids spurious pipeline stalls. Without an explicit zero register, it's necessary to take a register, zero it out, and use its value. This means that the zero-using operation is dependent on the zeroing operation, and (depending on how powerful the pipeline forwarding system is) possibly on the zeroed register's previous value. Architectures like x86, which have quite small register files and basically virtualize their registers to keep that from causing problems, have extremely powerful hazard analysis tools. The same is not generally true of RISC processors.
Certain operations may be more pipelineable if they can avoid a register read. If an explicit zero register is used, the fact that the operand will be zero is known at the instruction decode stage, rather than later on in the register fetch stage. Thus, the register read stage can be skipped.
Similarly, the ability to explicitly discard results avoids the need for a register write stage.
Certain operations may generate simpler microcode when one of their operands is known to be zero, or when the result is known to be discarded.
An explicit zero register takes some pressure off the compiler's optimizer, as it doesn't need to be as careful with its register assignment (no need to identify a register which won't cause a stall on read or write).

For each of your items, here's an answer.
Consider instructions that compulsory take a register for output, where you want to discard this output. Normally, you'd have to make sure that you have a free register available, and if not, push some of your current registers onto the stack, which is a costly operation. Evidently, it happens a lot that the output of operations is discarded, and the easiest way to deal with this is to have a 'unused' register available.
Now that we have such an unused register, why not use it? It happens a lot that you want to zero-initialize something or compare something to zero. The long way is to first write zero to that register (which requires an extra instruction and the literal for zero in your machine code, which may be of the form 0x00000000 which is rather long) and then use it. So, one instruction shaved off and a little bit of your program size as well.
These optimizations may seem a bit trivial and may raise the question 'how much does that actually improve anything?' The answer here is that the operations described above are apparently used a lot on your MIPS processor.

The concept of a zero register is not new. I first encountered it on a CDC 6600 mainframe, which dates back to the mid-to-late 1960's. In some ways it was one of the first RISC processors, and was the world's fastest computer for 5 years. In that architecture, the "B0" register was hardwired to always be zero. http://en.wikipedia.org/wiki/CDC_6600
The benefit of such a register is primarily that it simplified the instruction set. When the decoding and orchestration of simple and regular instruction sets can be implemented without microcode, it increases performance. In addition, for the 6600 like most LSI chips today, the time spent for a signal to travel the length a "wire" becomes on of the key factors in execution speed, and keeping the instruction set simple (and avoiding microcode) allows less transistors, and results in shorter circuit paths.

A zero register allows saving some opcodes when designing a new
instruction set architecture (ISA).
For example, the main RISC-V spec has 32 pseudo-instructions that
depend on the zero register (cf. Tables 26.2 and 26.3). A pseudo-instruction is an
instruction that is mapped by the assembler to another real
instruction (for example, branch-if-equal-to-zero is mapped to
branch-if-equal). For comparison: the main RISV-V spec lists 164
real instruction opcodes (i.e. counting RV(32|64)[IMAFD] base/extensions, a.k.a. RV64G). That means without a zero register RISC-V RV64G would occupy 32 more opcodes for those instructions (i.e. 20 % more). For a concrete RISC-V CPU
implementation, this real-to-pseudo instruction ratio may shift in either direction
depending on which extensions are selected.
Having less opcodes simplifies the instruction decoder.
A more complex decoder needs more time for decoding instructions
or occupies more gates (that can't be used for more useful CPU units)
or both.
Existing, incrementally developed ISAs have to deal with
backwards-compatibility. Thus, if your original ISA design
doesn't include a zero register, you can't just add it in a later
revision without breaking compatibility. Also, if your existing
ISA already requires a very complex decoder, adding then a zero
register doesn't pay off.
Besides the modern RISC-V ISA (developed since 2010, first
ratification in 2019), ARMv8 AArch64 (a 64 Bit ISA released in 2011),
in contrast to the previous ARM 32 bit ISAs, also features a zero register. Because of this and other changes
AArch64 ISA has much less in common with previous ARM 32 Bit
ISAs than - say - x86 and x86-64 ISAs.
In contrast to AArch64, x86-64
doesn't has a zero register. Although x86-64 is more modern than
the previous 32 bit x86 ISA, its ISA only changed incrementally.
Thus, it features all the existing x86 opcodes plus 64 bit
variants, and thus the decoder already is very complex.

MIPS Architecture : NOP (No-Operation) Vs Data Forwarding in Hazard Prevention

I learnt in computer architecture course that, data hazard can be prevented by using several arbitrary, independent nop instructions in between two mutually dependent instructions. This can be done at assembly level in compiler design.
The alternative way to avoid data hazard is to use data forwarding.
I am bit confused, How these two alternatives differ as far as performance, speed and hardware is concerned. Because as per my knowledge data forwarding is to be implemented at hardware level, whereas nop can be implemented at assembly level.
Anybody please explain me which approach is better if we consider factors such as performance, speed, hardware etc?
Thanks.

Obviously, having the compiler insert nops into the code stream to fill pipeline slots allows hardware to be simplified which can reduce the duration of a pipeline stage or the depth of the pipeline, reduce design effort (time to market, project risk, design cost), or allow a full processor core to fit on a single chip (which helps performance). However, this benefit is tiny compared to the loss of performance from not using forwarding. Higher latency for dependent instructions is very bad for typical programs.
The MIPS R2000, which had both delayed branches and delayed loads, provided result forwarding. (MIPS is an acronym for "Microprocessor without Interlocked Pipeline Stages"). Delayed loads were soon removed from MIPS (which was possible because such did not affect binary compatibility of correct code). The use of delayed instructions was partially from a belief that most delay slots could be filled by the compiler with useful instructions and partially from believing that the increase in code size was not important relative to the simplification of hardware.
Reducing the latency of a load operation was not practical, so the pipeline would need to be stalled for a cycle anyway. The cost of a nop is in cache and memory capacity effects (i.e., the effect of lower code density), and in some cases a single load delay slot could be filled.
Exposing the pipeline organization also has implications for binary compatibility. Later binary compatible implementations must accommodate the ISA designed for the original pipeline organization. A single delayed branch slot works reasonably well for a simple 5-stage scalar implementation (it can be filled with a useful instruction most of the time and allows zero-effective-delay branches [i.e., no stall to resolve the branch or prediction and flushing the pipeline on misprediction]), but when the pipeline is deepened (or made wider) prediction or stalling becomes necessary anyway.
If sufficient parallelism exists in the targeted workloads, hardware simplicity is sufficiently important, and binary compatibility is not a problem, then exposing a pipeline with minimal support for dynamically detecting and handling stall conditions may be sensible. (There are also ways of encoding nops that avoid most of the code size expansion issues.) Having reliably sufficient parallelism (whether instruction-level or thread-level) allows the avoiding of nops; by compiler scheduling with instruction-level parallelism or by hardware thread interleaving with thread-level parallelism.
Hardware simplicity tends to reduce energy per unit of work (as well as chip area), and many modern designs are limited by power use. It also makes sense to perform optimizations at compile time (when they are less latency critical and can be done once rather than each time the code is executed) if the storage and communication cost of additional information is not too expensive (assuming information necessary to perform the optimization is available at compile time [dynamic branch prediction is a classic example of where dynamic information is helpful]).

Well, basically since hardware is optimised with feed forwarding, there has to be no use of explicitly declared software NOPs. But that's not the case.
Though, feed forwarding proves helpful in reducing data hazards, but some hazards cannot be dealt with feed forwarding. It just isn't possible.
Eg.
beq R1,R5,label
instruction 2nd
Here the instruction 2nd will not be fetched until instruction 1 has completed its execution stage and decided whether or not to branch. Until then the 2nd instruction has to be stalled. (stalled for 2 memory cycles). This is done by software by sending out NOPs.
With improvements in technology and hardware optimizations, the beq instruction can complete its execution stage in its register fetch/decode stage by inserting a comparator in the fetch stage itself. Even so, the 2nd instruction will be stalled for(1 memory cycle now). Again NOP is needed.

"Work stealing" vs. "Work shrugging"?

Why is it that I can find lots of information on "work stealing" and nothing on "work shrugging" as a dynamic load-balancing strategy?
By "work-shrugging" I mean pushing surplus work away from busy processors onto less loaded neighbours, rather than have idle processors pulling work from busy neighbours ("work-stealing").
I think the general scalability should be the same for both strategies. However I believe that it is much more efficient, in terms of latency & power consumption, to wake an idle processor when there is definitely work for it to do, rather than having all idle processors periodically polling all neighbours for possible work.
Anyway a quick google didn't show up anything under the heading of "Work Shrugging" or similar so any pointers to prior-art and the jargon for this strategy would be welcome.
Clarification
I actually envisage the work submitting processor (which may or may not be the target processor) being responsible for looking around the immediate locality of the preferred target processor (based on data/code locality) to decide if a near neighbour should be given the new work instead because they don't have as much work to do.
I dont think the decision logic would require much more than an atomic read of the immediate (typically 2 to 4) neighbours' estimated q length here. I do not think this is any more coupling than implied by the thieves polling & stealing from their neighbours. (I am assuming "lock-free, wait-free" queues in both strategies).
Resolution
It seems that what I meant (but only partially described!) as "Work Shrugging" strategy is in the domain of "normal" upfront scheduling strategies that happen to be smart about processor, cache & memory loyality, and scaleable.
I find plenty of references searching on these terms and several of them look pretty solid. I will post a reference when I identify one that best matches (or demolishes!) the logic I had in mind with my definition of "Work Shrugging".

Load balancing is not free; it has a cost of a context switch (to the kernel), finding the idle processors, and choosing work to reassign. Especially in a machine where tasks switch all the time, dozens of times per second, this cost adds up.
So what's the difference? Work-shrugging means you further burden over-provisioned resources (busy processors) with the overhead of load-balancing. Why interrupt a busy processor with administrivia when there's a processor next door with nothing to do? Work stealing, on the other hand, lets the idle processors run the load balancer while busy processors get on with their work. Work-stealing saves time.
Example
Consider: Processor A has two tasks assigned to it. They take time a1 and a2, respectively. Processor B, nearby (the distance of a cache bounce, perhaps), is idle. The processors are identical in all respects. We assume the code for each task and the kernel is in the i-cache of both processors (no added page fault on load balancing).
A context switch of any kind (including load-balancing) takes time c.
No Load Balancing
The time to complete the tasks will be a1 + a2 + c. Processor A will do all the work, and incur one context switch between the two tasks.
Work-Stealing
Assume B steals a2, incurring the context switch time itself. The work will be done in max(a1, a2 + c) time. Suppose processor A begins working on a1; while it does that, processor B will steal a2 and avoid any interruption in the processing of a1. All the overhead on B is free cycles.
If a2 was the shorter task, here, you have effectively hidden the cost of a context switch in this scenario; the total time is a1.
Work-Shrugging
Assume B completes a2, as above, but A incurs the cost of moving it ("shrugging" the work). The work in this case will be done in max(a1, a2) + c time; the context switch is now always in addition to the total time, instead of being hidden. Processor B's idle cycles have been wasted, here; instead, a busy processor A has burned time shrugging work to B.

I think the problem with this idea is that it makes the threads with actual work to do waste their time constantly looking for idle processors. Of course there are ways to make that faster, like have a queue of idle processors, but then that queue becomes a concurrency bottleneck. So it's just better to have the threads with nothing better to do sit around and look for jobs.

The basic advantage of 'work stealing' algorithms is that the overhead of moving work around drops to 0 when everyone is busy. So there's only overhead when some processor would otherwise have been idle, and that overhead cost is mostly paid by the idle processor with only a very small bus-synchronization related cost to the busy processor.

Work stealing, as I understand it, is designed for highly-parallel systems, to avoid having a single location (single thread, or single memory region) responsible for sharing out the work. In order to avoid this bottleneck, I think it does introduce inefficiencies in simple cases.
If your application is not so parallel that a single point of work distribution causes scalability problems, then I would expect you could get better performance by managing it explicitly as you suggest.
No idea what you might google for though, I'm afraid.

Some issues... if a busy thread is busy, wouldn't you want it spending its time processing real work instead of speculatively looking for idle threads to offload onto?
How does your thread decide when it has so much work that it should stop doing that work to look for a friend that will help?
How do you know that the other threads don't have just as much work and you won't be able to find a suitable thread to offload onto?
Work stealing seems more elegant, because solves the same problem (contention) in a way that guarantees that the threads doing the load balancing are only doing the load balancing while they otherwise would have been idle.
It's my gut feeling that what you've described will not only be much less efficient in the long run, but will require lots of of tweaking per-system to get acceptable results.
Though in your edit you suggest that you want submitting processor to handle this, not the worker threads as you suggested earlier and in some of the comments here. If the submitting processor is searching for the lowest queue length, you're potentially adding latency to the submit, which isn't really a desirable thing.
But more importantly it's a supplementary technique to work-stealing, not a mutually exclusive technique. You've potentially alleviated some of the contention that work-stealing was invented to control, but you still have a number of things to tweak before you'll get good results, these tweaks won't be the same for every system, and you still risk running into situations where work-stealing would help you.
I think your edited suggestion, with the submission thread doing "smart" work distribution is potentially a premature optimization against work-stealing. Are your idle threads slamming the bus so hard that your non-idle threads can't get any work done? Then comes the time to optimize work-stealing.

So, by contrast to "Work Stealing", what is really meant here by "Work Shrugging", is a normal upfront work scheduling strategy that is smart about processor, cache & memory loyalty, and scalable.
Searching on combinations of the terms / jargon above yields many substantial references to follow up. Some address the added complication of machine virtualisation, which wasn't infact a concern of the questioner, but the general strategies are still relevent.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008