What is a "tight loop"? - terminology

I've heard that phrase a lot. What does it mean?
An example would help.

From Wiktionary:
(computing) In assembly languages, a loop which contains few instructions and iterates many times.
(computing) Such a loop which heavily uses I/O or processing resources, failing to adequately share them with other programs running in the operating system.
For case 1 it is probably like
for (unsigned int i = 0; i < 0xffffffff; ++ i) {}

I think the phrase is generally used to designate a loop which iterates many times, and which can have a serious effect on the program's performance - that is, it can use a lot of CPU cycles. Usually you would hear this phrase in a discussion of optimization.
For examples, I think of gaming, where a loop might need to process every pixel on the screen, or scientific app, where a loop is processing entries in giant arrays of data points.

A tight loop is one which is CPU cache-friendly. It is a loop which fits in the instruction cache, which does no branching, and which effectively hides memory fetch latency for data being processed.

There's a good example of a tight loop (~ infinite loop) in the video Jon Skeet and Tony the Pony.
The example is:
while(text.IndexOf(" ") != -1) text = text.Replace(" ", " ");
which produces a tight loop because IndexOf ignores a Unicode zero-width character (thus finds two adjacent spaces) but Replace does not ignore them (thus not replacing any adjacent spaces).
There are already good definitions in the other answers, so I don't mention them again.

SandeepJ's answer is the correct one in the context of Network appliances (for an example, see Wikipedia entry on middlebox) that deal with packets. I would like to add that the thread/task running the tight loop tries to remain scheduled on a single CPU and not get context switched out.

According to Webster's dictionary, "A loop of code that executes without releasing any resources to other programs or the operating system."
http://www.websters-online-dictionary.org/ti/tight+loop.html

From experience, I've noticed that if ever you're trying to do a loop that runs indefinitely, for instance something like:
while(true)
{
//do some processing
}
Such a loop will most likely always be resource intensive. If you check the CPU and Memory usage by the process with this loop, you will find that it will have shot up. Such is the idea some people call a "tight loop".

Many programming environments nowadays don't expose the programmer to the conditions that need a tight-loop e.g. Java web-services run in a container that calls your code and you need to minimise/eliminate loops within a servlet implementation. Systems like Node.js handle the tight-loop and again you should minimise/eliminate loops in your own code. They're used in cases where you have complete control of program execution e.g. OS or real-time/embedded environments. In the case of an OS you can think of the CPU being in an idle state as the amount of time it spends in the tight-loop because the tight-loop is where it performs checks to see if other processes need to run or if there are queues that need serviced so if there are no processes that need to run and queues are empty then the CPU is just whizzing round the tight-loop and this produces a notional indication of how 'not busy' the CPU is. A tight-loop should be designed as just performing checks so it just becomes like a big list of if .. then statements so in Assembly it will boil down to a COMPARE operand and then a branch so it is very efficient. When all the checks result in not branching, the tight-loop can execute millions of times per second. Operating/embedded Systems usually have some detection for CPU hogging to handle cases where some process has not relinquished control of the CPU - checks for this type of occurrence could be performed in the tight-loop. Ultimately you need to understand that a program needs to have a loop at some point otherwise you couldn't do anything useful with a CPU so if you never see the need for a loop it's because your environment handles all that for you. A CPU keeps executing instructions until there's nothing left to execute or you get a crash so a program like an OS must have a tight-loop to actually function otherwise it would run for just microseconds.

Related

performance differences between iterative and recursive method in mips

I have 2 version of a program I must analyze. One is recursive and the other is iterative. I must compare cache hit rate for both and examine as performance varies for both methods as well as their instructions count
For both methods regardless of block setting I get roughly 100 less memory access for the iterative method. both trigger 2 misses. I can only drop the cache hit rate to 85% if i set the setting to 1 block of size 256.
for the instructions count the iterative is roughly 1000 instructions less
Can someone explain to me why this happens or provide some literature I can read this in I can't seem to find anything. I would just like a general overview of why this occurs.
Took some info from this question: Recursion or Iteration? and some from in COMP 273 at McGill, which I assume you're also in and that's why you asked.
Each recursive call generally requires the return address of that call to be pushed onto the stack; in MIPS (assembly) this must be done manually otherwise the return address gets overwritten with each jal. As such, usually more cache space is used for a recursive solution and as such the memory access count is higher. In an iterative solution this isn't necessary whatsoever.

comparision and addition of two integers : in detail

I would like to know how many machine cycles does it take to compare two integers and how many if I add that and which one is easier?
basically i m looking to see which one is more expensive generally ??
Also I need an answer from c, c++, java perspective ....
helps is appreciated thanks!!
The answer is yes. And no. And maybe.
There are machines that can compare two values in their spare time between cycles, and others that need several cycles. On the old PDP8 you first had to negate one operand, do an add, and then test the result to do a compare.
But other machines can compare much faster than add, because no register needs to be modified.
But on still other machines the operations take the same time, but it takes several cycles for the result of the compare to make it to a place where one can test it, so, if you can use those cycles the compare is almost free, but fairly expensive if you have no other operations to shove into those cycles.
The simple answer is one cycle, both operations are equally easy.
A totally generic answer is difficult to give, since processor architectures are amazingly complex when you get down into the details.
All modern processors are pipelined. That is, there are no instructions where the operands go in on cycle c, and the result is available on cycle c+1. Instead, the instruction is broken down into multiple steps.
The instructions are read into the front end of the processor, which decodes the instruction. This may include breaking it down into multiple micro-ops. The operands are then read into registers, and then the execution units handle the actual operation. Eventually the answer is returned back to a register.
The instructions go through one pipeline stage each cycle, and modern CPUs have 10-20 pipeline stages. So it could be upto 20 processor cycles to add or compare two numbers. However, once one instruction has been through one stage of the pipeline, another instruction can be read into that stage. The ideal is that each clock cycle, one instruction goes into the front end, while one set of results comes out the other.
There is massive complexity involved in getting all this to work. If you want to do a+b+c, you need to add a+b before you can add c. So a lot of the work in the front end of the processor involves scheduling. Modern processors employ out-of-order execution, so that the processor will examine the incoming instructions, and re-order them such that it does a+b, then gets on with some other work, and then does result+c once the result is available.
Which all brings us back to the original question of which is easier. Because usually, if you're comparing two integers, it is to make a decision on what to do next. Which means you won't know your next instruction until you've got the result of the last one. Because the instructions are pipelined, this means you can lose 20 clock cycles of work if you wait.
So modern CPUs have a branch predictor which makes a guess what the result will be, and continues executing the instructions. If it guesses wrong, the pipeline has to be thrown out, and work restarted on the other branch. The branch predictor helps enormously, but still, if the comparison is a decision point in the code, that is for more difficult for the CPU to deal with than the addition.
Comparison is done via subtraction, which is almost the same as addition, except that the carry and subtrahend are complemented, so a - b - c becomes a + ~b + ~c. This is already accounted for in the CPU and basically takes the same amount of time either way.

Recovering from stack overflow or heap exhaustion in a Haskell program

I am currently writting a genetic algorithm in Haskell in which my chromosomes are rather complex structures representing executable systems.
In order for me to evaluate the fitness of my chromosomes I have to run an evolution function which performs one computational cycle of a given system. The fitness then is calculated just by counting how many times the evolution can be applied before there is no change in the system (in which case the system terminates).
The problem now is as follows: some systems can run infinitely long and will never terminate - I want to penalise those (by giving them little score). I could simply put a certain limit on number of steps but it does not solve another problem.
Some of my systems perform exponential computation (i.e. even for small values of evloution steps they grow to giant size) and they cause ERROR - Control stack overflow. For human observer it is clear that they will never terminate but the algorithm has no way of knowing so it runs and crushes.
My question is: is it possible to recover from such an error? I would like my algorithm to continue running after encountering this problem and just adjusting the chromosome score accordingly.
It seems to me like the best solution would be to tell the program: "Hey, try doing this, but if you fail don't worry. I know how to handle it". However I am not even sure if that's possible. If not - are there any alternatives?
This will be hard to do reliably from inside Haskell -- though under some conditions GHC will raise exceptions for these conditions. (You will need GHC 7).
import Control.Exception
If you really just want to catch stack overflows, this is possible, as this example shows:
> handle (\StackOverflow -> return Nothing) $
return . Just $! foldr (+) 0 (replicate (2^25) 1)
Nothing
Or catching any async exception (including heap exhaustion):
> handle (\(e :: AsyncException) -> print e >> return Nothing) $
return . Just $! foldr (+) 0 (replicate (2^25) 1)
stack overflow
Nothing
However, this is fragile.
Alternately, with GHC flags you can enforce maximum stack (or heap) size on a GHC-compiled process, causing it to be killed if it exceeds those limits (GHC appears to have no maximum stack limit these days).
If you compile your Haskell program with GHC (as is recommended), running it as:
$ ghc -O --make A.hs -rtsopts
the low heap limit below is enforced:
$ ./A +RTS -M1M -K1M
Heap exhausted;
This requires GHC. (Again, you shouldn't be using Hugs for this kind of work). Finally, you should ensure your programs don't use excessive stack in the first place, via profiling in GHC.
I think a general solution here is to provide a way to measure computation time, and kill it if it takes too much time. You can simply add counter to your evaluation function if it's recursive and if it drops to zero you return an error value - for example Nothing, otherwise it's Just result.
This approach can be implemented in other ways than explicit count parameter, for example by putting this counter into monad used by evaluation (if your code is monadic) or, impurely, by running computation in separate thread that will be killed on timeout.
I would rather use any pure solution, since it would be more reliable.
It seems to me like the best solution would be to tell the program:
"Hey, try doing this, but if you fail don't worry. I know how to handle it"
In most languages that would be a try/catch block. I'm not sure what the equivalent is in haskell, or even if some equivalent exists. Furthermore, I doubt that a try/catch construct could effectively trap/handle a stack overflow condition.
But would it be possible to apply some reasonable constraints to prevent the overflow from occurring? For instance, perhaps you could set some upper-bound on the size of a system, and monitor how each system approaches the boundary from one iteration to the next. Then you could enforce a rule like "if on a single evolution a system either exceeded its upper-bound or consumed more than 50% of the space remaining between its previous allocation and its upper-bound, that system is terminated and suffers a score penalty".
A thought on your genetic algorithm: Part of the fitness of your chromosomes is that they do not consume too many computational resources. The question you asked defines "too many resources" as crashing the runtime system. This is a rather arbitrary and somewhat random measure.
Knowing that it will add to the complexity of your evolve function, I still would suggest that this function be made aware of the computational resources that a chromosome consumes. This allows you to fine tune when it has "eaten" too much and dies prematurely of "starvation". It might also allow you to adjust your penalty based on how rapidly the chromosome went exponential with the idea that a chromosome that is just barely exponential is more fit then one with an extremely high branching factor.

producer/comsumer model and concurrent kernels

I'm writing a cuda program that can be interpreted as producer/consumer model.
There are two kernels,
one produces a data on the device memory,
and the other kernel the produced data.
The number of comsuming threads are set two a multiple of 32 which is the warp size.
and each warp waits utill 32 data have been produced.
I've got some problem here.
If the consumer kernel is loaded later than the producer,
the program doesn't halt.
The program runs indefinately sometimes even though consumer is loaded first.
What I'm asking is that is there a nice implementation model of producer/consumer in CUDA?
Can anybody give me a direction or reference?
here is the skeleton of my code.
**kernel1**:
while LOOP_COUNT
compute something
if SOME CONDITION
atomically increment PRODUCE_COUNT
write data into DATA
atomically increment PRODUCER_DONE
**kernel2**:
while FOREVER
CURRENT=0
if FINISHED CONDITION
return
if PRODUCER_DONE==TOTAL_PRODUCER && CONSUME_COUNT==PRODUCE_COUNT
return
if (MY_WARP+1)*32+(CONSUME_WARPS*32*CURRENT)-1 < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT
else if PRODUCUER_DONE==TOTAL_PRODUCER
if currnet*32*CONSUME_WARPS+THREAD_INDEX < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT
Since you did not provide an actual code, it is hard to check where is the bug. Usually the sceleton is correct, but the problem lies in details.
One of possible issues that I can think of:
By default, in CUDA there is no guarantee that global memory writes by one kernel will be visible by another kernel, with an exception of atomic operations. It can happen then that your first kernel increments PRODUCER_DONE, but there is still no data in DATA.
Fortunately, you are given the intristic function __threadfence() which halts the execution of the current thread, until the data is visible. You should put it before atomically incrementing PRODUCER_DONE. Check out chapter B.5 in the CUDA Programming Guide.
Another issue that may or may not appear:
From the point of view of kernel2, the compiler may deduct that PRODUCE_COUNT, once read, it never changes. The compiler may optimise the code so that, once loaded in register it reuses its value, instead of querying the global memory every time. Solution? Use volatile, or read the value using another atomic operation.
(Edit)
Third issue:
I forgot about one more problem. On pre-Fermi cards (GeForce before 400-series) you can run only a single kernel at a time. So, if you schedule the producer to run after the consumer, the system will wait for consumer-kernel to end before producer-kernel starts its execution. If you want both to run at the same time, put both into a single kernel and have an if-branch based on some block index.

High-level/semantic optimization

I'm writing a compiler, and I'm looking for resources on optimization. I'm compiling to machine code, so anything at runtime is out of the question.
What I've been looking for lately is less code optimization and more semantic/high-level optimization. For example:
free(malloc(400)); // should be completely optimized away
Even if these functions were completely inlined, they could eventually call OS memory functions which can never be inlined. I'd love to be able to eliminate that statement completely without building special-case rules into the compiler (after all, malloc is just another function).
Another example:
string Parenthesize(string str) {
StringBuilder b; // similar to C#'s class of the same name
foreach(str : ["(", str, ")"])
b.Append(str);
return b.Render();
}
In this situation I'd love to be able to initialize b's capacity to str.Length + 2 (enough to exactly hold the result, without wasting memory).
To be completely honest, I have no idea where to begin in tackling this problem, so I was hoping for somewhere to get started. Has there been any work done in similar areas? Are there any compilers that have implemented anything like this in a general sense?
To do an optimization across 2 or more operations, you have to understand the
algebraic relationship of those two operations. If you view operations
in their problem domain, they often have such relationships.
Your free(malloc(400)) is possible because free and malloc are inverses
in the storage allocation domain.
Lots of operations have inverses and teaching the compiler that they are inverses,
and demonstrating that the results of one dataflow unconditionally into the other,
is what is needed. You have to make sure that your inverses really are inverses
and there isn't a surprise somewhere; a/x*x looks like just the value a,
but if x is zero you get a trap. If you don't care about the trap, it is an inverse;
if you do care about the trap then the optimization is more complex:
(if (x==0) then trap() else a)
which is still a good optimization if you think divide is expensive.
Other "algebraic" relationships are possible. For instance, there are
may idempotent operations: zeroing a variable (setting anything to the same
value repeatedly), etc. There are operations where one operand acts
like an identity element; X+0 ==> X for any 0. If X and 0 are matrices,
this is still true and a big time savings.
Other optimizations can occur when you can reason abstractly about what the code
is doing. "Abstract interpretation" is a set of techniques for reasoning about
values by classifying results into various interesting bins (e.g., this integer
is unknown, zero, negative, or positive). To do this you need to decide what
bins are helpful, and then compute the abstract value at each point. This is useful
when there are tests on categories (e.g., "if (x<0) { ... " and you know
abstractly that x is less than zero; you can them optimize away the conditional.
Another way is to define what a computation is doing symbolically, and simulate the computation to see the outcome. That is how you computed the effective size of the required buffer; you computed the buffer size symbolically before the loop started,
and simulated the effect of executing the loop for all iterations.
For this you need to be able to construct symbolic formulas
representing program properties, compose such formulas, and often simplify
such formulas when they get unusably complex (kinds of fades into the abstract
interpretation scheme). You also want such symbolic computation to take into
account the algebraic properties I described above. Tools that do this well are good at constructing formulas, and program transformation systems are often good foundations for this. One source-to-source program transformation system that can be used to do this
is the DMS Software Reengineering Toolkit.
What's hard is to decide which optimizations are worth doing, because you can end
of keeping track of vast amounts of stuff, which may not pay off. Computer cycles
are getting cheaper, and so it makes sense to track more properties of the code in the compiler.
The Broadway framework might be in the vein of what you're looking for. Papers on "source-to-source transformation" will probably also be enlightening.