performance differences between iterative and recursive method in mips - mips

I have 2 version of a program I must analyze. One is recursive and the other is iterative. I must compare cache hit rate for both and examine as performance varies for both methods as well as their instructions count
For both methods regardless of block setting I get roughly 100 less memory access for the iterative method. both trigger 2 misses. I can only drop the cache hit rate to 85% if i set the setting to 1 block of size 256.
for the instructions count the iterative is roughly 1000 instructions less
Can someone explain to me why this happens or provide some literature I can read this in I can't seem to find anything. I would just like a general overview of why this occurs.

Took some info from this question: Recursion or Iteration? and some from in COMP 273 at McGill, which I assume you're also in and that's why you asked.
Each recursive call generally requires the return address of that call to be pushed onto the stack; in MIPS (assembly) this must be done manually otherwise the return address gets overwritten with each jal. As such, usually more cache space is used for a recursive solution and as such the memory access count is higher. In an iterative solution this isn't necessary whatsoever.

Related

Do different arithmetic operations have different processing times?

Are the basic arithmetic operations same with respect to processor usage. For e.g if I do an addition vs division in a loop, will the calculation time for addition be less than that for division?
I am not sure if this question belongs here or computer science SE
Yes. Here is a quick example:
http://my.safaribooksonline.com/book/hardware/9788131732465/instruction-set-and-instruction-timing-of-8086/app_c
those are the microcode and the timing of the operation of a massively old architecture, the 8086. it is a fairly simple point to start.
of relevant note, they are measured in cycles, or clocks, and everything move at the speed of the cpu (they are synchronized on the main clock or frequency of the microprocessor)
if you scroll down on that table you'll see a division taking anywhere from 80 to 150 cycles.
also note operation speed is affected by which area of memory the operand reside.
note that on modern processor you can have parallel instruction executed concurrently (even if the cpu is single threaded) and some of them are executed out of order, then vector instruction murky the question even more.
i.e. a SSE multiplication can multiply multiple number in a single operation (taking multiple time)
Yes. Different machine instructions are not equally expensive.
You can either do measurements yourself or use one of the references in this question to help you understand the costs for various instructions.

How to create Large Bit array in cuda?

I need to keep track of around 10000 elements of an array in my algorithm .So for this I need boolean for each record.If I used char array to keep track of 10000 arrays (as 0/1),it would take up lot of memory.
So Can I create an bit array of 10000 bits in Cuda where each bit represents corresponding array record?
As Roger said, the answer is yes, CUDA provides the same bitwise operations (i.e. >>, << and &) as normal C so you can implement your bit array essentially normally (almost, see thread synchronisation issues below).
However, for your situation it is almost certainly not a good idea.
There are problems with thread syncronisation. Imagine two of the threads on your GPU are inverting two bits of a single entry of your array. Each thread will read the same value out of memory, and apply their operation to it, but the thread that writes its value back to memory last will overwrite the result of the other thread. (Note: if your bit array is not being modified by the GPU code then this isn't a problem.)
And, unless this is explicitly required, you shouldn't be optimising for memory use, an array with 10K elements does not take much memory at all: even if you were storing each boolean in an 64 bit integer it's only 80 KB. And obviously you can store them in a smaller datatype. (You should only start worrying about compressing the array as much as possible when you are getting upwards of tens of millions, or even hundreds of millions of elements.)
Also, the way GPUs work means that you might get best performance by using a reasonably large data type for each boolean (most likely a 32 bit one) so that, for example, memory coalescing works better. (I haven't tested this assertion, you would need to run some benchmarks to check it.)

Recovering from stack overflow or heap exhaustion in a Haskell program

I am currently writting a genetic algorithm in Haskell in which my chromosomes are rather complex structures representing executable systems.
In order for me to evaluate the fitness of my chromosomes I have to run an evolution function which performs one computational cycle of a given system. The fitness then is calculated just by counting how many times the evolution can be applied before there is no change in the system (in which case the system terminates).
The problem now is as follows: some systems can run infinitely long and will never terminate - I want to penalise those (by giving them little score). I could simply put a certain limit on number of steps but it does not solve another problem.
Some of my systems perform exponential computation (i.e. even for small values of evloution steps they grow to giant size) and they cause ERROR - Control stack overflow. For human observer it is clear that they will never terminate but the algorithm has no way of knowing so it runs and crushes.
My question is: is it possible to recover from such an error? I would like my algorithm to continue running after encountering this problem and just adjusting the chromosome score accordingly.
It seems to me like the best solution would be to tell the program: "Hey, try doing this, but if you fail don't worry. I know how to handle it". However I am not even sure if that's possible. If not - are there any alternatives?
This will be hard to do reliably from inside Haskell -- though under some conditions GHC will raise exceptions for these conditions. (You will need GHC 7).
import Control.Exception
If you really just want to catch stack overflows, this is possible, as this example shows:
> handle (\StackOverflow -> return Nothing) $
return . Just $! foldr (+) 0 (replicate (2^25) 1)
Nothing
Or catching any async exception (including heap exhaustion):
> handle (\(e :: AsyncException) -> print e >> return Nothing) $
return . Just $! foldr (+) 0 (replicate (2^25) 1)
stack overflow
Nothing
However, this is fragile.
Alternately, with GHC flags you can enforce maximum stack (or heap) size on a GHC-compiled process, causing it to be killed if it exceeds those limits (GHC appears to have no maximum stack limit these days).
If you compile your Haskell program with GHC (as is recommended), running it as:
$ ghc -O --make A.hs -rtsopts
the low heap limit below is enforced:
$ ./A +RTS -M1M -K1M
Heap exhausted;
This requires GHC. (Again, you shouldn't be using Hugs for this kind of work). Finally, you should ensure your programs don't use excessive stack in the first place, via profiling in GHC.
I think a general solution here is to provide a way to measure computation time, and kill it if it takes too much time. You can simply add counter to your evaluation function if it's recursive and if it drops to zero you return an error value - for example Nothing, otherwise it's Just result.
This approach can be implemented in other ways than explicit count parameter, for example by putting this counter into monad used by evaluation (if your code is monadic) or, impurely, by running computation in separate thread that will be killed on timeout.
I would rather use any pure solution, since it would be more reliable.
It seems to me like the best solution would be to tell the program:
"Hey, try doing this, but if you fail don't worry. I know how to handle it"
In most languages that would be a try/catch block. I'm not sure what the equivalent is in haskell, or even if some equivalent exists. Furthermore, I doubt that a try/catch construct could effectively trap/handle a stack overflow condition.
But would it be possible to apply some reasonable constraints to prevent the overflow from occurring? For instance, perhaps you could set some upper-bound on the size of a system, and monitor how each system approaches the boundary from one iteration to the next. Then you could enforce a rule like "if on a single evolution a system either exceeded its upper-bound or consumed more than 50% of the space remaining between its previous allocation and its upper-bound, that system is terminated and suffers a score penalty".
A thought on your genetic algorithm: Part of the fitness of your chromosomes is that they do not consume too many computational resources. The question you asked defines "too many resources" as crashing the runtime system. This is a rather arbitrary and somewhat random measure.
Knowing that it will add to the complexity of your evolve function, I still would suggest that this function be made aware of the computational resources that a chromosome consumes. This allows you to fine tune when it has "eaten" too much and dies prematurely of "starvation". It might also allow you to adjust your penalty based on how rapidly the chromosome went exponential with the idea that a chromosome that is just barely exponential is more fit then one with an extremely high branching factor.

cache behaviour on redundant writes

Edit - I guess the question I asked was too long so I'm making it very specific.
Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?
For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?
Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.
Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )
I am not aware of any processor that does the optimization you describe - eliminating writes to clean cache lines that would not change the value - but it's a good question, a good idea, great minds think alike and all that.
I wrote a great big reply, and then I remembered: this is called "Silent Stores" in the literature. See "Silent Stores for Free", K. Lepak and M Lipasti, UWisc, MICRO-33, 2000.
Anyway, in my reply I described some of the implementation issues.
By the way, topics like this are often discussed in the USEnet newsgroup comp.arch.
I also write about them on my wiki, http://comp-arch.net
Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:
The old value at the location is loaded from the cache to the CPU (assuming it is already in the cache).
The old and new values are compared.
If the old and new values are different, the new value is written to the cache. Otherwise it is ignored.
Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.
Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.
So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.
The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.
http://blogs.amd.com/developer/tag/sse4a/
Read
Cache hit: Data is read from the cache line to the target register
Cache miss: Data is moved from memory to the cache, and read into the target register
Write
Cache hit: Data is moved from the register to the cache line
Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line
This is not an answer to your original question on computer-architecture, but might be relevant to your goal.
In this discussion, all array index starts with zero.
Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:
An array of size
An array of n, and a counter, used to emulate a set container. Duplicate values allowed.
Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.
When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.
A similar technique, known as a two-level histogram or radix histogram, can also be used.
Two pieces of information are stored:
An array of size
An boolean array of ceil(size / M), where M is the radix. ceil is the ceiling function.
Every time a non-zero value is written to index k in the full-size array, the element floor(k / M) in the boolean array should be marked.
Let's say, bool_array[j] is marked. This corresponds to the range from j*M to (j+1)*M-1 in the full-size array.
When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.

What is a "tight loop"?

I've heard that phrase a lot. What does it mean?
An example would help.
From Wiktionary:
(computing) In assembly languages, a loop which contains few instructions and iterates many times.
(computing) Such a loop which heavily uses I/O or processing resources, failing to adequately share them with other programs running in the operating system.
For case 1 it is probably like
for (unsigned int i = 0; i < 0xffffffff; ++ i) {}
I think the phrase is generally used to designate a loop which iterates many times, and which can have a serious effect on the program's performance - that is, it can use a lot of CPU cycles. Usually you would hear this phrase in a discussion of optimization.
For examples, I think of gaming, where a loop might need to process every pixel on the screen, or scientific app, where a loop is processing entries in giant arrays of data points.
A tight loop is one which is CPU cache-friendly. It is a loop which fits in the instruction cache, which does no branching, and which effectively hides memory fetch latency for data being processed.
There's a good example of a tight loop (~ infinite loop) in the video Jon Skeet and Tony the Pony.
The example is:
while(text.IndexOf(" ") != -1) text = text.Replace(" ", " ");
which produces a tight loop because IndexOf ignores a Unicode zero-width character (thus finds two adjacent spaces) but Replace does not ignore them (thus not replacing any adjacent spaces).
There are already good definitions in the other answers, so I don't mention them again.
SandeepJ's answer is the correct one in the context of Network appliances (for an example, see Wikipedia entry on middlebox) that deal with packets. I would like to add that the thread/task running the tight loop tries to remain scheduled on a single CPU and not get context switched out.
According to Webster's dictionary, "A loop of code that executes without releasing any resources to other programs or the operating system."
http://www.websters-online-dictionary.org/ti/tight+loop.html
From experience, I've noticed that if ever you're trying to do a loop that runs indefinitely, for instance something like:
while(true)
{
//do some processing
}
Such a loop will most likely always be resource intensive. If you check the CPU and Memory usage by the process with this loop, you will find that it will have shot up. Such is the idea some people call a "tight loop".
Many programming environments nowadays don't expose the programmer to the conditions that need a tight-loop e.g. Java web-services run in a container that calls your code and you need to minimise/eliminate loops within a servlet implementation. Systems like Node.js handle the tight-loop and again you should minimise/eliminate loops in your own code. They're used in cases where you have complete control of program execution e.g. OS or real-time/embedded environments. In the case of an OS you can think of the CPU being in an idle state as the amount of time it spends in the tight-loop because the tight-loop is where it performs checks to see if other processes need to run or if there are queues that need serviced so if there are no processes that need to run and queues are empty then the CPU is just whizzing round the tight-loop and this produces a notional indication of how 'not busy' the CPU is. A tight-loop should be designed as just performing checks so it just becomes like a big list of if .. then statements so in Assembly it will boil down to a COMPARE operand and then a branch so it is very efficient. When all the checks result in not branching, the tight-loop can execute millions of times per second. Operating/embedded Systems usually have some detection for CPU hogging to handle cases where some process has not relinquished control of the CPU - checks for this type of occurrence could be performed in the tight-loop. Ultimately you need to understand that a program needs to have a loop at some point otherwise you couldn't do anything useful with a CPU so if you never see the need for a loop it's because your environment handles all that for you. A CPU keeps executing instructions until there's nothing left to execute or you get a crash so a program like an OS must have a tight-loop to actually function otherwise it would run for just microseconds.