registers vs stacks - language-agnostic

What exactly are the advantages and disadvantages to using a register-based virtual machine versus using a stack-based virtual machine?
To me, it would seem as though a register based machine would be more straight-forward to program and more efficient. So why is it that the JVM, the CLR, and the Python VM are all stack-based?

Implemented in hardware, a register-based machine is going to be more efficient simply because there are fewer accesses to the slower RAM. In software, however, even a register based architecture will most likely have the "registers" in RAM. A stack based machine is going to be just as efficient in that case.
In addition a stack-based VM is going to make it a lot easier to write compilers. You don't have to deal with register allocation strategies. You have, essentially, an unlimited number of registers to work with.
Update: I wrote this answer assuming an interpreted VM. It may not hold true for a JIT compiled VM. I ran across this paper which seems to indicate that a JIT compiled VM may be more efficient using a register architecture.

This has already been answered, to a certain level, in the Parrot VM's FAQ and associated documents:
A Parrot Overview
The relevant text from that doc is this:
the Parrot VM will have a register architecture, rather than a stack architecture. It will also have extremely low-level operations, more similar to Java's than the medium-level ops of Perl and Python and the like.
The reasoning for this decision is primarily that by resembling the underlying hardware to some extent, it's possible to compile down Parrot bytecode to efficient native machine language.
Moreover, many programs in high-level languages consist of nested function and method calls, sometimes with lexical variables to hold intermediate results. Under non-JIT settings, a stack-based VM will be popping and then pushing the same operands many times, while a register-based VM will simply allocate the right amount of registers and operate on them, which can significantly reduce the amount of operations and CPU time.
You may also want to read this: Registers vs stacks for interpreter design
Quoting it a bit:
There is no real doubt, it's easier to generate code for a stack machine. Most freshman compiler students can do that. Generating code for a register machine is a bit tougher, unless you're treating it as a stack machine with an accumulator. (Which is doable, albeit somewhat less than ideal from a performance standpoint) Simplicity of targeting isn't that big a deal, at least not for me, in part because so few people are actually going to directly target it--I mean, come on, how many people do you know who actually try to write a compiler for something anyone would ever care about? The numbers are small. The other issue there is that many of the folks with compiler knowledge already are comfortable targeting register machines, as that's what all hardware CPUs in common use are.

Traditionally, virtual machine implementors have favored stack-based architectures over register-based due to 'simplicity of VM implementation' ease of writing a compiler back-end - most VMs are originally designed to host a single language and code density and executables for stack architecture are invariably smaller than executables for register architectures. The simplicity and code density are a cost of performance.
Studies have shown that a registered-based architecture requires an average of 47% less executed VM instructions than stack-based architecture, and the register code is 25% larger than corresponding stack code but this increase cost of fetching more VM instructions due to larger code size involves only 1.07% extra real machine loads per VM instruction which is negligible. The overall performance of the register-based VM is that it takes, on average, 32.3% less time to execute standard benchmarks.

One reason for building stack-based VMs is that that actual VM opcodes can be smaller and simpler (no need to encode/decode operands). This makes the generated code smaller, and also makes the VM code simpler.

How many registers do you need?
I'll probably need at least one more than that.

Stack based VM's are simpler and the code is much more compact. As a real world example, a friend built (about 30 years ago) a data logging system with a homebrew Forth VM on a Cosmac. The Forth VM was 30 bytes of code on a machine with 2k of ROM and 256 bytes of RAM.

It is not obvious to me that a "register-based" virtual machine would be "more straight-forward to program" or "more efficient". Perhaps you are thinking that the virtual registers would provide a short-cut during the JIT compilation phase? This would certainly not be the case, since the real processor may have more or fewer registers than the VM, and those registers may be used in different ways. (Example: values that are going to be decremented are best placed in the ECX register on x86 processors.) If the real machine has more registers than the VM, then you're wasting resources, fewer and you've gained nothing using "register-based" programming.

Stack based VMs are easier to generate code for.
Register based VMs are easier to create fast implementations for, and easier to generate highly optimized code for.
For your first attempt, I recommend starting with a stack based VM.

Related

CUDA profiling information on part of a code [duplicate]

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.

MIPS Architecture : NOP (No-Operation) Vs Data Forwarding in Hazard Prevention

I learnt in computer architecture course that, data hazard can be prevented by using several arbitrary, independent nop instructions in between two mutually dependent instructions. This can be done at assembly level in compiler design.
The alternative way to avoid data hazard is to use data forwarding.
I am bit confused, How these two alternatives differ as far as performance, speed and hardware is concerned. Because as per my knowledge data forwarding is to be implemented at hardware level, whereas nop can be implemented at assembly level.
Anybody please explain me which approach is better if we consider factors such as performance, speed, hardware etc?
Thanks.
Obviously, having the compiler insert nops into the code stream to fill pipeline slots allows hardware to be simplified which can reduce the duration of a pipeline stage or the depth of the pipeline, reduce design effort (time to market, project risk, design cost), or allow a full processor core to fit on a single chip (which helps performance). However, this benefit is tiny compared to the loss of performance from not using forwarding. Higher latency for dependent instructions is very bad for typical programs.
The MIPS R2000, which had both delayed branches and delayed loads, provided result forwarding. (MIPS is an acronym for "Microprocessor without Interlocked Pipeline Stages"). Delayed loads were soon removed from MIPS (which was possible because such did not affect binary compatibility of correct code). The use of delayed instructions was partially from a belief that most delay slots could be filled by the compiler with useful instructions and partially from believing that the increase in code size was not important relative to the simplification of hardware.
Reducing the latency of a load operation was not practical, so the pipeline would need to be stalled for a cycle anyway. The cost of a nop is in cache and memory capacity effects (i.e., the effect of lower code density), and in some cases a single load delay slot could be filled.
Exposing the pipeline organization also has implications for binary compatibility. Later binary compatible implementations must accommodate the ISA designed for the original pipeline organization. A single delayed branch slot works reasonably well for a simple 5-stage scalar implementation (it can be filled with a useful instruction most of the time and allows zero-effective-delay branches [i.e., no stall to resolve the branch or prediction and flushing the pipeline on misprediction]), but when the pipeline is deepened (or made wider) prediction or stalling becomes necessary anyway.
If sufficient parallelism exists in the targeted workloads, hardware simplicity is sufficiently important, and binary compatibility is not a problem, then exposing a pipeline with minimal support for dynamically detecting and handling stall conditions may be sensible. (There are also ways of encoding nops that avoid most of the code size expansion issues.) Having reliably sufficient parallelism (whether instruction-level or thread-level) allows the avoiding of nops; by compiler scheduling with instruction-level parallelism or by hardware thread interleaving with thread-level parallelism.
Hardware simplicity tends to reduce energy per unit of work (as well as chip area), and many modern designs are limited by power use. It also makes sense to perform optimizations at compile time (when they are less latency critical and can be done once rather than each time the code is executed) if the storage and communication cost of additional information is not too expensive (assuming information necessary to perform the optimization is available at compile time [dynamic branch prediction is a classic example of where dynamic information is helpful]).
Well, basically since hardware is optimised with feed forwarding, there has to be no use of explicitly declared software NOPs. But that's not the case.
Though, feed forwarding proves helpful in reducing data hazards, but some hazards cannot be dealt with feed forwarding. It just isn't possible.
Eg.
beq R1,R5,label
instruction 2nd
Here the instruction 2nd will not be fetched until instruction 1 has completed its execution stage and decided whether or not to branch. Until then the 2nd instruction has to be stalled. (stalled for 2 memory cycles). This is done by software by sending out NOPs.
With improvements in technology and hardware optimizations, the beq instruction can complete its execution stage in its register fetch/decode stage by inserting a comparator in the fetch stage itself. Even so, the 2nd instruction will be stalled for(1 memory cycle now). Again NOP is needed.

Simple Compute-Intensive CUDA Program

I'm preparing an acceptance test for a new machine with Nvidia graphics cards and I'd like a simple CUDA program that will fully exercise the GPU for a full day. The intent is to generate large amounts of heat and ensure the new machine is stable under the load. I'd like the code to be very easy to compile and run (no dependencies, no large input data sets), and also very easy to verify (small amounts of output). Also, I'd like it to be command-line only, no GUI (the test will have to be automated).
I was originally thinking of repeatedly running Vector Dot Products of large vectors. However, that's mostly memory-intensive. So if the GPUs are constantly waiting on memory accesses, then they probably aren't generating as much heat as they could.
I'm running on a CentOS Linux machine.
Does anyone have any suggestions?
You didn't mention which OS you are on.
Ideally, you would want to stress the floating point units, the logic/integer units, the GPU memory, the GPU voltage regulators (VRMs) and the main PSU. I don't think there is any single utility out there that does that.
Memory:
http://sourceforge.net/projects/cudagpumemtest/
Integer (?):
http://sourceforge.net/projects/cudalucas/
PSU and VRMs (In the past, this program could cause GPUs to run out-of-spec, breaking the card. I don't think that's the case anymore):
http://www.ozone3d.net/benchmarks/fur/

What advantages are there to programming for a non-cache-coherent multi-core machine?

What advantages are there to programming for a non-cache-coherent multi-core machine? Cache_coherence has many benefits, but how would one take advantage of the opposite of this feature - an independent cache for each individual core. What programming paradigm and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
You don't as such take advantage of cache non-coherence. You can't write code which relies on different cores having different views of memory, because a non-coherent cache doesn't guarantee to show different memory to different cores. It just reserves the right to do that.
Cache coherence costs circuits and time. Non-coherent caches are therefore cheaper (and cooler, perhaps?) and faster. Memory access might be faster in cycles, or might be the same best-case speed but with fewer stalls due to cache synchronisation and especially false sharing.
So it's not so much extra things you do to take advantage of non-coherence, it's the things that you don't have to do because you've dropped the disadvantages of coherence - you don't have to redesign your parallel code because it's spending all its time sitting around waiting for the result of a memory store from another core.
The downside on a non-coherent cache architecture at first appears to be that find yourself using additional synchronisation that's provided automatically by coherent caches. No double-checked locking for you. Then you realise that in effect, the coherent-cache architectures do this synchronisation (albeit in a super-fast hardware-implemented form) for every single memory access, and block if the cache line is dirty, whether you need it to or not. That cheers me right up :-)
What programming paradigm
Message passing.
and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
Pattern matching - the input block of memory could very well be "read-only": the "output" result can very well be placed in separate blocks waiting for a "reducer" of some sort.
Of course, this is just an example amongst many I am sure.
Just to make things clear: the principal reasons for going with "non-cache-coherent" architecture are cost & speed (assuming the problems at hand are more efficiently tackled using this architecture).
You can get a bit of extra performance, but you shoul never rely on each processor having different cache values, as you can never know when the cache is flushed.
I'm not an expert; but I don't think it has any advantage over a cache coherent architecture, besides from being simpler to implement. Of course, such simplicity can allow other optimizations that could be prohibitive in a more complex coherent system, making the non-coherent machine faster when carefully programmed.
said that, i concur with jldupont, message passing doesn't need coherency, so it's (almost) the mandatory way to do IPC.
You could think of the Cell SPE local memory as a sort of cache. It isn't cache really since it isn't automatic at all, but the speed is the same and it isn't coherent.
It has big speed advantages because the hardware does not need to spend any time synchronizing the cache line states between cores.
In a Cell, the programmer must do the synchronization manually by writing code to copy SPE local memory back and forth. So a disadvantage is much greater program complexity.

What is Turing Complete?

What does the expression "Turing Complete" mean?
Can you give a simple explanation, without going into too many theoretical details?
Here's the briefest explanation:
A Turing Complete system means a system in which a program can be written that will find an answer (although with no guarantees regarding runtime or memory).
So, if somebody says "my new thing is Turing Complete" that means in principle (although often not in practice) it could be used to solve any computation problem.
Sometimes it's a joke... a guy wrote a Turing Machine simulator in vi, so it's possible to say that vi is the only computational engine ever needed in the world.
Here is the simplest explanation
Alan Turing created a machine that can take a program, run that program, and show some result. But then he had to create different machines for different programs. So he created "Universal Turing Machine" that can take ANY program and run it.
Programming languages are similar to those machines (although virtual). They take programs and run them. Now, a programing language is called "Turing complete", if it can run any program (irrespective of the language) that a Turing machine can run given enough time and memory.
For example: Let's say there is a program that takes 10 numbers and adds them. A Turing machine can easily run this program. But now imagine that for some reason your programming language can't perform the same addition. This would make it "Turing incomplete" (so to speak). On the other hand, if it can run any program that the universal Turing machine can run, then it's Turing complete.
Most modern programming languages (e.g. Java, JavaScript, Perl, etc.) are all Turing complete because they each implement all the features required to run programs like addition, multiplication, if-else condition, return statements, ways to store/retrieve/erase data and so on.
Update: You can learn more on my blog post: "JavaScript Is Turing Complete" — Explained
Informal Definition
A Turing complete language is one that can perform any computation. The Church-Turing Thesis states that any performable computation can be done by a Turing machine. A Turing machine is a machine with infinite random access memory and a finite 'program' that dictates when it should read, write, and move across that memory, when it should terminate with a certain result, and what it should do next. The input to a Turing machine is put in its memory before it starts.
Things that can make a language NOT Turing complete
A Turing machine can make decisions based on what it sees in memory - The 'language' that only supports +, -, *, and / on integers is not Turing complete because it can't make a choice based on its input, but a Turing machine can.
A Turing machine can run forever - If we took Java, Javascript, or Python and removed the ability to do any sort of loop, GOTO, or function call, it wouldn't be Turing complete because it can't perform an arbitrary computation that never finishes. Coq is a theorem prover that can't express programs that don't terminate, so it's not Turing complete.
A Turing machine can use infinite memory - A language that was exactly like Java but would terminate once it used more than 4 Gigabytes of memory wouldn't be Turing complete, because a Turing machine can use infinite memory. This is why we can't actually build a Turing machine, but Java is still a Turing complete language because the Java language has no restriction preventing it from using infinite memory. This is one reason regular expressions aren't Turing complete.
A Turing machine has random access memory - A language that only lets you work with memory through push and pop operations to a stack wouldn't be Turing complete. If I have a 'language' that reads a string once and can only use memory by pushing and popping from a stack, it can tell me whether every ( in the string has its own ) later on by pushing when it sees ( and popping when it sees ). However, it can't tell me if every ( has its own ) later on and every [ has its own ] later on (note that ([)] meets this criteria but ([]] does not). A Turing machine can use its random access memory to track ()'s and []'s separately, but this language with only a stack cannot.
A Turing machine can simulate any other Turing machine - A Turing machine, when given an appropriate 'program', can take another Turing machine's 'program' and simulate it on arbitrary input. If you had a language that was forbidden from implementing a Python interpreter, it wouldn't be Turing complete.
Examples of Turing complete languages
If your language has infinite random access memory, conditional execution, and some form of repeated execution, it's probably Turing complete. There are more exotic systems that can still achieve everything a Turing machine can, which makes them Turing complete too:
Untyped lambda calculus
Conway's game of life
C++ Templates
Prolog
From wikipedia:
Turing completeness, named after Alan
Turing, is significant in that every
plausible design for a computing
device so far advanced can be emulated
by a universal Turing machine — an
observation that has become known as
the Church-Turing thesis. Thus, a
machine that can act as a universal
Turing machine can, in principle,
perform any calculation that any other
programmable computer is capable of.
However, this has nothing to do with
the effort required to write a program
for the machine, the time it may take
for the machine to perform the
calculation, or any abilities the
machine may possess that are unrelated
to computation.
While truly Turing-complete machines
are very likely physically impossible,
as they require unlimited storage,
Turing completeness is often loosely
attributed to physical machines or
programming languages that would be
universal if they had unlimited
storage. All modern computers are
Turing-complete in this sense.
I don't know how you can be more non-technical than that except by saying "turing complete means 'able to answer computable problem given enough time and space'".
Fundamentally, Turing-completeness is one concise requirement, unbounded recursion.
Not even bounded by memory.
I thought of this independently, but here is some discussion of the assertion. My definition of LSP provides more context.
The other answers here don't directly define the fundamental essence of Turing-completeness.
Turing Complete means that it is at least as powerful as a Turing Machine. This means anything that can be computed by a Turing Machine can be computed by a Turing Complete system.
No one has yet found a system more powerful than a Turing Machine. So, for the time being, saying a system is Turing Complete is the same as saying the system is as powerful as any known computing system (see Church-Turing Thesis).
In the simplest terms, a Turing-complete system can solve any possible computational problem.
One of the key requirements is the scratchpad size be unbounded and that is possible to rewind to access prior writes to the scratchpad.
Thus in practice no system is Turing-complete.
Rather some systems approximate Turing-completeness by modeling unbounded memory and performing any possible computation that can fit within the system's memory.
Super-brief summary from what Professor Brasilford explains in this video.
Turing Complete ≅ do anything that a Turing Machine can do.
It has conditional branching (i.e. "if statement"). Also, implies "go to" and thus permitting loop.
It gets arbitrary amount of memory (e.g. long enough tape) that the program needs.
I think the importance of the concept "Turing Complete" is in the the ability to identify a computing machine (not necessarily a mechanical/electrical "computer") that can have its processes be deconstructed into "simple" instructions, composed of simpler and simpler instructions, that a Universal machine could interpret and then execute.
I highly recommend The Annotated Turing
#Mark i think what you are explaining is a mix between the description of the Universal Turing Machine and Turing Complete.
Something that is Turing Complete, in a practical sense, would be a machine/process/computation able to be written and represented as a program, to be executed by a Universal Machine (a desktop computer). Though it doesn't take consideration for time or storage, as mentioned by others.
A Turing Machine requires that any program
can perform condition testing. That is fundamental.
Consider a player piano roll. The player piano can
play a highly complicated piece of music,
but there is never any conditional logic in the
music. It is not Turing Complete.
Conditional logic is both the power and the
danger of a machine that is Turing Complete.
The piano roll is guaranteed to halt every time.
There is no such guarantee for a TM. This
is called the “halting problem.”
In practical language terms familiar to most programmers, the usual way to detect Turing completeness is if the language allows or allows the simulation of nested unbounded while statements (as opposed to Pascal-style for statements, with fixed upper bounds).
We call a language Turing-complete if and only if (1) it is decidable by a Turing machine but (2) not by anything less capable than a Turing machine. For instance, the language of palindromes over the alphabet {a, b} is decidable by Turing machines, but also by pushdown automata; so, this language is not Turing-complete. Truly Turing-complete languages - ones that require the full computing power of Turing machines - are pretty rare. Perhaps the language of strings x.y.z where x is a number, y is a Turing-machine and z is an initial tape configuration, and y halts on z in fewer than x! steps - perhaps that qualifies (though it would need to be shown!)
A common imprecise usage confuses Turing-completeness with Turing-equivalence. Turing-equivalence refers to the property of a computational system which can simulate, and which can be simulated by, Turing machines. We might say Java is a Turing-equivalent programming language, for instance, because you can write a Turing-machine simulator in Java, and because you could define a Turing machine that simulates execution of Java programs. According to the Church-Turing thesis, Turing machines can perform any effective computation, so Turing-equivalence means a system is as capable as possible (if the Church-Turing thesis is true!)
Turing equivalence is a much more mainstream concern that true Turing completeness; this and the fact that "complete" is shorter than "equivalent" may explain why "Turing-complete" is so often misused to mean Turing-equivalent, but I digress.
As Waylon Flinn said:
Turing Complete means that it is at least as powerful as a Turing Machine.
I believe this is incorrect, a system is Turing complete if it's exactly as powerful as the Turing Machine, i.e. every computation done by the machine can be done by the system, but also every computation done by the system can be done by the Turing machine.
Can a relational database input latitudes and longitudes of places and roads, and compute the shortest path between them - no. This is one problem that shows SQL is not Turing complete.
But C++ can do it, and can do any problem. Thus it is.