Can every DFA be simulated by a PDA? - proof

Given a Deterministic Finite Automata (DFA) M_1, does there always exist a Pushdown Automata (PDA) M_2 that accepts the same language as M_1? I.e. can any DFA be simulated by a PDA? Intuitively, it makes sense to me that a PDA is more powerful since it has an arbitrary amount of memory and can therefore accept more languages than a DFA, but how could this be formally proven?

A finite automaton is nothing but a Turing machine that does not use its memory. A PDA is nothing but a Turing machine that restricts its access to its memory to LIFO.
So, a finite automaton is a PDA.

Related

Do CUDA cores have vector instructions?

According to most NVidia documentation CUDA cores are scalar processors and should only execute scalar operations, that will get vectorized to 32-component SIMT warps.
But OpenCL has vector types like for example uchar8.It has the same size as ulong (64 bit), which can be processed by a single scalar core. If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
If there are 1024 work items in a block (work group), and each work items processes a uchar8, will this effectively process 8120 uchar in parallel?
Edit:
My question was if on CUDA architectures specifically (independently of OpenCL), there are some vector instructions available in "scalar" cores. Because if the core is already capable of handling a 32-bit type, it would be reasonable if it can also handle addition of a 32-bit uchar4 for example, especially since vector operations are often used in computer graphics.
CUDA has "built-in" (i.e. predefined) vector types up to a size of 4 for 4-byte quantities (e.g. int4) and up to a size of 2 for 8-byte quantities (e.g. double2). A CUDA thread has a maximum read/write transaction size of 16 bytes, so these particular size choices tend to line up with that maximum.
These are exposed as typical structures, so you can reference for example .x to access just the first element of a vector type.
Unlike OpenCL, CUDA does not provide built-in operations ("overloads") for basic arithmetic e.g. +, -, etc. for element-wise operations on these vector types. There's no particular reason you couldn't provide such overloads yourself. Likewise, if you wanted a uchar8 you could easily provide a structure definition for such, as well as any desired operator overloads. These could probably be implemented just as you would expect for ordinary C++ code.
Probably an underlying question is, then, what is the difference in implementation between CUDA and OpenCL in this regard? If I operate on a uchar8, e.g.
uchar8 v1 = {...};
uchar8 v2 = {...};
uchar8 r = v1 + v2;
what will the difference be in terms of machine performance (or low-level code generation) between OpenCL and CUDA?
Probably not much, for a CUDA-capable GPU. A CUDA core (i.e. the underlying ALU) does not have direct native support for such an operation on a uchar8, and furthermore, if you write your own C++ compliant overload, you're probably going to use C++ semantics for this which will inherently be serial:
r.x = v1.x + v2.x;
r.y = v1.y + v2.y;
...
So this will decompose into a sequence of operations performed on the CUDA core (or in the appropriate integer unit within the CUDA SM). Since the NVIDIA GPU hardware doesn't provide any direct support for an 8-way uchar add within a single core/clock/instruction, there's really no way OpenCL (as implemented on a NVIDIA GPU) could be much different. At a low level, the underlying machine code is going to be a sequence of operations, not a single instruction.
As an aside, CUDA (or PTX, or CUDA intrinsics) does provide for a limited amount of vector operations within a single core/thread/instruction. Some examples of this are:
a limited set of "native" "video" SIMD instructions. These instructions are per-thread, so if used, they allow for "native" support of up to 4x32 = 128 (8-bit) operands per warp, although the operands must be properly packed into 32-bit registers. You can access these from C++ directly via a set of built-in intrinsics. (A CUDA warp is a set of 32 threads, and is the fundamental unit of lockstep parallel execution and scheduling on a CUDA capable GPU.)
a vector (SIMD) multiply-accumulate operation, which is not directly translatable to a single particular elementwise operation overload, the so-called int8 dp2a and dp4a instructions. int8 here is somewhat misleading. It does not refer to an int8 vector type but rather a packed arrangement of 4 8-bit integer quantities in a single 32-bit word/register. Again, these are accessible via intrinsics.
16-bit floating point is natively supported via half2 vector type in cc 5.3 and higher GPUs, for certain operations.
The new Volta tensorCore is something vaguely like a SIMD-per-thread operation, but it operates (warp-wide) on a set of 16x16 input matrices producing a 16x16 matrix result.
Even with a smart OpenCL compiler that could map certain vector operations into the various operations "natively" supported by the hardware, it would not be complete coverage. There is no operational support for an 8-wide vector (e.g. uchar8) on a single core/thread, in a single instruction, to pick one example. So some serialization would be necessary. In practice, I don't think the OpenCL compiler from NVIDIA is that smart, so my expectation is that you would find such per-thread vector operations fully serialized, if you studied the machine code.
In CUDA, you could provide your own overload for certain operations and vector types, that could be represented approximately in a single instruction. For example a uchar4 add could be performed "natively" with the __vadd4() intrinsic (perhaps included in your implementation of an operator overload.) Likewise, if you are writing your own operator overload, I don't think it would be difficult to perform a uchar8 elementwise vector add using two __vadd4() instructions.
If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
AFAIK it'll always be on a single core (instructions from a single kernel / workitem don't cross cores, except special instructions like barriers), but it may be more than one instruction. This depends on whether your hardware support operations on uchar8 natively. If it does not, then uchar8 will be broken up to as many pieces as required, and each piece will be processed with a separate instruction.
OpenCL is very "generic" in the sense that it supports many different vector type/size combos, but real-world hardware usually only implements some vector type/size combinations. You can query OpenCL devices for "preferred vector size" which should tell you what's the most efficient for that hardware.

Halting really undecidable for computers with limited memory?

Turing proved that the halting problem is undecidable over Turing machines. However, real computers are not actually Turing-complete: They would be, if they had an infinite amount of memory.
Given the fact that computers have a finite amount of memory, hence are not quite Turning-complete, does the halting problem become decidable? My intuition tells me that yes, but the program that solves this restricted halting problem might have a time and space complexity exponential to the size of the memory of the targeted computer.
The halting problem can be solved for a Turing machine with finite tape by a Turing machine with infinite tape. All the infinite Turing machine has to do is enumerate every possible state of the finite Turing machine (and there will be a finite, though very large, number of possible states) and mark which states have been visited by the Turing machine in the course of running a program. Eventually, one of two things will happen:
The finite Turing machine will halt.
The finite Turing machine will revisit a state. If it revisits a state, then you know there is an infinite loop, since the machine is deterministic, and the next state is therefore determined entirely by the previous state. If there are n states, the machine is guaranteed to revisit one of them on the n+1th step.

about floating point operation

Recently, I have been making program (FDTD Operation) using the CUDA
development environment, OS is Windows server 2008 , Graphic card is TeslaC2070, compiler is VS2010. This program calculates using single and double precision floating-point.
I was reading the CUDA programming guide 3.2 and 4.0 . In appendix, guide tell me sin(), cos() has maximum accuracy of 2 ULP. My original CPU program produces results which are different to the CUDA Version.
I want to make results correctly same. Is it possible?
To quote Goldberg (a paper that every Computer Scientist, Computational Scientist, and possibly even every scientist who programs, should read):
Due to roundoff errors, the associative laws of algebra do not
necessarily hold for floating-point numbers.
This means that when you change the order of operations—even when using ostensibly associative arithmetic—you are likely to get slightly different answers.
Parallelism, by definition, results in different ordering of operations relative to serial arithmetic. "Embarrasingly parallel" computations, that is, computations where each output element is computed independently from all others, sometimes do not have to worry about this. But collective operations, like reductions or scans, and spatial neighborhood computations, such stencils (as in FDTD), do experience this effect.
In practice, even using a different compiler (and even different compiler options) can change the result of floating point computation, even when compiling the same code, with or without parallelism.

Terminology for a "complete" programming language?

The full definition of "Turing Completeness" requires infinite memory.
Is there a better term than Turing Complete for a programming language and implementation that seems useably complete, except for being limited by finite (say 100 word or 16-bit or 32-bit, etc.) address space?
I guess, you can bring the limited memory into the definition. Something like
A programming language for a given architecture (!) is Limitedly Turing Complete, if for every Turing Machine there exists a program that either
a) simulates the Turing Machine and returns the same result (iff the Turing Machine returns) or
b) at some point uses at least one available limited resource (e.g. memory) completely and returns an arbitrary result.
The question is, whether this intuitive definition really helps, or if it is better to assume that your architecture has unlimited memory (even though it is actually finite). Note that you don't even have to try hard in order to satisfy Limited Turing Completeness (as defined above), if you simply go into an infinite loop that mallocs one byte each time, you found your program for all Turing Machines.
The problem seems to be that you cannot pin implementation specific properties. For instance, if you have 500K ram, you may be able to express a program that computes 1+1 but maybe you're not, who knows.
I'd argue that languages like Haskell and Brainfuck (yes, I'm serious) are actually Turing Complete because they abstract resources away. While languages like C++ are only Limitedly Turing Complete, because at some point the address-space of pointers is exhausted and it is not possible to address any more data (e.g. sort a list of 2^2^2^2^100 items).
You could say that an implementation requires infinite memory to be truly turing complete, but languages themselves have no concept of a memory limit. You can make a compiler for a million-bit machine or a 4-bit machine without changing the language.

What is Turing Complete?

What does the expression "Turing Complete" mean?
Can you give a simple explanation, without going into too many theoretical details?
Here's the briefest explanation:
A Turing Complete system means a system in which a program can be written that will find an answer (although with no guarantees regarding runtime or memory).
So, if somebody says "my new thing is Turing Complete" that means in principle (although often not in practice) it could be used to solve any computation problem.
Sometimes it's a joke... a guy wrote a Turing Machine simulator in vi, so it's possible to say that vi is the only computational engine ever needed in the world.
Here is the simplest explanation
Alan Turing created a machine that can take a program, run that program, and show some result. But then he had to create different machines for different programs. So he created "Universal Turing Machine" that can take ANY program and run it.
Programming languages are similar to those machines (although virtual). They take programs and run them. Now, a programing language is called "Turing complete", if it can run any program (irrespective of the language) that a Turing machine can run given enough time and memory.
For example: Let's say there is a program that takes 10 numbers and adds them. A Turing machine can easily run this program. But now imagine that for some reason your programming language can't perform the same addition. This would make it "Turing incomplete" (so to speak). On the other hand, if it can run any program that the universal Turing machine can run, then it's Turing complete.
Most modern programming languages (e.g. Java, JavaScript, Perl, etc.) are all Turing complete because they each implement all the features required to run programs like addition, multiplication, if-else condition, return statements, ways to store/retrieve/erase data and so on.
Update: You can learn more on my blog post: "JavaScript Is Turing Complete" — Explained
Informal Definition
A Turing complete language is one that can perform any computation. The Church-Turing Thesis states that any performable computation can be done by a Turing machine. A Turing machine is a machine with infinite random access memory and a finite 'program' that dictates when it should read, write, and move across that memory, when it should terminate with a certain result, and what it should do next. The input to a Turing machine is put in its memory before it starts.
Things that can make a language NOT Turing complete
A Turing machine can make decisions based on what it sees in memory - The 'language' that only supports +, -, *, and / on integers is not Turing complete because it can't make a choice based on its input, but a Turing machine can.
A Turing machine can run forever - If we took Java, Javascript, or Python and removed the ability to do any sort of loop, GOTO, or function call, it wouldn't be Turing complete because it can't perform an arbitrary computation that never finishes. Coq is a theorem prover that can't express programs that don't terminate, so it's not Turing complete.
A Turing machine can use infinite memory - A language that was exactly like Java but would terminate once it used more than 4 Gigabytes of memory wouldn't be Turing complete, because a Turing machine can use infinite memory. This is why we can't actually build a Turing machine, but Java is still a Turing complete language because the Java language has no restriction preventing it from using infinite memory. This is one reason regular expressions aren't Turing complete.
A Turing machine has random access memory - A language that only lets you work with memory through push and pop operations to a stack wouldn't be Turing complete. If I have a 'language' that reads a string once and can only use memory by pushing and popping from a stack, it can tell me whether every ( in the string has its own ) later on by pushing when it sees ( and popping when it sees ). However, it can't tell me if every ( has its own ) later on and every [ has its own ] later on (note that ([)] meets this criteria but ([]] does not). A Turing machine can use its random access memory to track ()'s and []'s separately, but this language with only a stack cannot.
A Turing machine can simulate any other Turing machine - A Turing machine, when given an appropriate 'program', can take another Turing machine's 'program' and simulate it on arbitrary input. If you had a language that was forbidden from implementing a Python interpreter, it wouldn't be Turing complete.
Examples of Turing complete languages
If your language has infinite random access memory, conditional execution, and some form of repeated execution, it's probably Turing complete. There are more exotic systems that can still achieve everything a Turing machine can, which makes them Turing complete too:
Untyped lambda calculus
Conway's game of life
C++ Templates
Prolog
From wikipedia:
Turing completeness, named after Alan
Turing, is significant in that every
plausible design for a computing
device so far advanced can be emulated
by a universal Turing machine — an
observation that has become known as
the Church-Turing thesis. Thus, a
machine that can act as a universal
Turing machine can, in principle,
perform any calculation that any other
programmable computer is capable of.
However, this has nothing to do with
the effort required to write a program
for the machine, the time it may take
for the machine to perform the
calculation, or any abilities the
machine may possess that are unrelated
to computation.
While truly Turing-complete machines
are very likely physically impossible,
as they require unlimited storage,
Turing completeness is often loosely
attributed to physical machines or
programming languages that would be
universal if they had unlimited
storage. All modern computers are
Turing-complete in this sense.
I don't know how you can be more non-technical than that except by saying "turing complete means 'able to answer computable problem given enough time and space'".
Fundamentally, Turing-completeness is one concise requirement, unbounded recursion.
Not even bounded by memory.
I thought of this independently, but here is some discussion of the assertion. My definition of LSP provides more context.
The other answers here don't directly define the fundamental essence of Turing-completeness.
Turing Complete means that it is at least as powerful as a Turing Machine. This means anything that can be computed by a Turing Machine can be computed by a Turing Complete system.
No one has yet found a system more powerful than a Turing Machine. So, for the time being, saying a system is Turing Complete is the same as saying the system is as powerful as any known computing system (see Church-Turing Thesis).
In the simplest terms, a Turing-complete system can solve any possible computational problem.
One of the key requirements is the scratchpad size be unbounded and that is possible to rewind to access prior writes to the scratchpad.
Thus in practice no system is Turing-complete.
Rather some systems approximate Turing-completeness by modeling unbounded memory and performing any possible computation that can fit within the system's memory.
Super-brief summary from what Professor Brasilford explains in this video.
Turing Complete ≅ do anything that a Turing Machine can do.
It has conditional branching (i.e. "if statement"). Also, implies "go to" and thus permitting loop.
It gets arbitrary amount of memory (e.g. long enough tape) that the program needs.
I think the importance of the concept "Turing Complete" is in the the ability to identify a computing machine (not necessarily a mechanical/electrical "computer") that can have its processes be deconstructed into "simple" instructions, composed of simpler and simpler instructions, that a Universal machine could interpret and then execute.
I highly recommend The Annotated Turing
#Mark i think what you are explaining is a mix between the description of the Universal Turing Machine and Turing Complete.
Something that is Turing Complete, in a practical sense, would be a machine/process/computation able to be written and represented as a program, to be executed by a Universal Machine (a desktop computer). Though it doesn't take consideration for time or storage, as mentioned by others.
A Turing Machine requires that any program
can perform condition testing. That is fundamental.
Consider a player piano roll. The player piano can
play a highly complicated piece of music,
but there is never any conditional logic in the
music. It is not Turing Complete.
Conditional logic is both the power and the
danger of a machine that is Turing Complete.
The piano roll is guaranteed to halt every time.
There is no such guarantee for a TM. This
is called the “halting problem.”
In practical language terms familiar to most programmers, the usual way to detect Turing completeness is if the language allows or allows the simulation of nested unbounded while statements (as opposed to Pascal-style for statements, with fixed upper bounds).
We call a language Turing-complete if and only if (1) it is decidable by a Turing machine but (2) not by anything less capable than a Turing machine. For instance, the language of palindromes over the alphabet {a, b} is decidable by Turing machines, but also by pushdown automata; so, this language is not Turing-complete. Truly Turing-complete languages - ones that require the full computing power of Turing machines - are pretty rare. Perhaps the language of strings x.y.z where x is a number, y is a Turing-machine and z is an initial tape configuration, and y halts on z in fewer than x! steps - perhaps that qualifies (though it would need to be shown!)
A common imprecise usage confuses Turing-completeness with Turing-equivalence. Turing-equivalence refers to the property of a computational system which can simulate, and which can be simulated by, Turing machines. We might say Java is a Turing-equivalent programming language, for instance, because you can write a Turing-machine simulator in Java, and because you could define a Turing machine that simulates execution of Java programs. According to the Church-Turing thesis, Turing machines can perform any effective computation, so Turing-equivalence means a system is as capable as possible (if the Church-Turing thesis is true!)
Turing equivalence is a much more mainstream concern that true Turing completeness; this and the fact that "complete" is shorter than "equivalent" may explain why "Turing-complete" is so often misused to mean Turing-equivalent, but I digress.
As Waylon Flinn said:
Turing Complete means that it is at least as powerful as a Turing Machine.
I believe this is incorrect, a system is Turing complete if it's exactly as powerful as the Turing Machine, i.e. every computation done by the machine can be done by the system, but also every computation done by the system can be done by the Turing machine.
Can a relational database input latitudes and longitudes of places and roads, and compute the shortest path between them - no. This is one problem that shows SQL is not Turing complete.
But C++ can do it, and can do any problem. Thus it is.