Why compilers don't translate in simpler languages? - language-agnostic

Usually compilers translate from the language they support to assembly. Or at most to an assembly-like language (bytecode), like GIMPLE/GENERIC for GCC or Python/Java/.NET bytecode.
Wouldn't it be simpler for a compiler translate to a simpler language, which already implement a big subset of their grammar?
For example an Objective-C compiler, which is 100% compatible with C, could add the semantics only for the syntax it extends to C's, translating it into C. I can see many advantages of doing this; one could use this Objective-C compiler to translate its code into C in order to compile the generated C code with a different compiler that doesn't support C++ (but that optimizes more, or that compiles quicker, or able to compile for more architectures). Or one would be able to use the generated C code in a project where only C is allowed.
I guess/hope that if things were working like this, it would have been a lot easier to write extensions for current languages (eg: adding to C++ keywords to ease the implementation of common patterns, or, still in C++, removing the declare before use rule by moving inline member functions to the end of header files)
What kind of penalties would there be? Generated code would be very difficult to be understood by humans? Compilers wouldn't be able to optimize as much as they can now? What else?

This is actually used by a lot of languages, through the use of Intermediate languages. The biggest example for this would be Pascal, which had the Pascal-P system: Pascal was compiled into a hypothetical assembly language. To port pascal would only mean making a compiler for this assembly language: a task a lot simpler than porting the entire pascal compiler. After writing this compiler, you'd only need to compile the (machine-independent) pascal compiler that was written in this.
Bootstrapping is also used quite often in programming language design. Many languages have their compilers written in the same language(Haskell comes to mind here). By doing this, writing a new functionality for the language simply means translating that idea into the current language, putting it into the compiler, then recompiling.
I don't think the problem with this method is really the readability of generated code(I don't sift through assembly byte-code generated through compilers, personally), but one of optimization. Many ideas in higher-level programming languages( weak-typing comes to mind) are hard to automatically translate into lower-level system languages such as C. There's a reason why GCC tends to do its optimization before code generation.
But for the most part, compilers do translate into simpler languages except for maybe the most basic of system languages.

Incidentally, as a counterexample, Tcl is one language that is known to be very-very hard (if not totally impossible) to translate to C. Over the last 20 years there have been a couple of projects that tried this, even one promise of a commercial product but none have materialized.
In part it is because Tcl is a very dynamic language (as any language with an eval function is). In part it is because the only way to know if something is code or data is to run the program.

Since Objective-C is a strict superset of C and C++ contains a very large amount that is a lot like C, to parse either you effectively already need to be able to parse C. In which case, outputting to machine code and outputting to more C code aren't substantially different in processing cost, the main cost to the user being that compiling now takes as long as it originally did plus the amount of time a second compiler takes.
Any attempt to copy and paste the stuff that looks like C and translate the rest around it would be prone to problems. Firstly, C++ isn't a strict superset of C so things that look like C don't necessarily compile exactly the same anyway (especially versus C99). And even if they did, supposing a user made an error in their C stuff, compilers don't tend to provide error information in a machine readable format, so it'd be really hard for the Objective-C to C layer to give the user a meaningful error after receiving e.g. "error at line 99".
That said, many compiler suites, like GCC and even more so like the upcoming Clang + LLVM, use an intermediate form to decouple the bit that knows about the specifics of one architecture from the bit that knows the specifics of a particular language. However, it tends to be more of a data structure than something intentionally easy to express as a written language.
So: compilers don't work like this for purely practical reasons.

Haskell is actually compiled this way: the GHC compiler first translates the source code to an intermediary functional language (which is less rich than Haskell self), performs optimizations and then lowers the whole thing to C code which is then compiled by GCC. This solutions has problems tough, and projects were started to replace this backend.
http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-llvm.html

There is a compilers construction stack which is fully based on this idea. Any new language is implemented as a trivial translation into a lower level language or a combination of languages which are already defined within this stack.
http://www.meta-alternative.net/mbase.html
However, in order to be able to do so, you'd need at least some metaprogramming capabilities in every little language you add to a hierarchy. This requirement adds some severe limitations on languages semantics.

Related

How to translate from a language to another?

I don't want an automated solution.
When you have to translate a program from a language to another, what do you do? You prefer to rewrite it from the beginning or copy and paste it and change only what need to be changed?
What's the best choice?
It depends on
the goals (quick hack for one time use? long-lived production project for work?)
the resources I have (how many man-hours? Test suit and/or functional spec for old code? familiarity with both languages?)
most importantly, differences between the languages. Both the conceptual (OO? functional? reflection? control structures?) as well as available libraries.
Please note that this bullet is not as trivial as it seems - this depends in large part on how idiomatic the original program is - as an example, some people write very "C-like" Perl code (e.g. using C control flow and very C++ like OO design) which can be trivially copied to C or C++, and some people write incredibly intricate idiomatic Perl using functional programming, closures and reflective capabilties; which can't be obviously translated into C/C++.
Also, quality of the original code. E.g. good program will have separated business logic, usually expressed in standard configuration and control flow that's easier to directly clone.
E.g. translating from PHP to Perl for a hack job, you can often start out with copying, since many PHP constructs can be 1-to-1 mapped onto equivalent Perl constructs (just take your pick of Perl templating web library). The resulting code won't be GOOD Perl but will be Good Enough for some purposes.
On the other hand, translating, say, LISP code to Java, you're better off just translating the original code into a functionality specification and re-write from scratch. Your example of Python and JavaScript is probably in the same box.
Usually you have two languages that share at least some concepts (e.g. both have OO, and some imperative control structures) and thus you end up with some combination of the two approaches - parts of the code can be "thoughtlessly" translated, parts need to be re-written from scratch.
The more of the second (complete rewrite) approach, the better quality idiomatic and powerful code you end up with.
Normally, a rewrite using the original for inspiration and direction is the best choice.
The way you may do something in one language might very well not be the right way to do it in another.
When it comes to copy-paste, it is rare that you can just do that - languages are different and follow different syntax rules.
Of course, this all depends on the source and destination languages.
With your comment - javascript and python, I would say a rewrite is the best option.
That probably mostly depends on the similarity between the two languages and the meaning of "translation".
For instance, translating a bit of C89 code to C++ might not be so hard considering it should compile out of the box when you copy-paste (C++ is a compatible super-set to C89). I would hardly consider that a "translation", though.
On the other hand, translating Java to Haskell would certainly require a complete rewrite as the language paradygms, even worse than the syntax, are completely different.
Consider Wikipedia's list of programming languages. I'm too lazy to count how many of them are on this list, but let's assume there are 100.
If you want to translate one of them into another, that means that there are at least 100*99 = 9900 possible combinations for translation.
And that's an awful lot. Since most languages are unique, translation is very, very dependent on source and destination language.
Consider this Pascal to C converter. The author states it took him one and a half year to make a good translator for these particular languages. Obviously, this isn't a trivial task.
Depending on your ambitions, you might spend either one day, or many years translating a program from language A to language B.
How long this takes depends on your skill, size of your source code, complexity of languages A and B and their similarity.
As you can see, this isn't a trivial task and is highly dependent on your situation.

Are preprocessors obsolete in modern languages?

I'm making a simple compiler for a simple pet language I'm creating and coming from a C background(though I'm writing it in Ruby) I wondered if a preprocessor is necessary.
What do you think? Is a "dumb" preprocessor still necessary in modern languages? Would C#'s conditional compilation capabilities be considered a "preprocessor"? Does every modern language that doesn't include a preprocessor have the utilities necessary to properly replace it? (for instance, the C++ preprocessor is now mostly obsolete(though still depended upon) because of templates.)
C's preprocessing can do really neat things, but if you look at the things it's used for you realize it's often for just adding another level of abstraction.
Preprocessing for different operations on different platforms? It's basically a layer of abstraction for platform independence.
Preprocessing for easily adding complex code? Abstraction because the language isn't generic enough.
Preprocessing for adding extensions into your code? Abstraction because your code / your language isn't flexible enough.
So my answer is: you don't need a preprocessor if your language is high-level enough *. I wouldn't call preprocessing evil or useless, I just say that the more abstract the language gets, the less reason I can think for it needing preprocessing.
* What's high-level enough? That is, of course, entirely subjective.
EDIT: Of course, I'm only really referring to macros. Using preprocessors for interfacing with other code files or for defining constants is evil.
The preprocessor is a cheap method to provide incomplete metaprogramming facilities to a language in an ugly fashion.
Prefer true metaprogramming or Lisp-style macros instead.
A preprocesssor is not necessary. For real metaprogramming, you should have something like MetaML or Template Haskell or hygienic macros à la Scheme. For quick and dirty stuff, if your users absolutely must have it, there's always m4.
However, a modern language should support the equivalent of C's #line directives. Such directives enable the compiler to locate errors in the original source, even when that source is embedded in a parser generator or a lexer generator or a literate program. In other words,
Design your language so as not to need a preprocessor.
Don't bundle your language with a blessed preprocessor.
But if others have their own reasons for using a preprocessor (parser generation is a popular one), provide support for accurate error messages.
I think that preprocessors are a crutch to keep a language with poor expressive power walking.
I have seen so much abuse of preprocessors that I hate them with a passion.
A preprocessor is a separated phase of compilation.
While preprocessing can be useful in some cases, the headaches and bugs it can cause make it a problem.
In C, preprocessor is used mostly for:
Including data - While powerful, the most common use-cases do not need such power, and "import"/"using" stuff(like in Java/C#) is much cleaner to use, and few people need the remaining cases;
Defining constants - Why not just provide a "const" statement
Macros - While C-style macros are very powerful(they can include statements such as returns), they also harm readability. Generics/Templates are cleaner and, while less powerful in a few ways, they are easier to understand.
Conditional compilation - This is possibly the most legitimate use-case for preprocessors, but once again it's painful for readability. Separating platform-specific code in platform-specific source code and using common if statements ends up being better for readability.
So my answer is while powerful, the preprocessor harms readability and/or isn't the best way to deal with some problems. Newer languages tend to consider code maintenance very important, and for those reasons the preprocessor appears to be obsolete.
It's your language so you can build whatever capabilities you want into the language itself, without a need for a preprocessor. I don't think a preprocessor should be necessary, and it adds a layer of complexity and obscurity on top of a language. Most modern languages don't have preprocessors, and in C++ you only use it when you have no other choice.
By the way, I believe D handles conditional compilation without a preprocessor.
It depends on exactly what other features you offer. For example, if I have a const int N, do you offer for me to take N variables? Have N member variables, take an argument to construct all of them? Create N functions? Perform N operations that don't necessarily work in loops (for example, pass N arguments)? N template arguments? Conditional compilation? Constants that aren't integral?
The C preprocessor is so absurdly powerful in the proper hands, you'd need to make a seriously powerful language not to warrant one.
I would say that although you should avoid the pre-processor for most everything you normally do, it's still necessary.
For example, in C++, in order to write a unit-testing library like Catch, a pre-processor is absolutely necessary. They use it in two different ways: One for assertion expansion1, and one for nesting sections in test cases2.
But, the pre-processor shouldn't be abused to do compile-time computations in C++ where const-expressions and template meta-programming can be used.
Sorry, I don't have enough reputation to post more than two links, so I'm putting this here:
github.com/philsquared/Catch/blob/master/docs/assertions.md
github.com/philsquared/Catch/blob/master/docs/test-cases-and-sections.md
A others have pointed out, much of the functionality provided by the C preprocessor exists to compensate for limitations of the C language. For example, #include and inclusion guards exist due to the lack of an import statement, and macros largely exist due to the lack of inline functions and constant declarations.
However, the one feature of the C preprocessor that would still be beneficial in more modern languages is the #line directive, since this supports the use of semantically-rich preprocessors/compilers. An an example, consider yacc, which is a domain-specific-language (DSL) for writing a parser as a collection of BNF grammar rules. A central feature of yacc is that chunks of C code called actions can be embedded within BNF rules. When a BNF rule is used to parse a piece of an input file, an action embedded in that rule will be executed. The yacc compiler generates a C file that implements the BNF-based parser specified in the input file, and any actions that appeared in the input Yacc file are copied to the generated C file, but each action is surrounded by #line directives. This use of #line directives provides two important benefits.
First, if there is a syntax error in an action, then the error message generated by the C compiler can specify that the error occurred in, say, <input-file-to-yacc>, line 42 rather than in <output-file-generated-by-yacc>.c, line 3967.
Second, the location information provided by #line directives is copied into generated object code files created by the C compiler. So if you are using a debugger to investigate a program crash, if the bug that caused the crash originated from an action embedded in a Yacc input file, then the debugger will report the location of that buggy line of source code as being in <input-file-to-yacc>, line 42 rather than in <output-file-generated-by-yacc>.c, line 3967.
The designers of C# and Perl wisely provided a #line directive. Unfortunately, the designers of many other languages (Java being one that springs to mind) neglected to provide a #line directive. Because of this, Yacc-like parser generators for many languages are unable to communicate the source location of embedded actions to compilers (and, therefore, to debuggers).

What features of interpreted languages can a compiled one not have?

Interpreted languages are usually more high-level and therefore have features as dynamic typing (including creating new variables dynamically without declaration), the infamous eval and many many other features that make a programmer's life easier - but why can't compiled languages have these as well?
I don't mean languages like Java that run on a VM, but those that compile to binary like C(++).
I'm not going to make a list now but if you are going to ask which features I mean, please look into what PHP, Python, Ruby etc. have to offer.
Which common features of interpreted languages can't/don't/do exist in compiled languages? Why?
Whether source code is compiled - to native binaries, some kind of intermediate language (Java Bytecode/IL) - or interpreted is absolutely no trait of the language. It's just a question of the implementation.
You can actually have both compilers and interpreters for the same language like
Haskell: GHC <-> GHCI
C: gcc <-> ch
VB6: VS IDE <-> VB6 compiler
Certain language features like eval or dynamic typing may suggest a distinction between so called "dynamic languages" and static ones, but how this is run can never be the primary question.
Initially, one of the largest benefits of interpreted languages was debugging. That way you can get incredibly accurate and detailed information when looking for the reason a program isn't working. However, most compilers have become advanced enough that that is not too big of a deal any more.
The other main benefit (in my opinion anyway), is that with interpreted languages, you don't have to wait for eternity for your project to compile to test it out.
You couldn't plausibly do eval, for example, for reasons I'd have thought were pretty obvious: exactly how would you implement it? Make the runtime contain a full copy of the compiler? Every time you wanted to evaluate a string (keeping in mind that each time it could be different!) you'd save the string to a file, run the compiler on it to make a DLL/shared-lib, then load that DLL/shared-lib and call your code? You can't see why this might be a wee bit impractical? ;)
You can find this kind of thing in dynamic languages all over the place that you can't do with static code short of basically running an interpreter, in effect, behind the scenes.
Continuing on from Dario - I think you are really asking why a compiled program can't evaluate statements at runtime (e.g. eval). Here's some reasons I can think of:
The full compiler would have to be distributed with the program (or be part of the program)
For an eval function to have access to type information and symbols (such as variable names and function names) in the environment it was used the original program would have to be compiled with those symbols accessible (compiled languages usually remove these symbols at compile time).
Edit: As noted neither of these reasons make it impossible for a language/compiler to be able to evaluate code at runtime, but they are definitely things that need to be taken into consideration when developing a compiler or when designing a language.
Maybe the question is not about interpreted/compiled languages (compile is ambiguous anyway) but about languages that do/don't carry their own compiler around with them? For instance we've said C++ could do eval with a handy compiler floating around in the app, and reflection presumably is similar in some ways.

What language features are required in a programming language to make a compiler?

Programming languages seem to go through several stages. Firstly, someone dreams up a new language, Foo Language. The compiler/interpreter is written in another language, usually C or some other low level language. At some point, FooL matures and grows, and eventually someone, somewhere will write a compiler and/or interpreter for FooL in FooL itself.
My question is this: What is the minimal subset of language features such that someone could implement that language in itself?
Compiler can be written even using a Turing machine - a Universal Turing Machine is basically a compiler/interpreter of any Turing machine, so any Turing-complete language should be enough :)
In theory, surprisingly little. A computability theorist would say that all you need is mu-recursion or a Turing machine or the like.
However, from a practical point of view, you're not going to be very happy trying to implement a programming language in a Turing machine. I would say that, at a minimum, you would want to have all the usual control-flow constructs, the primitive datatypes, subroutines, as well as arrays and structs. That should be enough to let you implement that subset of the language in the language itself -- and you can then bootstrap yourself up from there.
One option is a read-eval-print loop. This can be used to build many higher-level constructs. I believe this is the path taken by LISP.
I am unsure about the beginnings of C, but I think it started with a few system calls to implement branching, loops, assignment and single-character I/O, and built from there.
Id assume a assembler would make the cut.
My question is this: What is the minimal subset of language features such that someone could implement that language in itself?
There is no requirement for the language to be useful for anything other than compiling itself? I present to you Useless, the language in which every text is a proper program and means "a program that takes any input and produces itself" (this is also known as Useless compiler).

runnable pseudocode?

I am attempting to determine prior art for the following idea:
1) user types in some code in a language called (insert_name_here);
2) user chooses a destination language from a list of well-known output candidates (javascript, ruby, perl, python);
3) the processor translates insert_name_here into runnable code in destination language;
4) the processor then runs the code using the relevant system call based on the chosen language
The reason this works is because there is a pre-established 1 to 1 mapping between all language constructs from insert_name_here to all supported destination languages.
(Disclaimer: This obviously does not produce "elegant" code that is well-tailored to the destination language. It simply does a rudimentary translation that is runnable. The purpose is to allow developers to get a quick-and-dirty implementation of algorithms in several different languages for those cases where they do not feel like re-inventing the wheel, but are required for whatever reason to work with a specific language on a specific project.)
Does this already exist?
The .NET CLR is designed such that C++.Net, C#.Net, and VB.Net all compile to the same machine language, and you can "decompile" that CLI back in to any one of those languages.
So yes, I would say it already exists though not exactly as you describe.
There are converters available for different languages. The problem you are going to have is dealing with libraries. While mapping between language statements might be easy, finding mappings between library functions will be very difficult.
I'm not really sure how useful that type of code generator would be. Why would you want to write something in one language and then immediately convert it to something else? I can see the rationale for 4th Gen languages that convert diagrams or models into code but I don't really see the point of your effort.
Yes, a program that transform a program from one representation to another does exist. It's called a "compiler".
And as to your question whether that is always possible: as long as your target language is at least as powerful as the source language, then it is possible. So, if your target language is Turing-complete, then it is always possible, because there can be no language that is more powerful than a Turing-complete language.
However, there does not need to be a dumb 1:1 mapping.
For example: the Microsoft Volta compiler which compiles CIL bytecode to JavaScript sourcecode has a problem: .NET has threads, JavaScript doesn't. But you can implement threads with continuations. Well, JavaScript doesn't have continuations either, but you can implement continuations with exceptions. So, Volta transforms the CIL to CPS and then implements CPS with exceptions. (Newer versions of JavaScript have semi-coroutines in the form of generators; those could also be used, but Volta is intended to work across a wide range of JavaScript versions, including obviously JScript in Internet Explorer.)
This seems a little bizarre. If you're using the term "prior art" in its most common form, you're discussing a potentially patentable idea. If that is the case, you have:
1/ Published the idea, starting the clock running on patent filing - I'm assuming, perhaps incorrectly, that you're based in the U.S. Other jurisdictions may have other rules.
2/ Told the entire planet your idea, which means it's pretty much useless to try and patent it, unless you act very fast.
If you're not thinking about patenting this and were just using the term "prior art" in a laypersons sense, I apologize. I work for a company that takes patents very seriously and it's drilled into us, in great detail, what we're allowed to do with information before filing.
Having said that, patentable ideas must be novel, useful and non-obvious. I would think that your idea would not pass on the third of these since you're describing a language translator which would have the prior art of the many pascal-to-c and fortran-to-c converters out there.
The one glimmer of hope would be the ability of your idea to generate one of multiple output languages (which p2c and f2c don't do) but I think even that would be covered by the likes of cross compilers (such as gcc) which turn source into one of many different object languages.
IBM has a product called Visual Age Generator in which you code in one (proprietary) language and it's converted into COBOL/C/Java/others to run on different target platforms from PCs to the big honkin' System z mainframes, so there's your first problem (thinking about patenting an idea that IBM, the biggest patenter in the world, is already using).
Tons of them. p2c, f2c, and the original implementation s of C++ and Objective C strike me immediately. Beyond that, it's kind of hard to distinguish what you're describing from any compiler, especially for us old guys whose compilers generated ASM code for an intermediate represetation anyway.