Are preprocessors obsolete in modern languages? - language-agnostic

I'm making a simple compiler for a simple pet language I'm creating and coming from a C background(though I'm writing it in Ruby) I wondered if a preprocessor is necessary.
What do you think? Is a "dumb" preprocessor still necessary in modern languages? Would C#'s conditional compilation capabilities be considered a "preprocessor"? Does every modern language that doesn't include a preprocessor have the utilities necessary to properly replace it? (for instance, the C++ preprocessor is now mostly obsolete(though still depended upon) because of templates.)

C's preprocessing can do really neat things, but if you look at the things it's used for you realize it's often for just adding another level of abstraction.
Preprocessing for different operations on different platforms? It's basically a layer of abstraction for platform independence.
Preprocessing for easily adding complex code? Abstraction because the language isn't generic enough.
Preprocessing for adding extensions into your code? Abstraction because your code / your language isn't flexible enough.
So my answer is: you don't need a preprocessor if your language is high-level enough *. I wouldn't call preprocessing evil or useless, I just say that the more abstract the language gets, the less reason I can think for it needing preprocessing.
* What's high-level enough? That is, of course, entirely subjective.
EDIT: Of course, I'm only really referring to macros. Using preprocessors for interfacing with other code files or for defining constants is evil.

The preprocessor is a cheap method to provide incomplete metaprogramming facilities to a language in an ugly fashion.
Prefer true metaprogramming or Lisp-style macros instead.

A preprocesssor is not necessary. For real metaprogramming, you should have something like MetaML or Template Haskell or hygienic macros à la Scheme. For quick and dirty stuff, if your users absolutely must have it, there's always m4.
However, a modern language should support the equivalent of C's #line directives. Such directives enable the compiler to locate errors in the original source, even when that source is embedded in a parser generator or a lexer generator or a literate program. In other words,
Design your language so as not to need a preprocessor.
Don't bundle your language with a blessed preprocessor.
But if others have their own reasons for using a preprocessor (parser generation is a popular one), provide support for accurate error messages.

I think that preprocessors are a crutch to keep a language with poor expressive power walking.
I have seen so much abuse of preprocessors that I hate them with a passion.

A preprocessor is a separated phase of compilation.
While preprocessing can be useful in some cases, the headaches and bugs it can cause make it a problem.
In C, preprocessor is used mostly for:
Including data - While powerful, the most common use-cases do not need such power, and "import"/"using" stuff(like in Java/C#) is much cleaner to use, and few people need the remaining cases;
Defining constants - Why not just provide a "const" statement
Macros - While C-style macros are very powerful(they can include statements such as returns), they also harm readability. Generics/Templates are cleaner and, while less powerful in a few ways, they are easier to understand.
Conditional compilation - This is possibly the most legitimate use-case for preprocessors, but once again it's painful for readability. Separating platform-specific code in platform-specific source code and using common if statements ends up being better for readability.
So my answer is while powerful, the preprocessor harms readability and/or isn't the best way to deal with some problems. Newer languages tend to consider code maintenance very important, and for those reasons the preprocessor appears to be obsolete.

It's your language so you can build whatever capabilities you want into the language itself, without a need for a preprocessor. I don't think a preprocessor should be necessary, and it adds a layer of complexity and obscurity on top of a language. Most modern languages don't have preprocessors, and in C++ you only use it when you have no other choice.
By the way, I believe D handles conditional compilation without a preprocessor.

It depends on exactly what other features you offer. For example, if I have a const int N, do you offer for me to take N variables? Have N member variables, take an argument to construct all of them? Create N functions? Perform N operations that don't necessarily work in loops (for example, pass N arguments)? N template arguments? Conditional compilation? Constants that aren't integral?
The C preprocessor is so absurdly powerful in the proper hands, you'd need to make a seriously powerful language not to warrant one.

I would say that although you should avoid the pre-processor for most everything you normally do, it's still necessary.
For example, in C++, in order to write a unit-testing library like Catch, a pre-processor is absolutely necessary. They use it in two different ways: One for assertion expansion1, and one for nesting sections in test cases2.
But, the pre-processor shouldn't be abused to do compile-time computations in C++ where const-expressions and template meta-programming can be used.
Sorry, I don't have enough reputation to post more than two links, so I'm putting this here:
github.com/philsquared/Catch/blob/master/docs/assertions.md
github.com/philsquared/Catch/blob/master/docs/test-cases-and-sections.md

A others have pointed out, much of the functionality provided by the C preprocessor exists to compensate for limitations of the C language. For example, #include and inclusion guards exist due to the lack of an import statement, and macros largely exist due to the lack of inline functions and constant declarations.
However, the one feature of the C preprocessor that would still be beneficial in more modern languages is the #line directive, since this supports the use of semantically-rich preprocessors/compilers. An an example, consider yacc, which is a domain-specific-language (DSL) for writing a parser as a collection of BNF grammar rules. A central feature of yacc is that chunks of C code called actions can be embedded within BNF rules. When a BNF rule is used to parse a piece of an input file, an action embedded in that rule will be executed. The yacc compiler generates a C file that implements the BNF-based parser specified in the input file, and any actions that appeared in the input Yacc file are copied to the generated C file, but each action is surrounded by #line directives. This use of #line directives provides two important benefits.
First, if there is a syntax error in an action, then the error message generated by the C compiler can specify that the error occurred in, say, <input-file-to-yacc>, line 42 rather than in <output-file-generated-by-yacc>.c, line 3967.
Second, the location information provided by #line directives is copied into generated object code files created by the C compiler. So if you are using a debugger to investigate a program crash, if the bug that caused the crash originated from an action embedded in a Yacc input file, then the debugger will report the location of that buggy line of source code as being in <input-file-to-yacc>, line 42 rather than in <output-file-generated-by-yacc>.c, line 3967.
The designers of C# and Perl wisely provided a #line directive. Unfortunately, the designers of many other languages (Java being one that springs to mind) neglected to provide a #line directive. Because of this, Yacc-like parser generators for many languages are unable to communicate the source location of embedded actions to compilers (and, therefore, to debuggers).

Related

Pros and cons of weak and strong typing

I'm making the transition from Java to PHP/Javascript and discovering all the practical aspects of using a weakly typed language.
As I'm in a position to fully compare the two I'd like to know the pros and cons of each approach. Also, are there any other forms of typing out there?
A weakly dynamically typed programming language (like PHP) made that the programmer's mistakes occur as non-coherent behaviours (for instance, the program gonna display stupid informations).
With a strongly dynamically typed language (like python), the programming mistakes causes error message. It makes the mistakes easier to uncover and diagnosis but in general the program became not usable after the message has been shown.
Finally, with a strongly statically typed language (like Java, Ada, OCaml, Haskell, ...) some mistakes can be uncovered at compile time and hence reduce the risk to provide an bugged program. (but the release occurs later)
Yes. Python uses Dynamic Typing.
Generally it's a matter of personal preference and the role that the architects of a given language's intended use.
PHP (a scripting language) for example makes sense to be weakly typed, as the tasks it generally performs are far less complex, and require less constraints then say a compiled language.
Regarding your final question, Mathematica is said to be "typeless."
High-level, typeless, dynamic language with consistent symbolic syntax and semantics across all data, functions, and interfaces
PHP/javascript can be used to develope better looking UI's than Java. PHP will be having less constraints and easy to learn and execute than java.

How to translate from a language to another?

I don't want an automated solution.
When you have to translate a program from a language to another, what do you do? You prefer to rewrite it from the beginning or copy and paste it and change only what need to be changed?
What's the best choice?
It depends on
the goals (quick hack for one time use? long-lived production project for work?)
the resources I have (how many man-hours? Test suit and/or functional spec for old code? familiarity with both languages?)
most importantly, differences between the languages. Both the conceptual (OO? functional? reflection? control structures?) as well as available libraries.
Please note that this bullet is not as trivial as it seems - this depends in large part on how idiomatic the original program is - as an example, some people write very "C-like" Perl code (e.g. using C control flow and very C++ like OO design) which can be trivially copied to C or C++, and some people write incredibly intricate idiomatic Perl using functional programming, closures and reflective capabilties; which can't be obviously translated into C/C++.
Also, quality of the original code. E.g. good program will have separated business logic, usually expressed in standard configuration and control flow that's easier to directly clone.
E.g. translating from PHP to Perl for a hack job, you can often start out with copying, since many PHP constructs can be 1-to-1 mapped onto equivalent Perl constructs (just take your pick of Perl templating web library). The resulting code won't be GOOD Perl but will be Good Enough for some purposes.
On the other hand, translating, say, LISP code to Java, you're better off just translating the original code into a functionality specification and re-write from scratch. Your example of Python and JavaScript is probably in the same box.
Usually you have two languages that share at least some concepts (e.g. both have OO, and some imperative control structures) and thus you end up with some combination of the two approaches - parts of the code can be "thoughtlessly" translated, parts need to be re-written from scratch.
The more of the second (complete rewrite) approach, the better quality idiomatic and powerful code you end up with.
Normally, a rewrite using the original for inspiration and direction is the best choice.
The way you may do something in one language might very well not be the right way to do it in another.
When it comes to copy-paste, it is rare that you can just do that - languages are different and follow different syntax rules.
Of course, this all depends on the source and destination languages.
With your comment - javascript and python, I would say a rewrite is the best option.
That probably mostly depends on the similarity between the two languages and the meaning of "translation".
For instance, translating a bit of C89 code to C++ might not be so hard considering it should compile out of the box when you copy-paste (C++ is a compatible super-set to C89). I would hardly consider that a "translation", though.
On the other hand, translating Java to Haskell would certainly require a complete rewrite as the language paradygms, even worse than the syntax, are completely different.
Consider Wikipedia's list of programming languages. I'm too lazy to count how many of them are on this list, but let's assume there are 100.
If you want to translate one of them into another, that means that there are at least 100*99 = 9900 possible combinations for translation.
And that's an awful lot. Since most languages are unique, translation is very, very dependent on source and destination language.
Consider this Pascal to C converter. The author states it took him one and a half year to make a good translator for these particular languages. Obviously, this isn't a trivial task.
Depending on your ambitions, you might spend either one day, or many years translating a program from language A to language B.
How long this takes depends on your skill, size of your source code, complexity of languages A and B and their similarity.
As you can see, this isn't a trivial task and is highly dependent on your situation.

Why compilers don't translate in simpler languages?

Usually compilers translate from the language they support to assembly. Or at most to an assembly-like language (bytecode), like GIMPLE/GENERIC for GCC or Python/Java/.NET bytecode.
Wouldn't it be simpler for a compiler translate to a simpler language, which already implement a big subset of their grammar?
For example an Objective-C compiler, which is 100% compatible with C, could add the semantics only for the syntax it extends to C's, translating it into C. I can see many advantages of doing this; one could use this Objective-C compiler to translate its code into C in order to compile the generated C code with a different compiler that doesn't support C++ (but that optimizes more, or that compiles quicker, or able to compile for more architectures). Or one would be able to use the generated C code in a project where only C is allowed.
I guess/hope that if things were working like this, it would have been a lot easier to write extensions for current languages (eg: adding to C++ keywords to ease the implementation of common patterns, or, still in C++, removing the declare before use rule by moving inline member functions to the end of header files)
What kind of penalties would there be? Generated code would be very difficult to be understood by humans? Compilers wouldn't be able to optimize as much as they can now? What else?
This is actually used by a lot of languages, through the use of Intermediate languages. The biggest example for this would be Pascal, which had the Pascal-P system: Pascal was compiled into a hypothetical assembly language. To port pascal would only mean making a compiler for this assembly language: a task a lot simpler than porting the entire pascal compiler. After writing this compiler, you'd only need to compile the (machine-independent) pascal compiler that was written in this.
Bootstrapping is also used quite often in programming language design. Many languages have their compilers written in the same language(Haskell comes to mind here). By doing this, writing a new functionality for the language simply means translating that idea into the current language, putting it into the compiler, then recompiling.
I don't think the problem with this method is really the readability of generated code(I don't sift through assembly byte-code generated through compilers, personally), but one of optimization. Many ideas in higher-level programming languages( weak-typing comes to mind) are hard to automatically translate into lower-level system languages such as C. There's a reason why GCC tends to do its optimization before code generation.
But for the most part, compilers do translate into simpler languages except for maybe the most basic of system languages.
Incidentally, as a counterexample, Tcl is one language that is known to be very-very hard (if not totally impossible) to translate to C. Over the last 20 years there have been a couple of projects that tried this, even one promise of a commercial product but none have materialized.
In part it is because Tcl is a very dynamic language (as any language with an eval function is). In part it is because the only way to know if something is code or data is to run the program.
Since Objective-C is a strict superset of C and C++ contains a very large amount that is a lot like C, to parse either you effectively already need to be able to parse C. In which case, outputting to machine code and outputting to more C code aren't substantially different in processing cost, the main cost to the user being that compiling now takes as long as it originally did plus the amount of time a second compiler takes.
Any attempt to copy and paste the stuff that looks like C and translate the rest around it would be prone to problems. Firstly, C++ isn't a strict superset of C so things that look like C don't necessarily compile exactly the same anyway (especially versus C99). And even if they did, supposing a user made an error in their C stuff, compilers don't tend to provide error information in a machine readable format, so it'd be really hard for the Objective-C to C layer to give the user a meaningful error after receiving e.g. "error at line 99".
That said, many compiler suites, like GCC and even more so like the upcoming Clang + LLVM, use an intermediate form to decouple the bit that knows about the specifics of one architecture from the bit that knows the specifics of a particular language. However, it tends to be more of a data structure than something intentionally easy to express as a written language.
So: compilers don't work like this for purely practical reasons.
Haskell is actually compiled this way: the GHC compiler first translates the source code to an intermediary functional language (which is less rich than Haskell self), performs optimizations and then lowers the whole thing to C code which is then compiled by GCC. This solutions has problems tough, and projects were started to replace this backend.
http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-llvm.html
There is a compilers construction stack which is fully based on this idea. Any new language is implemented as a trivial translation into a lower level language or a combination of languages which are already defined within this stack.
http://www.meta-alternative.net/mbase.html
However, in order to be able to do so, you'd need at least some metaprogramming capabilities in every little language you add to a hierarchy. This requirement adds some severe limitations on languages semantics.

What's the use of abstract syntax trees?

I am learning on my own about writing an interpreter for a programming language, and I have read about Abstract Syntax Trees. I have an idea of what they are, but I do not see their use.
Why are ASTs useful?
They represent the logic/syntax of the code, which is naturally a tree rather than a list of lines, without getting bogged down in concrete syntax issues such as where you place your asterisk.
The logic can then be manipulated in a manner more consistent and convenient from the backend's POV, which can be (and is, for everything but Lisps ;) very different from how we write the concrete syntax.
The main benefit os using an AST is that you separate the parsing and validation logic from the implementation piece. Interpreters implemented as ASTs really are easier to understand and maintain. If you have a problem parsing some strange syntax you look at the AST parser , if a pices of code is not producing the expected results than you look at the code that interprets the AST.
The other great advantage is when you syntax requires "lookahead" e.g. if your syntax allows a subroutine to be used before it is defined it is trivial to validate the existence of a subroutine when you are using an AST - its much more difficult with an "on the fly" parser.
You need "syntax trees" to represent the structure of most programming langauges, in order to carry out analysis or transformation on documents that contain programming language text. (You can see some fancy examples of this via my bio).
Whether that tree is abstract (AST) or concrete (CST) is a matter of taste, convenience, and engineering sweat. The term CST is specially used to describe the parse derivation tree when a grammar is used to deconstruct source code; it usually contains tree elements for lots of concrete syntax such as statement terminator semicolons. AST is used to mean "something simpler than the CST", e.g., leaving out semicolon tree nodes because they don't affect program analysis much, and thus writing analyzers that process ASTs is less conceptual and engineering effort than writing the same analyzer on a CST. A better way to understand this is to realize that the AST is usually as isomorphic equivalent of the CST, that is, you should be able to regenerate the CST from it. If you want to transform the source text and regenerate it, then the CST is often a better choice as it loses less information from the original program (and my fancy example uses this approach).
I think you will find the SO discussion on abstract vs. concrete syntax trees pretty helpful.
In general you are going to parse you code into some form of AST, it may be more or less of a formal model. So I think what Kirk Woll was getting at by his comment above is that when you parse the language, you very often use the parser to create some sort of data model of the raw content of what you are reading, generally organized in a tree fashion. So by that definition an AST is hard to avoid unless you are doing a very simple translator.
I use ANTLR often for parsing complex languages and in that context there is a slightly more specific meaning of an AST. ANTLR has a handy way of generating an AST in the parser grammar using pretty simple actions. You then write a generally much simpler parser for this AST which you can operate on like a much simpler version the language you are processing. Whether the extra work of building two parsers is a net gain is a function of the language complexity and what you are planning on doing with with it once you parsed it.
A good book on the subject that you may want to take a look at is "Language Implementation Patterns" by Terrence Parr the ANTLR author. He addresses this topic pretty thoroughly. That said, I didn't really get ASTs until I started using them, so that (as usual) is the best way to understand them.
Late to the question but I thought I'd add something. You don't actually have to build an AST. It is possible to emit instructions directly as you parse the source code. In this case, the AST is implied in the parsing grammar. For simple languages, especially dynamically typed languages, this is a perfectly ok strategy. For more complex languages or where you need to further analyze the source code, an AST can be very useful. For example, if your language is statically typed, ie your variables are declared with fixed types then the AST can be used to check that you're not assigning the wrong type to a variable. eg assigning a string to a variable that is declared to hold an integer would be wrong and this can be caught more conveniently with the AST.
Also, as others have mentioned, the AST offers a clean separation between syntax analysis and code generation and makes the code much more modular.

What features of interpreted languages can a compiled one not have?

Interpreted languages are usually more high-level and therefore have features as dynamic typing (including creating new variables dynamically without declaration), the infamous eval and many many other features that make a programmer's life easier - but why can't compiled languages have these as well?
I don't mean languages like Java that run on a VM, but those that compile to binary like C(++).
I'm not going to make a list now but if you are going to ask which features I mean, please look into what PHP, Python, Ruby etc. have to offer.
Which common features of interpreted languages can't/don't/do exist in compiled languages? Why?
Whether source code is compiled - to native binaries, some kind of intermediate language (Java Bytecode/IL) - or interpreted is absolutely no trait of the language. It's just a question of the implementation.
You can actually have both compilers and interpreters for the same language like
Haskell: GHC <-> GHCI
C: gcc <-> ch
VB6: VS IDE <-> VB6 compiler
Certain language features like eval or dynamic typing may suggest a distinction between so called "dynamic languages" and static ones, but how this is run can never be the primary question.
Initially, one of the largest benefits of interpreted languages was debugging. That way you can get incredibly accurate and detailed information when looking for the reason a program isn't working. However, most compilers have become advanced enough that that is not too big of a deal any more.
The other main benefit (in my opinion anyway), is that with interpreted languages, you don't have to wait for eternity for your project to compile to test it out.
You couldn't plausibly do eval, for example, for reasons I'd have thought were pretty obvious: exactly how would you implement it? Make the runtime contain a full copy of the compiler? Every time you wanted to evaluate a string (keeping in mind that each time it could be different!) you'd save the string to a file, run the compiler on it to make a DLL/shared-lib, then load that DLL/shared-lib and call your code? You can't see why this might be a wee bit impractical? ;)
You can find this kind of thing in dynamic languages all over the place that you can't do with static code short of basically running an interpreter, in effect, behind the scenes.
Continuing on from Dario - I think you are really asking why a compiled program can't evaluate statements at runtime (e.g. eval). Here's some reasons I can think of:
The full compiler would have to be distributed with the program (or be part of the program)
For an eval function to have access to type information and symbols (such as variable names and function names) in the environment it was used the original program would have to be compiled with those symbols accessible (compiled languages usually remove these symbols at compile time).
Edit: As noted neither of these reasons make it impossible for a language/compiler to be able to evaluate code at runtime, but they are definitely things that need to be taken into consideration when developing a compiler or when designing a language.
Maybe the question is not about interpreted/compiled languages (compile is ambiguous anyway) but about languages that do/don't carry their own compiler around with them? For instance we've said C++ could do eval with a handy compiler floating around in the app, and reflection presumably is similar in some ways.