Related
I'm making a simple compiler for a simple pet language I'm creating and coming from a C background(though I'm writing it in Ruby) I wondered if a preprocessor is necessary.
What do you think? Is a "dumb" preprocessor still necessary in modern languages? Would C#'s conditional compilation capabilities be considered a "preprocessor"? Does every modern language that doesn't include a preprocessor have the utilities necessary to properly replace it? (for instance, the C++ preprocessor is now mostly obsolete(though still depended upon) because of templates.)
C's preprocessing can do really neat things, but if you look at the things it's used for you realize it's often for just adding another level of abstraction.
Preprocessing for different operations on different platforms? It's basically a layer of abstraction for platform independence.
Preprocessing for easily adding complex code? Abstraction because the language isn't generic enough.
Preprocessing for adding extensions into your code? Abstraction because your code / your language isn't flexible enough.
So my answer is: you don't need a preprocessor if your language is high-level enough *. I wouldn't call preprocessing evil or useless, I just say that the more abstract the language gets, the less reason I can think for it needing preprocessing.
* What's high-level enough? That is, of course, entirely subjective.
EDIT: Of course, I'm only really referring to macros. Using preprocessors for interfacing with other code files or for defining constants is evil.
The preprocessor is a cheap method to provide incomplete metaprogramming facilities to a language in an ugly fashion.
Prefer true metaprogramming or Lisp-style macros instead.
A preprocesssor is not necessary. For real metaprogramming, you should have something like MetaML or Template Haskell or hygienic macros à la Scheme. For quick and dirty stuff, if your users absolutely must have it, there's always m4.
However, a modern language should support the equivalent of C's #line directives. Such directives enable the compiler to locate errors in the original source, even when that source is embedded in a parser generator or a lexer generator or a literate program. In other words,
Design your language so as not to need a preprocessor.
Don't bundle your language with a blessed preprocessor.
But if others have their own reasons for using a preprocessor (parser generation is a popular one), provide support for accurate error messages.
I think that preprocessors are a crutch to keep a language with poor expressive power walking.
I have seen so much abuse of preprocessors that I hate them with a passion.
A preprocessor is a separated phase of compilation.
While preprocessing can be useful in some cases, the headaches and bugs it can cause make it a problem.
In C, preprocessor is used mostly for:
Including data - While powerful, the most common use-cases do not need such power, and "import"/"using" stuff(like in Java/C#) is much cleaner to use, and few people need the remaining cases;
Defining constants - Why not just provide a "const" statement
Macros - While C-style macros are very powerful(they can include statements such as returns), they also harm readability. Generics/Templates are cleaner and, while less powerful in a few ways, they are easier to understand.
Conditional compilation - This is possibly the most legitimate use-case for preprocessors, but once again it's painful for readability. Separating platform-specific code in platform-specific source code and using common if statements ends up being better for readability.
So my answer is while powerful, the preprocessor harms readability and/or isn't the best way to deal with some problems. Newer languages tend to consider code maintenance very important, and for those reasons the preprocessor appears to be obsolete.
It's your language so you can build whatever capabilities you want into the language itself, without a need for a preprocessor. I don't think a preprocessor should be necessary, and it adds a layer of complexity and obscurity on top of a language. Most modern languages don't have preprocessors, and in C++ you only use it when you have no other choice.
By the way, I believe D handles conditional compilation without a preprocessor.
It depends on exactly what other features you offer. For example, if I have a const int N, do you offer for me to take N variables? Have N member variables, take an argument to construct all of them? Create N functions? Perform N operations that don't necessarily work in loops (for example, pass N arguments)? N template arguments? Conditional compilation? Constants that aren't integral?
The C preprocessor is so absurdly powerful in the proper hands, you'd need to make a seriously powerful language not to warrant one.
I would say that although you should avoid the pre-processor for most everything you normally do, it's still necessary.
For example, in C++, in order to write a unit-testing library like Catch, a pre-processor is absolutely necessary. They use it in two different ways: One for assertion expansion1, and one for nesting sections in test cases2.
But, the pre-processor shouldn't be abused to do compile-time computations in C++ where const-expressions and template meta-programming can be used.
Sorry, I don't have enough reputation to post more than two links, so I'm putting this here:
github.com/philsquared/Catch/blob/master/docs/assertions.md
github.com/philsquared/Catch/blob/master/docs/test-cases-and-sections.md
A others have pointed out, much of the functionality provided by the C preprocessor exists to compensate for limitations of the C language. For example, #include and inclusion guards exist due to the lack of an import statement, and macros largely exist due to the lack of inline functions and constant declarations.
However, the one feature of the C preprocessor that would still be beneficial in more modern languages is the #line directive, since this supports the use of semantically-rich preprocessors/compilers. An an example, consider yacc, which is a domain-specific-language (DSL) for writing a parser as a collection of BNF grammar rules. A central feature of yacc is that chunks of C code called actions can be embedded within BNF rules. When a BNF rule is used to parse a piece of an input file, an action embedded in that rule will be executed. The yacc compiler generates a C file that implements the BNF-based parser specified in the input file, and any actions that appeared in the input Yacc file are copied to the generated C file, but each action is surrounded by #line directives. This use of #line directives provides two important benefits.
First, if there is a syntax error in an action, then the error message generated by the C compiler can specify that the error occurred in, say, <input-file-to-yacc>, line 42 rather than in <output-file-generated-by-yacc>.c, line 3967.
Second, the location information provided by #line directives is copied into generated object code files created by the C compiler. So if you are using a debugger to investigate a program crash, if the bug that caused the crash originated from an action embedded in a Yacc input file, then the debugger will report the location of that buggy line of source code as being in <input-file-to-yacc>, line 42 rather than in <output-file-generated-by-yacc>.c, line 3967.
The designers of C# and Perl wisely provided a #line directive. Unfortunately, the designers of many other languages (Java being one that springs to mind) neglected to provide a #line directive. Because of this, Yacc-like parser generators for many languages are unable to communicate the source location of embedded actions to compilers (and, therefore, to debuggers).
In theory, source code should not contain hardcoded values beyond 0, 1 and the empty string. In practice, I find it very hard to avoid all hardcoded values while being on very tight delivery times, so I end up using a few of them and feeling a little guilty.
How do you reconcile avoiding hardcoded values with tight delivery times?
To avoid hard-coding you should
use configuration files (put your values in XML or ini-like text files).
use database table(s) to store your values.
Of course not all values qualify to be moved to a config file. For those you should use constructs provided by the programming language such as (constants, enums, etc).
Just saw an answer to use "Constatn Interface". With all due respect to the poster and the voters, but this is not recommended. You can read more about that at:
http://en.wikipedia.org/wiki/Constant_interface
The assumption behind the question seems to me invalid.
For most software, configuration files are massively more difficult to change that source code. For widely installed, software, this could easily be a factor of a million times more difficult: there could easily be that many files hanging round on user installations which you have little knowledge and no control over.
Having numeric literals in the software is no different from having functional or algorithmic literals: it's just source code. It is the responsibility of any software that intends to be useful to get those values right.
Failing that make them at least maintainable: well named and organised.
Making them configurable is the kind of last-ditch compromise you might be forced into if you are on a tight schedule.
This comes with a little bit of planning, in most cases it is as simple as having a configuration file, or possibly a database table that stores critical configuration items. I don't find that there is any reason that you "have" to have hard coded values, and it shouldn't take you much additional time to offload to a configuration mechanism to where tight time lines would be a valid excuse.
The problem of hardcoded values is that sometimes it's not obvoius that particular code relies on them. For example, in java it is possible to move all constants into separate interface and separate particular constants into inner sub-interfaces. It's quite convenient and obvious. Also it's easy to find the constant usage just by using IDE facilities ("find usage" functionality) and change or refactor them.
Here's an example:
public interface IConstants {
public interface URL {
String ALL = "/**";
}
public interface REST_URL {
String DEBUG = "/debug";
String SYSTEM = "/system";
String GENERATE = "/generate";
}
}
Referencing is quite human readable: IConstants.REST_URL.SYSTEM
Most non-trivial enterprise-y projects will have some central concept of properties or configuration options, which already takes care of loading up option from a file/database. In these cases, it's usually simple (as in, less than 5 minutes' work) to extend this to support the new propert(ies) you need.
If your project doesn't have one, then either:
It could benefit from one - in which case write it, taking values from a flat .properties file to start with. This shouldn't take more than an hour, tops, and is reusable for any config stuff in future
Even that hour would be a waste - in which case, you can still hav a default value but allow this to be overridden by a system property. This require no infrastructure work and minimal time to implement in your class.
There's really no excuse for hardcoding values - it only saves you a few minutes at most, and if your project deadline is measured in minutes then you've got bigger problems than how to code for configurability.
Admittedly, I hardcode a lot of stuff at my current hobby project. Configuration files are ridiculously easy to use instead (at least with Python, which comes with a great and simple .cfg parser), I just don't bother to use them because I am 99% confident that I will never have to change them - and even if that assumption proved false, it's small enough to refactor it with reasonable effort. For annything larger/more important, however, I would never type if foo == "hardcoded bar", but rather if foo == cfg.bar (likely with a more meaningful name for cfg). Cfg is a global singleton (yeah, I know...) which is fed the .cfg file at startup, and next time some sentinel value changes, you change the configuration file and not the source.
With a dynamic/reflective language, you don't even need to change the part loading the .cfg when you add another value to it - make it populate the cfg object dynamically with all entries in the file (or use a hashmap, for that matter) and be done.
2 suggestions here:
First, if you are working on embedded system using language like C. Simply work out a coding convention to use a #define for any string or constant. All the #define should be categorized in a .h file. That should be enough - not too complex but enough for maintainability. You don't need to mangle all the constant between the code line.
Second, if you are working on a application with access to DB. It is simple just to keep all the constant values in the database. You just need a very simple interface API to do the retrieval.
With simple tricks, both methods can be extended to support multi-language feature.
The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.
I'm having trouble finding good advice and common practices for the use of namespaces in Clojure. I realize that namespaces are not the same as Java packages so I'm trying to tease out the conventions in Clojure, which seem surprisingly hard to determine.
I think I have a pretty good idea how to split functions into clj files and even roughly how I'd want to organize those files into directories. But beyond that I'm having trouble finding the mechanics for my dev environment. Some inter-related questions:
Do I use the same uniqueness conventions for Clojure namespaces as I would normally use for Java packages? [ie backwards-company-domain.project.subsystem]
Should I save my files in a directory structure that matches my namespaces? [ala Java]
If I have multiple namespaces, do I need to compile all of my code into a jar and add it to my classpath to make it accessible?
Should each namespace compile to one jar? Or should I create a single jar that contains clj code from many namespaces?
Thanks...
I guess it's ok if you think it helps, but many Clojure projects don't do so -- cf. Compojure (using a top-level compojure ns and various compojure.* ns's for specific functionality), Ring, Leiningen... Clojure itself uses clojure.* (and clojure.contrib.* for contrib libraries), but that's a special case, I suppose.
Yes! You absolutely must do so, or else Clojure won't be able to find your namespaces. Also note that you musn't use the underscore in namespace names or the hyphen in filenames and wherever you use a hyphen in a namespace name, you must use an underscore in the filename (so that the ns my.cool-project is defined in a file called cool_project.clj in a directory called my).
You need to make sure all your stuff is on the classpath, but it doesn't matter if it's in a jar, multiple jars, a mixture of jars and directories on the filesystem... As long as it obeys the correct naming conventions (your point no. 2) you should be fine.
However, do not compile things ahead-of-time if there's no particular reason to do so -- this may prevent your code from being portable across various versions of Clojure without providing any benefits besides a slightly improved loading time.
You'll still need to use AOT compilation sometimes, notably in some Java interop scenarios -- the documentation of the relevant functions / macros always mentions that. There are examples of things requiring AOT in clojure.contrib; I've never needed it, so I can't provide much in the way of details.
I'd say you should use jars for functional units of code. E.g. Compojure and Ring get packaged as single jars containing many namespaces which together compose the whole package. Also, clojure.contrib is notably packaged as a single jar with multiple unrelated libraries; but that again may be a special case.
On the other hand, a single jar containing all of your project's code together with its dependencies might occasionally be useful for deployment. Check out the Leiningen build tool and its 'uberjar' facility if you think that sort of thing may be useful to you.
Strictly speaking, not necessary, though many Java projects have dropped that convention as well, especially for internal projects or private APIs. Do avoid single-segment namespaces though, which would result in classfiles being generated in the default package.
Yes.
Regarding 3 & 4, packaging and AOT compilation are entirely orthogonal to the question of namespace conventions.
Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?
Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.
Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.
I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.
Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.
There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json
I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.
In addition to 'Visitor', 'unparser' is another good keyword to web-search for.
That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.
I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.
The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.
There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.