What language(s) have comments with side effects? In essence, comments which are not comments....
English. Do I win?
DOS Batch Shell programming
The REM (Remark) allows you to put in a comment. But it has the side-effect of modifying the ERRORLEVEL variable to 0.
In a sense, it makes last operation a success.
I don't know how a comment can fail, but if it does, you are covered.
I can think of several places where comments aren't really comments.
HTML and script tags (providing support for browsers that don't allow or support scripts).
And then, considerably more obscurely:
IBM Informix 4GL (I4GL) and 4J's Genero (successor to Informix Dynamic 4GL, D4GL). The notation '--#' was used by D4GL to include material only applicable to D4GL; I4GL would see that as a comment. The inverse notation was '--#', which looked like a comment to D4GL but was treated as active material by I4GL.
And, even more obscurely:
I wrote an I4GL file which was dual-languaged, exploiting I4GL's multiple comment facilities. Material starting '#' (hash) marked the start of a comment outside of strings - up to the next newline, as does '--' (double-dash). Also, '{...}' (braces) enclose multiline comments.
The top of the source file was actually a shell script, mostly enclosed in '{...}' which is, of course, perfectly legitimate in shell. The shell script was a data-driven code generator that copied itself to the top of the output, and then generated about 100 functions which were all depressingly similar but slightly different (in a language without templates or a pre-processor). The code had to validate what was in the database for a given ship against incoming data from an external source (Lloyds of London, in fact), to see what had changed since the last time the external data was received. Non-trivial comparison work, especially since it had to deal with database (SQL) nulls.
The file was not really a Quine program, but it had some points in common with it. In particular, you could feed the script broken I4GL code and the regenerated file would be perfect again, basically because it ignored the existing I4GL code.
Haskell can turn the usual comments in code paradigm upside down by having code in comments - also Mathematica and the like; literal programming is a nice feature for the more mathematically inclined languages.
I also find annotations in Java are like comments with behaviour.
Then of course there are "polyglots" -- programs which can be compiled/executed in multiple languages. Usually these rely on the fact that the same line is a comment in one language, but an actual line of code in another.
QBasic has a use of comments all its own: REM $STATIC or REM $DYNAMIC set how arrays are allocated.
Another example: When web browsers parse comments <!-- -- -->in<!-- -- -->correctly.
CSS for clever cross-browser hacks. Of course, I wouldn't really call CSS a language.
Just stumbled upon this old question and my first thought was javadoc comments.
Related
I would like to program a parser for a Markup language similar to BBCode, Markdown, Wikisyntax etc. using a high-level language like Python or Perl. It should feature sectioning, code highlighting, automatic link creation, embedding images but allowing HTML for more complex formatting.
Has anyone done similar things or has worked closely with those systems and could describe generally how this could be done efficiently?
Although efficiency is not really of concern for such a small system, it is generally favourable.
In particular I would like to learn if there is a more efficient way than using regular expressions for such a program.
For your general discussion…
You should start with the following blueprint:
you need to iterate charwise over entire data
you need to identify every char by its context, for it may be a tag-opening ('<', '[' etc) or just the char. This may be done by having an escapement flag, triggered by an escape-char (like backslashes in some languages do). if you use that approach, you also need to check for an escaped escapement.
you may also need some flag telling you to be inside a comment or special data section, that may have different escapement rules.
you need to build a tree-like structure or at least some stack for nested tags. This is why regexes are a bad idea: they not only take much to much overhead, they're also of no use if you want to get the correct closing tag for the second x (x=any tag) in the following snipped: <x><x><x></x><x><x></x></x><x></x><!-- </x> -->this one →</x><x></x></x>
The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.
I'm looking for some good overviews of best practices and common patterns for enabling syntax highlighting in a textbox. It seems like a very common exercise almost all languages have a UI control that enables syntax highlighting in different languages. I'm just curious to see if there is a common pattern of implementation.
Is everyone using regular expressions? Is there a repository for regular expressions that are commonly used in syntax highlighting scenarios?
Are there alternative/better approaches to syntax highlighting?
Update
Links to relevant resources about performing syntax highlighting in a given language or concepts related to syntax highlighting would be great. Lexing (lexical analysis) was brought up in an answer but without a link to learn more. Anything to help better understand this commonly solved problem would be great.
Lexical Analysis on Wikipedia
Regular expressions are definitely the first place most start out at. However, they can't really cope with many edge cases that one meets in most languages - text that looks like keywords can be in found string literals, string literals in turn can contain escaped delimiters, as well as special characters. Same thing goes for comments, etc.
Basically to do a good job of syntax highlighting you need to perform lexing of the source - parsing it with the application of language-specific heuristics to build a list of regions, where each region of the source is annotated with how it is to be styled.
As edits take place, you can again apply language rules to see how far this change can alter the presentation of a region. For example typing a letter inside a string literal simply makes the string literal region longer, but typing a closing quote truncates the region and turns the leftover part of it into code, subject to all the other lexing rules.
Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?
Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.
Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.
I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.
Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.
There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json
I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.
In addition to 'Visitor', 'unparser' is another good keyword to web-search for.
That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.
I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.
The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.
There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.
To take an example, consider a set of discounts available to a supermarket shopper.
We could define these rules as data in some standard fashion (lists of qualifying items, applicable dates, coupon codes) and write generic code to handle these. Or, we could write each as a chunk of code, which checks for the appropriate things given the customer's shopping list and returns any applicable discounts.
You could reasonably store the rules as objects, serialised into Blobs or stored in code files, so that each rule could choose its own division between data and code, to allow for future rules that wouldn't fit the type of generic processor considered above.
It's often easy to criticise code that mixes data in, via if statements that check for 6 different things that should be in a file or a database, but is there a rule that helps in the edge cases?
Or is this the point of Object Oriented design, to stop us worrying about the line between data and code?
To clarify, the underlying question is this: How would you code the above example? Is there a rule of thumb that made you decide what is data and what is code?
(Note: I know, code can be compiled, but in a world of dynamic languages and JIT compilation, even that is a blurry concept.)
Fundamentally, there is of course no difference between data and code, but for real software infrastructures, there can be a big difference. Apart from obvious things like, as you mentioned, compilation, the biggest issue is this:
Most sufficiently large projects are designed to produce "releases" that are one big bundle, produced in 3-month (or longer) cycles, tested extensively and cannot be changed afterwards except in tightly controlled ways. "Code" most definitely cannot be changed, so anything that does need to be changed has to be factored out and made "configuration data" so that changing it becomes palatable those whose job it is to ensure that a release works.
Of course, in most cases bad configuration data can break a release just as thoroughly as bad code, so the whole thing is largely an illusion - in reality it doesn't matter whether it's code or "configuration data" that changes, what matters is that the interface between the main system and the parts that change is narrow and well-defined enough to give you a good chance that the person who does the change understands all consequences of what he's doing.
This is already harder than most people think when it's really just a few strings and numbers that are configured (I've personally witnessed a production mainframe system crash because it had one boolean value set differently than another system it was talking to). When your "configuration data" contains complex logic, it's almost impossible to achieve. But the situation isn't going to be any better ust because you use a badly-designed ad hoc "rules configuration" language instead of "real" code.
This is a rather philosophical question (which I like) so I'll answer it in a philosophical way: with nothing much to back it up. ;)
Data is the part of a system that can change. Code defines behavior; the way in which data can change into new data.
To put it more accurately: Data can be described by two components: a description of what the datum is supposed to represent (for instance, a variable with a name and a type) and a value.
The value of the variable can change according to rules defined in code. The description does not change, of course, because if it does, we have a whole new piece of information.
The code itself does not change, unless requirements (what we expect of the system) change.
To a compiler (or a VM), code is actually the data on which it performs its operations. However, the to-be-compiled code does not specify behavior for the compiler, the compiler's own code does that.
It all depends on the requirement. If the data is like lookup data and changes frequently you dont really want to do it in code, but things like Day of the Week, should not chnage for the next 200 years or so, so code that.
You might consider changing your topic, as the first thing I thought of when I saw it, was the age old LISP discussion of code vs data. Lucky in Scheme code and data looks the same, but thats about it, you can never accidentally mix code with data as is very possible in LISP with unhygienic macros.
Data are information that are processed by instructions called Code. I'm not sure I feel there's a blurring in OOD, there are still properties (Data) and methods (Code). The OO theory encapsulates both into a gestalt entity called a Class but they are still discrete within the Class.
How flexible you want to make your code in a matter of choice. Including constant values (what you are doing by using if statements as described above) is inflexible without re-processing your source, whereas using dynamically sourced data is more flexible. Is either approach wrong? I would say it really depends on the circumstances. As Leppie said, there are certain 'data' points that are invariate, like the days of the week that can be hard coded but even there it may be advantageous to do it dynamically in certain circumstances.
In Lisp, your code is data, and your
data is code
In Prolog clauses are terms, and terms
are clauses.
The important note is that you want to separate out the part of your code that will execute the same every time, (i.e. applying a discount) from the part of your code which could change (i.e. the products to be discounted, or the % of the discount, etc.)
This is simply for safety. If a discount changes, you won't have to re-write your discount code, you'll only need to go into your discounts repository (DB, or app file, or xml file, or however you choose to implement it) and make a small change to a number.
Also, if the discount code is separated into an XML file, then you can give the entire application to a manager, and with sufficient instructions, they won't need to pester you whenever they want to change the discount rates.
When you mix in data and code, you are exponentially increasing the odds of breaking when anything changes. So, as leppie said, you need to extract the constantly changing parts, and put them in a separate place.
Huge difference. Data is a given to system while code is a part of system.
Wrong data is senseless: our code===handler is good and what you put that you take, it is not a trouble of system that you meant something else. But if code is bad - system is bad.
In example, let's consider some JSON, some bad code parser.js by me and let's say good V8. For my system bad parser.js is a code and my system works wrong. But for Google system my bad parser is data that no how says about quality of V8.
The question is very practical, no sophistic.
https://en.wikipedia.org/wiki/Systems_engineering tries to make good answer and money.
Data is information. It's not about where you decide to put it, be it a db, config file, config through code or inside the classes.
The same happens for behaviors / code. It's not about where you decide to put it or how you choose to represent it.
The line between data and code (program) is blurry. It's ultimately just a question of terminology - for example, you could say that data is everything that is not code. But, as you wrote, they can be happily mixed together (although usually it's better to keep them separate).
Code is any data which can be executed. Now since all data is used as input to some program at some point of time, it can be said that this data is executed by a program! Thus your program acts as a virtual machine for your data. Hence in theory there is no difference between data and code!
In the end what matters is software engineering/development considerations like performance, efficiency etc. For example data driven programs may not be as efficient as programs which have hard coded (and hence fragile) conditional statements. Hence I choose to define code as any data which can be efficiently executed and all else being plain data.
It's a tradeoff between flexibility and efficiency. Executable data (like XML rules) offers more flexibility (sometimes) while the same data/rules when coded as part of the application will run more efficiently but changing it frequently becomes cumbersome. In other words executable data is easy to deploy but is inefficient and vice-versa. So ultimately the decision rests with you - the software designer.
Please correct me if I wrong.
Relationship between code and data is as follows:
code after compiled to a program processes the data while execution
program can extract data, transform data, load data, generate data ...
Also
program can extract code, transform code, load code, generate code tooooooo...
Hence code without compiled or interperator is useless, data is always worth..., but code after compiled can do all the above activities....
For eg)
Sourcecontrolsystem process Sourcecodes
here source code itself is a code
Backupscripts process files
here files is a data and so on...
I would say that the distinction between data, code and configuration is something to be made within the context of a particular component. Sometimes it's obvious, sometimes less so.
For example, to a compiler, the source code it consumes and the object code it creates are both data - and should be separated from the compiler's own code.
In your case you seem to be describing the option of a particularly powerful configuration file, which can contain code. Much as, for example, the GIMP lets you 'configure' plugins using Scheme. As the developer of the component that reads this configuration, you would think of it as data. When working at a different level -- writing the configuration -- you would think of it as code.
This is a very powerful way of designing.
Applying this to the underlying question ("How would you code the above example?"), one option might be to adopt or design a high level Domain Specific Language (DSL) for specifying rules. At startup, or when first required, the server reads the rule and executes it.
Provide an admin interface allowing the administrator to
test a new rule file
replace the current configuration with that from a new rule file
... all of which would happen at runtime.
A DSL might be something as simple as a table parser or an XML parser, or it could be something as sophisticated as a scripting language. From C, it's easy to embed Python or Lua. From Java it's easy to embed Groovy or Clojure.
You could switch in compiled code at runtime, with clever linking or classloader tricks. This seems more difficult and less valuable than the embedded DSL option, in my opinion.
The best practical answer to this question I found is this:
Any class that needs to be serialized, now or in any foreseeable future, is data.
Everything else is code.
That's why, for example, Java's HashMap is data - although it has a lot of code, API methods and specific implementation (i.e., it might look as code at first glance).