what is the name of the convention used in this syntax diagram

what is the name of the convention used in this syntax diagram - json

I found this diagram in the JSON specification:
(source: json.org)
Where does this diagramming convention come from? Is it just some random convention cooked up by D.C.?

That's called a railroad diagram. Have a look at the wikipedia article for more information.

That diagram is known as a syntax diagram or railroad diagram. It's used to visually represent context-free grammars. It's a graphical depiction of the Extended Backus-Naur Form, which is also used to represent context-free grammars.

Related

Differences between EER and UML

I've downloaded MySQL workbench and can create a EER diagram.
What's the difference between this and a UML diagram?
Where does a ERD come into this?

I think by UML diagram you mean : UML Class Diagram. [ there are other UML diagrams also]
EER (Enhanced entity–relationship) Diagram-Model
Used for Database Design. Like class diagrams support also subclass -superclass [specialization and generalization]. So entities in EER diagrams has attributes not methods.Because they show just plain data.
Note: ER [entity–relationship] Diagrams are origin of EER. They are from Structured Analysis. Also used for database modeling.How ER become to EER? I think because of Object Oriented Style Hype.
UML Class Diagrams
Used for Object Oriented Analysis-Design.
Can be used to model databases also : there are UML class profiles for it.
[I think UML profiles for Database Designs are NOT good as ER diagrams]
But in simple terms classes are blueprints in which objects instantinated . So classes may have methods-functions as well as attributes.Software classes definitely has methods but conceptual classes[ used for domain modeling] may not.

The primary difference between ERD and UML is that ERD stands for Entity Relationship Diagram, a type of diagram (as explained above), while UML stands for Uniform Modeling Language, which is essentially a standard defining a modeling language commonly used in software development, especially in Object Oriented Program.
UML also proposes standard diagram types (as noted above), usually grouped into structural or behavioral diagrams. Most DB GUIs use ERDs, which are better suited for the domain and most users don't need formal, academic diagrams.
Note that most software teams like to site UML as a reference; yet, don't usually implement UML to full spec or with academic rigor when creating documentation diagrams.
As a rule of thumb, if you want to model DB entities/relationships you are probably looking for an ERD but if you want to model an entire program/system then you probably need one-many of the different UML diagrams.

What does "Markup should be rigorous" mean?

The ISO definition of generalized markup states:
Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and databases can be used for processing documents as well.
What does "rigorous" mean in this context?
I found a paper which says:
...the type definition and the marked up document together [...] constitute the rigorously described document that machine processing requires.
...but I'm still unclear on the exact definition.

rigorous (comparative more rigorous, superlative most rigorous)
Manifesting, exercising, or favoring rigour; allowing no abatement or mitigation; scrupulously accurate; exact; strict; severe; relentless; as, a rigorous officer of justice; a rigorous execution of law; a rigorous definition or demonstration.
I.e., the rules for a markup language should be specified in such a way as to leave no ambiguity and no doubt as to their interpretation.

Rigorous XML for instance just means valid XML that follows all the markup rules.

Tools for describing JSON schemas

I'm writing a spec and need to describe some JSON objects. Big JSONs tend to get too confusing with text and tabs alone. Is there any online (preferably) tool to create diagrams like the ones on http://www.json.org/ or http://www.sqlite.org/lang_altertable.html. They use them to describe syntax, but, is there anything like it to describe JSON objects ? They are great to represent objects that are required, optional, arrays, etc.

These types of syntax diagrams are known as "railroad diagrams".
There is an online tool at http://bottlecaps.de/rr/ui that you can use to generate tour own diagrams. You must specify your grammar in EBNF notation.

What's the use of abstract syntax trees?

I am learning on my own about writing an interpreter for a programming language, and I have read about Abstract Syntax Trees. I have an idea of what they are, but I do not see their use.
Why are ASTs useful?

They represent the logic/syntax of the code, which is naturally a tree rather than a list of lines, without getting bogged down in concrete syntax issues such as where you place your asterisk.
The logic can then be manipulated in a manner more consistent and convenient from the backend's POV, which can be (and is, for everything but Lisps ;) very different from how we write the concrete syntax.

The main benefit os using an AST is that you separate the parsing and validation logic from the implementation piece. Interpreters implemented as ASTs really are easier to understand and maintain. If you have a problem parsing some strange syntax you look at the AST parser , if a pices of code is not producing the expected results than you look at the code that interprets the AST.
The other great advantage is when you syntax requires "lookahead" e.g. if your syntax allows a subroutine to be used before it is defined it is trivial to validate the existence of a subroutine when you are using an AST - its much more difficult with an "on the fly" parser.

You need "syntax trees" to represent the structure of most programming langauges, in order to carry out analysis or transformation on documents that contain programming language text. (You can see some fancy examples of this via my bio).
Whether that tree is abstract (AST) or concrete (CST) is a matter of taste, convenience, and engineering sweat. The term CST is specially used to describe the parse derivation tree when a grammar is used to deconstruct source code; it usually contains tree elements for lots of concrete syntax such as statement terminator semicolons. AST is used to mean "something simpler than the CST", e.g., leaving out semicolon tree nodes because they don't affect program analysis much, and thus writing analyzers that process ASTs is less conceptual and engineering effort than writing the same analyzer on a CST. A better way to understand this is to realize that the AST is usually as isomorphic equivalent of the CST, that is, you should be able to regenerate the CST from it. If you want to transform the source text and regenerate it, then the CST is often a better choice as it loses less information from the original program (and my fancy example uses this approach).
I think you will find the SO discussion on abstract vs. concrete syntax trees pretty helpful.

In general you are going to parse you code into some form of AST, it may be more or less of a formal model. So I think what Kirk Woll was getting at by his comment above is that when you parse the language, you very often use the parser to create some sort of data model of the raw content of what you are reading, generally organized in a tree fashion. So by that definition an AST is hard to avoid unless you are doing a very simple translator.
I use ANTLR often for parsing complex languages and in that context there is a slightly more specific meaning of an AST. ANTLR has a handy way of generating an AST in the parser grammar using pretty simple actions. You then write a generally much simpler parser for this AST which you can operate on like a much simpler version the language you are processing. Whether the extra work of building two parsers is a net gain is a function of the language complexity and what you are planning on doing with with it once you parsed it.
A good book on the subject that you may want to take a look at is "Language Implementation Patterns" by Terrence Parr the ANTLR author. He addresses this topic pretty thoroughly. That said, I didn't really get ASTs until I started using them, so that (as usual) is the best way to understand them.

Late to the question but I thought I'd add something. You don't actually have to build an AST. It is possible to emit instructions directly as you parse the source code. In this case, the AST is implied in the parsing grammar. For simple languages, especially dynamically typed languages, this is a perfectly ok strategy. For more complex languages or where you need to further analyze the source code, an AST can be very useful. For example, if your language is statically typed, ie your variables are declared with fixed types then the AST can be used to check that you're not assigning the wrong type to a variable. eg assigning a string to a variable that is declared to hold an integer would be wrong and this can be caught more conveniently with the AST.
Also, as others have mentioned, the AST offers a clean separation between syntax analysis and code generation and makes the code much more modular.

Theory, examples of reversible parsers?

Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?

Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.

Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.

I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.

Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.

There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json

I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.

In addition to 'Visitor', 'unparser' is another good keyword to web-search for.

That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.

I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.

The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.

There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008