Syntax highlight design pattern - language-agnostic

I'm looking for some good overviews of best practices and common patterns for enabling syntax highlighting in a textbox. It seems like a very common exercise almost all languages have a UI control that enables syntax highlighting in different languages. I'm just curious to see if there is a common pattern of implementation.
Is everyone using regular expressions? Is there a repository for regular expressions that are commonly used in syntax highlighting scenarios?
Are there alternative/better approaches to syntax highlighting?
Update
Links to relevant resources about performing syntax highlighting in a given language or concepts related to syntax highlighting would be great. Lexing (lexical analysis) was brought up in an answer but without a link to learn more. Anything to help better understand this commonly solved problem would be great.
Lexical Analysis on Wikipedia

Regular expressions are definitely the first place most start out at. However, they can't really cope with many edge cases that one meets in most languages - text that looks like keywords can be in found string literals, string literals in turn can contain escaped delimiters, as well as special characters. Same thing goes for comments, etc.
Basically to do a good job of syntax highlighting you need to perform lexing of the source - parsing it with the application of language-specific heuristics to build a list of regions, where each region of the source is annotated with how it is to be styled.
As edits take place, you can again apply language rules to see how far this change can alter the presentation of a region. For example typing a letter inside a string literal simply makes the string literal region longer, but typing a closing quote truncates the region and turns the leftover part of it into code, subject to all the other lexing rules.

Related

How to write a parser for Markup?

I would like to program a parser for a Markup language similar to BBCode, Markdown, Wikisyntax etc. using a high-level language like Python or Perl. It should feature sectioning, code highlighting, automatic link creation, embedding images but allowing HTML for more complex formatting.
Has anyone done similar things or has worked closely with those systems and could describe generally how this could be done efficiently?
Although efficiency is not really of concern for such a small system, it is generally favourable.
In particular I would like to learn if there is a more efficient way than using regular expressions for such a program.
For your general discussion…
You should start with the following blueprint:
you need to iterate charwise over entire data
you need to identify every char by its context, for it may be a tag-opening ('<', '[' etc) or just the char. This may be done by having an escapement flag, triggered by an escape-char (like backslashes in some languages do). if you use that approach, you also need to check for an escaped escapement.
you may also need some flag telling you to be inside a comment or special data section, that may have different escapement rules.
you need to build a tree-like structure or at least some stack for nested tags. This is why regexes are a bad idea: they not only take much to much overhead, they're also of no use if you want to get the correct closing tag for the second x (x=any tag) in the following snipped: <x><x><x></x><x><x></x></x><x></x><!-- </x> -->this one →</x><x></x></x>

What would it take to evolve regex into something that can parse HTML? [duplicate]

This question already has answers here:
What to do Regular expression pattern doesn't match anywhere in string?
(8 answers)
Closed 8 years ago.
Reading this amusing rant ( RegEx match open tags except XHTML self-contained tags ) I wondered ... how could regexes be changed to successfully parse HTML?
I'm looking here for suggestions that :
make the minimal addition to regexes as we know and love them (ie. not "make them look like XSLT!" type answers)
are robust enough to work properly.
suggest syntax (not just list the general requirements)
Has anyone actually made something like this?
Add a new escape sequence:
\H -- match HTML document
DOM/XML parsers internally use regex to parse html. The difference between them and using ONLY regex is to make up for the shortcomings of regex. One of the major shortcomings of regex is handling nested tags and malformed code (like missing tags). So around the basic regex, all sorts of algorithms and conditions are written to try and handle those things. And then there is of course the parts that actually create an object out of it.
So you asked what it would take to make regex do what a DOM/XML parser does? You would have to somehow cram all those algorithms and conditions into the regex engine, internally and within pattern syntax.
I personally do not wish for this to happen. IMO regex should be straight pattern matching. IMO it already has some stuff in it that IMO is questionable (some regex flavors do in fact have a way to use conditions, for instance). Taking the regex engine and then building a larger tool around it (like a DOM/XML parser) IMO is the best way to go.
It's interesting that real world tools can be and often are modified to perform tasks they might not otherwise be suited for. For example, if someone were to attempt to eat broth with a fork, they would be largely unsuccessful. Enter the spork.
I don't think programmers necessarily work that way all the time. It's not uncommon for tools to expand their scope, but it's also been a long tradition that programmers try to use specific tools for specific purposes.
Now, it just so happens that in order for regex to be able to parse HTML, it would have to be a pattern matcher/recognizer that also remembered state. This is, to a T, exactly what a parser does. It uses pattern matching (indeed, it often uses regex!) in order to match tokens. It then remembers combinations of tokens.
So in fact regex is used very frequently to parse HTML, along with other functions that remember larger patterns that cannot be described or processed using regex alone.
Hope that answers the question.
Perl 6 has a regex extension that is designed to do that: http://en.wikipedia.org/wiki/Perl_6_rules.
Depends what you mean by "parse". Typically this involves transforming a character stream into an object tree. To do this with regular expressions you would need to completely change capturing groups to be runtime-variable multi-node tree, rather than the compile-time-fixed array that they currently are. Once you've done that you've just re-written lex/yacc.

How do HTML parses work if they're not using regexp?

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).
This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.
One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.
Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.
So how does a HTML parser work? Doesn't it use regular expressions to parse?
Well, no.
If you reach back in your brain to a theory of computation course, if you took one, or a compilers course, or something similar, you may recall that there are different kinds of languages and computational models. I'm not qualified to go into all the details, but I can review a few of the major points with you.
The simplest type of language & computation (for these purposes) is a regular language. These can be generated with regular expressions, and recognized with finite automata. Basically, that means that "parsing" strings in these languages use state, but not auxiliary memory. HTML is certainly not a regular language. If you think about it, the list of tags can be nested arbitrarily deeply. For example, tables can contain tables, and each table can contain lots of nested tags. With regular expressions, you may be able to pick out a pair of tags, but certainly not anything arbitrarily nested.
A classic simple language that is not regular is correctly matched parentheses. Try as you might, you will never be able to build a regular expression (or finite automaton) that will always work. You need memory to keep track of the nesting depth.
A state machine with a stack for memory is the next strength of computational model. This is called a push-down automaton, and it recognizes languages generated by context-free grammars. Here, we can recognize correctly matched parentheses--indeed, a stack is the perfect memory model for it.
Well, is this good enough for HTML? Sadly, no. Maybe for super-duper carefully validated XML, actually, in which all the tags always line up perfectly. In real-world HTML, you can easily find snippets like <b><i>wow!</b></i>. This obviously doesn't nest, so in order to parse it correctly, a stack is just not powerful enough.
The next level of computation is languages generated by general grammars, and recognized by Turing machines. This is generally accepted to be effectively the strongest computational model there is--a state machine, with auxiliary memory, whose memory can be modified anywhere. This is what programming languages can do. This is the level of complexity where HTML lives.
To summarize everything here in one sentence: to parse general HTML, you need a real programming language, not a regular expression.
HTML is parsed the same way other languages are parsed: lexing and parsing. The lexing step breaks down the stream of individual characters into meaningful tokens. The parsing step assembles the tokens, using states and memory, into a logically coherent document that can be acted on.
Usually by using a tokeniser. The draft HTML5 specification has an extensive algorithm for handling "real world HTML".
Regular expressions are just one form of parser. An honest-to-goodness HTML parser will be significantly more complicated than can be expressed in regexes, using recursive descent, prediction, and several other techniques to properly interpret the text. If you really want to get into it, you might check out lex & yacc and similar tools.
The prohibition against using regexes for HTML parsing should probably be written more correctly as: "Don't use naive regular expressions to parse HTML..." (lest ye feel the wrath) "...and treat the results with caution." For certain specific goals, a regex may well be perfectly adequate, but you need to be very careful to be aware of the limitations of your regex and as cautious as is appropriate to the source of the text you're parsing (e.g., if it's user input, be very careful indeed).
Parsing HTML is the transformation of a linear text into a tree structure. Regular expressions cannot generally handle tree structures. The regular expression you need at each point to get the next token changes all the time. You can use regular expressions in a parser, but you will need a whole array of regular expressions for each possible state of parsing.
If you want to have a 100% solution: You need to write your own custom code that iterates through the HTML character-by-character and you need to have a tremendous amount of logic to determine if you should stop the current node and start the next.
The reason is that this is valid HTML:
<ul>
<li>One
<li>Two
<li>Three
</ul>
But so is this:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
If you are ok with "90% solution": Then using an XML parser to load a document is fine. Or using Regex (though the xml is easier if you are then master of the content).

Which language(s) have comments that are not comments?

What language(s) have comments with side effects? In essence, comments which are not comments....
English. Do I win?
DOS Batch Shell programming
The REM (Remark) allows you to put in a comment. But it has the side-effect of modifying the ERRORLEVEL variable to 0.
In a sense, it makes last operation a success.
I don't know how a comment can fail, but if it does, you are covered.
I can think of several places where comments aren't really comments.
HTML and script tags (providing support for browsers that don't allow or support scripts).
And then, considerably more obscurely:
IBM Informix 4GL (I4GL) and 4J's Genero (successor to Informix Dynamic 4GL, D4GL). The notation '--#' was used by D4GL to include material only applicable to D4GL; I4GL would see that as a comment. The inverse notation was '--#', which looked like a comment to D4GL but was treated as active material by I4GL.
And, even more obscurely:
I wrote an I4GL file which was dual-languaged, exploiting I4GL's multiple comment facilities. Material starting '#' (hash) marked the start of a comment outside of strings - up to the next newline, as does '--' (double-dash). Also, '{...}' (braces) enclose multiline comments.
The top of the source file was actually a shell script, mostly enclosed in '{...}' which is, of course, perfectly legitimate in shell. The shell script was a data-driven code generator that copied itself to the top of the output, and then generated about 100 functions which were all depressingly similar but slightly different (in a language without templates or a pre-processor). The code had to validate what was in the database for a given ship against incoming data from an external source (Lloyds of London, in fact), to see what had changed since the last time the external data was received. Non-trivial comparison work, especially since it had to deal with database (SQL) nulls.
The file was not really a Quine program, but it had some points in common with it. In particular, you could feed the script broken I4GL code and the regenerated file would be perfect again, basically because it ignored the existing I4GL code.
Haskell can turn the usual comments in code paradigm upside down by having code in comments - also Mathematica and the like; literal programming is a nice feature for the more mathematically inclined languages.
I also find annotations in Java are like comments with behaviour.
Then of course there are "polyglots" -- programs which can be compiled/executed in multiple languages. Usually these rely on the fact that the same line is a comment in one language, but an actual line of code in another.
QBasic has a use of comments all its own: REM $STATIC or REM $DYNAMIC set how arrays are allocated.
Another example: When web browsers parse comments <!-- -- -->in<!-- -- -->correctly.
CSS for clever cross-browser hacks. Of course, I wouldn't really call CSS a language.
Just stumbled upon this old question and my first thought was javadoc comments.

Theory, examples of reversible parsers?

Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?
Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.
Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.
I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.
Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.
There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json
I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.
In addition to 'Visitor', 'unparser' is another good keyword to web-search for.
That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.
I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.
The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.
There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.