How do HTML parses work if they're not using regexp?

How do HTML parses work if they're not using regexp? - html

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).
This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.
One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.
Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.

So how does a HTML parser work? Doesn't it use regular expressions to parse?
Well, no.
If you reach back in your brain to a theory of computation course, if you took one, or a compilers course, or something similar, you may recall that there are different kinds of languages and computational models. I'm not qualified to go into all the details, but I can review a few of the major points with you.
The simplest type of language & computation (for these purposes) is a regular language. These can be generated with regular expressions, and recognized with finite automata. Basically, that means that "parsing" strings in these languages use state, but not auxiliary memory. HTML is certainly not a regular language. If you think about it, the list of tags can be nested arbitrarily deeply. For example, tables can contain tables, and each table can contain lots of nested tags. With regular expressions, you may be able to pick out a pair of tags, but certainly not anything arbitrarily nested.
A classic simple language that is not regular is correctly matched parentheses. Try as you might, you will never be able to build a regular expression (or finite automaton) that will always work. You need memory to keep track of the nesting depth.
A state machine with a stack for memory is the next strength of computational model. This is called a push-down automaton, and it recognizes languages generated by context-free grammars. Here, we can recognize correctly matched parentheses--indeed, a stack is the perfect memory model for it.
Well, is this good enough for HTML? Sadly, no. Maybe for super-duper carefully validated XML, actually, in which all the tags always line up perfectly. In real-world HTML, you can easily find snippets like <b><i>wow!</b></i>. This obviously doesn't nest, so in order to parse it correctly, a stack is just not powerful enough.
The next level of computation is languages generated by general grammars, and recognized by Turing machines. This is generally accepted to be effectively the strongest computational model there is--a state machine, with auxiliary memory, whose memory can be modified anywhere. This is what programming languages can do. This is the level of complexity where HTML lives.
To summarize everything here in one sentence: to parse general HTML, you need a real programming language, not a regular expression.
HTML is parsed the same way other languages are parsed: lexing and parsing. The lexing step breaks down the stream of individual characters into meaningful tokens. The parsing step assembles the tokens, using states and memory, into a logically coherent document that can be acted on.

Usually by using a tokeniser. The draft HTML5 specification has an extensive algorithm for handling "real world HTML".

Regular expressions are just one form of parser. An honest-to-goodness HTML parser will be significantly more complicated than can be expressed in regexes, using recursive descent, prediction, and several other techniques to properly interpret the text. If you really want to get into it, you might check out lex & yacc and similar tools.
The prohibition against using regexes for HTML parsing should probably be written more correctly as: "Don't use naive regular expressions to parse HTML..." (lest ye feel the wrath) "...and treat the results with caution." For certain specific goals, a regex may well be perfectly adequate, but you need to be very careful to be aware of the limitations of your regex and as cautious as is appropriate to the source of the text you're parsing (e.g., if it's user input, be very careful indeed).

Parsing HTML is the transformation of a linear text into a tree structure. Regular expressions cannot generally handle tree structures. The regular expression you need at each point to get the next token changes all the time. You can use regular expressions in a parser, but you will need a whole array of regular expressions for each possible state of parsing.

If you want to have a 100% solution: You need to write your own custom code that iterates through the HTML character-by-character and you need to have a tremendous amount of logic to determine if you should stop the current node and start the next.
The reason is that this is valid HTML:
<ul>
<li>One
<li>Two
<li>Three
</ul>
But so is this:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
If you are ok with "90% solution": Then using an XML parser to load a document is fine. Or using Regex (though the xml is easier if you are then master of the content).

Related

What should I read on json and html parsers to build one myself?

I want to create a json and html parser to deepen my knowledge in them (I don't want to reinvent it to be "more efficient", as you could think).
What should I read to succede with it?
P.S: I know about parsing laws, but couldn't find some on json.
P.P.S: C++ implementation is my target.

JSON is specified in RFC 8259 (using EBNF) and ECMA-404 (using railroad diagrams). Since they both define the same grammar, which of the two you use is unimportant; go for the one you fibd easier.
JSON parsing is pretty simple. HTML, on the other hand, is a huge project, made more complicated by the absence of a versioned authoritative standard which makes it a bit of a moving target.
HTML parsing as currently defined by the "living standard" is a procedure which probably cannot be encapsulated in a context-free grammar. No real attempt is made to use grammatical descriptions in the standard, although it is possible to extract at least a lexical grammar, if you ignore the sections dealing with the handling of lexical errors.
Certainly, you could write a parser for a well-behaved subset, but that parser might not cope well with many of the "HTML" documents you will want to process. Personally, for learning purposes, I'd suggest trying your hand at XML. (Also see XML Namespaces].

Custom parser for HTML5 and other languages

I'm attempting to write my very own custom parser (in C#) for (X)HTML5 and whatever might be embedded (EcmaScript, CSS) - just to learn and have fun. Although I'm an intermediate programmer I don't know much about parsers and all the technical stuff. I am able to create a lexical analyser (tokeniser) for HTML5 fairly easily but the syntactical analysis (parsing) is a bit tricky. I'm not sure if I should first lexically analyse all the source input and then do the other or try both at the same time; get char until I have a token, realise what the token syntactially means and then expect a certain token relevant to the previous one. The problem that I face is that HTML might have other languages such as CSS and JavaScript embedded and they, as far as I can see, will have different categories of tokens, so I'm not sure how to "know" where I am in the code as I tokenise it in order to have varying definitions of what a token "is". Any thoughts? Also, what are the benefits/drawbacks of analysing lexically first and then syntactically vs. doing both in at the same time?

If this is purely for your own education regarding parsing, I would suggest tacking a much smaller / easier field than HTML , CSS and JS parsing as HTML and JS both represent some really quite nasty parsing problems which even the most experienced parser writer would feel nervous tackling.
A language based off Scheme or Basic would probably be my first pick.
(A personal favourite is building a parser / interpreter as I go through http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-10.html )
(Also picking up a copy of something like Modern Complier Design probably wouldn't hurt: http://www.amazon.com/Modern-Compiler-Design-D-Grune/dp/0471976970 )
If it has to be web related in order to keep your interest, I'd take a stab at doing your parser for one of the smaller web related languages such as sass ( http://sass-lang.com )
On the other hand, if this is something work related where you really need to parse those specific things, I'd suggest skipping the effort of writing your own parser entirely and hook into something like the Razor or Chromium libraries.
And to directly answer at least the second half of your question: I would recommend always splitting the various phases of parsing / interpreting out as far as possible from each other.
Each problem is difficult enough on it's own without trying to be "too smart" and attempting to combine functionality into a single sweep.
Where ever possible I'd suggest keeping things as high level, abstract and "clean" as possible... thus construct a tree of nodes specifically for lexical parsing and another for syntactical parsing... and in the case of combined languages as HTML, CSS and JS, a different AST and parsing code for each.

There is a great course on Udacity [1] called Programming Languages that covers the full concept of HTML and Javacript processing .
It covers in depth the lexical analysis, parsing and interpretation . It only covers a subset of Javascript so you have further development ahead of you after you finish the course, but you will have acquired the general structure and the concepts .
[1] http://www.udacity.com/overview/Course/cs262/CourseRev/apr2012

How to write a parser for Markup?

I would like to program a parser for a Markup language similar to BBCode, Markdown, Wikisyntax etc. using a high-level language like Python or Perl. It should feature sectioning, code highlighting, automatic link creation, embedding images but allowing HTML for more complex formatting.
Has anyone done similar things or has worked closely with those systems and could describe generally how this could be done efficiently?
Although efficiency is not really of concern for such a small system, it is generally favourable.
In particular I would like to learn if there is a more efficient way than using regular expressions for such a program.

For your general discussion…
You should start with the following blueprint:
you need to iterate charwise over entire data
you need to identify every char by its context, for it may be a tag-opening ('<', '[' etc) or just the char. This may be done by having an escapement flag, triggered by an escape-char (like backslashes in some languages do). if you use that approach, you also need to check for an escaped escapement.
you may also need some flag telling you to be inside a comment or special data section, that may have different escapement rules.
you need to build a tree-like structure or at least some stack for nested tags. This is why regexes are a bad idea: they not only take much to much overhead, they're also of no use if you want to get the correct closing tag for the second x (x=any tag) in the following snipped: <x><x><x></x><x><x></x></x><x></x><!-- </x> -->this one →</x><x></x></x>

What's the use of abstract syntax trees?

I am learning on my own about writing an interpreter for a programming language, and I have read about Abstract Syntax Trees. I have an idea of what they are, but I do not see their use.
Why are ASTs useful?

They represent the logic/syntax of the code, which is naturally a tree rather than a list of lines, without getting bogged down in concrete syntax issues such as where you place your asterisk.
The logic can then be manipulated in a manner more consistent and convenient from the backend's POV, which can be (and is, for everything but Lisps ;) very different from how we write the concrete syntax.

The main benefit os using an AST is that you separate the parsing and validation logic from the implementation piece. Interpreters implemented as ASTs really are easier to understand and maintain. If you have a problem parsing some strange syntax you look at the AST parser , if a pices of code is not producing the expected results than you look at the code that interprets the AST.
The other great advantage is when you syntax requires "lookahead" e.g. if your syntax allows a subroutine to be used before it is defined it is trivial to validate the existence of a subroutine when you are using an AST - its much more difficult with an "on the fly" parser.

You need "syntax trees" to represent the structure of most programming langauges, in order to carry out analysis or transformation on documents that contain programming language text. (You can see some fancy examples of this via my bio).
Whether that tree is abstract (AST) or concrete (CST) is a matter of taste, convenience, and engineering sweat. The term CST is specially used to describe the parse derivation tree when a grammar is used to deconstruct source code; it usually contains tree elements for lots of concrete syntax such as statement terminator semicolons. AST is used to mean "something simpler than the CST", e.g., leaving out semicolon tree nodes because they don't affect program analysis much, and thus writing analyzers that process ASTs is less conceptual and engineering effort than writing the same analyzer on a CST. A better way to understand this is to realize that the AST is usually as isomorphic equivalent of the CST, that is, you should be able to regenerate the CST from it. If you want to transform the source text and regenerate it, then the CST is often a better choice as it loses less information from the original program (and my fancy example uses this approach).
I think you will find the SO discussion on abstract vs. concrete syntax trees pretty helpful.

In general you are going to parse you code into some form of AST, it may be more or less of a formal model. So I think what Kirk Woll was getting at by his comment above is that when you parse the language, you very often use the parser to create some sort of data model of the raw content of what you are reading, generally organized in a tree fashion. So by that definition an AST is hard to avoid unless you are doing a very simple translator.
I use ANTLR often for parsing complex languages and in that context there is a slightly more specific meaning of an AST. ANTLR has a handy way of generating an AST in the parser grammar using pretty simple actions. You then write a generally much simpler parser for this AST which you can operate on like a much simpler version the language you are processing. Whether the extra work of building two parsers is a net gain is a function of the language complexity and what you are planning on doing with with it once you parsed it.
A good book on the subject that you may want to take a look at is "Language Implementation Patterns" by Terrence Parr the ANTLR author. He addresses this topic pretty thoroughly. That said, I didn't really get ASTs until I started using them, so that (as usual) is the best way to understand them.

Late to the question but I thought I'd add something. You don't actually have to build an AST. It is possible to emit instructions directly as you parse the source code. In this case, the AST is implied in the parsing grammar. For simple languages, especially dynamically typed languages, this is a perfectly ok strategy. For more complex languages or where you need to further analyze the source code, an AST can be very useful. For example, if your language is statically typed, ie your variables are declared with fixed types then the AST can be used to check that you're not assigning the wrong type to a variable. eg assigning a string to a variable that is declared to hold an integer would be wrong and this can be caught more conveniently with the AST.
Also, as others have mentioned, the AST offers a clean separation between syntax analysis and code generation and makes the code much more modular.

Theory, examples of reversible parsers?

Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?

Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.

Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.

I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.

Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.

There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json

I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.

In addition to 'Visitor', 'unparser' is another good keyword to web-search for.

That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.

I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.

The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.

There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008