Why so much HTML input sanitization necessary? - html

I have implemented a search engine in C for my html website. My entire web is programmed in C.
I understand that html input sanitization is necessary because an attacker can input these 2 html snippets into my search page to trick my search page into downloading and displaying foreign images/scripts (XSS):
<img src="path-to-attack-site"/>
<script>...xss-code-here...</script>
Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ? Wouldn't that render both scripts useless since they would not be considered html ? I've seen html filtering that goes way beyond this where they filter absolutely all the JavaScript commands and html markup !

Input sanitisation is not inherently ‘necessary’.
It is a good idea to remove things like control characters that you never want in your input, and certainly for specific fields you'll want specific type-checking (so that eg. a phone number contains digits).
But running escaping/stripping functions across all form input for the purpose of defeating cross-site-scripting attacks is absolutely the wrong thing to do. It is sadly common, but it is neither necessary nor in many cases sufficient to protect against XSS.
HTML-escaping is an output issue which must be tackled at the output stage: that is, usually at the point you are templating strings into the output HTML page. Escape < to <, & to &, and in attribute values escape the quote you're using as an attribute delimiter, and that's it. No HTML-injection is possible.
If you try to HTML-escape or filter at the form input stage, you're going to have difficulty whenever you output data that has come from a different source, and you're going to be mangling user input that happens to include <, & and " characters.
And there are other forms of escaping. If you try to create an SQL query with the user value in, you need to do SQL string literal escaping at that point, which is completely different to HTML escaping. If you want to put a submitted value in a JavaScript string literal you would have to do JSON-style escaping, which is again completely different. If you wanted to put a value in a URL query string parameter you need URL-escaping, not HTML-escaping. The only sensible way to cope with this is to keep your strings as plain text and escape them only at the point you output them into a different context like HTML.
Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ?
Well yes, if you also stripped ampersands and quotes. But then users wouldn't be able to use those characters in their content. Imagine us trying to have this conversation on SO without being able to use <, & or "! And if you wanted to strip out every character that might be special when used in some context (HTML, JavaScript, CSS...) you'd have to disallow almost all punctuation!
< is a valid character, which the user should be permitted to type, and which should come out on the page as a literal less-than sign.
My entire web is programmed in C.
I'm so sorry.

Encoding brackets is indeed sufficient in most cases to prevent XSS, as anything between tags will then display as plain-text.

Related

Uses for the '"' entity in HTML

I am revising some XHTML files authored by another party. As part of this effort, I am doing some bulk editing via Linq to XML.
I've just noticed that some of the original source XHTML files contain the " HTML entity in text nodes within those files. For instance:
<p>Greeting: "Hello, World!"</p>
And that when recovering the XHTML text via XElement.ToString(), the " entities are being replaced by plain double-quotes:
<p>Greeting: "Hello, World!"</p>
Question: Can anyone tell me what the motivation might have been for the original author to use the " entities instead of plain double-quotes? Did those entities serve a purpose which I don't fully appreciate? Or, were they truly unnecessary as I suspect?
I do understand that " would be necessary in certain contexts, such as when there is a need to place a double-quote within an HTML attribute. For instance:
<a href="/images/hello_world.jpg" alt="Greeting: "Hello, World!"">
Greeting</a>
It is impossible, and unnecessary, to know the motivation for using " in element content, but possible motives include: misunderstanding of HTML rules; use of software that generates such code (probably because its author thought it was “safer”); and misunderstanding of the meaning of ": many people seem to think it produces “smart quotes” (they apparently never looked at the actual results).
Anyway, there is never any need to use " in element content in HTML (XHTML or any other HTML version). There is nothing in any HTML specification that would assign any special meaning to the plain character " there.
As the question says, it has its role in attribute values, but even in them, it is mostly simpler to just use single quotes as delimiters if the value contains a double quote, e.g. alt='Greeting: "Hello, World!"' or, if you are allowed to correct errors in natural language texts, to use proper quotation marks, e.g. alt="Greeting: “Hello, World!”"
Reason #1
There was a point where buggy/lazy implementations of HTML/XHTML renderers were more common than those that got it right. Many years ago, I regularly encountered rendering problems in mainstream browsers resulting from the use of unencoded quote chars in regular text content of HTML/XHTML documents. Though the HTML spec has never disallowed use of these chars in text content, it became fairly standard practice to encode them anyway, so that non-spec-compliant browsers and other processors would handle them more gracefully. As a result, many "old-timers" may still do this reflexively. It is not incorrect, though it is now probably unnecessary, unless you're targeting some very archaic platforms.
Reason #2
When HTML content is generated dynamically, for example, by populating an HTML template with simple string values from a database, it's necessary to encode each value before embedding it in the generated content. Some common server-side languages provided a single function for this purpose, which simply encoded all chars that might be invalid in some context within an HTML document. Notably, PHP's htmlspecialchars() function is one such example. Though there are optional arguments to htmlspecialchars() that will cause it to ignore quotes, those arguments were (and are) rarely used by authors of basic template-driven systems. The result is that all "special chars" are encoded everywhere they occur in the generated HTML, without regard for the context in which they occur. Again, this is not incorrect, it's simply unnecessary.
In my experience it may be the result of auto-generation by a string-based tools, where the author did not understand the rules of HTML.
When some developers generate HTML without the use of special XML-oriented tools, they may try to be sure the resulting HTML is valid by taking the approach that everything must be escaped.
Referring to your example, the reason why every occurrence of " is represented by " could be because using that approach, you can safely use such "special" characters in both attributes and values.
Another motivation I've seen is where people believe, "We must explicitly show that our symbols are not part of the syntax." Whereas, valid HTML can be created by using the proper string-manipulation tools, see the previous paragraph again.
Here is some pseudo-code loosely based on C#, although it is preferred to use valid methods and tools:
public class HtmlAndXmlWriter
{
private string Escape(string badString)
{
return badString.Replace("&", "&").Replace("\"", """).Replace("'", "&apos;").Replace(">", ">").Replace("<", "<");
}
public string GetHtmlFromOutObject(Object obj)
{
return "<div class='type_" + Escape(obj.Type) + "'>" + Escape(obj.Value) + "</div>";
}
}
It's really very common to see such approaches taken to generate HTML.
As other answers pointed out, it is most likely generated by some tool.
But if I were the original author of the file, my answer would be: Consistency.
If I am not allowed to put double quotes in my attributes, why put them in the element's content ? Why do these specs always have these exceptional cases ..
If I had to write the HTML spec, I would say All double quotes need to be encoded. Done.
Today it is like In attribute values we need to encode double quotes, except when the attribute value itself is defined by single quotes. In the content of elements, double quotes can be, but are not required to be, encoded. (And I am surely forgetting some cases here).
Double quotes are a keyword of the spec, encode them.
Lesser/greater than are a keyword of the spec, encode them. etc..
It is likely because they used a single function for escaping attributes and text nodes. & doesn't do any harm so why complicate your code and make it more error-prone by having two escaping functions and having to pick between them?

Is it dangerous to allow the user to input HTML entities?

I'm trying to figure out what the the minimum amount of encoding would be that would protect a site from XSS.
I know for sure I'll need to encode < (<) and > (>) inside of tags, " (") and ' (&#39) inside attributes.
Do I also need to encode & (&)? I was having trouble with double encoding when the user was saving data (because & would become &amp;). Are there any security vulnerabilities or downsides that would happy if I didn't encode the ampersands? This would mean they'd be able to input any HTML entities they wanted.
By HTML entities I specifically mean ampersand-prefixed sequences that correspond to entities (like © ™).
This question is language-agnostic (except for the HTML part, of course).
Edit: heh, stack-overflow lets me keep my html encoded entities :) That might be telling.
You only need to encode these entities if you are displaying them on a page (and & needs to be escaped just as much as > and < because it is the escape sequence identifier).
If you're having trouble with double encoding of & signs, it sounds like you're doing it before you insert the data into your storage mechanism (database?) Stop that. You should only escape the data for the page when it comes to display on the page.

HTML: Should I encode greater than or not? ( > > )

When encoding possibly unsafe data, is there a reason to encode >?
It validates either way.
The browser interprets the same either way, (In the cases of attr="data", attr='data', <tag>data</tag>)
I think the reasons somebody would do this are
To simplify regex based tag removal. <[^>]+>? (rare)
Non-quoted strings attr=data. :-o (not happening!)
Aesthetics in the code. (so what?)
Am I missing anything?
Strictly speaking, to prevent HTML injection, you need only encode < as <.
If user input is going to be put in an attribute, also encode " as ".
If you're doing things right and using properly quoted attributes, you don't need to worry about >. However, if you're not certain of this you should encode it just for peace of mind - it won't do any harm.
The HTML4 specification in its section 5.3.2 says that
authors should use ">" (ASCII decimal 62) in text instead of ">"
so I believe you should encode the greater > sign as > (because you should obey the standards).
Current browsers' HTML parsers have no problems with uquoted >s
However, unfortunately, using regular expressions to "parse" HTML in JS is pretty common. (example: Ext.util.Format.stripTags). Also poorly written command line tools, IDEs, or Java classes etc. may not be sophisticated enough to determine the limiter of an opening tag.
So, you may run into problems with code like this:
<script data-usercontent=">malicious();//"></script>
(Note how the syntax highlighter treats this snippet!)
Always
This is to prevent XSS injections (through users using any of your forms to submit raw HTML or javascript). By escaping your output, the browser knows not to parse or execute any of it - only display it as text.
This may feel like less of an issue if you're not dealing with dynamic output based on user input, however it's important to at least understand, if not to make a good habit.
Yes, because if signs were not encoded, this allows xss on forms social media and many other because a attacker can use <script> tag. If you parse the signs the browser would not execute it but instead show the sign.
Encoding html chars is always a delicate job. You should always encode what needs to be encoded and always use standards. Using double quotes is standard, and even quotes inside double quotes should be encoded. ENCODE always. Imagine something like this
<div> this is my text an img></div>
Probably the img> will be parsed from the browser as an image tag. Browsers always try to resolve unclosed tags or quotes. As basile says use standards, otherwise you could have unexpected results without understanding the source of errors.

What are all the HTML escaping contexts?

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):
<div>This is regular text</div>
As well as within the values of attributes:
<input value="this is value text">
And, I believe, within HTML comments:
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)
Tangentially: The following text will throw errors as HTML 4.01 Strict:
foo
Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.
<div>This is regular text</div>
Text content: & must be escaped. < must be escaped.
If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.
In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
<input value="this is value text">
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.
Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.
In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.
To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.
For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- < -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.
<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)
There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.
The following text will throw errors as HTML 4.01 Strict:
foo
Yes, and it's just as much an error in Transitional.
If you put a space after the &, however, it validates just fine.
Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)
The above contexts clearly have different rules about what needs to be escaped.
I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.
E.g.
<h1>Fish & Chips</h1>
<img alt="Awesome picture of Meat Pie & Chips" />
Fish & Chips
The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).
So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?
Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?
I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).
Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."
For URL Encoding, this article seems a good starting point.
Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)
If you are looking for the best practices to escape characters in web browsers (including HTML, JavaScript and style sheets), the XSS prevention cheat sheet by Michael Coates is probably what you're looking for. It includes a description of the different interpretation contexts, tables indicating how to encode characters in each context and code samples (using ESAPI).
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet
Beware that <script> followed by <!-- followed by <script> again, enters double-escaped state, in which you probably never want to be, so ideally you should escape < with "\u003C" within your script's strings (and regexps) to not trigger it accidentally.
You can read more about it here http://qbolec-memdump.blogspot.com/2013/11/script-tag-content-madness.html
If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.
You don't say what environment you are targeting.

Regex to match attributes in HTML?

I have a txt file which actually is a html source of some webpage.
Inside that txt file there are various strings preceded by a "title=" tag.
e.g.
<div id='UWTDivDomains_5_6_2_2' title='Connectivity Framework'>
I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.
Like this, there are many such tags each having a different text after the title='some text here which i need to extract '
I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.
I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout.
I tried using the search as
title='[a-zA-Z0-9]
It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.
I want all string to be matched and written to the second file.
What is the correct regular expression or way to do what i want to do, using powergrep?
-AD.
I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.
The difficulties are:
In HTML attributes can have single-quotes, double-quotes or even no quotes;
Similar strings can appear in the HTML document itself;
You have to handle correct escaping; and
Malformed HTML (decent parsers are extremely robust to common errors).
So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.
HTML parsers exist for a reason. Use them.
I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:
title='[a-zA-Z0-9 ]*'
or better yet:
title='([^']*)'
The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.
The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".
By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).
Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.
Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.
I would use this regular expression to get the title attribute values
<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)
Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.
Here's the regex you need
title='([a-zA-Z0-9]+)'
but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.
Try this instead:
title=\'[a-zA-Z0-9]*\'