Is there XSS risk when using a template literal with an untrusted string to set an attribute value? - ecmascript-6

I'm building an iframe, not with innerHTML, but with createElement.. I have two untrusted strings that are used:
iframeEl.title = untrustedStr1;
iframeEl.src = `http://example.com/?id=${untrustedStr2}`;
According to the OWASP XSS cheatsheet, the title attribute is totally safe, so I'm not worried about that.
However, I'm not 100% sure about the iframeEl.src case.
I'm thinking about the 5 significant characters that typically need to be encoded: <, >, &, ", and ' and I don't see any way to escape out of the template literal. And I also don't see a mechanism to have untrustedStr2 run as JavaScript. (For example, if untrustedStr2 = 'document.cookie', it's interpolated as a string, not via evaluation).
I suppose if untrustedStr2 is a getter method somehow, I could have a problem. But if it's absolutely a string, this is safe and I don't need to encode, not even for the 5 significant characters. Is that right?

When working with the DOM, there are no html encoding issues in any element properties. The characters <, >, &, ", and ' do not need escaping.
However, you still need to deal with the semantics of the respective attribute. While title is just a plain string that's not used for anything but displaying tooltips, others are not safe:
on… event handlers contain javascript code. It's a bad practice to assign strings to them anyway, but if you do, interpolating values must follow javascript escaping rules.
⇨ Rule #3
style properties contain CSS rules which need their own escaping.
⇨ Rule #4
src or href attributes are urls that the browser will load at some point. Those definitely are sensitive, and when interpolating values into urls you need to follow URL encoding rules.
⇨ Rule #5
… (not meant to be exhaustive)
In your particular case, if you fail to url-encode the untrustedStr2, the attacker may send arbitrary query parameters or fragments to example.com. This is not a security issue in itself if example.com isn't susceptible to reflected XSS (the attacker may send the same link to the user via other channels), but it is broken functionality (undesired behaviour), but still it's your page endorsing the linked content.
So if untrustedStr2 is meant as a value of the id URI query parameter, you should definitely use
iframeEl.src = `http://example.com/?id=${encodeURIComponent(untrustedStr2)}`;
// ^^^^^^^^^^^^^^^^^^

It seems unlikely for untrustedStr2 to evaluate and/or break out of the string. However if you don't encode it you may allow "HTTP Parameter Pollution (HPP)".
// untrustedStr2 = '9&id=42';
iframeEl.src = `http://example.com/?id=${untrustedStr2}`;
By itself, this is not necessarily an indication of vulnerability. However, if the developer is not aware of the problem, the presence of duplicated parameters may produce an anomalous behavior in the application that can be potentially exploited by an attacker. As often in security, unexpected behaviors are a usual source of weaknesses that could lead to HTTP Parameter Pollution attacks in this case.
See https://owasp.org/www-project-web-security-testing-guide/v41/4-Web_Application_Security_Testing/07-Input_Validation_Testing/04-Testing_for_HTTP_Parameter_Pollution.html
Or you could allow an attempt at a CSRF attack:
// untrustedStr2 = '9&action=delete';
iframeEl.src = `http://example.com/?id=${untrustedStr2}`;
I think it would be safer to encode it.

Related

XSS validation: Is it safe to check only if value contains <> and %3C %3E

I've been doing server side XSS validation. Here is what I found to use:
List of forbidden attributes: javascript:,mocha:,eval(,alert(,vbscript:,livescript:,expression(,url(,&{,&#,/*,*/,onclick,oncontextmenu,ondblclick,onmousedown,onmouseenter,onmouseleave,onmousemove,onmouseover,onmouseout,onmouseup,onkeydown,onkeypress,onkeyup,onblur,onchange,onfocus,onfocusin,onfocusout,oninput,oninvalid,onreset,onsearch,onselect,onsubmit,ondrag,ondragend,ondragenter,ondragleave,ondragover,ondragstart,ondrop,oncopy,oncut,onpaste,ontouchcancel,ontouchend,ontouchmove,ontouchstart,onwheel
However they seem to be too strict since some correct values such as "™" are considered as illegal if I use this list.
I'm thinking if to check only value doesn't contain any of those characters like '<', '>', "%3E", "%3C" would be safe to prevent XSS attack?
You do not need to block any of those characters, long lesson-learned you're just cutting out other language support potentially and you can work with these characters safely.
Imagine we have a chat-box, as this is where these may get used most often in a textarea. The user can pass the following <html></html> and if it wasn't handled properly every user receiving the message will open up a new HTML document inside this chat (scary).
This gets handled on the client side fortunately, some hacks to make it "text-only".
When dealing with login you could regex check username, if that passes you can then SQL search the username if match is made check if password matches too by encrypting the password just given and matching it, no match get out.
I've never needed to block characters or treat anything special knowing proper encoding/decoding (escape practices and whatnot). Maybe this will help your search.

Should lexer distinguish different types of string tokens?

I'm writing a jade-like language that will transpile to html. Here's how a tag definition looks like:
section #mainWrapper .container
this transpiles to:
<section id="mainWrapper" class="container">
Should the lexer tell class and id apart or should it only spit out the special characters with names?
In other words, should the token array look like this:
[
{type: 'tag', value: 'section'},
{type: 'id', value: 'mainWrapper'},
{type: 'class', value: 'container'}
]
and then the parser just assembles these into a tree
or should the lexer be very primitive and only return matched strings, and then the parser takes care of distinguishing them?:
[
{type: 'name', value: 'section'},
{type: 'name', value: '#mainWrapper'},
{type: 'name', value: '.container'}
]
As a rule of thumb, tokenisers shouldn't parse and parser shouldn't tokenise.
In this concrete case, it seems to me unlikely that every unadorned use of a name-like token -- such as section -- would necessarily be a tag. It's more likely that section is a tag because of its syntactic context. If the tokeniser attempts to mark it as a tag, then the tokeniser is tracking syntactic context, which means that it is parsing.
The sigils . and # are less clear-cut. You could consider them single-character tokens (which the syntax will insist be followed by a name) or you might consider them to be the first character of a special type of string. Some things that might sway you one way or the other:
Can the sigil be separated from the following name by whitespace? (# mainWrapper). If so, the sigil is probably a token.
Is the lexical form of a class or id different from a name? Think about the use of special characters, for example. If you can't accurately recognise the object without knowing what sigil (if any) preceded it, then it might better be considered as a single token.
Are there other ways to represent class names. For example, how do you represent multiple classes? Some possibilities off the top of my head:
#classA #classB
#(classA classB)
#"classA classB"
class = "classA classB"
If any of the options other than the first one are valid, you probably should just make # a token. But correct handling of the quoted strings might generate other challenges. In particular, it could require retokenising the contents of the string literal, which would be a violation of the heuristic that parsers shouldn't tokenise. Fortunately, these aren't absolute rules; retokenisation is sometimes necessary. But keep it to a minimum.
The separation into lexical and syntactic analysis should not be a strait-jacket. It's a code organization technique intended to make the individual parts easier to write, understand, debug and document. It is often (but not always) the case that the separation makes it easier for users of your language to understand the syntax, which is also important. But it is not appropriate for every parsing task, and the precise boundary is flexible (but not porous: you can put the boundary where it is most convenient but once it's placed, don't try to shove things through the cracks.)
If you find that this separation of concerns too difficult for your project, you should either reconsider your language design or try scannerless parsing.

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

sanitizing data before mysql injection and xss - am i doing it right with pdo and htmlpurifier

I am still working with securing my web app. I decided to use PDO library to prevent mysql injection and html purifier to prevent xss attacks. Because all the data that comes from input goes to database I perform such steps in order to work with data:
get data from input field
start pdo, prepare query
bind each variable (POST variable) to query, with sanitizing it using html purifier
execute query (save to database).
In code it looks like this:
// start htmlpurifier
require_once '/path/to/htmlpurifier/library/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
// start pdo
$pdo = new PDO('mysql:host=host;dbname=dbname', 'login', 'pass');
$pdo -> setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
// prepare and bind
$stmt = $pdo -> prepare('INSERT INTO `table` (`field1`) VALUES ( :field1 )');
// purify data and bind it.
$stmt -> bindValue(':field1', $purifier->purify($_POST['field1']), PDO::PARAM_INT);
// execute (save to database)
$stmt -> execute();
Here are the questions:
Is that all I have to do to prevent XSS and mysql injection? I am aware that i cant be 100% sure but in most cases should it work fine and is it enough?
Should I sanitize the data once again when grabing it from db and putting to browser or filtering before saving is just enough?
I was reading on wiki that it's smart to turn of magic_quotes. Ofocurse if magic quotes puts unnecessery slahes it can be annoying but if I don't care about those slashes isn't turning it of just losing another line of defense?
Answer:
Please note that code I have written in this example is just an example. There is a lot of inputs and query to DB is much more complicated. Unfortunately I can't agree with you that if PDO type of variable should be int I do not have to filter it with XSS attacks. Correct me if I am wrong:
If the input should be an integer, and it is then it's ok - I can put it to DB. But remember that any input can be changed and we have to expect the worse. So if everything is alright than it is alright, but if a malicious user would input XSS code than I have multiple lines of defense:
client side defense - check if it is numeric value. Easy to compromise, but can stop total newbies.
server side - xss injection test (with html purify or ie htmlspecialchars)
db side - if somehow somebody puts malicious code that will avoid xss protection than database is going to return error because there should be integer, not any other kind of variable.
I guess it is not doing anything wrong, and it can do a lot of good. Ofcourse we are losing some time to calculate everything, but i guess we have to put on the weight performance and security and determine what is more important for you. My app is going to be used by 2-3 users at a time. Not many. And a security is much more important for me than performance.
Fortunately my whole site is with UTF8 so I do not expect any problems with encoding.
While searching the net i met a lot of opinions about addslashes(), stripslashes(), htmlspecialchars(), htmlentities().. and i've chosen htmlpurity and pdo. Everyone is saying that they are best solutions before xss and mysql injections threats. If you have any other opinion please share.
As for SQL injection, yes, you can be 100% sure if you always use prepared statements. As for XSS, you must also make sure that all your pages are UTF-8. HTML Purifier sanitizes data with the assumption that it's encoded in UTF-8, so there may be unexpected problems if you put that data in a page with a different encoding. Every page should have a <meta> tag that specifies the encoding as UTF-8.
Nope, you don't need to sanitize the data after you grab it from the DB, provided that you already sanitized it and you're not adding any user-submitted stuff to it.
If you always use prepared statements, magic quotes is nothing but a nuisance. It does not provide any additional lines of defense because prepared statements are bulletproof.
Now, here's a question for you. PDO::PARAM_INT will turn $field1 into an integer. An integer cannot be used in an SQL injection attack. Why are you passing it through HTML Purifier if it's just an integer?
HTML Purifier slows down everything, so you should only use it on fields where you want to allow HTML. If it's an integer, just do intval($var) to destroy anything that isn't a number. If it's a string that shouldn't contain HTML anyway, just do htmlspecialchars($var, ENT_COMPAT, 'UTF-8') to destroy all HTML. Both of these are much more efficient and equally secure if you don't need to allow HTML. Every field should be sanitized, but each field should be sanitized according to what it's supposed to contain.
Response to your additions:
I didn't mean to imply that if a variable should contain an integer, then it need not be sanitized. Sorry if my comment came across as suggesting that. What I was trying to say is that if a variable should contain an integer, it should not be sanitized with HTML Purifier. Instead, it should be validated/sanitized with a different function, such as intval() or ctype_digit(). HTML Purifier will not only use unnecessary resources in this case, but it also can't guarantee that the variable will contain an integer afterwards. intval() guarantees that the result will be an integer, and the result is equally secure because nobody can use an integer to carry out an XSS or SQL injection attack.
Similarly, if the variable should not contain any HTML in the first place, like the title of a question, you should use htmlspecialchars() or htmlentities(). HTML Purifier should only be used if you want your users to enter HTML (using a WYSIWYG editor, for example). So I didn't mean to suggest that some kinds of inputs don't need sanitization. My view is that inputs should be sanitized using different functions depending on what you want them to contain. There is no single solution that works on all types of inputs. It's perfectly possible to write a secure website without using HTML Purifier if you only ever accept plain-text comments.
"Client-side defense" is not a line of defense, it's just a convenience.
I'm also getting the nagging feeling that you're lumping XSS and SQL injection together when they are completely separate attack vectors. "XSS injection"? What's that?
You'll probably also want to add some validation to your code in addition to sanitization. Sanitization ensures that the data is safe. Validation ensures that the data is not only safe but also correct.

Using magic strings or constants in processing punctuation?

We do a lot of lexical processing with arbitrary strings which include arbitrary punctuation. I am divided as to whether to use magic characters/strings or symbolic constants.
The examples should be read as language-independent although most are Java.
There are clear examples where punctuation has a semantic role and should be identified as a constant:
File.separator not "/" or "\\"; // a no-brainer as it is OS-dependent
and I write XML_PREFIX_SEPARATOR = ":";
However let's say I need to replace all examples of "" with an empty string ``. I can write:
s = s.replaceAll("\"\"", "");
or
s = s.replaceAll(S_QUOT+S_QUOT, S_EMPTY);
(I have defined all common punctuation as S_FOO (string) and C_FOO (char))
In favour of magic strings/characters:
It's shorter
It's natural to read (sometimes)
The named constants may not be familiar (C_APOS vs '\'')
In favour of constants
It's harder to make typos (e.g. contrast "''" + '"' with S_APOS+S_APOS + C_QUOT)
It removes escaping problems Should a regex be "\\s+" or "\s+" or "\\\\s+"?
It's easy to search the code for punctuation
(There is a limit to this - I would not write regexes this way even though regex syntax is one of the most cognitively dysfunctional parts of all programming. I think we need a better syntax.)
If the definitions may change over time or between installations, I tend to put these things in a config file, and pick up the information at startup or on-demand (depending on the situation). Then provide a static class with read-only interface and clear names on the properties for exposing the information to the system.
Usage could look like this:
s = s.replaceAll(CharConfig.Quotation + CharConfig.Quotation, CharConfig.EmtpyString);
For general string processing, I wouldn't use special symbols. A space is always going to be a space, and it's just more natural to read (and write!):
s.replace("String", " ");
Than:
s.replace("String", S_SPACE);
I would take special care to use things like "\t" to represent tabs, for example, since they can't easily be distinguished from spaces in a string.
As for things like XML_PREFIX_SEPARATOR or FILE_SEPARATOR, you should probably never have to deal with constants like that, since you should use a library to do the work for you. For example, you shouldn't be hand-writing: dir + FILE_SEPARATOR + filename, but rather be calling: file_system_library.join(dir, filename) (or whatever equivalent you're using).
This way, you'll not only have an answer for things like the constants, you'll actually get much better handling of various edge cases which you probably aren't thinking about right now