Securely rendering hyperlinks in untrusted text - html

As part of a project, I'm accepting text from a user via web form and displaying it on a web page. The text they provide may contain URLs, if so I'd like to render it as a hyperlink for improved experience. For example the user might submit text containing http://www.google.com and I want to convert it to <a href="http://www.google.com">...
I'm wondering what security issues I should be aware of while doing this. I've already taken measures to avoid any simple XSS insertions, because my XML library will escape any special characters, but I imagine there are more sophisticated attacks.

In addition to ignoring javascript:, you should probably only make hyperlinks for the http: protocol, because there are certain applications that can be launched or controlled through other protocols. Steam, Skype, and AOL Messenger come to mind.

If you are only surrounding URLs with a elements, the only problem that should arrise if they enter a malicious URL (it might be shortened) and you end up clicking it, provided all other means of attack are secure in your software (e.g. can not execute arbitrarily JavaScript, etc).
Make sure you don't consider the javascript: pseudo protocol when you are matching URLs. Nothing nice could come of that.

Related

Safe Way to Include User Text Input in HTML

This feels like an easy one, but I'm having trouble finding the right search terms to get me what I need...
I have a requirement for part of my web page to display a previously-entered note from a user. The note is saved in the database, and I am currently incorporating it using Razor like this:
<span>#Model.UserNote</span>
This works fine, but it gets my spidey senses tingling... what if the user decides that he wants his note to be something like "</span><script>...</script><span>". I know how to use parameters to avoid injection attacks in SQL Server, but is there an HTML equivalent or another approach to avoid saving or injecting malicious markup in HTML? Displaying the text in a control like a textbox feels safer, but may not give me the visual appearance that I am looking for. Thanks in advance!
The thing you want to search for is cross-site scripting (xss).
The general solution is to encode output according to its context. For example if you are writing such data into plain html, you need html encoding, which is basically replacing < with & lt; and so on in dynamic data (~user input), so that everything only gets rendered as text. For a javascript context (for example but not only inside a <script> tag) you would need javascript encoding.
In .net, there is HttpUtility that includes such methods, eg. HttpUtility.JavascriptStringEncode(). Also there is the formerly separate AntiXSS library that can help by providing even stricter (whitelist-based) encoding, as opposed to the blacklist-based HttpUtility. So don't roll your own, it's trickier than it may first appear - just use a well-known implementation.
Also Razor has built-in protection against trivial xss attack vectors. By using #myVar, Razor automatically applies html encoding, so your code above is secure. Note that it would not be secure in a javascript context, where you need to apply javascript encoding yourself (ie. call the relevant method from HttpUtility for instance).
Note that without proper encoding, it is not more secure to use an input field or a textarea - an injection is an injection, doesn't matter much what characters need to be used if injection is possible.
Also slightly related, .net provides another protection besides the automatic html encoding. It uses "request validation", and by default won't allow request parameters (either get or post) to contain a less than character (<), immediately followed by a letter. Such a request would be blocked by the framework as potentially unsafe, unless this feature is deliberately turned off.
Your original example is blocked by both of these mechanisms (automatic encoding and request validation).
It's very important to note though, that in terms of xss, this is the very tip of the iceberg. While these protections in .net help somewhat, they are by no means sufficient in general. While your example is secure, in general you need to understand xss and what exactly these protections do to be able to produce secure code.

What web admin wysiwyg preview options are available?

I have a web admin where there is a wysiwyg editor when a user edits information.
There is also a view only template.The user views the information before clicking an edit action.
Currently the view template results in one line for the saved field value.
<p><b>Hello</b></p><p>there</p>
What options do I have to al least make the a little more readable when the user is "viewing"?
Options I can think of are:
Leave as it. Well, that can become a long line of text.
Somehow to avoid encoding of MVC3 and to add actual <br> in place of the </p> or <br> that is in the content. At least the lines will break up.
Have the content actually present as html. This is, you will see bold. What if there is an unclosed tag.
With any of the above, i may place it in a scrollable div.
(I had trouble tagging this question. Feel free to retag).
Typically when you are working with editors you are going to eventually be presenting the HTML live on the site anyway, so encoding shouldn't be a big concern as you are already trusting them.
Now, what I've done in the past is with using editors, such as ckeditor, etc, they cleanup the content which would fix the issue with your concern about unclosed tag.
so I would go with option 3 on your list.
Also ensure that any editor you support encoded data before sending to the server. Do not turn off request validation.
Use the [AllowHtml] attribute on a model property if necessary.
Also use the Anti-xss library from Microsoft - specifically the HTML sanitizer to help remove evil script and help protect against cross site scripting.

Why are iframes considered dangerous and a security risk?

Why are iframes considered dangerous and a security risk? Can someone describe an example of a case where it can be used maliciously?
The IFRAME element may be a security risk if your site is embedded inside an IFRAME on hostile site. Google "clickjacking" for more details. Note that it does not matter if you use <iframe> or not. The only real protection from this attack is to add HTTP header X-Frame-Options: DENY and hope that the browser knows its job.
If anybody claims that using an <iframe> element on your site is dangerous and causes a security risk, they do not understand what <iframe> element does, or they are speaking about possibility of <iframe> related vulnerabilities in browsers. Security of <iframe src="..."> tag is equal to <img src="..." or <a href="..."> as long there are no vulnerabilities in the browser. And if there's a suitable vulnerability, it might be possible to trigger it even without using <iframe>, <img> or <a> element, so it's not worth considering for this issue.
In addition, IFRAME element may be a security risk if any page on your site contains an XSS vulnerability which can be exploited. In that case the attacker can expand the XSS attack to any page within the same domain that can be persuaded to load within an <iframe> on the page with XSS vulnerability. This is because vulnerable content from the same origin (same domain) inside <iframe> is allowed to access the parent content DOM (practically execute JavaScript in the "host" document). The only real protection methods from this attack is to add HTTP header X-Frame-Options: DENY and/or always correctly encode all user submitted data (that is, never have an XSS vulnerability on your site - easier said than done).
However, be warned that content from <iframe> can initiate top level navigation by default. That is, content within the <iframe> is allowed to automatically open a link over current page location (the new location will be visible in the address bar). The only way to avoid that is to add sandbox attribute without value allow-top-navigation. For example, <iframe sandbox="allow-forms allow-scripts" ...>. Unfortunately, sandbox also disables all plugins, always. For example, historically Youtube couldn't be sandboxed because Flash player was still required to view all Youtube content. No browser supports using plugins and disallowing top level navigation at the same time. However, unless you have some very special reasons, you cannot trust any plugins to work at all for majority of your users in 2021, so you can just use sandbox always and guard your site against forced redirects from user generated content, too. Note that this will break poorly implemented content that tries to modify document.top.location. The content in sandboxed <iframe> can still open links in new tabs so well implemented content will work just fine. Also notice that if you use <iframe sandbox="... allow-scripts allow-same-origin ..." src="blog:..."> any XSS attack within the blob: content can be extended to host document because blob: URLs always inherit the origin of their parent document. You cannot wrap unfiltered user content in blob: and render it as an <iframe> any more than you can put that content directly on your own page.
Example attack goes like this: assume that users can insert user generated content with an iframe; an <iframe> without an attribute sandbox can be used to run JS code saying document.top.location.href = ... and force a redirect to another page. If that redirect goes to a well executed phishing site and your users do not pay attention to address bar, the attacker has a good change to get your users to leak their credentials. They cannot fake the address bar but they can force the redirect and control all content that users can see after that. Leaving allow-top-navigation out of sandbox attribute value avoids this problem. However, due historical reasons, <iframe> elements do not have this limitation by default, so you'll be more vulnerable to phishing if your users can add <iframe> element without attribute sandbox.
Note that X-Frame-Options: DENY also protects from rendering performance side-channel attack that can read content cross-origin (also known as "Pixel perfect Timing Attacks").
That's the technical side of the issue. In addition, there's the issue of user interface. If you teach your users to trust that URL bar is supposed to not change when they click links (e.g. your site uses a big iframe with all the actual content), then the users will not notice anything in the future either in case of actual security vulnerability. For example, you could have an XSS vulnerability within your site that allows the attacker to load content from hostile source within your iframe. Nobody could tell the difference because the URL bar still looks identical to previous behavior (never changes) and the content "looks" valid even though it's from hostile domain requesting user credentials.
As soon as you're displaying content from another domain, you're basically trusting that domain not to serve-up malware.
There's nothing wrong with iframes per se. If you control the content of the iframe, they're perfectly safe.
I'm assuming cross-domain iFrame since presumably the risk would be lower if you controlled it yourself.
Clickjacking is a problem if your site is included as an iframe
A compromised iFrame could display malicious content (imagine the iFrame displaying a login box instead of an ad)
An included iframe can make certain JS calls like alert and prompt which could annoy your user
An included iframe can redirect via location.href (yikes, imagine a 3p frame redirecting the customer from bankofamerica.com to bankofamerica.fake.com)
Malware inside the 3p frame (java/flash/activeX) could infect your user
IFRAMEs are okay; urban legends are not.
When you "use iframes", it doesn't just mean one thing. It's a lexical ambiguity. Depending on the use case, "using iframes" may mean one of the following situations:
Someone else displays your content in an iframe
You display domeone else's content in an iframe
You display your own content in an iframe
So which of these cases can put you in risk?
1. Someone else displays your content
This case is almost always referred to as clickjacking - mimicking your site's behaviour, trying to lure your users into using a fake UI instead of the real site. The misunderstanding here is that you using or not using iframes is irrelevant, it's simply not your call - it's someone else using iframes, which you can do nothing about. Btw, even they don't need them specifically: they can copy your site any other way, stealing your html, implementing a fake site from scratch, etc.
So, ditching iframes in attempt to prevent clickjacking - it makes exactly zero sense.
2. You display someone else's content
Of the three above, this is the only one that's somewhat risky, but most of the scary articles you read all the time come from a world before same-origin policy was introduced. Right now, it's still not recommended to include just any site into your own (who knows what it will contain tomorrow?), but if it's a trusted source (accuweather, yahoo stock info etc), you can safely do it. The big no-no here is letting users (therefore, malicious users) control the src of the iframe, telling it what to display. Don't let users load arbitrary content into your page, that's the root of all evil. But it's true with or without iframes. It has nothing to do with them; it could happen using a script or a style tag (good luck living without them) - the problem is you let them out. Any output on your site containing any user-given content is RISKY. Without sanitizing (de-HTMLifying) it, you're basically opening your site up for XSS attacks, anyone can insert a <script> tag into your content, and that is bad news. Like, baaaad news.
Never output any user input without making dead sure it's harmless.
So, while iframes are innocent again, the takeaway is: don't make them display 3rd-party content unless you trust the source. In other words, don't include untrusted content in your site. (Also, don't jump in front of fast-approaching freight trains. Duuh.)
3. You display your own content in an iframe
This one is obviously harmless. Your page is trusted, the inner content of the iframe is trusted, nothing can go wrong. Iframe is no magic trick; it's just an encapsulation technique, you absolutely have the right to show a piece of your content in a sandbox. It's much like putting it inside a div or anything else, only it will have its own document environment.
TL;DR
Case 1: doesn't matter if you use iframes or not,
Case 2: not an iframe problem,
Case 3: absolutely harmless case.
Please stop believing urban legends. The truth is, iframe-s are totally safe. You could as well blame script tags for being dangerous; anything can cause trouble when maliciously inserted in a site. But how did they insert it in the first place? There must be an existing backend vulnerability if someone was able to inject html content into a site. Blaming one piece of technology for a common attack (instead of finding the real cause) is just a synonym for keeping security holes open. Find the dragon behind the fire.
Unsanitized output is bad; iframes are not.
Stop the witch-hunt.
UPDATE:
There is an attribute called sandbox, worth checking out: https://www.w3schools.com/tags/att_sandbox.asp
UPDATE 2:
Before you comment against iframes - please think about hammers. Hammers are dangerous. They also don't look very nice, they're difficult to swim with, bad for teeth, and some guy in a movie once misused a hammer causing serious injuries. Also, just googled it and tons of literature says mortals can't even move them. If this looks like a good reason to never ever use a hammer again, iframes may not be your real enemy. Sorry for going offroad.
"Dangerous" and "Security risk" are not the first things that spring to mind when people mention iframes … but they can be used in clickjacking attacks.
iframe is also vulnerable to Cross Frame Scripting:
https://www.owasp.org/index.php/Cross_Frame_Scripting

Why do I need Markdown?

Why do I need a Markdown with a front edit editor like WMD? What does the markdown do to the content that’s sent from the WMD editor?
How does Markdown store the content in the backend? Is it the same way like *bold* or in some other format? Why can’t I just do an html encode?
Sorry if I sounded very naïve.

			
				
It's probably helpful to take a step back and ask some of the larger questions. The issue Markdown is trying to solve is that of rich editing in the browser. Consider this: At some point, for any piece of software to enable rich text it has to describe the richness in a some manner, however that may be.
We could call that description of richness (by description of richness I mean like "this bit of text is bold" or "this bit of text is a hyperlink), we could call that description of richness "markup" -- it marks up the text with meta "richness".
Implementations of rich text can take on two approaches, either a.) hide the markup from the user or b.) let them have access to the markup.
For those who choose to hide it, the end result is very often WYSIWYG. The user is oblivious to what is happening behind the scenes. The editor takes care of the details. Think MS Word as an example. No one manipulates the Word markup format as a regular end user.
For implementations which choose to expose the markup, a markup language is then in order to allow users to interacat with it. Such markup languages would be things like HTML doing <tag> or BB code for example, doing things like [tag].
Markdown is one such of these languages.
As opposed to the former types I mentioned, Markdown has tried to design itself so that the markup renders common ASCII people already use. For example, it's common for people to asterisk their text to set it off, *important*, and this notation in Markdown is an indicator of italic.
In regards to storage, as Stephan pointed out, the system will most likely store the raw markdown, because the user will most likely need to have the possibility of editing, and the original markdown can be recalled for that purpose.
In most of the systems I've built, I store the markdown, and then normalize it to a 2nd field which caches the HTML rendering of the markdown. This way I don't have to do markdown->HTML rendering for every markdown field. It takes a little more space, but I'd rather the user have a faster response than use less DB storage space.
Care should also be taken when accepting Markdown from the browser, as it can easily contain <script> tags which need to be filtered out. Most markdown implementations will also recognize HTML intermingled with Markdown formatting, as so to be safe, you need to make sure your inputs and caches are sanitized properly.
The reason for using an alternate encoding system other than HTML is for security
Markdown and other such wiki style encoding systems do not usually support scripting languages
HTML supports scripting languages in many ways (
The two main security issues are:
Malware criminals use scripts in user generated content to attempt malware actions on the content readers computer by scripting to access known security holes
Free loaders using scripts to subvert the rest of the site by changing the content frame or styles i.e. ads, menu's, logos etc. This can also be criminal behaviour if not just annoying
By using an intermediate language such as Markdown you have total control on the rendered output
Filtering HTML is possible, but is also complex and risky
The other significant reason for an alternate encoding system is enforcement of style. Normal HTML has too many options. By limiting the available options, users can only use certain styles. The usually makes for cleaner looking and more readable content (compare SO to Ebay)
The main reason for using Markdown is the readability of a marked text. For instance, you can send it in a plain-text email and the reader will still understand the emphiasis, bullets, the text will be divided in paragraphs et cetera.
When you ask about storing data, it depends. If you enable Markdown in the WordPress blog engine, it stores data as the user has input it - in Markdown. In Stack Overflow, however, it seems like the data is stored as HTML. At least, the "Stack Overflow data dumps" contain HTML, not Markdown (I've seen people complaining) that they have to convert it back).
If you use the WMD editor, you can show the user how the outputs will look like after being converted to HTML. Even though Markdown syntax is really simple, it is not hard to make mistakes. Hence, it is best to show users the output.
Another reason for using Markdown instead of a WYSIWIG control - a WYSIWIG control allows the user to use HTML in data you are displaying on your web page. So, you have to be the one who decides when there is simply incorrect HTML and when it is an evil XSS/CSRF/whatever injection. In Markdown, you simply convert *something* to <b>something</b>, remove any unknow HTML elements and you're done.

Block certain html element from getting indexed by search engines

For styling purposes i want to insert some dummy text on the page, but it shouldn't be getting linked to the actual content. Is there a way to block it for search engines, or do i have to use good old images for that?
Or would it be possible to load it dynamically via javascript? because i heard that google will read certain amount of javascript.
Can you show the content in a borderless iframe, and block the iframe's src (a completely separate "page") from the search engines?
Alternatively, add the content with javascript, storing the javascript in a .js file that you block from the engines?
If you load that text via AJAX it probably won't be indexed - last time I checked, GoogleBot doesn't actually execute JS (nor do the other spiders (but some spambots apparently can and do)).
Caveat: the AJAX response should probably contain a X-Robots-Tag: noindex header, in case its URL is actually linked somewhere.
I'd be extremely careful with whatever trick you decide on. Odds are just as likely google will think you're trying to display different content to the user than to it.
I've always believed that Google actually works by rendering the page (possibly using some server-side version of the Chrome rendering engine) and then reads the result back with OCR software to confirm that the text in the source matches what the user would see with JS and frames enabled. Google has always openly warned webmasters not to try serving robots different content to the users, OCR would be the perfect way to find out (especially if your 'verifier' used IE's user-agent string and crawled from IP ranges not registered by Google).
Short answer then, serve the decoration as either:
an iframe
an object
an SVG image
Since your clearly linking the document into your page google will proably consider it a seperate resource and rate things accordingly, especially if the same text appears on every page. Which brings me to:
Are you going to use the same text decor on all/most pages? If so Google will almost certainly treat it as "window dressing" and ignore it (it apparently does this with menus and such).
I'd guess that loading in the content after the page has finished loading (when the document.ready event fires, for example) would be a fairly safe way to do what you're talking about. Not 100% sure about this, though.