What I mean is, is there some functionality built into HTML that you can use to escape output? For example some sort of that tag that would tell the browser that that everything inside this tag should not be considered regular HTML, but treated as regular text.
I know there are things like Google AutoEscape and Microsoft AntiXSS but these are not built into HTML.
And if there isn't the obvious question is why? Since XSS is somewhat common and a well known type of attack that devs can easily miss, why isn't there functionality built into HTMLto prevent this and make it easy on the devs?
There is <pre> that displays its content literally:
The HTML <pre> element (or HTML Preformatted Text) represents
preformatted text. Text within this element is typically displayed in
a non-proportional ("monospace") font exactly as it is laid out in the
file. Whitespace inside this element is displayed as typed.
However, it's not useful protection against XSS or other attacks, since the attacker could simply inject a closing </pre> and then go on doing whatever they want in the rest of the code, which will be interpreted as part of the document.
Security wise, there's no simple alternative to escaping the data you want to output on server side.
Related
Assuming I am in control of the parsing environment and I'm certain it is only to be converted to HTML (and not any of the many other formats possible); is it ok to embed some HTML within one's Markdown, in order to side-step around a bug?
Could there be any basic sideffects I (as a newbie) couldn't predict but should be aware of?
Non-conventional Markdown example:
_"<strong>This</strong> is an example sentence."_ -**OP**
Which outputs valid HTML:
<em>"<strong>This</strong> is an example sentence."</em> -<strong>OP</strong>
Resulting in successful content:
"This is an example sentence." -OP
Background (don't have to read):
I noticed that if I include HTML in my Markdown, it appears to get skipped during the conversion, resulting in it being seamlessly incorporated in the output HTML.
This appears to be a good thing, at least in my case (Using Hugo to build a website with a template theme) where the Markdown wasn't producing the correct result (leaving a pair of unwanted *s in the HTML: should have been *italic* but asterisks showing).
For those wondering - yes, I confirmed my Markdown was correct using other parsers that handled it fine.
Note: the examples here are simplifications of my specific case
Not only is it okay to do, but it is encouraged. As the rules state:
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.
And later:
If you want, you can even use HTML tags instead of Markdown formatting; e.g. if you’d prefer to use HTML <a> or <img> tags instead of Markdown’s link or image syntax, go right ahead.
Of course, there are a few things to take into consideration. For example block level tags must be at the document root level (cannot be nested inside blockquotes, lists, etc) and content inside them does not get parsed as Markdown. However, inline tags can be placed anywhere and do not restrict Markdown parsing.
For people using Markdown in highly modular or user-flexible environments (probably slightly more advanced readers):
One should note that although Markdown is most commonly converted to HTML, it can also be used with other formats[1].
For this reason I think it's important to confirm that if you (as a publisher of content) are not the one who determines what the Markdown will be parsed with, or how it is converted it may be 'safer' to not embed HTML in it.
[1] as stated in the Markdown Wikipedia page.
I have a simple webpage that takes query items and crafts them in to the page.
Example URL:
http://quir.li/player.html?media=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D0VqTwnAuHws
The page then has the URL displayed somewhere in the page:
<span id="sourceUrlDisplay">http://www.youtube.com/watch?v=0VqTwnAuHws</span>
I feel that this makes the page vulnerable to XSS in case the page gets loaded with an URL containing something similar to
http://quir.li/player.html?media=<script>alert('test')</script>
I have found, that rendering the URL into a <pre> tag does not help. Is there a simple solution to this, like an HTML tag whose content really is not interpreted but just printed out?
Note: This question is somewhat similar to this one, but more general.
No, there is no such tag in HTML that would prevent XSS attacks, and it's impossible to make one. Let's assume that there was such a tag, say, <safe>. The attacker would only need to close it: </safe><script> malicious code </script><safe>.
The way to stop XSS in this specific case would be to escape special characters to their URL encoding counterparts, so that http://quir.li/player.html?media=<script>alert('test')</script> becomes http://quir.li/player.html?media=%3Cscript%3Ealert('test')%3C%2Fscript%3E.
You should escape special characters of HTML to remove their special meaning. For example, in PHP, htmlspecialchars() function is intended for such escaping.
I think it's worth mentioning that there actually are 2 (deprecated) HTML tags that do exactly what you're looking for: <xmp> and <listing>. Any HTML inside these tags will be shown as text in most browsers.
The keyword here being: most. The tags have been deprecated since HTML 3.2 and while it seems to still be supported by Chrome and Firefox (and possibly more browsers), who knows when they will be completely removed. On top of that, it is not impossible that some browsers may have implemented it incorrectly.
In the end, you're still better off escaping the HTML as the other answers say.
In our internal CRM we have a simple html input textarea where you can leave notes and messages. We later use this information to email this, only since that email is in HTML the formating is all wrong.
So if for example I have the following in my MYSQL table:
This is a test message!
Some line
Some more lines
If we later email this it comes out as:
This is a test message! Some line Some more lines
This is obviously not wanted but I don't want to add some complicated WYSIWYG editor to our CRM. Can I allow line-breaks? If so, how?
I don't want to use <pre></pre> tags because I believe it is not supported in all email clients (I could be wrong).
You could use text/plain header, if you don't intend on using any HTML tags in the message. (That would mean no colors, no links, and no text formatting).
You could also make a quick and dirty solution to replace all \ns in your text to <br>\n.
The problem is that html renders all whitespace as single spaces. If you look at the source of the email once it's received, I'll bet the newlines will be there (if they're not, then the problem is on the email generation side).
<pre></pre> is the simplest thing you can do, I think.
A basic solution would be to replace new lines with <br>s.
A smarter one would give special consideration to multiple line breaks (e.g. treating /\n\s*\n/ as a point to end a paragraph and start a new one (</p><p>)).
The specifics would depend on the language you are using to generate the email from the MySQL data. You might want to consider something like a Markdown parser.
You can send emails in two flavours: html and plain text. In Html, line-breaks are not processed (just like in your browser). Looks like this is what you are doing here.
Two solutions: either you send emails in plain text, or you change line-breaks to <br>.
Assuming PHP is in the mix, there's the nl2br() function. Otherwise, rolling your own won't be hard.
The root of this issue is that browsers (mail clients can either use embedded browsers for rendering - Outlook for example - or behave like browsers) will take any amount of whitespace/new line/carriage returns/etc outside of tags in HTML and render them as a single whitespace. This is so you can do things like indent your markup and still have it look sane in the browser.
You will have to insert markup in order to control the rendering as has been has suggested: convert newlines to <br> or <p> tags and so on, much like cms WYSIWYG editors do. Either that or chose a different format for your emails.
Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.
I've got an input so the user can type either html or plain text. When the user copy & paste text from MS Word, for example, it generates a weird html. Then, when you view that topic, you can see the whole page's style is affected. I don't really know if the generated html has unclosed tags or something, but it looks like it does and thus, the style of the page is affected.
Does anybody know how to "isolate" the html of that div(or whatever the container be) from the whole page's style?
Short of showing the content in an IFRAME, you can't really do that. What I usually do in this situation is apply tag stripping logic to the content as it comes in. You really don't want to allow arbitrary HTML from a security perspective, but even if you don't care what your users input, you should be stripping out invalid HTML tags (Word has a habit of creating tags with weird namespace-looking things like o:p) and running something like Tidy over the result to ensure every tag is properly closed. There are a number of Tidy libraries for .NET out there; here's one.
Here's a quick cut-and-paste of how I've done this in the past. Note that the class implements an interface from the project I used it in, but you get the general idea.
Copying text from word can include <style> tags. The only sure way to isolate these styles is to put the input control in an <iframe>
You can either sanitize the input or display it in an IFrame.
It it were me I'd strip all but basic formatting (e.g., bold, italics) and use Tidy. That's what I end up doing, I strip and convert all the CSS styles of word into <strong>, <em>, etc.