How can I secure user submitted HTML markup?

How can I secure user submitted HTML markup? - html

Before I start I'd like to say I have read similar questions here but I don't think it really answers the question: Show HTML user input, security issue and Security risks from user-submitted HTML
I think these highlight the problems quite well but I am essentially asking advice for best practice in these circumstances.
I have been programming for a while and have just now come to the point where I want website administrators to submit HTML markup to display the content they want in their own sites.
Securing this content in the database is fine but now I want to display it on the site securely.
Even though, this feature is only available to the site admins I still want to secure against malicious script injections and try to prevent them breaking the page by using poor HTML.
Is the reality that I cannot safely guard against script injections as the threads above seemed to point out?
Do I use the mentality that if they break the site, it's down to them, or can I use some sort of markup validator when they update the content?

What do you think about markdown?
It's a safe way to submit html, and have libraries to most popular languages.

You're correct, if you allow to submit pure HTML - there's no way to prevent all possible injections. Even if you disable <script> tag in all it's possible combinations (and there're many) there're other ways like onfocus onmouseover events that can be used to run malicious code.

I would advice HTMLPurifier, it's the best solution out there for sure. Google it!

Related

How can I pass form data from one html file to another without JS/PHP?

I'm learning basic web dev and started with HTML, CSS, Bootstrap. Haven't touched PHP or anything server side yet.
What I've done so far is I've created a pretty basic registration form with 5 fields and what I'm trying to do is display the input of those fields in a table that I've created on another page. The submit button has the "method" and action. Now, I've Googled a ton to find some solutions and have gone through most of the questions of this site but I still can't find out to achieve what I'm trying to do without the use of PHP/JS.
So, is it even possible to read form data from another page like this without the use of JS/PHP? If so, how do I proceed and what needs to be done? I can post the source code but I don't think it's going to help since there isn't much there, everything else is working fine except for finding a solution to this.
Thank you.

You need a programming language.
If you want to do it entirely client-side, then that has to be JS.
If you want to do it server-side (which allows you to access the data and, optionally, make it available to other people, instead of limiting it to the user of the browser) then you can use any programming language at all (although JS and PHP are among the most common choices).
Since you are trying to create a registration page, you'll need to use server-side programming.

You necessarily need to use JavaScript / PHP.
Since you are just starting, I would highly recommend you to check out the W3Schools tutorials on HTML, CSS, JavaScript, PHP, Bootstrap and jQuery.
:)

So this is long gone but I was actually able to resolve my problem without using anything other than basic HTML , so here's how I did it for anyone else who's trying to find the answer to problem (probably not, you don't usually do this professionally and basically this was a challenge from a friend).
So, two things.
SessionStorage
LocalStorage
This is built-in to your browser and you can use it to achieve simple tasks by simply assigning values to it. They'll remain there and you can use however you want.
But, as the name implies, sessionstorage will only retain those values during the session (the time you have your browser open for) while localstorage can retain it indefinitely. Not sure if I can link other sites over here so just Google these terms to learn more and how to use them.

How to block people from viewing source code?

So I take this class, and I'm way ahead of everyone else and a lot of people steal code from my website, I have already disabled right clicking but it's rather easy to get around this, is their any way to stop people from being able to view my source code?

tl;dr: Nope.
You could look into obfuscation, as well as CSS & JS minification.
"If you steal from one author, it’s plagiarism; Steal from many, it’s art."

No, if someone wants it, they will get it, you can make it harder but, you will just alienate your users from normal functionality, focus on your backend code.
If they steal your code, your lector will hopefully notice, either way they only hurt themselves.

Afaik the only way to hide your source code is if you put it on the server-side.
It is not possible from hiding client-side source code from users - sorry.
One suggestion would be stopping the user from right-clicking but that might cause you more problems...

You could render the html pages server side and convert them into images which get sent to the client. You could then have some image maps that handle clicking on the various locations.

There isn't a perfect solution (100% bullet proof) to protect your JavaScript code on the client side, however there are some tools on the market that can help you to protect your code:
Code Compression/Minification (Usually don't protect the code)
Google Closure (Free)
Uglify JS (Free)
Code Obfuscation/Compression/Minification
JScrambler (Paid, but is on my opinion the best one on the market)
Jasob (Paid)
Stunnix (Paid, it seems to be outdated)
Hope this answers your question!

keep user-generated content from breaking layout?

I have a site that wraps some user-generated content, and I want to be able to separate the markup for the layout, and the markup from the user-generated content, so the u-g content can't break the site layout.
The user-generated content is trusted, as it is coming from a known group of users on my network, but nonetheless only a small subset of html tags are allowed (p, ul/ol/li, em, strong, and a couple more). However, the user-generated content is not guaranteed to be well-formed, and we have had some instances of malformed user-generated content breaking the layout of the site.
We are working with our users to keep the content well-formed, but in the meantime I am trying to find a good way to separate the content from the layout. I have been looking into namespaces, but have been unable to find good documentation about CSS support for embedded namespaces.
Anyone have any good ideas?
EDIT
I have seen some really good suggestions here, but I should probably clarify that I have absolutely no control over the input mechanism that the users use. They are entering content into one system, and my page uses that system's API to pull content out of it. That system is using TinyMCE, but like I said, we are still getting some malformed content.

Why not use markdown
If your users are HTML literate or people that can grasp the concept of markdown syntax I suggest you rather go with that. Stackoverflow works great with it. I can't imagine having a usual rich editor on Stackoverflow. Markdown editors are much simpler and faster to use and provide enough formatting capabilities for most situations. If you need some special additional features you can always add those in but for starters oute of the box capabilities will suffice.
Real-time view for self validation
But don't forget to include a real time view of what users are writing. Self validation makes miracles so they correct their own mistakes before posting data.

Instead of parsing the result or forcing the user to use a structured format, just display the content within an iframe:
<iframe id="user_html"></iframe>
<script>
document.getElementById("user_html").src = "data:text/html;charset=utf-8," + escape(content);
</script>

I built custom CMS systems exclusively for several years and always had great luck with a combination of a quality WYSIWYG, strong front-end validation, and relentless back-end validation.
I always gravitate toward CKEditor because it's the only front-end editor that can deal with Microsoft Word output on the front end...that's a must-have in my books. Sure, others have a paste from word solution, but good luck getting users to use it. I've actually had a client overload a db insert thanks to Microsoft Word that didn't get scrubbed in Tiny. HTML tidy is a great solution to clean things up prior to validation on the back end.
CK has built-in templates and classes, so I used those to help my users format without going overboard. On the back-end I checked to ensure they hadn't tried any funny business with CSS, but it was never a concern with that group of users. Give them enough (safe) features and they'll never HAVE to go rogue.

Maybe overkill, but HTML
Tidy
could help if you can use it.
Use a WYSIWYG like
TinyMCE
or CKEditor that has built in cleanup methods.
Robert Koritnik's suggestion to use markdown seems brilliant, especially considering that you only allow a few harmless formatting tags.
I don't think there's anything you can do with CSS to stop layouts from breaking due to open HTML tags, so I would probably forget that idea.

Are there any alternatives to recaptcha.net, for stopping spam? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
A member of my company in greater ranking than myself refuses to use recaptcha.net on his website to thwart spam off of a public form. He thinks it would be difficult for anyone coming to our site to enter their information since the Turing Tests are "so darn hard to read".
Is there an alternative to using this method? That doesn't contain these sorts of difficult to read images?
(Okay stupid question...if it were up to me we'd use recaptcha because everyone else on earth does...but I just figured I'd check anyway.)
Also, is using a hidden field that is set by Javascript and later checked on the server really a good way to thawart spam?
I myself don't really buy that it is...since there are all sorts of Javascript engines that don't run in a browser yet can run Javascript (Rhino etc...), that could easily be used to thawart a JS/Serverside anti-spam method.

CAPTCHA will reduce your spam but it won't eliminate it. People are paid to decipher those glyphs. Some sites use the glyph that was presented to them for their own site so some hapless visitor will decipher it.
Just so you're aware that it's not a perfect solution.
Based on the principle of don't solve a problem until it's a problem: is spam a significant problem on your website? There is something to be said for not annoying your customers/visitors. Even here I sometimes need to make a few edits and I get the irritating "I'm a Human Being" test on typically the last edit I need to make. It's annoying.
People have proposed all sorts of other methods for dealing with this problem. One I read about used picutres of cats and dogs that you had to classify because apparently there's a database of 30+ million of these in the US for abandoned animals or somesuch. This or anything that gets in widespread use will be defeated.
The biggest problem with spam on sites is if you use software that's in widespread use (eg phpBB). Your best bet for those is to make enough modifications to defeat out-of-the-box scripting. You may get targeted anyway but spamming is a high-volume low-success game. There's no real reason to target your site until it accounts for a significant amount of traffic.
The other thing worth mentioning is techniques that can be used to defeat scripted spam:
Use Javascript to write critical content rather than including it as static HTML. That's a lot harder to deal with (but not impossible);
Rename and/or reorder key fields like username and password. For example, generate username and password form fields and store them as session variables so they only work for that user. That then requires the user to have visited the page with the login form (rather than scripting a form response that can be POSTed directly);
Obfuscate the form submission. Things like unobtrusive Javascript that you can do in jQuery and similar frameworks make this pretty easy;
Include a CAPTCHA image and field box and then don't display them (display: none in CSS). You'll confuse parsers.

The best way for not so popular sites is to insert a hidden field and check it. If it's filled then it's spam because those bots just fill in any field they find.

You might want to look into Akistmet and/or Mollom.

Add a non-standard required input field. For example, require a check-box that says "check me" to be checked. That will defeat any automated scripts that aren't tailored to your site. Just keep in mind it won't defeat anyone specifically targeting your site.

A simple way is to display an image reading "orange", and asking users to type that.

Yes, recaptcha will cut spam but it will also cut conversions! You should consider using XVerify which does real time data verification. What makes those registrations spam is bogus data, with XVerify it will make sure the information you put in is real data by verifing the email address, phone number, and physical address of users. If the information is fake the user cannot click continue! SIMPLE!

I used to think CAPTCHAs were good and used reCAPTCHA on public forms. I noticed that spam submissions were gone but I also noticed that real submissions were cut drastically as well.
Now I don't believe in CAPTCHAs. They work but I feel they can do more harm than good. After having to enter in hard to read CAPTCHAs on other sites I understand why I don't get as many real submissions. Any input that a user must act on that is not related to their main goal is a deterrent.
I usually use several methods to prevent spam and it depends on what type of content I'm expecting in forms. I created server methods that scan comments and mark them as spam based on content. It works ok, but I'm no spam expert so it doesn't work great. I wish someone would make a web service that did this.
I think the links from Evan are pretty interesting!

Another method that I have heard about, which basically extends the javascript idea, is getting the client's browser to perform a configurable JavaScript calculation.
It has been implemented in the NoBot sample as part of the Microsoft AJAX Control Toolkit
http://www.asp.net/AJAX/AjaxControlToolkit/Samples/NoBot/NoBot.aspx for some more details of how it works.

I found an alternative called Are You A Human. Not that programmers should go on gut feelings, but from the start it seemed insecure. Since it's a fun game you play, I decided to try it. It didn't work for me. It's possible the host isn't set up for it. That's the last thing for me to check.
If anyone else has tried ayah, I'd like to know how it worked.

I've used Confident Captcha before and it was really easy to get set up and running. Also I haven't had any spam get through on the forum I manage.
It isn't a text based Captcha but instead uses images similar to picaptcha. I've tried 'Are you a human' before and it's definitely an interesting concept.

Found one called NuCaptcha which displays moving letters...

8 years later...
I have been looking for alternatives to Google's reCaptcha, which doesn't ruin the UX, tracks user, etc. and found this gem: Coinhive Captcha.
It works by mining Monero coins (hash count is adjustable) in the background and provides a server-side API to verify it. It should be noted, that - depending on the selected hash count to solve - it may be slow, specially on mobile devices.

Should I sanitize HTML markup for a hosted CMS?

I am looking at starting a hosted CMS-like service for customers.
As it would, it would require the customer to input text which would be served up to anyone that comes to visit their site. I am planning on using Markdown, possibly in combination with WMD (the live markdown preview that SO uses) for the big blocks of text.
Now, should I be sanitizing their input for html? Given that there would only be a handful of people editing their 'CMS', all paying customers, should i be stripping out the bad HTML, or should I just let them run wild? After all, it is their 'site'
Edit: The main reason as to why I would do it is to let them use their own javascript, and have their own css and divs and what not for the output

Why wouldn't you sanitize the input?
If you don't, you're inviting calamity - to either your customer or yourself or both.

Your question asks:
"Edit: The main reason as to why I would do it is to let them use their own javascript, and have their own css and divs and what not for the output".
If you allow users to supply arbitrary JavaScript, then sanitizing input is not worth the effort. The definition of Cross-Site Scripting (XSS) is basically "users can supply JavaScript and some users are bad".
Now, some websites do allow users to supply JavaScript and they mitigate the risk in one of two ways:
Host the individual user's CMS's under a different domain. Blogger and Tumblr (e.g. myblog.blogspot.com vs. blogger.com) do this to prevent user's templates from stealing other user's cookies. You have to know what you are doing and never host any of the user content under the root domain.
If user content is never shared between users then it does not matter what script malicious users supply. However, CMS's are about sharing so this probably doesn't apply here
There are some Blacklist filters out there that may work, but they only work today. The HTML spec and browsers change regularly which makes filters almost impossible to maintain. Blacklisting is a sure fire way to have both security and functional problems.
When dealing with user data, always treat it as untrusted. If you don't address this early in the product and your scenarios change, it is almost impossible to go back and find all of the XSS points or modifythe product to prevent XSS without upsetting your users.

You would also be protecting again disgruntled employees, cross customer attacks, or any other sort of idiotic behavior.
You should always sanitize, no matter the users or viewers.

At least parse their entry an only allow a certain "safe" subset of HTML tags.

I think you should always sanitize the input. Most people use a CMS because they don't want to create their own website from scratch and they want easy access to edit their pages. These users most likely will not be trying to put in text that would get sanitized, but by protecting against it you are protecting their users.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008