Cleaning up HTML from textarea - html

I have a page with two textareas, where registered users can fill them with HTML codes. First one has TinyMCE (so HTML is cleaned up), but the other one does not, since I expect the code to be inserted as embed codes from other sites (mostly sites that provide maps, e.g. Google Maps, MapMyRace.com, etc). But problem is that those other sites may provide different tags, not just <embed> or <iframe>. So I can't strip tags because then I might strip tags that I didn't know other sites provided. I will save the HTML in these two textareas into my database, to be retrieved and displayed as parts of some other pages.
Do you have any suggestions to make this setup more secure? Or should I disallow free input of HTML in the 2nd textarea altogether? (Or.. I let the users tick a check box saying "I accept full responsibility for the behavior of the code I am inserting".. LOL)
Your opinion is highly appreciated :)
Thanks

The short answer is : free HTML is insecure and must be avoided. Nothing blocks your user from creating an iframe that redirects the user to some harmful page or put ads on your page or deface your site.
My favorite approach to this problem is to allow the user to paste a link (no the "embed on page" iframe code) in a text box. Then I use regex to identify the pasted link (is it youtube, Bing maps, ...) and I create the HTML from the pasted link, which isn't too complex for most iframe providers. It's much more work for you, and it restricts the APIs you can put on your page, but it's secure.

Letting your users use arbitrary HTML is dangerous. You may want to have a black and white lists of tags that you disallow and allow (respectively).

Related

Allow users to customize web page

I am on a project to provide a platform that will allow my users to write blog and customize the look of the page however they want it using just html and css. I will be using python/django. I am just concern as to how do I go forward with it. Will there be any security issues that I should be concerned about? If you could guide me on how to proceed I would be very grateful.
For starters, check this question. You will need to remove the tags (and attributes) that may create dangerous behaviours (like the script tag or the onload attribute).
Give them fields to add their css and html(for this add a pretty wysiwyg editor like ckeditor, tinymce, etc.
In css, stripping html and removing urls should be enough (let me know if there is something additional on this part.). Put the css inside a tag in the head.
For html you should be adding the content inside with the safe filter {{content|safe}} after your mandatory content (if there is a global navbar, etc).
Again, kill dangerous tags as soon as possible...script, iframe, etc.
With something like this, user should have control over the layout of their content and the style of that section of the site. This is assuming you want to have the same structure for all users (i.e. sidebar to the right showing the 3 latest entries).
If you want to give them some more customization, the easiest way (both for developers and users) is just show them a list of options (ie. sidebar can show n latest entries, it shows/hide blogger info, it does have social share options, blog entry has comments enabled, etc).

HTML WYSIWYG edtor: why is the editable content moved in an iFrame

Why is the editable html moved into an iFrame? I analysed different editors (TinyMce, CKEditor, etc) and all move the editable content into a separate iFrame which they lay over the original text.
What is the technical reason for this. I experimented with the contenteditable="true", which is the base of all this editors too, and didn't find a reason yet to do this.
I'm CKEditor core developer. Not for a long time - just for last half of the year, but I've learnt a lot about why we use iframed editable :)
Styling - content of the iframed editor doesn't inherit styles of the page. This is extremely important, because we cannot reset styles (sic! CSS really sucks). What's more - in iframe we can freely add our own styles which is helpful too.
Only in iframed editable we can work on entire page with head, metas, body styles, title, etc. Some of our users need this.
Browsers have very buggy (and incomplete) impls of contenteditable. E.g. guess what will happen when you paste list into the editable which is a <h1> element on Firefox (you can check that in this editor - http://createjs.org/demo/hallo/)? It will leak out of editable area and become a non-editable element. We have to handle these cases manually in the editor and this is really hard work :).
I'm not sure about this but I believe that designMode wich allows to switch entire document into the editable area had been first and contenteditable came later. So the reason may be historical too - it's hard to switch from one approach to another.
Probably there're more reasons why we use an iframed editable. I'll update my answer when I'll learn them :)
From the tinymce froum
Hi Zappino!
It is the very nature of editors like TinyMCE to use an IFrame because
in a frame you can modify any part of an HTML document to suit your
needs without breaking anything in the main page's document.
Especially if you want to edit a complete HTML document including the
parts between and you won't be able to do so without an
IFrame.
Cross Domain Skripting will occur if you store TinyMCE's files on a
different (sub-)domain than the page from which you embed the editor.
Show us a test scenario of your installation with which you are having
trouble and someone might be able to help you out!
Greetings from Germany (back to Germany )
Felix Riesterer.

Prevent site configuration info from showing up on Google

I have a site that's running WordPress.
The main page has an embedded Flash player and an imbedded iframe, and for some reason, all the configuration info from the Flash player is showing up on Google for my site, and nothing else.
How can I have the main site information show up on Google, without having that Flash player config info show up?
And can I customize what shows up at all?
If there's some way to tag the info I don't want to show up, or tag the info I want to show up, I can probably do most ofthe edits myself, I just don't know where to start...
EDIT: I tried most of the suggestions below, and I didn't get anywhere...
Any other ideas?
Thanks a lot!
If you don't want Google, or other crawler to access certain parts of your website you should use a robots.txt file. Inside you specify which parts are accessible and which aren't, when the crawlers get to your website will always look for this file for instructions.
You can check some documentation on how to do it here and here
In order to influence what text is used on the google search result try putting this within your head tags
<meta name="description" content="WHATEVER YOU WANT DISPLAYED ON GOOGLE">
Source: http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
Some more information from google on controling parts of a page. Apparently there are google off/google on tags.
http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/
Hope this helps.
If you want Google to index only part of your pages, you can't follow normal SEO routines. You should provide a mechanism to understand whether the current client (requester) is a robot or not. If yes, then don't render that part. This is the only way. Otherwise, a robot either gets the whole rendered content, or doesn't have access based on robots.txt file (Robot Exclusion Protocol).
Another way (which is not really smart, and can't be guaranteed to work) is to dynamically inject your content into the page via JavaScript. Because AMAIK, robots don't run JavaScript.
As search spiders won't render javascript generated markup (JS is not run as it is client-side in the browser), a quick fix would be to don't output any of flash / markup initially in the HTML document and then use JS to add the flash stuff on load.
Note: as far as I'm aware, Google is currently testing a JS reading spider so this may not work long term.
Google is returning this data because it simply can't find any content where it normally would. Search engines require content - they're not advanced enough to process your multimedia to determine what it's all about.
Google will IGNORE your meta description if it doesn't feel that it reflects your page content (of which there is only iframes and JS)
Use SWFObject to provide alternate content for users without flash (including search engines) - ensure it's not some dinky text like "download flash here" - but a lengthy descriptive content piece about your site or media that they would normally experience if they could experience.
Use robots.txt or <meta name="robots" content="noindex,follow"> for the iframe content to prevent it from being indexed.
For the love of all things holy, please look at reducing the number of JS files and inline JS on your site (i'd recommend WP-minify since it's so obvious that you love plugins)

Insert my site's HTML into another static page

I would like to use my website on mechanical turk.
But because I can only enter static HTML into mechanical turk's description.
I need to somehow place my website there. How do I do that?
You can't effectively do this, with one exception.
See if they permit the iframe element. If they do, then you can use an iframe to reference your web site. Be warned, cookie behavior and other things may cause interaction problems on your site. iframe is also considered a security risk, so I would not be surprised if they don't allow it.
Your actual best bet is going to be merely linking to your site from the description field you're given.

FCKEditor breaking HTML forms

I'm in the process of reproducing some standalone HTML forms as pages in a CMS that uses FCKEditor by simply copying and pasting the relevant code into the editor.
But when I save and view the page, the HTML has been changed and the tag has been moved up to just below the open tag -- and not at the bottom of the form. This obviously renders all of the fields in the form, including the submit button, useless.
Is there a way to tell FCKEditor that I know what I'm doing and I don't need it to validate the HTML output?
Unfortunately this is a hosted CMS service (actually part of an email blast tool) so making changes to the configuration will mean I need to go through the company's support system, which is fine -- but they haven't been able to solve it for me yet, so I'm hoping to get the answers for them.
Thanks!
This is a bit of a difficult thing because as far as I know, it's not necessarily the WYSIWYG editors that "fix" "broken" HTML, it's the browsers' HTML editing engines themselves, and it's often near impossible to talk them out of doing this.
You'd have to show your exact source to get detailed feedback, but check out whether protectedSource is something for you. It's supposed to protect code that is covered by the regular expression you specify.
I'm not sure about FCKEditor, but you might want to consider switching to TinyMCE. TinyMCE allows you to both edit a list of allowed tags, and to turn off HTML validation off completely if you like.