I have a site that wraps some user-generated content, and I want to be able to separate the markup for the layout, and the markup from the user-generated content, so the u-g content can't break the site layout.
The user-generated content is trusted, as it is coming from a known group of users on my network, but nonetheless only a small subset of html tags are allowed (p, ul/ol/li, em, strong, and a couple more). However, the user-generated content is not guaranteed to be well-formed, and we have had some instances of malformed user-generated content breaking the layout of the site.
We are working with our users to keep the content well-formed, but in the meantime I am trying to find a good way to separate the content from the layout. I have been looking into namespaces, but have been unable to find good documentation about CSS support for embedded namespaces.
Anyone have any good ideas?
EDIT
I have seen some really good suggestions here, but I should probably clarify that I have absolutely no control over the input mechanism that the users use. They are entering content into one system, and my page uses that system's API to pull content out of it. That system is using TinyMCE, but like I said, we are still getting some malformed content.
Why not use markdown
If your users are HTML literate or people that can grasp the concept of markdown syntax I suggest you rather go with that. Stackoverflow works great with it. I can't imagine having a usual rich editor on Stackoverflow. Markdown editors are much simpler and faster to use and provide enough formatting capabilities for most situations. If you need some special additional features you can always add those in but for starters oute of the box capabilities will suffice.
Real-time view for self validation
But don't forget to include a real time view of what users are writing. Self validation makes miracles so they correct their own mistakes before posting data.
Instead of parsing the result or forcing the user to use a structured format, just display the content within an iframe:
<iframe id="user_html"></iframe>
<script>
document.getElementById("user_html").src = "data:text/html;charset=utf-8," + escape(content);
</script>
I built custom CMS systems exclusively for several years and always had great luck with a combination of a quality WYSIWYG, strong front-end validation, and relentless back-end validation.
I always gravitate toward CKEditor because it's the only front-end editor that can deal with Microsoft Word output on the front end...that's a must-have in my books. Sure, others have a paste from word solution, but good luck getting users to use it. I've actually had a client overload a db insert thanks to Microsoft Word that didn't get scrubbed in Tiny. HTML tidy is a great solution to clean things up prior to validation on the back end.
CK has built-in templates and classes, so I used those to help my users format without going overboard. On the back-end I checked to ensure they hadn't tried any funny business with CSS, but it was never a concern with that group of users. Give them enough (safe) features and they'll never HAVE to go rogue.
Maybe overkill, but HTML
Tidy
could help if you can use it.
Use a WYSIWYG like
TinyMCE
or CKEditor that has built in cleanup methods.
Robert Koritnik's suggestion to use markdown seems brilliant, especially considering that you only allow a few harmless formatting tags.
I don't think there's anything you can do with CSS to stop layouts from breaking due to open HTML tags, so I would probably forget that idea.
Related
Before I start I'd like to say I have read similar questions here but I don't think it really answers the question: Show HTML user input, security issue and Security risks from user-submitted HTML
I think these highlight the problems quite well but I am essentially asking advice for best practice in these circumstances.
I have been programming for a while and have just now come to the point where I want website administrators to submit HTML markup to display the content they want in their own sites.
Securing this content in the database is fine but now I want to display it on the site securely.
Even though, this feature is only available to the site admins I still want to secure against malicious script injections and try to prevent them breaking the page by using poor HTML.
Is the reality that I cannot safely guard against script injections as the threads above seemed to point out?
Do I use the mentality that if they break the site, it's down to them, or can I use some sort of markup validator when they update the content?
What do you think about markdown?
It's a safe way to submit html, and have libraries to most popular languages.
You're correct, if you allow to submit pure HTML - there's no way to prevent all possible injections. Even if you disable <script> tag in all it's possible combinations (and there're many) there're other ways like onfocus onmouseover events that can be used to run malicious code.
I would advice HTMLPurifier, it's the best solution out there for sure. Google it!
I have to migrate a static HTML website to TYPO3. I know, I could read docus first, but I believe I will need to read some days first to only recognize which direction to run...
Do I have to learn TypoScript like
Default PAG
page = PAGE
page.typeNum = 0
page.20 = TEXT
page.20.value = HELLO UNIVERSE!
page.10 = TEXT
page.10.value = HELLO WORLD!
or is there another way to do it quickly? With markers?
thank you guys!
You will have to learn a little bit of TypoScript to do what you want. Sorry :-( But you won't have to learn that much, and what you do learn you'll be able to reuse when building other TYPO3 sites.
First thing: skip markers. Markers are a remnant of an old, deprecated templating system. The way you should be doing this is with TemplaVoila.
TemplaVoila works by giving you an interface to map TYPO3 content (or instructions to generate content) to blocks of markup in your HTML file. In other words, you take your static HTML file, then go through it and tell TemplaVoila "OK, that DIV is my sidebar, so put a list of all the site pages in there... that P is the footer, put a link to the privacy policy there... that DIV is the main content area, fill it with blocks of content created by the user," and so forth. This is a very powerful approach, because it means that if you work with other Web designers or graphic designers, they don't have to learn any special "magic tags" or markers; they can just give you well-formed HTML and with a few clicks you can turn it into a live template for a site. Pretty nifty.
There's a piece of TYPO3 documentation called "Futuristic Template Building" that explains pretty clearly how to go from a static HTML page to a TYPO3-ized site with TemplaVoila. Here's a direct link to the section of that doc that walks you through the process. (Don't be scared by the word "futuristic" into thinking that TemplaVoila isn't fully baked yet -- that doc was written six years ago, when TemplaVoila was pretty futuristic, but today it's quite mature and in use on TYPO3 sites all over the world.)
This should be enough to get you started, but if you hit roadblocks or can't wrap your head around it feel free to post your questions back to this thread and I'll help you out.
I'm reviving this, since a lot has happened since 2010.
There are multiple ways in TYPO3 to do the templating. All of them involve TypoScript, but in some there is only a minimal amount of TS needed.
Use "the old built-in way", doing all rendering in TypoScript and some HTML templates with markers in them. In this approach, you'd use the content elements provided by the core. Their rendering is defined with TypoScript in the core-extension "CSS Styled Content".
Use "the new built-in way". Here you'd also use the content elements provided by the core, and optionally self-defined ones. The rendering happens using the Fluid templating engine. You would do this using the core-extension "Fluid Styled Content". This is available since version 7.5.
Use a third party extension for content element rendering. I know of these:
Templavoilà - You probably should not use it, since it is not actively developed anymore, although there is a version claiming compatibility to TYPO3 7 LTS, but I don't know much about that.
FluidTYPO3 - This is a whole ecosystem of extensions with which you can define page templates and content element templates completely using the Fluid Templating engine (backend forms, backend preview and frontend rendering). It also provides a mechanism for nesting content elements.
DCE - Dynamic content elements. I don't know anything about them, you would need to read the docs.
Mask - TYPO3-core near wizard for own contentelements and pagetemplates. Uses database fields, not flexforms.
More extensions I don't know of.
It would be a bit much to explain all of these here in detail.
My current personal favorite is the FluidTYPO3 ecosystem, but I'm considering a switch to using Fluid Styled Content, because it is directly integrated into the core. I'm not sure if it supports nested content elements, so maybe one would need a separate solution for that (e.g. the extension gridelements).
I have always hated wysiwyg editors but most of the applications I develop they are necessary for our clients to edit their content. After trying out a few different ones I seemed to like tinyMCE the best. Although powerful and seems to generate fairly good HTML it is not without its issues. Recently I have been thinking about creating a custom wysiwyg that suits my needs perfectly taking advantage of the contentEditable attribute. Is this HTML5 feature ready? Will I have many cross browser issues? What are its limitations? I guess my question finally boils down to; Will it be worth throwing in the towel on 3rd party wysiwygs and moving to contentEditable regions?
The third party wysiwyg editors will also use the contenteditable attribute. The biggest problems is that they really do create tag soup and the same text created in different user agents will have different HTML source.
Personally I would say that you should stick with tinyMCE of CKEdit.
I say it depends on your scope. If you need something elaborate, massive, and the amount of loaded javascript does not bother you, the use some WYSIWYG. They give lots of possibilities, but also some problems (like this security issue: http://www.devilscafe.in/2011/10/tinymce-ajaxfilemanager-remote-file.html).
But if you need something simple, use html5 contenteditable combined wit execCommand, like this: http://www.quackit.com/html/codes/contenteditable.cfm.
A prospective client has a site with pages done in Dreamweaver(tm). I don't have Dreamweaver(tm), never use it, and have in the past have seen some spaghetti (html) code on pages created with it. As a result, I'm wondering: if I create clean sections of semantic html, will Dreamweaver continue to be able to edit it when the client wishes to modify the code that I have created by hand (assuming that I create clean and validating code). This is more of an experience based question than a research based question. I'm sure that Dreamweaver(tm)'s documentation would tell me that it's compatible with clean, semantic html, but it's hard to take that advice from the horse's mouth.
So, based on your experience, is Dreamweaver happy with clean semantic html created externally, or does the visual mode display badly when it's dealing with manually, externally-created semantic html?
Short answer: it is compatible.
It depends on how your clients want to modify the code. Dreamweaver is a good editing tool, and it supports and "understands" semantic (x)html/css. However, if your clients still use html for styling (such as table-based layouts etc.), then proper clean css-styling can be frustrating for them to deal with later. But it is another problem and it has nothing to do with the editor.
Editing HTML in the “code” mode of Dreamweaver leaves it alone. Not sure about the visual mode though.
Is there a way to embed only a section of a website in another HTML page?
Example: I see an answer I want to blog about, so I grab the HTML content, and splat it in somewhere, and show only that, styled like it is on stackoverflow. Basically, I want to blockquote the section of the page with original styling, if that makes sense. Is that something the site itself has to provide, or can I use an iframe and tell it to show only a certain element or something crazy? Open to all options, but I want it to show up as HTML, not as an image (that's really a last resort).
If this is even possible, are there security concerns I need to aware of?
Don't think image should really be last resort. You have no control over the HTML/CSS of the source page, so even if you craft a solution (probably by using JavaScript to parse out the desired snippet) there is no guarantee that tomorrow the site doesn't decide to change its layout.
Even Jeff, who has control over the layout of stackoverflow.com, still prefers to screen-capture the site, rather than pull in the contents live.
Now if your goal was to have the contents auto-update, that would be a different story. But still, unless you use some agreed-upon method of sharing content, such as RSS, your solution would be very fragile.
The concept you are describing is roughly what is called a "purple include" or "transclusions". There is a library out there for it, but its not exactly actively developed. Here's a couple ajaxian articles on it.
I'd recommend using a server side solution with Python; using urllib2 to request the page, then using BeautifulSoup to parse out the bit that you need. BeautifulSoup has a very flexible selection api with which you can craft heuristics for the section you are interested in.
To illustrate:
soup = BeautifulSoup(html)
text = soup.find(text="Some text on the page that is unlikely to change")
print soup.parent.prettify()
That way if the webmaster later changes the markup on the page, your scraping script should still work.
On client side <iframe> is the only practical option. It is possible to scroll it, but it might not work in the long term, because it's technically close to clickjacking attack.
There's also cross-site XHR, but requires opt-in from destination site, and today works only in few latest browsers.
Getting HTML on server side is easy (every decent web framework has ability to download page and parse HTML and you can use XPath/XSLT or DOM to extract bit you want).
Getting styles however is going to be tricky – CSS rules may not work with HTML fragment taken out of context. You'd have to parse CSS, extract and transform rules or use browser and read currentStyle of every node.
Obviously you have to heavily filter HTML you extract to avoid XSS. It's harder than it seems.
If you don't need to automate this, a good HTML+CSS WYSIWYG editor might be able to extract content fragment with styles.
That sounds like something that IE8's Web Slices would be perfect for. However, it's only available in IE8, and the site of origin would have to implement for you to be able to take advantage of it.