Embed sandboxed HTML on a webpage + SEO - html

I want to embed some HTML on my website... I would like that:
SEO: that content can be crawled and indexed
Integration: it renders nicely (does not break my DOM trees for instance, or does not inherit my styles)
Security: it remains safe for our user (javascript disabled)
Flexibility: the HTML can be completely free (don't want any BBCode or MarkDown or even TinyMCE, it's our users that are writing the HTML code...)
I saw that I might be able to use the IFrame for that, but I am not sure it is a very good solution concerning my SEO constraint.
Any answer would be greatly appreciated!!! Thanks.

For your requirements (rendering and security, primarily), IFRAME seems to be your only option, especially when we consider no rules are specified for the HTML content except the JS removal. Even some CSS + 'a' tag can bring a serious security risk, like overlaying outgoing links on your standard interface.
For the SEO part, you can use SEO maps to show the search engines the relation between the content and the container, also use html tags like link to make connection.

To make sure the user's html is safe then you should use HTMLPurifer. In terms of the rest of the question, you should split this up into multiple questions.

Related

(Dis-)advantages of embedding HTML header/footer with <object> tags?

I am looking for a simple way to make my website modular and extract common parts that appear often like header and footer into separate files.
In my case, I can not use any server-side includes with e.g. PHP. I also generally would like to avoid including big libraries like JQuery just for such a simple task.
Now I came across https://stackoverflow.com/a/691059/4464570 which suggests to simply use an HTML <object> tag to include another HTML file into a page, like this:
<object data="parts/header.html" type="text/html">header goes here</object>
I might be missing something important here, but to me this way seems to perfectly fit my needs. It is very short and precise, the <object> tag is well supported by all browsers, I don't need to include any big libraries and actually I don't even need any JavaScript, which allows users blocking that to still view the correct page structure and layout.
So are there any disadvantages I'm currently not aware of yet with this approach? The main reason for my doubts is that out of dozens of answers on how to include HTML fragments, only one recommended <object> while all others went for a PHP or JavaScript/JQuery way.
Furthermore, do I have to pay attention to anything special regarding how to put the <object> tag into my main page, or regarding the structure of the file I want to include this way? Like, may the embedded file be a complete HTML document with <!DOCTYPE>, <html>, <head> and <body> or should/must I strip all those structures and leave only the inner HTML elements? Is there anything special about using JavaScript or CSS inside HTML embedded this way?
The use of the <object> tag for HTML content is very similar to the use of an <iframe>. Both just load a webpage as a seperate document inside a frame that itself is the child of your main document.
Frames used to be very popular in the early days of web development, then in the form of the <frame> tag. They are generally frowned upon, however, and you should really use them as little as possible.
Why not to use this technique for displaying your own content
The HTML content in the child frame cannot communicate with the parent. For example, you can't use a script in the parent HTML to communicate with the child and vice versa. That makes it not very useful for serving your own content when you want to display anything but static text.
Why not to use this technique for displaying someone else's content
You can't use it to serve a lot of external content either. Many websites (including eg. SO) send an X-Frame-Options header along with their webpage that has the value SAMEORIGIN. This prevents your content from being loaded and displayed.

Is Google microformats supposed to be visible on the web page?

I was trying to add microformats as following to my webpage:
<div itemscope itemtype="http://schema.org/Product">
<span itemprop="brand">Company Name</span>
<span itemprop="name">Product Name</span>
<span itemprop="description">Product Description</span>
Product #: <span itemprop="sku">12345</span>
</div>
I thought this microformat will only show up in a google search result page. But after adding it, those information became visible on my webpage, and not in a good shape.
Is there something wrong? Or should I use display:none to make it invisible on my webpage?
Microformats are meant to add machine readable meaning to existing content on the page. They're not invisible meta data, they augment content that's already there. So, yes, it'll show up. You can hide or style it via any of the usual ways in which you hide or style content.
You are using Microdata, not Microformats.
Microdata is a syntax to include structured data within HTML5. Ideally you would use your existing content (i.e., add the needed attributes like itemprop etc. to your already existing markup), and only if that’s not possible, the hidden elements meta and link (which are allowed in the body if used for Microdata).
If you don’t want to use your existing markup and the visible content, you could use an alternative syntax: JSON-LD. This gets included as a data block (using the script element), which is not visible by default.
Don't try to use hide or style on your content, it will have a bad impact on your site. You might get penalized for cloaking if you practice it on all of your pages.
If you are trying to mark/let the bots know about some more info that is not on your page you can try using either the Data Highlighter for simple things in you Search Engine Console (Webmaster Tools) or for more complicated stuff you can try using JSON-LD coding on you pages.
Microformats are HTML. Used to publish a standard API that is consumed and used by search engines, browsers, and other web sites. Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Microformats are a way to enable "smart scraping" of web pages, so that you can create tools and scripts that losslessly extract machine-readable information from cleanly-formatted, human-readable HTML. Structured Data is the name given to content which is marked up in a specific way, using MicroFormatting, to explain what that content is all about.
It is always recommended to show the Microdata information and not to hide it. You can probably try to give a good shape. It would show up in the Google and Bing result pages as well but you need to wait a little for that. There is nothing wrong with the Microformats applied by you. The thing is SEO need some more patience.

Is it safe to allow to embed an arbitrary external stylesheet into my web-page?

I have a dynamic web-page which I want other people to embed into their web-pages, with an iframe (not necessarily with any kind of more advanced techniques like JavaScript).
Instead of providing all sorts of designs and styles myself, I'm thinking about allowing them to provide their own stylesheet for my page through an HTTP GET parameter, and embed such external stylesheet through a URL w/ <link type="text/css" rel="stylesheet" href… on my page.
Is this safe? Will it violate the security paradigm of my web-site? I'm aware that extra text could be inserted with CSS alone, and indeed elements could be removed (which is the whole point of me providing such functionality for my users), but anything else I should be aware of?
Could malicious people insert links onto my site through such a CSS, to benefit from my http referer and potentially violate some checks, or is CSS insertion limited to text?
In the general case, no, allowing third-party CSS is not safe. Some implementations allow JavaScript in CSS, which means that allowing users to modify your CSS allows them to execute arbitrary JavaScript in the context of your page.
However, if this is meant to be sort of a "white-label" page, where it appears to be part of the site it's embedded in and the fact that it's really your page is just an implementation detail, this doesn't seem like a major concern. The person specifying the "third-party" CSS is the site owner, so it's not really third-party at that point — they're not going to XSS themselves!
But nobody else should ever be putting CSS on a page that's meant to be under your control, because it's really under the control of whoever is controlling the CSS.
CSS cannot insert linkable content. It can only style, position and hide what's already there. Sure, people can mess up your page with :before and :after text an perhaps make things look a little confusing or change labels on existing links, but not the URLs themselves.

How to convert text to speech on a web page?

I am making a web page that displays fragments of text from news sites (CNN, BBC, etc.) but I also want it to be read to people who can't see. How can I program the HTML page to read the text for them? Any ideas?
Thanks, Boda Cydo.
People who can't see will already be using either a screen reader (which will read the text to them), braille display or similar.
You just need to focus on making the text accessible and let their software handle "displaying" it to them.
The best way to make your website readable to people who cannot see is to use semantic HTML and follow standards. HTML readers can't magically infer your meaning if you don't. For example:
Use H1-H6 to designate the correct levels of titles in your site
Use P tags for body content
Use UL lists for navigation and A tags only for things that are really links
Use CSS for style - If an image is just used for style, put it in a background image instead
Only use tables for data that really is tabular.
If you have any content images, use IMG and provide ALT text
Use LABEL tags appropriately for forms
Use title attributes where appropriate
Most importantly - try turning off CSS in your browser. Does your web page still make sense to you? If so, you are probably on the right path.
No, you need to use Flash or a Java Applet to do this. There is nothing native in a browser for text-to-speech. Most people with these needs already have software that does this for them.
look to http://en.wikipedia.org/wiki/JAWS_%28screen_reader%29 . As far as i know it have integration with browsers
As Diodeus noted, if they have a need for text to speech then they will already have software to read for them. Just make the text available.
If you actually want to go through with implementing it yourself (though I wouldn't recommend it) then you can try to use the Google Translate API as described here. It looks like Google has taken down that text-to-speech site for now, but I assume (since it's Google after all) that they'll eventually release it. You may want to also look at the Android TTS library here.

Apart from <script> tags, what should I strip to make sure user-entered HTML is safe?

I have an app that reprocesses HTML in order to do nice typography. Now, I want to put it up on the web to let users type in their text. So here's the question: I'm pretty sure that I want to remove the SCRIPT tag, plus closing tags like </form>. But what else should I remove to make it totally safe?
Oh good lord you're screwed.
Take a look at this
Basically, there are so many things you want to strip out. Plus, there's stuff that's valid, but could be used in malicious ways. What if the user wants to set their font size smaller on a footnote? Do you care if that get applied to your entire page? How about setting colors? Now all the words on your page are white on a white background.
I would look into the requirements phase again.
Is a markdown-like alternative possible?
Can you restrict access to the final content, reducing risk of exposure? (meaning, can you set it up so the user only screws themselves, and can't harm other people?)
You should take the white-list rather than the black-list approach: Decide which features are desired, rather than try to block any unwanted feature.
Make a list of desired typographic features that match your application. Note that there is probably no one-size-fits-all list: It depends both on the nature of the site (programming questions? teenagers' blog?) and the nature of the text box (are you leaving a comment or writing an article?). You can take a look at some good and useful text boxes in open source CMSs.
Now you have to chose between your own markup language and HTML. I would chose a markup language. The pros are better security, the cons are incapability to add unexpected internet contents, like youtube videos. A good idea to prevent users' rage is adding an "HTML to my-site" feature that translates the corresponding HTML tags to your markup language, and delete all other tags.
The pros for HTML are consistency with standards, extendability to new contents types and simplicity. The big con is code injection security issues. Should you pick HTML tags, try to adopt some working system for filtering HTML (I think Drupal is doing quite a good job in this case).
Instead of blacklisting some tags, it's always safer to whitelist. See what stackoverflow does: What HTML tags are allowed on Stack Overflow?
There are just too many ways to embed scripts in the markup. javascript: URLs (encoded of course)? CSS behaviors? I don't think you want to go there.
There are plenty of ways that code could be sneaked in - especially watch for situations like <img src="http://nasty/exploit/here.php"> that can feed a <script> tag to your clients, I've seen <script> blocked on sites before, but the tag got right through, which resulted in 30-40 passwords stolen.
<iframe>
<style>
<form>
<object>
<embed>
<bgsound>
Is what I can think of. But to be sure, use a whitelist instead - things like <a>, <img>† that are (mostly) harmless.
† Just make sure that any javascript:... / on*=... are filtered out too... as you can see, it can get quite complicated.
I disagree with person-b. You're forgetting about javascript attributes, like this:
<img src="xyz.jpg" onload="javascript:alert('evil');"/>
Attackers will always be more creative than you when it comes to this. Definitely go with the whitelist approach.
MediaWiki is more permissive than this site; yes, it accepts setting colors (even white on white), margins, indents and absolute positioning (including those that would put the text completely out of screen), null, clippings and "display;none", font sizes (even if they are ridiculously small or excessively large) and font-names (even if this is a legacy non-Unicode Symbol font name that will not render text successfully), as opposed to this site which strips out almost everything.
But MediaWiki successifully strips out the dangerous active scripts from CSS (i.e. the behaviors, the onEvent handlers, the active filters or javascript link targets) without filtering completely the style attribute, and bans a few other active elements like object, embed, bgsound.
Both sits are banning marquees as well (not standard HTML, and needlessly distracting).
But MediaWiki sites are patrolled by lots of users and there are policy rules to ban those users that are abusing repeatedly.
It offers support for animated iamges, and provides support for active extensions, such as to render TeX maths expressions, or other active extensions that have been approved (like timeline), or to create or customize a few forms.