Is importing text via CSS really permitted? - html

There is a html page that I have no control over, ditto goes for any javascript.
I can however style it, and one of the thing's I've done is injected a slab of text via content in a CSS pseudo-element.
However this slab is multiple lines and with CSS strings being only one line leads to a cumbersome property full of \0as.
I was wondering whether I could use the url(blah/blah) syntax with content in place of a string; and the docs say yes!
However, when I try it (the slab now unencoded and hosted in it's own file), the content doesn't show.
Looking in the networking tab of the devtools shows it is requested, but it looks like the browser is ignoring it.
At first I thought it was a headers issue (I was working just out of the filesystem), so I built a tiny server to apply text/plain (I also tried text/html) on localhost.
It appears the browsers are only accepting images for content, with the following header seen to be sent with the request in chrome's devtools; Accept:image/webp,image/*,*/*;q=0.8.
This issue occurs in firefox too, so why does the mdn specifically use a .html example in the syntax?
Is there any way to get something like what I'm attempting up and running, or am I left to deal with the long CSS statement?

The docs say "an external resource (such as an image)" so they don't explicitly rule out such use of plain text, but they don't explicitly allow it either. It seems likely that "such as an image" is intended to allow further media types such as video or interactive SVG but deliberately vague so as to not second-guess future technologies.
As such, it would seem that it's "permitted" as in you haven't done anything invalid, but not supported as in there is no reason why you should expect it to actually do anything useful.
This issue occurs in firefox too, so why does the mdn specifically use a .html example in the syntax?
I'd guess that it was simply a RFC 6761- and RFC 2606-compliant example URI often used in the docs as an example URI. (Of course, to nitpick, there's no reason why a URI ending in .html should be assumed to always return HTML either, though its a bit perverse to do otherwise).

Related

How to programmatically find out (using Node) how a given string would be rendered in browser?

How do you get the string that browser rendered, programmatically, using Node/JS — the same thing as if you copied everything in a browser window?
For example, for this given HTML source (notice spaces between "a" and "z"):
<html><head></head><body>a z</body>
which renders with a single space in Chrome:
how would you get this string with single space, a z?
I tried Cheerio and JSDom, but after I load the <html><head></head><body>a z</body> as string and query the body contents, I get original piece of code back, the one with many spaces.
Thank you.
Good question, however I don't think that there will be feasible way to do it.
First, what is happening is greatly explained in this article When does white space matter in HTML?.
Since white spaces are not going anywhere, but represented in such way only by browser, it will hard to troubleshoot it server side. There are reasons for that:
You don't know in which browser it will be rendered, it could be even Lynx, will it show spaces or not you don't know.
That means that, if it would be possible, you would have to test with every browser in the wild.
For instance Server-Side-Rendering (SSR) technology, partially applies / renders pages on server side, but still because there is no actually a device that will display it, it is partial. So most likely you will get same spaces.
Imaginary possible solution would be to use something like KarmaJS, install some headless browser on server side, and execute some test cases, so that KarmaJS would control browser to render the page, and may be you will be able to get access to rendered, CSS applied and hopefully spaces-trimmed DOM. Which I'm not sure, and it will be limited set of browsers.
Another imaginary possible solution would be to use WebKit or Blink engines, or perhaps Electron, to some how via API attempt to acquire that DOM.

Switch browser to a strict mode in order to write proper html code

Is it possible to switch a browser to a "strict mode" in order to write proper code at least during the development phase?
I see always invalid, dirty html code (besides bad javascript and css) and I feel that one reason is also the high tolerance level of all browsers. So at least I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
Is there anything like that with any of the known browser?
I know about w3c-validator but honestly who is really using this frequently?
Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?
Is there anything like that with any of the known browser? Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?
The answer to all those questions is “No“. No browsers have any built-in integration like what you describe. There are (or were) some browser extensions that would take every single document you load and send it to the W3C validator for checking, but using one of those extensions (or anything else that automatically sends things to the W3C validator in the background) is a great way to get the W3C to block your IP address (or the IP-address range for your entire company network) for abuse of W3C services.
I know about w3c-validator but honestly who is really using this frequently?
The W3C validator currently processes around 17 requests every second—around 1.5 million documents every day—so I guess there are quite a lot of people using it frequently.
I see always invalid, dirty html code… I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
I'm not sure what specifically you mean by “dirty html code” or “proper code“ but I can say that there are a lot of markup cases that are not bad or invalid but which some people mistakenly consider bad.
For example, some people think every <p> start tag should always have a matching </p> end tag but the fact is that from the time when HTML was created, it has never required documents to always have matching </p> end tags in all cases (in fact, when HTML was created, the <p> element was basically an empty element—not a container—and so the <p> tag simply was a marker.
Another example of a case that some people mistakenly think of as bad is the case of unquoted attribute values; e.g., <link rel=stylesheet …>. But that fact is that unless an attribute value contains spaces, it generally doesn't need to be quoted. So in fact there's actually nothing wrong at all with a case like <link rel=stylesheet …>.
So there's basically no point in trying to find a tool or mechanism to check for cases like that, because those cases are not actually real problems.
All that said, the HTML spec does define some markup cases as being errors, and those cases are what the W3C validator checks.
So if you want to catch real problems and be able to fix them, the answer is pretty simple: Use the W3C validator.
Disclosure: I'm the maintainer of the W3C validator. 😀
As #sideshowbarker notes, there isn't anything built in to all browsers at the moment.
However I do like the idea and wish there was such a tool also (that's how I got to this question)
There is a "partial" solution, in that if you use Firefox, and view the source (not the developer tools, but the CTRL+U or right click "View Page Source") Firefox will highlight invalid tag nesting, and attribute issues in red in the raw HTML source. I find this invaluable as a first pass looking at a page that doesn't seem to be working.
It is quite nice because it isn't super picky about the asdf id not being quoted, or if an attribute is deprecated, but it highlights glitchy stuff like the spacing on the td attributes is messed up (this would cause issues if the attributes were not quoted), and it caught that the span tag was not properly closed, and that the script tag is outside of the html tag, and if I had missed the doctype or had content before it, it flags that too.
Unfortunately "seeing" these issues is a manual process... I'd love to see these in the dev console, and in all browsers.
Most plugins/extensions only get access to the DOM after it has been parsed and these errors are gone or negated... however if there is a way to get the raw HTML source in one of these extension models that we can code an extension for to test for these types of errors, I'd be more than willing to help write one (DM #scunliffe on Twitter). Alternatively this may require writing something at a lower level, like a script to run in Fiddler.

Is an image within a noscript tag only downloaded if Javascript is disabled?

Consider the following example:
<noscript>
<img class="photo" src="example.png">
</noscript>
Does the client only download the image file if they have Javascript disabled? (I'm aware the client can only see the image if Javascript is disabled in this example)
The reason I'm asking is because I've been using base64 data URIs for several background-image properties in an external CSS (avoiding http requests). I would like to also use base64 data URIs for the src value of some img tags by updating their values via external Javascript (to retain benefits of caching).
Essentially, the whole point of this is to avoid/limit http requests and so I was wondering if I can degrade gracefully and only fetch the image files if Javascript is disabled? Or is the image downloaded regardless?
Short Answer:
NO, images are NOT downloaded inside a <noscript> element
Technical Answer:
I had just been doing some testing on my personal website for functionality with JavaScript disabled, and came across this article… with JavaScript still disabled, btw.
Well, at the very top of this web page, Stack Overflow have a warning message which states:
“Stack Overflow works best with JavaScript enabled”
Being the type of web geek who tends to “view source” at practically every single website I look at (!), the HTML code for the message is as follows:
<noscript>
<div id="noscript-warning">Stack Overflow works best with JavaScript enabled<img src="http://example.com/path/to/1x1-pixel.gif" alt="" class="dno"></div>
</noscript>
Nothing too ground-breaking there. However, what interested me was the IMG element, which referenced a 1px by 1px invisible image.
I am assuming that this image must be used by analytics/statistics software to detect how many of their users are browsing without JavaScript. If this assumption is correct (and there isn’t any other reason for using a blank 1x1 pixel image here that I’ve overlooked here), therefore this would basically confirm the following: browsers do not download the contents of anything within a NOSCRIPT element, except in the situation that JavaScript is actually disabled. (And there certainly does not seem to be any retro ’98-style layout techniques going on, I am glad to say!) ;-)
(P.S. – I hope that I don’t somehow offend anyone at the Stack Exchange network for pointing this out!)
The HTML 4.01 specification says just that the content of noscript is not rendered in certain situations. However, this suggests that browsers should not perform any GET operations on the basis of its content in such situations, since such operations would be pointless and would reduce efficiency.
The HTML5 draft is more explicit and probably reflects actual browser behavior. It says about the noscript element, in an emphatic note: “it works is by essentially ‘turning off’ the parser when scripts are enabled, so that the contents of the element are treated as pure text and not as real elements”. (The note relates to why noscript does not work when using the XHTML syntax, but it also reveals the principle by which it works, when it works.)
So we can expect that when scripting is enabled, the content of noscript won’t even be parsed (except to recognize the end tag). Blender’s answer seems to confirm this, and so does my little experiment with Firefox:
<img src=foo style="foo: 1">
<noscript>
<img src=bar style="bla: 1">
</noscript>
Firefox makes a failing GET request for foo but no request for bar, when scripting is enabled. In addition, it shows a warning about erroneous CSS code foo: 1, in the error console, but no warning about bla: 1. So apparently the img tag was not even parsed.
However, I don’t see how the question relates to the scenario presented as a reason for asking it. I think you use an img element outside noscript and put there, using data: URL, the desired initial content (which will remain, as fallback, the final content when scripting is disabled).

What is the best way to handle user generated html content that will be viewed by the public?

In my web application I allow user generated content to be posted for public consumption similar to Stackoverflow.
What is the best practice for handing this?
My current steps for handling user generated content are:
I use MarkItUp to allow users
an easy way to format their html.
After a user has submitted thier
changes I run it through an HTML
Sanitizer (scroll to the
bottem) that uses a white list
approach.
If the Sanitization process has
removed any user created content I
do not save the content. I then
Return there modified content with a
warning message, "Some illegal
content tags where detected and
removed double check your work and
try again."
If the content passes through the
sanitization process cleanly, I save
the raw html content to the
database.
When rendering to the client I just
pass the raw html out of the db to
the page.
That's an entirely reasonable approach. For typical applications it will be entirely sufficient.
The trickiest part of white-listing raw HTML is the style attribute and embed/object. There are legitimate reasons why someone might want to put CSS styles into an otherwise untrusted block of formatted text, or say, an embedded YouTube video. This issue comes up most commonly with feeds. You can't trust the arbitrary block of text contained within a feed entry, but you don't want to strip out, e.g., syntax highlighting CSS or flash video, because that would fundamentally change the content and potentially confuse anyone reading it. Because CSS can contain dangerous things like behaviors in IE, you may have to parse the CSS if you decide to allow the style attribute to stay in. And with embed/object you may need to white-list hostnames.
Addenda:
In worst case scenarios, HTML escaping everything in sight can lead to a very poor user experience. It's much better to use something like one of the HTML5 parsers to go through the DOM with your whitelist. This is much more flexible in terms of how you present the sanitized output to your users. You can even do things like:
<div class="sanitized">
<div class="notice">
This was sanitized for security reasons.
</div>
<div class="raw"><pre>
<script>alert("XSS!");</script>
</pre></div>
</div>
Then hide the .raw stuff with CSS, and use jQuery to bind a click handler to the .sanitized div that toggles between .raw and .notice:
CSS:
.raw {
display: none;
}
jQuery:
$('.sanitized').click(function() {
$(this).find('.notice').toggle();
$(this).find('.sanitized').toggle();
});
The white list is a good move. Any black list solution is prone to letting through more than it should, because you just can't think of everything. I've seen some attemts of using black lists (for example The Code Project), and if they manage to catch everything, generally they still cause additional problems like replacing characters in code so that it can't be used without manually restoring it first.
The safest method would be:
HTML encode all the text.
Match a set of allowed tags and attributes and decode those.
Using a regular expression you can even require that each opening tag has a closing tag, so that an unclosed tag can't mess up the page.
You should be able to do this in something like ten lines of code, so the code that you linked to seems overly complicated.

Block certain html element from getting indexed by search engines

For styling purposes i want to insert some dummy text on the page, but it shouldn't be getting linked to the actual content. Is there a way to block it for search engines, or do i have to use good old images for that?
Or would it be possible to load it dynamically via javascript? because i heard that google will read certain amount of javascript.
Can you show the content in a borderless iframe, and block the iframe's src (a completely separate "page") from the search engines?
Alternatively, add the content with javascript, storing the javascript in a .js file that you block from the engines?
If you load that text via AJAX it probably won't be indexed - last time I checked, GoogleBot doesn't actually execute JS (nor do the other spiders (but some spambots apparently can and do)).
Caveat: the AJAX response should probably contain a X-Robots-Tag: noindex header, in case its URL is actually linked somewhere.
I'd be extremely careful with whatever trick you decide on. Odds are just as likely google will think you're trying to display different content to the user than to it.
I've always believed that Google actually works by rendering the page (possibly using some server-side version of the Chrome rendering engine) and then reads the result back with OCR software to confirm that the text in the source matches what the user would see with JS and frames enabled. Google has always openly warned webmasters not to try serving robots different content to the users, OCR would be the perfect way to find out (especially if your 'verifier' used IE's user-agent string and crawled from IP ranges not registered by Google).
Short answer then, serve the decoration as either:
an iframe
an object
an SVG image
Since your clearly linking the document into your page google will proably consider it a seperate resource and rate things accordingly, especially if the same text appears on every page. Which brings me to:
Are you going to use the same text decor on all/most pages? If so Google will almost certainly treat it as "window dressing" and ignore it (it apparently does this with menus and such).
I'd guess that loading in the content after the page has finished loading (when the document.ready event fires, for example) would be a fairly safe way to do what you're talking about. Not 100% sure about this, though.