Basic encoding/decoding of characters for the web

Basic encoding/decoding of characters for the web - html

I feel like this is something I should definitely know about, but I'm not entirely sure of the details of at what point a character is decoded by a browser (or even if I'm thinking about it in the right way).
While inspecting the DOM of a site to which I've added some content (through a form, for example), I can see my < (in the contents of my comment) appear as a string. Even if the angular brackets are well-balanced (e.g. <something>), it appears as a string rather than an element in the DOM. I appreciate this is critical in defense against injection attacks such as XSS, so (on the server), the content is written as a string literal rather than an element - but how does the browser recognise this and render it differently? And when does it decode it?
If the server does respond with > or < why do I not see this in dev tools?
My confusion comes from the fact that, when inspecting, there is no difference between my <something> content and a <something> element (if there were such a thing).

So, I'd expect to see (when inspecting the DOM) <content>, but it seems not.
This is merely because your browser's DOM inspector is a bit loose in its representation. You're inspecting the DOM after all, a complex object oriented internal memory structure, yet your browser is showing it to you in an HTML-like presentation. Either because of an oversight or as a conscious decision to make this presentation more readable, not everything that should be an HTML entity in valid HTML is being displayed as HTML entity.
If you inspect the actual source code of the page, you'll see <content>.

Related

How to programmatically find out (using Node) how a given string would be rendered in browser?

How do you get the string that browser rendered, programmatically, using Node/JS — the same thing as if you copied everything in a browser window?
For example, for this given HTML source (notice spaces between "a" and "z"):
<html><head></head><body>a z</body>
which renders with a single space in Chrome:
how would you get this string with single space, a z?
I tried Cheerio and JSDom, but after I load the <html><head></head><body>a z</body> as string and query the body contents, I get original piece of code back, the one with many spaces.
Thank you.

Good question, however I don't think that there will be feasible way to do it.
First, what is happening is greatly explained in this article When does white space matter in HTML?.
Since white spaces are not going anywhere, but represented in such way only by browser, it will hard to troubleshoot it server side. There are reasons for that:
You don't know in which browser it will be rendered, it could be even Lynx, will it show spaces or not you don't know.
That means that, if it would be possible, you would have to test with every browser in the wild.
For instance Server-Side-Rendering (SSR) technology, partially applies / renders pages on server side, but still because there is no actually a device that will display it, it is partial. So most likely you will get same spaces.
Imaginary possible solution would be to use something like KarmaJS, install some headless browser on server side, and execute some test cases, so that KarmaJS would control browser to render the page, and may be you will be able to get access to rendered, CSS applied and hopefully spaces-trimmed DOM. Which I'm not sure, and it will be limited set of browsers.
Another imaginary possible solution would be to use WebKit or Blink engines, or perhaps Electron, to some how via API attempt to acquire that DOM.

Safe Way to Include User Text Input in HTML

This feels like an easy one, but I'm having trouble finding the right search terms to get me what I need...
I have a requirement for part of my web page to display a previously-entered note from a user. The note is saved in the database, and I am currently incorporating it using Razor like this:
<span>#Model.UserNote</span>
This works fine, but it gets my spidey senses tingling... what if the user decides that he wants his note to be something like "</span><script>...</script><span>". I know how to use parameters to avoid injection attacks in SQL Server, but is there an HTML equivalent or another approach to avoid saving or injecting malicious markup in HTML? Displaying the text in a control like a textbox feels safer, but may not give me the visual appearance that I am looking for. Thanks in advance!

The thing you want to search for is cross-site scripting (xss).
The general solution is to encode output according to its context. For example if you are writing such data into plain html, you need html encoding, which is basically replacing < with & lt; and so on in dynamic data (~user input), so that everything only gets rendered as text. For a javascript context (for example but not only inside a <script> tag) you would need javascript encoding.
In .net, there is HttpUtility that includes such methods, eg. HttpUtility.JavascriptStringEncode(). Also there is the formerly separate AntiXSS library that can help by providing even stricter (whitelist-based) encoding, as opposed to the blacklist-based HttpUtility. So don't roll your own, it's trickier than it may first appear - just use a well-known implementation.
Also Razor has built-in protection against trivial xss attack vectors. By using #myVar, Razor automatically applies html encoding, so your code above is secure. Note that it would not be secure in a javascript context, where you need to apply javascript encoding yourself (ie. call the relevant method from HttpUtility for instance).
Note that without proper encoding, it is not more secure to use an input field or a textarea - an injection is an injection, doesn't matter much what characters need to be used if injection is possible.
Also slightly related, .net provides another protection besides the automatic html encoding. It uses "request validation", and by default won't allow request parameters (either get or post) to contain a less than character (<), immediately followed by a letter. Such a request would be blocked by the framework as potentially unsafe, unless this feature is deliberately turned off.
Your original example is blocked by both of these mechanisms (automatic encoding and request validation).
It's very important to note though, that in terms of xss, this is the very tip of the iceberg. While these protections in .net help somewhat, they are by no means sufficient in general. While your example is secure, in general you need to understand xss and what exactly these protections do to be able to produce secure code.

Is importing text via CSS really permitted?

There is a html page that I have no control over, ditto goes for any javascript.
I can however style it, and one of the thing's I've done is injected a slab of text via content in a CSS pseudo-element.
However this slab is multiple lines and with CSS strings being only one line leads to a cumbersome property full of \0as.
I was wondering whether I could use the url(blah/blah) syntax with content in place of a string; and the docs say yes!
However, when I try it (the slab now unencoded and hosted in it's own file), the content doesn't show.
Looking in the networking tab of the devtools shows it is requested, but it looks like the browser is ignoring it.
At first I thought it was a headers issue (I was working just out of the filesystem), so I built a tiny server to apply text/plain (I also tried text/html) on localhost.
It appears the browsers are only accepting images for content, with the following header seen to be sent with the request in chrome's devtools; Accept:image/webp,image/*,*/*;q=0.8.
This issue occurs in firefox too, so why does the mdn specifically use a .html example in the syntax?
Is there any way to get something like what I'm attempting up and running, or am I left to deal with the long CSS statement?

The docs say "an external resource (such as an image)" so they don't explicitly rule out such use of plain text, but they don't explicitly allow it either. It seems likely that "such as an image" is intended to allow further media types such as video or interactive SVG but deliberately vague so as to not second-guess future technologies.
As such, it would seem that it's "permitted" as in you haven't done anything invalid, but not supported as in there is no reason why you should expect it to actually do anything useful.
This issue occurs in firefox too, so why does the mdn specifically use a .html example in the syntax?
I'd guess that it was simply a RFC 6761- and RFC 2606-compliant example URI often used in the docs as an example URI. (Of course, to nitpick, there's no reason why a URI ending in .html should be assumed to always return HTML either, though its a bit perverse to do otherwise).

Switch browser to a strict mode in order to write proper html code

Is it possible to switch a browser to a "strict mode" in order to write proper code at least during the development phase?
I see always invalid, dirty html code (besides bad javascript and css) and I feel that one reason is also the high tolerance level of all browsers. So at least I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
Is there anything like that with any of the known browser?
I know about w3c-validator but honestly who is really using this frequently?
Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?

Is there anything like that with any of the known browser? Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?
The answer to all those questions is “No“. No browsers have any built-in integration like what you describe. There are (or were) some browser extensions that would take every single document you load and send it to the W3C validator for checking, but using one of those extensions (or anything else that automatically sends things to the W3C validator in the background) is a great way to get the W3C to block your IP address (or the IP-address range for your entire company network) for abuse of W3C services.
I know about w3c-validator but honestly who is really using this frequently?
The W3C validator currently processes around 17 requests every second—around 1.5 million documents every day—so I guess there are quite a lot of people using it frequently.
I see always invalid, dirty html code… I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
I'm not sure what specifically you mean by “dirty html code” or “proper code“ but I can say that there are a lot of markup cases that are not bad or invalid but which some people mistakenly consider bad.
For example, some people think every start tag should always have a matching end tag but the fact is that from the time when HTML was created, it has never required documents to always have matching end tags in all cases (in fact, when HTML was created, the element was basically an empty element—not a container—and so the tag simply was a marker.
Another example of a case that some people mistakenly think of as bad is the case of unquoted attribute values; e.g., <link rel=stylesheet …>. But that fact is that unless an attribute value contains spaces, it generally doesn't need to be quoted. So in fact there's actually nothing wrong at all with a case like <link rel=stylesheet …>.
So there's basically no point in trying to find a tool or mechanism to check for cases like that, because those cases are not actually real problems.
All that said, the HTML spec does define some markup cases as being errors, and those cases are what the W3C validator checks.
So if you want to catch real problems and be able to fix them, the answer is pretty simple: Use the W3C validator.
Disclosure: I'm the maintainer of the W3C validator. 😀

As #sideshowbarker notes, there isn't anything built in to all browsers at the moment.
However I do like the idea and wish there was such a tool also (that's how I got to this question)
There is a "partial" solution, in that if you use Firefox, and view the source (not the developer tools, but the CTRL+U or right click "View Page Source") Firefox will highlight invalid tag nesting, and attribute issues in red in the raw HTML source. I find this invaluable as a first pass looking at a page that doesn't seem to be working.
It is quite nice because it isn't super picky about the asdf id not being quoted, or if an attribute is deprecated, but it highlights glitchy stuff like the spacing on the td attributes is messed up (this would cause issues if the attributes were not quoted), and it caught that the span tag was not properly closed, and that the script tag is outside of the html tag, and if I had missed the doctype or had content before it, it flags that too.
Unfortunately "seeing" these issues is a manual process... I'd love to see these in the dev console, and in all browsers.
Most plugins/extensions only get access to the DOM after it has been parsed and these errors are gone or negated... however if there is a way to get the raw HTML source in one of these extension models that we can code an extension for to test for these types of errors, I'd be more than willing to help write one (DM #scunliffe on Twitter). Alternatively this may require writing something at a lower level, like a script to run in Fiddler.

Is a browser obliged to use a DOM to render an HTML page?

I was reading the page about the Document Object Model on Wikipedia.
One sentence caught my interest; it says:
A Web browser is not obliged to use DOM in order to render an HTML
document.
You can find the entire context on the page right here.
I don't understand that is there any other alternative to render an HTML document? What exactly does this sentence mean?

Strictly speaking IE (at least < IE9) does not use a DOM to render an HTML document. It uses its own internal object model (which is not always a pure tree structure).
The DOM is an API, and IE maps the API methods and properties onto actions on its internal model. Since the DOM assumes a tree structure, the mapping is not always perfect, which accounts for a number of oddities when accessing the document via the DOM in IE.

The primary job of a browser is to display HTML. Most browsers use a DOM; they parse the HTML, create a DOM structure from it (which can also be used in JavaScript) and render the page based on that DOM.
But if a browser chooses not to, it is free to do so. I wouldn't know why, and I certainly don't understand why this line is explicitly mentioned in the Wiki article..

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008