Link: Response Header VS HTML - html

I am currently working on a function to assist in preparing Link: HTTP header or a set of <link> tags and while reading different materials on this, I still am not able to find an answer to simple question: when to use Link: header and when to use <link>.
So far I can only say, that if you want to use HTTP20 server push, it is recommended to utilize the header. On the other hand, even if I push a stylesheet, it will not be applied unless there is a respective tag in HTML output.
Since I am preparing the library in order to help with some standardization and sanitization, I would like to catch, at least, some "weird" cases like this, if it's possible, but for that I need some set of recommendations or best practices in that regard. Sadly I am unable to find any thus far, so am turning to more knowledgeable people: what best practices or weird cases should I consider catching or should I just allow whatever to be sent regardless of whether it's a header or a tag?
If anyone is interested, the code is present in https://github.com/Simbiat/HTTP20/blob/main/src/Headers.php (links function).

They are supposed to be equivalent as #Evert states so in theory you can use either. However there are some considerations:
Headers are usually set in web server config (at least for static pages) which may not be as easy to update for developers.
However it has the added advantage that you can set these for multiple pages all at once (e.g. preload your core fonts on every .html file, rather than having to remember to set this on all pages, or all page templates if using a CMS).
On the other side with the HTML version it’s often easier to configure it per page (or page template), if you have different needs (e.g. different fonts are used in different pages).
There’s also some which say there are slight performance considerations to doing it in the header but honestly, as long as it’s high enough in the <HEAD> element I really think you’d struggle to notice this.
Of perhaps of more importance is whether it’s passed on hop to hop if your web server is hidden behind other infrastructure (e.g. a CDN or other proxy). In theory it should be, for simple headers, but for things like HTTP/2 push that’s not so easy. If it’s in the HTML you don’t need to worry about this (assuming intermediaries are not changing the markup of course!).
You mentioned the HTTP/2 push use case and that definitely needs the header (though this is not a defined standard method of setting push and some servers or CDNs use other methods, but many use this). However given HTTP/2 push’s complexities and concerns it can cause more problems than it solves, this is maybe a reason to recommend the HTML method to ensure it’s never pushed.
All in all I recommend setting this in the HTML. It’s just easier.
This is not the case however with other, similar things, which can be set in HTML and HTTP headers. CSP for example is limited in the HTML version, lacking some features of the HTTP Header version, and is also not recommended as it could be altered with JavaScript whereas the HTTP header cannot. But for simple Link headers these are less of a concern.

Related

What are the concrete risks of using custom HTML elements and attributes in HTML5?

My question is similar to what this poster is asking:
What are the concrete risks of using custom HTML attributes?
but I want to know what can happen if I use custom elements and custom attributes with the current html specs (html 5).
Example
<x a="5"> abc </x>
Visually I see no issues in any browser. js works:
x = document.getElementsByTagName('x');
alert('x has attribute a=' + x[0].getAttribute('a'));
css works too:
x{
color: red;
}
x[a]{
text-decoration:underline;
}
Possible Risks include
Backward compatibility. In particular, IE8 and below won't understand your tag, and you'll have to remember to write document.createElement('x') for all your new elements.
Semantics - having your html machine-readable may not be your goal, but there may come a time when it needs to be parsed in a moderately useful fashion.
Portability & maintenance - there are plenty of current html tags that almost certainly do what you want them to do. At some point, someone else may have to look after your code. Is there anything to be gained from having them spend time learning what all your new tags are for?
SEO - don't take the risk of a penalty just because it's something you can do..
For completeness, there are justified reasons to do it though. If you can demonstrate your new tag improves the semantics of your page (your example of 'x' obviously doesn't) and you can think of some use-case where your page will be machine-parsed by your own process, then go for it.
The only issue I can think of is that other applications, including search engines, won't recognize your custom elements and properties, so they won't know what to look for or how to use them which is a decided disadvantage for SEO. Other applications trying to access your content, including RESTful apps, will not know either without you telling the app developer.
This was always listed as one of the disadvantages of XML/XHTML but here we are again, back full circle to where we should have been in the first place, the use of XML on the web ... but I digress.
The main reason custom elements were frowned upon in the past is because browsers don't know what to do with them and there was no standardised way of telling them what they are.
What are the risks of using custom HTML elements in HTML5 without following standardisation?
Browsers will handle them differently:
Some browsers may ignore the elements and pretend they're not there; <x>, I don't know what <x> is, lets get rid of that.
Some browsers may attempt to convert the element into something else; define a <tab> element and a browser may think you've mis-spelled <table>, for instance.
You'd have to handle what the element is supposed to do across a large range of devices; just because it works on your PC doesn't mean it works on your phone, or your TV, or your e-reader... or your WiFi-powered fridge...
The good news is that there is some new documentation being written up to allow developers to define their own custom elements in a standardised way. Custom Elements, as it's titled, gives both developers and browser vendors the know-how to allow developers to implement and script custom elements in a way which will work across all supporting browsers... or that's the idea, anyway.

Switch browser to a strict mode in order to write proper html code

Is it possible to switch a browser to a "strict mode" in order to write proper code at least during the development phase?
I see always invalid, dirty html code (besides bad javascript and css) and I feel that one reason is also the high tolerance level of all browsers. So at least I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
Is there anything like that with any of the known browser?
I know about w3c-validator but honestly who is really using this frequently?
Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?
Is there anything like that with any of the known browser? Is there maybe some sort of regular interface between browser and validator? Are there any development environments where the validation is tested automatically?
The answer to all those questions is “No“. No browsers have any built-in integration like what you describe. There are (or were) some browser extensions that would take every single document you load and send it to the W3C validator for checking, but using one of those extensions (or anything else that automatically sends things to the W3C validator in the background) is a great way to get the W3C to block your IP address (or the IP-address range for your entire company network) for abuse of W3C services.
I know about w3c-validator but honestly who is really using this frequently?
The W3C validator currently processes around 17 requests every second—around 1.5 million documents every day—so I guess there are quite a lot of people using it frequently.
I see always invalid, dirty html code… I would be ready to have a stricter mode while I use the browser for the development for the pages in order to force myself to proper code.
I'm not sure what specifically you mean by “dirty html code” or “proper code“ but I can say that there are a lot of markup cases that are not bad or invalid but which some people mistakenly consider bad.
For example, some people think every <p> start tag should always have a matching </p> end tag but the fact is that from the time when HTML was created, it has never required documents to always have matching </p> end tags in all cases (in fact, when HTML was created, the <p> element was basically an empty element—not a container—and so the <p> tag simply was a marker.
Another example of a case that some people mistakenly think of as bad is the case of unquoted attribute values; e.g., <link rel=stylesheet …>. But that fact is that unless an attribute value contains spaces, it generally doesn't need to be quoted. So in fact there's actually nothing wrong at all with a case like <link rel=stylesheet …>.
So there's basically no point in trying to find a tool or mechanism to check for cases like that, because those cases are not actually real problems.
All that said, the HTML spec does define some markup cases as being errors, and those cases are what the W3C validator checks.
So if you want to catch real problems and be able to fix them, the answer is pretty simple: Use the W3C validator.
Disclosure: I'm the maintainer of the W3C validator. 😀
As #sideshowbarker notes, there isn't anything built in to all browsers at the moment.
However I do like the idea and wish there was such a tool also (that's how I got to this question)
There is a "partial" solution, in that if you use Firefox, and view the source (not the developer tools, but the CTRL+U or right click "View Page Source") Firefox will highlight invalid tag nesting, and attribute issues in red in the raw HTML source. I find this invaluable as a first pass looking at a page that doesn't seem to be working.
It is quite nice because it isn't super picky about the asdf id not being quoted, or if an attribute is deprecated, but it highlights glitchy stuff like the spacing on the td attributes is messed up (this would cause issues if the attributes were not quoted), and it caught that the span tag was not properly closed, and that the script tag is outside of the html tag, and if I had missed the doctype or had content before it, it flags that too.
Unfortunately "seeing" these issues is a manual process... I'd love to see these in the dev console, and in all browsers.
Most plugins/extensions only get access to the DOM after it has been parsed and these errors are gone or negated... however if there is a way to get the raw HTML source in one of these extension models that we can code an extension for to test for these types of errors, I'd be more than willing to help write one (DM #scunliffe on Twitter). Alternatively this may require writing something at a lower level, like a script to run in Fiddler.

Disadvantages of using consistent-behaving yet deprecated HTML tags?

When users visit my website, they don't care about how perfect or how much standard the page is coded. They only care about whether it works or not.
There are tags that are deprecated but have consistent behavior throughout all major, minor, and very minor browsers. They work now and will work in the future. (I'm not talking about optional tags like <marquee> and <blink> which will probably be removed in the future since their non-existence doesn't break pages.) The tags I'm talking about are for example:
<center> (used by google.com homepage, yes and it's May 2014)
<body bgcolor=, alink=, vlink=, link= (all used by google.com)
<font size= (also used by google.com)
If my HTML generator produces tags like <body bgcolor=black>, it is guaranteed to work for near 100% of users.
If it instead produce CSS like background:black;, it will be supported by lesser users compared to <body bgcolor=black>. (Start with https://superuser.com/q/732669/78897 and https://superuser.com/q/447269/78897, though I'm sure they are not the only ones in the whole world.)
Bear with me, this is a real question based on a true problem. Exactly what are the real disadvantages of having these tags as output?
Potential disadvantages include the following:
1) Your customer might actually care about how standard the code is. Maybe not now, but in the future. Maybe for questionable reasons, but still.
2) Deprecated constructs do not always work consistently. For example, align=center attribute set on a table may have different effects depending on browser mode. This is a relatively weak argument, though, since the browser practices have been described rather well in HTML5 CR and you can manage the potential problems. (Besides, even CSS settings may work inconsistently.)
3) There is no guarantee that deprecated features will be supported by all future browsers. On the other hand, the same applies to standard features. In practice, very few features that have been defined in HTML specifications have actually been removed from browsers. (Regarding tags, I think basefont is the only case.) All the examples mentioned, and also marquee, have been described in HTML5 CR as “obsolete” but still well-defined, and according to HTML5 CR, browsers are expected, and partly required, to support them all.
4) Your colleagues (designers/developers/...) may regard your code (and you) as old-fashioned, non-semantic, and whatever.
5) Code maintenance and development may be more difficult. If you have 1,000 pages with <body bgcolor=black> and the customer says they want a somewhat different background color, you would need to edit each page. This argument is, however, weaker than it seems to be. First, how often do such things actually happen? Second, if the pages have actually been generated using suitable tools, perhaps you just need to change the value of one parameter and regenerate them (or just let servers do that, if the pages are dynamically generated). Third, if you have a link element on all pages, referring to basic style sheet for the pages, as you normally should, you just need to add one rule to that style sheet. It is easy to override presentational HTML attributes with CSS.
To summarize, the practical arguments against your approach are rather weak. The most important arguments relate to coding style and principles.
I've added some more disadvantages:
Another disadvantage of using those tags is site bandwidth. When you put in html center, bgcolor and similar tags every time browser needs to load the whole content even if on every page those tags are the same or even if user visited this site many times. But when you place design in css file browsers may cache those files (especially when you set headers properly) so they only load html and images (if no cache is set).
One another thing is that if you decide to redesign the site/style new elements, it's much easier to put changes only in CSS files. It's possible in future you won't be doing those changes on your own or other companies/freelancers will be doing them and it will be much easier for them to make changes in the site. So the site will be cheaper to maintain.
In addition if html / php code is poor (or site is very complex) and many "visual conditions" appear in many files (for example on one page you decide to use one colour and you put it in HTML, on the other another colour) and something goes wrong it will be much easier to find the problem because you may simple cut some css and check where's the problem.
The disadvantage is when one of the major browsers chooses to get rid of the deprecated tag in a future release.
The advantage of using CSS over tags is that you can change the whole web site look and feel in a simple move.
Consider people that require larger font sizes. Colour blindness and also enable the most use of screen readers.
Even those consistent behaviour tags may be removed from browser. What if you would like to create HTML5 website? Then you will need to learn everything from scratch and change literally everything for your website to make it work because you never know if those tags will be supported in HTML 5 in future or only in older HTML documents
CSS provides easier maintenance, for one; client decides they want some elements aligned left instead of center? Change your css rule and poof, you're done. But if you're using old-school valign and such? Get ready to go change every single instance of that in the file(s).

Dynamically Obfuscate HTML

I was wondering if there was any way to dynamically obfuscate html on a live server but not offline, so soon as my website was visited the source would be obfuscated rather than in plain text.
Since the client (browser) will have to parse it into a sensible DOM tree, this is pretty much fruitless. These days it's a lot more common to inspect a site using Firebug/Webkit Inspector, which provides a nicely formatted, navigable tree. Most people won't even notice that the HTML is "obfuscated", much less be stopped by it.
Executable code can be obfuscated by minimizing variable names and such without changing the result. HTML is the result though, if you change anything about it, the result will change. So "obfuscation" would mostly be limited to creative use of spacing anyway.
The real question you should ask yourself is "why do I need to obfuscate HTML?". If you're hiding sensitive information, then you should be either encrypting that data, or never presenting it to the client.
Most sensitive information or transactions should take place on the server, and the client only receives a token, or encrypted information, or a unique transaction identifier that can be passed back and forth.
Let me put it this way: There's no way to dynamically obfuscate the HTML on your site such that any reasonably competent person couldn't get it anyway.
You could use JavaScript to attempt to obfuscate it, but you'd have to do it in a way that didn't actually affect the DOM.
You could generate the contents of the page itself with JavaScript, but that is likely to damage accessibility, and once again the DOM will have to be in a condition the browser can use.
You could insert massive amounts of whitespace into the source, but that is easily overcome as well.
All this, and you make it harder and more annoying to manage your site. Minification has its purpose, but obfuscation here is lose-lose.
Your could search for and remove all tabs, newlines, extra spaces, and comments
If you are using php, IonCube has a plugin. it can be found here: http://www.ioncube.com/html_encoder.php it turns your html page into minified javascript.

Writing XSS Filter for (X)HTML Based on White List

I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters
written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white
list of options for these tags. For example. typical HTML input can consist of
<b>, <i>, tags and <a> tag with href. But straightforward implementation is not
good enough, because, even allowed simple links may include XSS:
Click On Me
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://
Questions:
Are these assumptions are good enough for most of purposes? Meaning that If I do not
give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
Is there a good reference for formal grammar of HTML/XHTML in order to write simple
parser that would cleanup all incorrect of forbidden tags like <script>
You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#.NET_version
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.
Consider this one:
<x style="express/**/ion:(alert(/bah!/))">
Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.
All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.
As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.
A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.
I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.
HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.
Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!