What is the best SRI hash size? - html

I recently discovered the following nifty little site for generating SubResource Integrity (SRI) Tags for externally loaded resources. For example, enterring the latest jQuery URL (https://code.jquery.com/jquery-3.3.1.min.js), one gets the following <script> tag:
<script src="https://code.jquery.com/jquery-3.3.1.min.js" integrity="sha256-FgpCb/KJQlLNfOu91ta32o/NMZxltwRo8QtmkMRdAu8= sha384-tsQFqpEReu7ZLhBV2VZlAu7zcOV+rXbYlF2cqB8txI/8aZajjp4Bqd+V6D5IgvKT sha512-+NqPlbbtM1QqiK8ZAo4Yrj2c4lNQoGv8P79DPtKzj++l5jnN39rHA/xsqn8zE9l0uSoxaCdrOgFs6yjyfbBxSg==" crossorigin="anonymous"></script>
I understand the purpose of SRI hashes, and I know that they can use different hash sizes (256-, 384-, or 512-bit), but I had never seen all three used at once like this before. Digging into the MDN docs, I found that
An integrity value may contain multiple hashes separated by whitespace. A resource will be loaded if it matches one of those hashes.
But how exactly is that matching performed? Time for multiple questions in one SO post...
Do browsers attempt to match the longest hash first, since its more secure, or the shortest first, since its faster?
Would one really ever expect for one hash to match and not all three (other than the trivial case of a developer mistyping a hash)?
Is there any benefit to providing all three hashes instead of just one?
Similar to #1, If you only provide one hash value, which should you use? I typically see sites (e.g., Bootstrap) providing sha384-values in their example code. Is that because its right in the middle, not too big, not too small?
Out of curiosity, can the integrity attribute be used on any tags beside <script> and <link>. I'm particularly wondering about multimedia tags like <img>, <source>, etc.

Do browsers attempt to match the longest hash first, since its more secure, or the shortest first, since its faster?
Per https://w3c.github.io/webappsec-subresource-integrity/#agility, “the user agent will choose the strongest hash function in the list”.
Would one really ever expect for one hash to match and not all three?
No. But as far as the browser behavior: If the strongest hash matches, the browser uses that and just ignores the rest (so it wouldn’t matter anyway whether the others match too or not).
Is there any benefit to providing all three hashes instead of just one?
There is no current benefit in practice. That’s because per https://w3c.github.io/webappsec-subresource-integrity/#hash-functions, “Conformant user agents MUST support the SHA-256, SHA-384, and SHA-512 cryptographic hash functions”.
So currently, if you just specify a SHA-512 hash, all browsers that support SRI will use that.
But per https://w3c.github.io/webappsec-subresource-integrity/#agility the intent of specifying multiple hashes is “to provide agility in the face of future cryptographic discoveries… Authors are encouraged to begin migrating to stronger hash functions as they become available”.
In other words, at some point in the future, browsers will start to add support for stronger hash functions (SHA-3-based ones https://en.wikipedia.org/wiki/SHA-3 or whatever).
Thus, since you’ll need to continue to target older browsers as well as newer ones, there will be a period of time when you’re targeting some browsers for which SHA-512 is the strongest hash function while also targeting the new browsers that have come along by that time which will have added support for some SHA-3 (or whatever) hash functions.
So in that case, you would need to specify multiple hashes in the integrity value.
Similar to #1, If you only provide one hash value, which should you use?
A SHA-512 value.
I typically see sites (e.g., Bootstrap) providing sha384-values in their example code. Is that because its right in the middle, not too big, not too small?
I don't know why they choose to do it that way. But since browsers are required to support SHA-512 hashes, you don’t gain anything by specifying a SHA-384 hash instead — in fact you just lose the value of having the strongest hash function available.
Out of curiosity, can the integrity attribute be used on any tags beside <script> and <link>.
No, it can’t be — not yet.
I'm particularly wondering about multimedia tags like <img>, <source>, etc.
As https://w3c.github.io/webappsec-subresource-integrity/#verification-of-html-document-subresources explains, the plan has always been for SRI to eventually be used for those too —
Note: A future revision of this specification is likely to include integrity support for all possible subresources, i.e., a, audio, embed, iframe, img, link, object, script, source, track, and video elements.
…but we are not yet in that future.

Related

How to prevent search engines from indexing a span of text?

From the information I have been able to find so far, <noindex> is supposed to achieve this, making a single section of a page hidden from search engine spiders. But then it also seems this is not obeyed by many browsers - so if that is the case, what markup should be used instead of / in addition to it?
Yahoo uses a built-in class: <span class="robots-nocontent">
Googlebot has no equivalent(?)
Yandex uses <noindex>
Others?
There is no way to stop crawlers from indexing anything, it's up to their author to decide what the crawlers would do. The rule-obeying ones, like Yahoo Slurp, Googlebot, etc. they each have their own rule, as you've already discovered, but it's still up to them whether to completely obey the rules, or not - say you set robots-nocontent but that part is still indexed and put in some other place, maybe for checks for spam, illegal material, malware, etc.
And that's just for the "good" ones, there's no telling what the bad ones would do. So think of all the noindex stuff as a set of guidelines, not a set of strict rules.
And the only thing that works for sure: if you have sensitive data, or you simply don't want something indexed - don't make it publicly available.

Can an ID attribute start with colon?

Was viewing the source of Gmail for purely academic purposes and I came across this.
<input id=":3f4"
name="attach"
type="checkbox"
value="13777be311c96bab_13777be311c96bab_0.1_-1"
checked="">
Wonder of wonders, most elements have ids that starts with a :
I always thought the definition for ID attribute was this.
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods (".").
Or am I missing anything new? I mean is that OK with HTML5?
Permissibility
It is allowed in the latest working draft: http://www.w3.org/TR/html-markup/datatypes.html#common.data.id
Any string, with the following restrictions:
- must be at least one character long
- must not contain any space characters
The spec also notes:
Previous versions of HTML placed greater restrictions on the content
of ID values (for example, they did not permit ID values to begin with
a number).
The definition you quoted appears in the HTML 4 spec.
There is a widely-visited SO thread which visits some of considerations regarding IDs (mainly from an HTML 4 perspective).
Rationale
After thinking about it more, I realized that there are two good questions here:
Why does the spec allow this?
IDs which can contain any character have the potential to break all sorts of things, such as CSS selectors (if proper escaping is not used), Sizzle (which jQuery uses) pattern matches, server IDs (such as ASP.Net web forms use) and IDs which are generated from model properties (such as one might do with a MVC pattern).
All those things aside, I believe a key goal of HTML 5 was to not create restrictions that weren't absolutely necessary (which was a shortcoming of XHTML). Just because a purpose hasn't been identified for something yet doesn't mean that it won't be in the future.
Despite the many things which won't work, certain things work just fine, for example document.getElementById(":foo")
http://jsfiddle.net/Xjast/
As with most things, it is up to the developer to be knowledgeable of the tools that he or she is using.
Why does Google do this?
Obviously this can't be answered conclusively unless you are part of the Gmail team. However, Google heavily minimizes and obfuscates their code; they also manage a huge amount of script, which suggests well-defined conventions.
Here's another thought. What if Google is leveraging the fact that CSS selectors require escaping of certain characters? This would go a long way towards reducing accidental restyling of content contained in an email message.

Does using custom data attributes produce browser compatibility issues?

I have to choose between custom data tags or ids. I would like to choose custom data tags, but I want to be sure that they do not cause browser compatibility issues for the most widely used browsers today.
I'm using jQuery 1.6 and my particular scenario involves a situation where I need to reference a commentId for several actions.
<div data-comment-id="comment-1" id="comment-1">
<a class="foo"></a>
</div>
It's easier to extract data tags in jQueryin: $('foo').data('commentId');
Extract a substring from the id seems a bit complicated and could break for one reason or another: <a id="comment-1"
Are there any sweeping merits or fatal flaws for either approach?
I would advise in favor of data attributes for the following reasons:
ids need to be unique document-wide. Thus they are limited in the semantics they can carry
you can have multiple data-attributes per element
and probably less relevant in your case:
changing ids might break idrefs
However, I'm not sure whether I understand your specs completely as extracting the element id in jQuery is as trivial as getting the data attribute: $('.foo').attr('id');.
You might be interested in Caniuse.com, a browser compatibility site for web technologies.
If XHTML is an issue to you, you might also be interested in how to use custom data attributes in XHTML: see here for a discussion on SO and here for an XHTML-compatible approach using namespaces.
this guy says data attibutes work on IE6.

Adding ids to HTML tags for QA automation

I have a query In our application we have lots of HTML tags. During development many tags were not given any id because of no requirement.Now the QA team wants to automate the test cases using QTP. In most of the cases this tool doesn't recognizes because it does not find ids for most of the HTML tags.Now we are asked to add ids to all the HTML tags.
I want to know if there will be any effect adding id attribute to these tags. Even positive impact are welcome
I do not think there will be any either positive or negative effect : maybe the size of the HTML page will increase a bit, but probably not that much.
Still, are you sure you need to put "id" attributes on every HTML tag of your pages ? Wouldn't only a few of those be enough ? Like on form fields, on links, on error-messages ; and that's probably about it ?
One thing you must take care, though, is that "id", as in "identifers", must be unique ; which implies it might be good, before starting adding them, to define some kind of "id-policy", to say, for instance, that "ids for elements of that kind should be named that way".
And, for your next projects : have developpers add those when theyr're developping ;-)
(And following the policy, of course)
Now that I'm thinking about it : a positive effect might be that it'll be easier to write Javascript code interacting with your HTML document -- but that'll be true for next projects or evolutions for this one, when those id are already present in the HTML at the time developpers put the JS code in place...
Since there are no QTP related answers yet.
GUI recognition in QTP is object-oriented. In order to identify an object QTP needs a unique combination of object's properties, and checking them better to be as fast as possible - that is why HTML ID would be ideal.
Now, where it is especially critical - for objects that do not have other unique identifiers. The most typical example - html tables. Their contents is dynamic, their number on the page may vary. By adding HTML ID you allow recognition mechanism get straight to the right table.
Objects with other unique properties can be recognized well without HTML ID. For example, if you have a single "submit" link on the page QTP will successfully recognize it by inner text.
So the context-specific answer: don't start adding ids to every single tag. Ask automation guys to prepare a list of objects they have problem with. And add ids to those objects.
PS. It also depends on automation programming skills. There are descriptive programming and dynamic recognition methods. They allow retrieving the right objects even without ids provided.
As Albert said, QTP doesn't rely solely on elements' id, in fact due to the fact that many web applications generate different ids for each session, (as far as I remember) the id property isn't part of the default description for most web test objects.
QTP is pretty good at recognizing most simple web controls and if you're facing problems it may be the case that a Web Extensibility project will help you bridge the gap between the semantics of your web application and the raw HTML it is created in. If a complex control is recognized by QTP as a WebElement (which is actually the div that contains the span that drives the code) you will understandably have object recognition problems since there are many divs on the page but probably many less complex controls.
If you are talking about side-effects - NO. Adding ids won't cause any problems (apart from taking up some extra bytes of course)
If you really have the need to add ids, go ahead and add them.
http://www.w3.org/TR/html4/struct/links.html#anchors-with-id says: The id and name attributes share the same name space. This means that they cannot both define an anchor with the same name in the same document. It is permissible to use both attributes to specify an element's unique identifier for the following elements: A, APPLET, FORM, FRAME, IFRAME, IMG, and MAP. When both attributes are used on a single element, their values must be identical.

So what if custom HTML attributes aren't valid XHTML?

I know that is the reason some people don't approve of them, but does it really matter? I think that the power that they provide, in interacting with JavaScript and storing and sending information from and to the server, outweighs the validation concern. Am I missing something? What are the ramifications of "invalid" HTML? And wouldn't a custom DTD resolve them anyway?
The ramification is that w3c comes along in 2, 5, 10 years and creates an attribute with the same name. Now your page is broken.
HTML5 is going to provide a data attribute type for legal custom attributes (like data-myattr="foo") so maybe you could start using that now and be reasonably safe from future name collisions.
Finally, you may be overlooking that custom logic is the rational behind the class attribute. Although it is generally thought of as a style attribute it is in reality a legal way to set custom meta-properties on an element. Unfortunately you are basically limited to boolean properties which is why HTML5 is adding the data prefix.
BTW, by "basically boolean" I mean in principle. In reality there is nothing to stop you using a seperator in your class name to define custom values as well as attributes.
class="document docId.56 permissions.RW"
Yes you can legally add custom attributes by using "data".
For example:
<div id="testDiv" data-myData="just testing"></div>
After that, just use the latest version of jquery to do something like:
alert($('#testDiv').data('myData'))
or to set a data attribute:
$('#testDiv').data('myData', 'new custom data')
And since jQuery works in almost all browsers, you shouldn't have any problems ;)
update
data-myData may be converted to data-mydata in some browsers, as far as the javascript engine is concerned. Best to keep it lowercase all the way.
Validation is not an end in itself, but a tool to be used to help catch mistakes early, and reduce the number of mysterious rendering and behavioural issues that your web pages may face when used on multiple browser types.
Adding custom attributes will not affect either of these issues now, and unlikely to do so in the future, but because they don't validate, it means that when you come to assess the output of a validation of your page, you will need to carefully pick between the validation issues that matter, and the ones that don't. Each time you change your page and revalidate, you have to repeat this operation. If your page validates entirely then you get a nice green PASS message, and you can move on the next stage of testing, or to the next change that needs to be made.
I've seen people obsessed with validation doing far worse/weird things than using a simple custom attribute:
<base href="http://example.com/" /><!--[if IE]></base><![endif]-->
In my opinion, custom attributes really don't matter. As other say, it may be good to watch out for future additions of attributes in the standards. But now we have data-* attributes in HTML5, so we're saved.
What really matters is that you have properly nested tags, and properly quoted attribute values.
I even use custom tag names (those introduced by HTML5, like header, footer, etc), but these ones have problems in IE.
By the way, I often find ironically how all those validation zealots bow in front of Google's clever tricks, like iframe uploads.
Instead of using custom attributes, you can associate your HTML elements with the attributes using JSON:
var customAttributes = { 'Id1': { 'custAttrib1': '', ... }, ... };
And as for the ramifications, see SpliFF's answer.
Storing multiple values in the class attribute is not correct code encapsulation and just a convoluted hack way of doing things. Take a custom ad rotator for instance that uses jquery. It is much cleaner on the page to do
<div class="left blue imagerotator" AdsImagesDir="images/ads/" startWithImage="0" endWithImage="10" rotatorTimerSeconds="3" />
and let some simple jquery code do the work from here.
Any developer or web designer now can work on the ad rotator and change values to this when asked without much ado.
Coming back to project a year later or coming into a new one where the previous developer split and went to an island somewhere in the pacific can be hell trying to figure out intentions when code is written in an unclear encrypted manner like this:
<div class="left blue imagerotator dir:images-ads endwith:10 t:3 tf:yes" />
When we write code in c# and other languages we don't write code putting all custom properties in one property as a space delimited string and end up having to parse that string every time we need to access or write to it. Think about the next person that will work on your code.
The thing with validation is that TODAY it may not matter, but you cannot know if it's going to matter tomorrow (and, by Murphy's law, it WILL matter tomorrow).
It's just better to choose a future-proof alternative. If they don't exist (they do in this particular case), the way to go is to invent a future proof alternative.
Using custom attributes is probably harmless, but still, why choose a potentially harmful solution just because you think (you can never be sure) it will cause no harm?. It might be worth to discuss this further if the future proof alternative was too costly or unwieldy, but this is certainly not the case.
Old discussion but nevertheless; in my opinion since html is a mark-up and not a progamming language, it should always be interpreted with leniency for mark-up 'errors'. A browser is perfectly able to do so. I don't think this will and should change ever. Therefore, the only important practical criteria is that your html will be displayed correctly by most browsers and will continue to do so in, say a few years. After that time, your html will probalbly be redesigned anyway.
Just to add my ingredient to the mix, validation is also important when you need to create content that can/could be post-processed using automated tools. If your content is valid you can much more easily convert markup from one format to another. For example, doing valid XHTML to XML with a specific schema is Much easier when parsing data that you know and can verify to follow a predictable format.
I, for example NEED my content to be valid XHTML because very often it is converted into XML for various jobs and then converted back without data loss or unexpected rendering results.
Well it depends on your client/boss/etc .. do they require it be validating XHTML?
Some people say there are a lot of workarounds - and depending on the sceneraio, they can work great. This includes adding classes, leveraging the rel attribute, and someone that has even written their own parser to extract JSON from HTML comments.
HTML5 provides a standard way to do this, prefix your custom attributes with "data-". I would recommend doing this now anyway, as there is a chance you may use an attribute that will be used down the track in standard XHTML.
Using non-standard HTML could make the browser render the page in "quirks mode", in which case some other parts of the page may render differently, and other things like positioning may be slightly different. Using a custom DTD may get around this, though.
Because they're not standard you have no idea what might happen, neither now, nor in the future. As others have said W3C might start using those same names in the future. But what's even more dangerous is that you don't know what the developers of "browser xxx" have done when they encounter they.
Maybe the page is rendered in quirks mode, maybe the page doesn't render at all on some obscure mobile browser, maybe the browser will leak memory, maybe a virus killer will choke on your page, etc, etc, etc.
I know that following the standards religiously might seem like snobbery. However once you have experienced problems due to not following them, you tend to stop thinking like that. However, then it's mostly too late, and you need to start your application from scratch with a different framework...
I think developers validate just to validate, but there is something to be said for the fact that it keeps markup clean. However, because every (exaggeration warning!) browser displays everything differently there really is no standard. We try to follow standards because it makes us feel like we at least have some direction. Some people argue that keeping code standard will prevent issues and conflicts in the future. My opinion: Screw that nobody implements standards correctly and fully today anyway, might as well assume all your code will fail eventually. If it works it works, use it, unless its messy or your just trying to ignore standards to stick it to W3C or something. I think its important to remember that standards are implemented very slowly, has the web changed all that much in 5 years. I'm sure anyone will have years of notice when they need to fix a potential conflict. No reason to plan for compatibility of standards in the future when you can't even rely on today's standards.
Oh I almost forgot, if your code doesn't validate 10 baby kittens will die. Are you a kitten killer?
Jquery .html(markup) doesn't work if markup is invalid.
Validation
You shouldn't need custom attributes to provide validation. A better approach would be to add validation based on fields actual task.
Assign meaning by using classes. I have classnames like:
date (Dates)
zip (Zip code)
area (Areas)
ssn (Social security number)
Example markup:
<input class="date" name="date" value="2011-08-09" />
Example javascript (with jQuery):
$('.date').validate(); // use your custom function/framework etc here.
If you need special validators for a certain or scenario you just invent new classes (or use selectors) for your
special case:
Example for checking if two passwords match:
<input id="password" />
<input id="password-confirm" />
if($('#password').val() != $('#password-confirm').val())
{
// do something if the passwords don't match
}
(This approach works quite seamless with both jQuery validation and the mvc .net framework and probably others too)
Bonus: You can assign multiple classes separated with a space class="ssn custom-one custom-two"
Sending information "from and to the server"
If you need to pass data back, use <input type="hidden" />. They work out of the box.
(Make sure you don't pass any sensitive data with hidden inputs since they can be modified by the user with almost no effort at all)