Can an ID attribute start with colon? - html

Was viewing the source of Gmail for purely academic purposes and I came across this.
<input id=":3f4"
name="attach"
type="checkbox"
value="13777be311c96bab_13777be311c96bab_0.1_-1"
checked="">
Wonder of wonders, most elements have ids that starts with a :
I always thought the definition for ID attribute was this.
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods (".").
Or am I missing anything new? I mean is that OK with HTML5?

Permissibility
It is allowed in the latest working draft: http://www.w3.org/TR/html-markup/datatypes.html#common.data.id
Any string, with the following restrictions:
- must be at least one character long
- must not contain any space characters
The spec also notes:
Previous versions of HTML placed greater restrictions on the content
of ID values (for example, they did not permit ID values to begin with
a number).
The definition you quoted appears in the HTML 4 spec.
There is a widely-visited SO thread which visits some of considerations regarding IDs (mainly from an HTML 4 perspective).
Rationale
After thinking about it more, I realized that there are two good questions here:
Why does the spec allow this?
IDs which can contain any character have the potential to break all sorts of things, such as CSS selectors (if proper escaping is not used), Sizzle (which jQuery uses) pattern matches, server IDs (such as ASP.Net web forms use) and IDs which are generated from model properties (such as one might do with a MVC pattern).
All those things aside, I believe a key goal of HTML 5 was to not create restrictions that weren't absolutely necessary (which was a shortcoming of XHTML). Just because a purpose hasn't been identified for something yet doesn't mean that it won't be in the future.
Despite the many things which won't work, certain things work just fine, for example document.getElementById(":foo")
http://jsfiddle.net/Xjast/
As with most things, it is up to the developer to be knowledgeable of the tools that he or she is using.
Why does Google do this?
Obviously this can't be answered conclusively unless you are part of the Gmail team. However, Google heavily minimizes and obfuscates their code; they also manage a huge amount of script, which suggests well-defined conventions.
Here's another thought. What if Google is leveraging the fact that CSS selectors require escaping of certain characters? This would go a long way towards reducing accidental restyling of content contained in an email message.

Related

Force screen reader to read numbers as individual digits

I'm using voice over and I'm trying to get it to read numbers as individual digits, for example if I have inputted 2000, voice over will read out "two thousand". I want the desired behaviour to read out "two zero zero zero".
my current input element looks something like this
<input class="some-class" id="some-id" name="some-name"
type="text">
I have tried setting the type attribute to type="number", type="tel" and adding a style attribute equal to style="speak:spell-out", but non of them worked.
When I separate the numbers with whitespace like value="2 0 0 0" it works, but of course, you can't expect the user to do this.
I understand there may be a way to do this using javascript, but the solution can not contain javascript in the browser due to business requirements.
Any suggestions?
You shouldn't try to force a particular pronunciation or digit grouping.
Add spaces if grouping has a particular importance or meaning.
Take the base principle that numbers shouldn't be read differently by a screen reader than it is presented.
If digits must be separated in a particular way, add spaces, dots, dashes or another separation character.
Conversely, if there's no spaces, there's no special need to absolutely read a number digit by digit.
That's quite simple.
You shouldn't force the screen reader to read something in the way you view it yourself.
Other people may not have the same vision as you. Concerning numbers, some people will prefer to read digit by digit, but others will prefer having them grouped by two, three or four, for their ease of reading, writing and memorizing. Their screen reader is normally configured accordingly.
If a given grouping is important, then groups must be separated with spaces or other characters. If there's no separations, then it implicitly means that grouping has no particular importance.
Note that screen readers always give the possibility to read numbers digit by digit, if the user wish to do so. It is usually not the default.
Reading numbers digit by digit is usually done only for very big numbers (billions), or when mixing digits and letters.
Additionally consider that:
Different screen reader users have different preferences, and accessibility speaking, it's generally a bad idea to go against preferences or common defaults
There are several screen readers, and a lot of different voices in many languages; all potentially behave in slightly different way when reading numbers, and any small change in order to tweak it might create more problems than solve.
Screen reader users are used to pronunciation quirks, and they can fix them using personal dictionary
Screen readers are nowadays not that bad on deciding whether they need to read numbers at a hole, in groups, or digit by digit.
So, avoid deciding a particular grouping or pronunciation. It's a bad idea, and anyway technically perilous.
I understand there may be a way to do this using javascript, but the solution can not contain javascript in the browser due to business requirements.
You tried HTML and CSS and you can't use Javascript. Screenreaders use the Accessibility tree. They do not use CSS, there's no instruction to tell them to spell the text. They might choose to spell some abbreviations while reading some acronym. This is screenreader choice.
Screenreader users are used to ear numbers as strangely as they come and if they want them to be read char by char they have the appropriate shortcut to willingly spell them.

What is the best SRI hash size?

I recently discovered the following nifty little site for generating SubResource Integrity (SRI) Tags for externally loaded resources. For example, enterring the latest jQuery URL (https://code.jquery.com/jquery-3.3.1.min.js), one gets the following <script> tag:
<script src="https://code.jquery.com/jquery-3.3.1.min.js" integrity="sha256-FgpCb/KJQlLNfOu91ta32o/NMZxltwRo8QtmkMRdAu8= sha384-tsQFqpEReu7ZLhBV2VZlAu7zcOV+rXbYlF2cqB8txI/8aZajjp4Bqd+V6D5IgvKT sha512-+NqPlbbtM1QqiK8ZAo4Yrj2c4lNQoGv8P79DPtKzj++l5jnN39rHA/xsqn8zE9l0uSoxaCdrOgFs6yjyfbBxSg==" crossorigin="anonymous"></script>
I understand the purpose of SRI hashes, and I know that they can use different hash sizes (256-, 384-, or 512-bit), but I had never seen all three used at once like this before. Digging into the MDN docs, I found that
An integrity value may contain multiple hashes separated by whitespace. A resource will be loaded if it matches one of those hashes.
But how exactly is that matching performed? Time for multiple questions in one SO post...
Do browsers attempt to match the longest hash first, since its more secure, or the shortest first, since its faster?
Would one really ever expect for one hash to match and not all three (other than the trivial case of a developer mistyping a hash)?
Is there any benefit to providing all three hashes instead of just one?
Similar to #1, If you only provide one hash value, which should you use? I typically see sites (e.g., Bootstrap) providing sha384-values in their example code. Is that because its right in the middle, not too big, not too small?
Out of curiosity, can the integrity attribute be used on any tags beside <script> and <link>. I'm particularly wondering about multimedia tags like <img>, <source>, etc.
Do browsers attempt to match the longest hash first, since its more secure, or the shortest first, since its faster?
Per https://w3c.github.io/webappsec-subresource-integrity/#agility, “the user agent will choose the strongest hash function in the list”.
Would one really ever expect for one hash to match and not all three?
No. But as far as the browser behavior: If the strongest hash matches, the browser uses that and just ignores the rest (so it wouldn’t matter anyway whether the others match too or not).
Is there any benefit to providing all three hashes instead of just one?
There is no current benefit in practice. That’s because per https://w3c.github.io/webappsec-subresource-integrity/#hash-functions, “Conformant user agents MUST support the SHA-256, SHA-384, and SHA-512 cryptographic hash functions”.
So currently, if you just specify a SHA-512 hash, all browsers that support SRI will use that.
But per https://w3c.github.io/webappsec-subresource-integrity/#agility the intent of specifying multiple hashes is “to provide agility in the face of future cryptographic discoveries… Authors are encouraged to begin migrating to stronger hash functions as they become available”.
In other words, at some point in the future, browsers will start to add support for stronger hash functions (SHA-3-based ones https://en.wikipedia.org/wiki/SHA-3 or whatever).
Thus, since you’ll need to continue to target older browsers as well as newer ones, there will be a period of time when you’re targeting some browsers for which SHA-512 is the strongest hash function while also targeting the new browsers that have come along by that time which will have added support for some SHA-3 (or whatever) hash functions.
So in that case, you would need to specify multiple hashes in the integrity value.
Similar to #1, If you only provide one hash value, which should you use?
A SHA-512 value.
I typically see sites (e.g., Bootstrap) providing sha384-values in their example code. Is that because its right in the middle, not too big, not too small?
I don't know why they choose to do it that way. But since browsers are required to support SHA-512 hashes, you don’t gain anything by specifying a SHA-384 hash instead — in fact you just lose the value of having the strongest hash function available.
Out of curiosity, can the integrity attribute be used on any tags beside <script> and <link>.
No, it can’t be — not yet.
I'm particularly wondering about multimedia tags like <img>, <source>, etc.
As https://w3c.github.io/webappsec-subresource-integrity/#verification-of-html-document-subresources explains, the plan has always been for SRI to eventually be used for those too —
Note: A future revision of this specification is likely to include integrity support for all possible subresources, i.e., a, audio, embed, iframe, img, link, object, script, source, track, and video elements.
…but we are not yet in that future.

Hyphenating arbitrary text automatically

What kinds of challenges are there facing automatic hyphenation? It seems that you could just draw word by word, breaking when the length of the line exceeds the length of the viewport (or whatever we're wrapping our text in), placing hyphens after as many characters as can fit (provided at least two characters fit and the word is at least four characters), skipping words that already contain a hyphen (there's no requirement that words have to be hyphenated).
But I note how Firefox and IE need a dictionary to be able to hyphenate with CSS's hyphens. This seems to imply that there are further issues regarding where we can place hyphens.
What kinds of issues are these? Do any exist in the English language or do they only exist in other languages?
You have these issues in all languages. You can only place a hyphen where meaningful tokens result from the split, as has already been pointed out. You don't want to, for example, split a word like "wr-ong".
This may or may not be a syllable, while in most languages (including English) it is. But the main point is that you cannot pin it down as easily just with some simple rules. You would need to consider a lot of phonology to get a highly accurate result, and these rules vary from language to language.
With this background, I can see why one would take a dictionary instead, and frankly, being a computational linguist myself, this is also what I would probably opt for.
If you DO want to go for an automatic solution, I would recommend doing some research in English phonology of syllables, or the so-called syllabification. You might want to start with this article on Wikipedia:
Wikipedia - Syllabification

Should the percent symbol (%) always be HTML-escaped?

I know the percent symbol has to be URL-encoded when being passed around, but when I display it in the browser, is it also necessary to escape it like so: %?
In URLs, the percent sign (%) has a special meaning, so it should be escaped. In HTML, it does not, so it is not necessary to escape it.
I agree with the chosen answer, but would like to qualify the statement “it is not necessary to escape it.”
If you have a need (or desire) to escape a percentage sign in HTML code, (and there are good reasons to do this with any potentially ambiguous character or symbol) then I would highly recommend using the percentage entity code &percnt; as opposed to any numeric code. (those I use when there is no entity name you could use)
That was the answer I was looking for when I found this page, because I forgot it looses the final "e".
We should probably all be using at least the entities kindly listed here. (whoever Webmasterish is; thank you)
Reasoning: Numeric codes (and particularly byte codes from unencoded characters) change with code–pages, on systems using different default languages, and / or different operating systems. (Windows and Mac using slightly different code sets for “English” being the classic, which still plagues plain–text eMail sent between Apple Mail and Outlook) This is slowing down, and should stop with UTF, but I'm still seeing it pop up.
If you're converting HTML to some other mark–up, (note, I used "–" not a "-", or even "−" for the same reason) such as LaTeX, DVI, PostScript or even MarkDown, then it's useful to completely squash any ambiguity… And those processes tend to happen on the information you least expect to be used in such a way when you initially write it. So just get used to doing it everywhere and be grateful to your former self for having had the foresight to do so. Probably years down the line, when you're looking to update formulae to be more readable by utilising MathJax or such, and keep picking up hyphenated words. <swearmarks>
I'd like to add this - if you use javascript in href, you are in troubles too. Check this example:
http://jsfiddle.net/cs4MZ/
One of the workarounds might be using onclick instead of href.
If you're talking about in HTML text, visible to the reader, no. It can't do anything harmful, there.
...if you're talking about inside of HTML attributes, then yes, that would be good to consider.
URLs and HTML are different languages, as weird as that might seem, so they have different weaknesses.

Why shouldn't I use weird Characters in code/HTML documents?

I'm wondering if it's a bad idea to use weird characters in my code. I recently tried using them to create little dots to indicate which slide you're on and to change slides easily:
There are tons of these types of characters, and it seems like they could be used in place of icons/images in many cases, they are style-able and scale-able, and screen readers would be able to make sense of them.
But, I don't see anyone doing this, and I've got a feeling this is a bad idea, I just can't decide why. I guess it seems too easy to be true. Could someone tell me why this is or isn't okay? Here are some more examples of the characters i'm talking about:
↖ ↗ ↙ ↘ ㊣ ◎ ○ ● ⊕ ⊙ ○  △ ▲ ☆ ★ ◇ ◆ ■ □ ▽ ▼ § ¥ 〒 ¢ £ ※ ♀ ♂ &⁂ ℡ ↂ░ ▣ ▤ ▥ ▦ ▧ ✐✌✍✡✓✔✕✖ ♂ ♀ ♥ ♡ ☜ ☞ ☎ ☏ ⊙ ◎ ☺ ☻ ► ◄ ▧ ▨ ♨ ◐ ◑ ↔ ↕ ♥ ♡ ▪ ▫ ☼ ♦ ▀ ▄ █ ▌ ▐ ░ ▒ ▬ ♦ ◊
PS: I would also welcome general information about these characters, what they're called and stuff (ASCII, Unicode)?
There are three things to deal with:
1. As characters in a sentence/text:
The problem is that some fonts simply do not have them. However since CSS can control font use you probably will not run into this problem. As long as you use a web safe font, and know that that character is available in that font, you should probably be okay.
You can also use an embedded font, though be sure to fall back on a web safe font that contains the character you need as many browser will not support embedded fonts.
However sometimes certain devices will not have multiple fonts to choose from. If that font does not support your character you will run into problems. However depending on what your site does and the audience you are targeting this may not be a problem for you. Not to mention that devices like that are very old, and uncommon.
All in all it was probably not a good idea a handful of years ago, but now you are not likely to have problems as long as you cover all your bases.
It is important however to point out that you should never hard code those characters, instead use HTML entities. Just inserting those characters into your code can lead to unpredictable results. I recently copied some text from Word directly into my code, Word used smart quotes (quote marks that curve inwards properly). They showed up fine in Notepad++, but when I viewed the page I did not get quotes, I got some weird symbol.
I could have either replaced them with normal quotes " or with HTML entities to keep the style “ and ” (“ and ”).
Any Unicode character can be inserted this way (even those without special names).
Wikipedia has a good reference:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
2. As UI elements:
While it may be safe to use them in many cases, it is still better to use HTML elements where possible. You could simply style some div elements to be round and filled/not filled for your example.
As far as design goes they are really limiting, finding one that fits with the style of your page can be a hassle, and may mean that you will definitely need to embed a font, which is still only supported by the latest browsers.
Plus many devices do not support heavy font manipulation, and will often display them poorly. It works in the flow of your text, but as a vital part of the UI there can be major problems. Any possible issue one of those characters can bring will be multiplied by the fact that it is part of your UI.
From an artistic stand point they simply limit your abilities too much.
3. What are you doing?
Finaly you need to consider this:
Text is for telling
Image is for showing
HTML is for organizing
CSS is for making things look good while you show them
JavaScript is for functionality
Those characters are text, they are for telling someone something. So ask the question: "What am I doing?" and then use what was designed for that task. If you are telling use them, if you are showing use Image, or CSS.
I've seen this done before (the stars) and I think it's an awesome idea! It's also becoming quite popular to use a font (with #font-face) full of icons, like this one: http://fortawesome.github.com/Font-Awesome/
I can't see any downside to using a font like "font awesome" (only the upsides you mention like scalabilty and the ability to change color with CSS). Perhaps there's a downside to using the special characters you mention but none that I know of.
The problem with using those characters is that not all of them are available in all fonts used by all users, which means your application may look strange, or in the worst case be unusable. That said, it is becoming more common to assume the characters available in certain common fonts (Apple/Microsoft's Arial, Bitstream Vera). You can't even assume that you can download a font, as some users may capture content for offline reading with a service like Instapaper or Read It Later.
There are a number of problems:
Portability: using anything other than the 7-bit ASCII characters in code can make your code less portable, as recipients may use the wrong encoding. You can do a lot to mitigate this (eg. use UTF16 or at least UTF-8 encoded files). Most languages allow you to specify strings in characters using some form of escape notation (eg. "\u1234" in C#), which will avoid the problem, but loses some of the advantages.
Font-dependency: user interface elements that depend on special characters being available in a font may be harder to internationalize, since those glyphs might not be in the font that you want/need to use for a particular audience.
No color, limited choice of art: while font glyphs might seem useful to a coder, they probably look pretty poor to a UI designer.
The question is very broad; it could be split to literally thousands of questions of the type “why shouldn’t I use character ... in HTML documents?” This seems to be what the question is about—not really about code. And it’s about characters, seen as “weird” or “uncommon” or “special” from some perspective, not about character encodings. (None of the characters mentioned are encoded in ASCII. Some are encoded in ISO-8895-1. All are encoded in Unicode.)
The characters are used in HTML documents. There is no general reason against not using them, but loads of specific reasons why some specific characters might not be the best approach in a specific situation.
For example, the “little dots” you mention in your example (probably not dots at all but circles or bullets), when used as control elements as you describe, would mean poor usability and poor accessibility. Making them significantly larger would improve the situation, but this more or less proves that such text characters are not suitable for controls.
Screen readers could make sense of special characters if they used a database of various properties of characters. Well, they don’t, and they often fail to read properly even the most common special characters. Just reading the Unicode name of a character can be cryptic or outright misleading. The proper reading would generally depend on meaning and context.
The main issue, however, is that people do not generally recognize characters in the meanings that you would assign to them. How many people know what the circled plus symbol “⊕” stands for? Maybe 1 out of 1,000, optimistically thinking. It might be all right to use in on a page about advanced mathematics or physics, especially if the notation is defined there. But used in general text, it would be just… a weird character, and people would read different meanings into it, or just get puzzled.
So using special characters just because they look cool isn’t a good idea. Even when there is time and place for a special character, there are technical issues with them. How many fonts do you expect to contain “⊕”? How many of those fonts do you expect Joe Q. Public to have in his computer? In this specific case, you would find the font coverage reasonably good, but you would still have to analyze it and write a longish list of font names in your CSS code to cover most platforms. In the pile of poo case (♨), it would be unrealistic to expect most people to see anything but a symbol for unrepresentable character. Regarding the methods of finding out such things, check out my Guide to using special characters in HTML.
I've run into problems using unusual characters: the tools editor, compiler, interpreter etc.) often complain and report errors. In the end, it wasn't worth the hassle. Darn western hegemony, or homogeneity, or, well, something!