Does 1 English letter = 1 Chinese character? - html

I am a UX designer and we are working on a product where there needs to be a text input field for the user to insert their note. There needs to be a word limit indication, whether they're typing in Traditional Chinese or English.
So my question is:
If the character limit is 15, am I correct to say:
I am in Sweden (11/15 characters)
我在瑞典 (4/15 characters)
I was told that 1 Chinese character counts as 2-byte code and 1 English letter counts as 1-byte. How does this affect the character limit? I want to make sure my design is clear as possible for the developers.

So it’s about display size, right? Counting words won’t be useful in that case because a word can be as long as you want.
Counting characters is marginally more useful, but also doesn’t guarantee that the message will fit in the end because different characters have different widths. Just as an example, these four strings all consist of five characters each:
“​​​​​”
“     ”
“WWWWW”
“﷽﷽﷽﷽﷽”
There really is no elegant way to solve this. You’d need to know the precise metrics of the font you’re using and then calculate the visual width of each input.
If you’re fine with a “close enough” solution, you can just use the <input> element’s maxlength attribute. HTML and JavaScript count UTF-16 code units, however, which means that characters in the so-called Basic Multilingual Plane count as 1 and everything else counts as 2.
The Basic Multilingual Plane contains 99% of all characters in common, present-day use, so the vast majority of users probably won’t notice anything wrong. You could do something fancier with JavaScript, but I reckon it’s not really necessary for this kind of task.
Just keep in mind that this approach still won’t guarantee that the user’s input will fit visually on the print-out unless you leave a lot of empty room just in case. Definitely play around with some narrow and wide characters to see how much space they really take up when printed.

Related

Force screen reader to read numbers as individual digits

I'm using voice over and I'm trying to get it to read numbers as individual digits, for example if I have inputted 2000, voice over will read out "two thousand". I want the desired behaviour to read out "two zero zero zero".
my current input element looks something like this
<input class="some-class" id="some-id" name="some-name"
type="text">
I have tried setting the type attribute to type="number", type="tel" and adding a style attribute equal to style="speak:spell-out", but non of them worked.
When I separate the numbers with whitespace like value="2 0 0 0" it works, but of course, you can't expect the user to do this.
I understand there may be a way to do this using javascript, but the solution can not contain javascript in the browser due to business requirements.
Any suggestions?
You shouldn't try to force a particular pronunciation or digit grouping.
Add spaces if grouping has a particular importance or meaning.
Take the base principle that numbers shouldn't be read differently by a screen reader than it is presented.
If digits must be separated in a particular way, add spaces, dots, dashes or another separation character.
Conversely, if there's no spaces, there's no special need to absolutely read a number digit by digit.
That's quite simple.
You shouldn't force the screen reader to read something in the way you view it yourself.
Other people may not have the same vision as you. Concerning numbers, some people will prefer to read digit by digit, but others will prefer having them grouped by two, three or four, for their ease of reading, writing and memorizing. Their screen reader is normally configured accordingly.
If a given grouping is important, then groups must be separated with spaces or other characters. If there's no separations, then it implicitly means that grouping has no particular importance.
Note that screen readers always give the possibility to read numbers digit by digit, if the user wish to do so. It is usually not the default.
Reading numbers digit by digit is usually done only for very big numbers (billions), or when mixing digits and letters.
Additionally consider that:
Different screen reader users have different preferences, and accessibility speaking, it's generally a bad idea to go against preferences or common defaults
There are several screen readers, and a lot of different voices in many languages; all potentially behave in slightly different way when reading numbers, and any small change in order to tweak it might create more problems than solve.
Screen reader users are used to pronunciation quirks, and they can fix them using personal dictionary
Screen readers are nowadays not that bad on deciding whether they need to read numbers at a hole, in groups, or digit by digit.
So, avoid deciding a particular grouping or pronunciation. It's a bad idea, and anyway technically perilous.
I understand there may be a way to do this using javascript, but the solution can not contain javascript in the browser due to business requirements.
You tried HTML and CSS and you can't use Javascript. Screenreaders use the Accessibility tree. They do not use CSS, there's no instruction to tell them to spell the text. They might choose to spell some abbreviations while reading some acronym. This is screenreader choice.
Screenreader users are used to ear numbers as strangely as they come and if they want them to be read char by char they have the appropriate shortcut to willingly spell them.

Hyphenating arbitrary text automatically

What kinds of challenges are there facing automatic hyphenation? It seems that you could just draw word by word, breaking when the length of the line exceeds the length of the viewport (or whatever we're wrapping our text in), placing hyphens after as many characters as can fit (provided at least two characters fit and the word is at least four characters), skipping words that already contain a hyphen (there's no requirement that words have to be hyphenated).
But I note how Firefox and IE need a dictionary to be able to hyphenate with CSS's hyphens. This seems to imply that there are further issues regarding where we can place hyphens.
What kinds of issues are these? Do any exist in the English language or do they only exist in other languages?
You have these issues in all languages. You can only place a hyphen where meaningful tokens result from the split, as has already been pointed out. You don't want to, for example, split a word like "wr-ong".
This may or may not be a syllable, while in most languages (including English) it is. But the main point is that you cannot pin it down as easily just with some simple rules. You would need to consider a lot of phonology to get a highly accurate result, and these rules vary from language to language.
With this background, I can see why one would take a dictionary instead, and frankly, being a computational linguist myself, this is also what I would probably opt for.
If you DO want to go for an automatic solution, I would recommend doing some research in English phonology of syllables, or the so-called syllabification. You might want to start with this article on Wikipedia:
Wikipedia - Syllabification

Accessibility & HTML title tag separators - alternatives to the vertical line (pipe)

I am attempting to make a site a bit more screen reader friendly and in testing I noticed that a common pattern is quite annoying on a screen reader - the site is using a vertical line / pipe character as a separator in the <title> tag (e.g. <title>Page Name | Site Name</title>). When I use VoiceOver as a screen reader to do testing it is read as "Page Name Vertical Line Site Name" which sounds especially odd with the particular title of the site.
What are the best accessible alternatives to the pipe that also have no negative effect on SEO? I've tried a <title>Page Name - Site Name</title> and <title>Page Name · Site Name</title> and they work okay, but I afraid they might have gotchas (e.g. reading as 'dash' or 'ampersand m i d d o t semicolon') on some edge case or causing chaos with SEO. Is there an accepted best practice for this?
The pronunciation of punctuation or special characters varies by screen reader, so there is no optimal choice. While it is true that “vertical line” sounds odd, it’s an oddity that screen reader users are accustomed to, since the “|” is widely used—not that much in title elements, rather in link lists and other contexts. The use of an en dash “–” might help, as it is a normal punctuation character and might be just ignored or even handled in an advanced way (e.g., a pause followed by raised tone). On other other hand, a comma “,” or a colon “:” might do the same thing, or do better.
It is very unlikely that such choices have any impact on SEO, since search engines generally ignore punctuation and special characters. (They might notify some special characters in some contexts, e.g. distinguishing between C and C++.)
Depending on context and context language, you could also consider using purely verbal expressions, e.g. in English using “of” instead of a separator character. “New products of ACME Corporation” sounds better than “New products | ACME Corporation” (though the latter is in no way wrong). This may have a minor impact on SEO, since search engines may treat even small words like “of” as significant; but this would not matter much, due to the way people write things in search boxes.
Using either a hyphen, or comma will have no effect on the site's SEO ranking.
But shorter titles often get more clicks from search (my own impression, backed by some data below), so think if you really need the site's name on every page?
Also, keep your title around 55 characters or so for best results in Google, the character count is trickier than it used to be, see http://moz.com/blog/new-title-tag-guidelines-preview-tool for a detailed explanation of some recent changes by Google. The current length is actually determined by pixel counts, not characters.
See this experiment for PPC CTR based on title lenghths: http://danzarrella.com/ppc-ad-line-lengths-and-clickthrough-rates.html# (not SEO, but correlates with my own experience in organic results).

Why shouldn't I use weird Characters in code/HTML documents?

I'm wondering if it's a bad idea to use weird characters in my code. I recently tried using them to create little dots to indicate which slide you're on and to change slides easily:
There are tons of these types of characters, and it seems like they could be used in place of icons/images in many cases, they are style-able and scale-able, and screen readers would be able to make sense of them.
But, I don't see anyone doing this, and I've got a feeling this is a bad idea, I just can't decide why. I guess it seems too easy to be true. Could someone tell me why this is or isn't okay? Here are some more examples of the characters i'm talking about:
↖ ↗ ↙ ↘ ㊣ ◎ ○ ● ⊕ ⊙ ○  △ ▲ ☆ ★ ◇ ◆ ■ □ ▽ ▼ § ¥ 〒 ¢ £ ※ ♀ ♂ &⁂ ℡ ↂ░ ▣ ▤ ▥ ▦ ▧ ✐✌✍✡✓✔✕✖ ♂ ♀ ♥ ♡ ☜ ☞ ☎ ☏ ⊙ ◎ ☺ ☻ ► ◄ ▧ ▨ ♨ ◐ ◑ ↔ ↕ ♥ ♡ ▪ ▫ ☼ ♦ ▀ ▄ █ ▌ ▐ ░ ▒ ▬ ♦ ◊
PS: I would also welcome general information about these characters, what they're called and stuff (ASCII, Unicode)?
There are three things to deal with:
1. As characters in a sentence/text:
The problem is that some fonts simply do not have them. However since CSS can control font use you probably will not run into this problem. As long as you use a web safe font, and know that that character is available in that font, you should probably be okay.
You can also use an embedded font, though be sure to fall back on a web safe font that contains the character you need as many browser will not support embedded fonts.
However sometimes certain devices will not have multiple fonts to choose from. If that font does not support your character you will run into problems. However depending on what your site does and the audience you are targeting this may not be a problem for you. Not to mention that devices like that are very old, and uncommon.
All in all it was probably not a good idea a handful of years ago, but now you are not likely to have problems as long as you cover all your bases.
It is important however to point out that you should never hard code those characters, instead use HTML entities. Just inserting those characters into your code can lead to unpredictable results. I recently copied some text from Word directly into my code, Word used smart quotes (quote marks that curve inwards properly). They showed up fine in Notepad++, but when I viewed the page I did not get quotes, I got some weird symbol.
I could have either replaced them with normal quotes " or with HTML entities to keep the style “ and ” (“ and ”).
Any Unicode character can be inserted this way (even those without special names).
Wikipedia has a good reference:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
2. As UI elements:
While it may be safe to use them in many cases, it is still better to use HTML elements where possible. You could simply style some div elements to be round and filled/not filled for your example.
As far as design goes they are really limiting, finding one that fits with the style of your page can be a hassle, and may mean that you will definitely need to embed a font, which is still only supported by the latest browsers.
Plus many devices do not support heavy font manipulation, and will often display them poorly. It works in the flow of your text, but as a vital part of the UI there can be major problems. Any possible issue one of those characters can bring will be multiplied by the fact that it is part of your UI.
From an artistic stand point they simply limit your abilities too much.
3. What are you doing?
Finaly you need to consider this:
Text is for telling
Image is for showing
HTML is for organizing
CSS is for making things look good while you show them
JavaScript is for functionality
Those characters are text, they are for telling someone something. So ask the question: "What am I doing?" and then use what was designed for that task. If you are telling use them, if you are showing use Image, or CSS.
I've seen this done before (the stars) and I think it's an awesome idea! It's also becoming quite popular to use a font (with #font-face) full of icons, like this one: http://fortawesome.github.com/Font-Awesome/
I can't see any downside to using a font like "font awesome" (only the upsides you mention like scalabilty and the ability to change color with CSS). Perhaps there's a downside to using the special characters you mention but none that I know of.
The problem with using those characters is that not all of them are available in all fonts used by all users, which means your application may look strange, or in the worst case be unusable. That said, it is becoming more common to assume the characters available in certain common fonts (Apple/Microsoft's Arial, Bitstream Vera). You can't even assume that you can download a font, as some users may capture content for offline reading with a service like Instapaper or Read It Later.
There are a number of problems:
Portability: using anything other than the 7-bit ASCII characters in code can make your code less portable, as recipients may use the wrong encoding. You can do a lot to mitigate this (eg. use UTF16 or at least UTF-8 encoded files). Most languages allow you to specify strings in characters using some form of escape notation (eg. "\u1234" in C#), which will avoid the problem, but loses some of the advantages.
Font-dependency: user interface elements that depend on special characters being available in a font may be harder to internationalize, since those glyphs might not be in the font that you want/need to use for a particular audience.
No color, limited choice of art: while font glyphs might seem useful to a coder, they probably look pretty poor to a UI designer.
The question is very broad; it could be split to literally thousands of questions of the type “why shouldn’t I use character ... in HTML documents?” This seems to be what the question is about—not really about code. And it’s about characters, seen as “weird” or “uncommon” or “special” from some perspective, not about character encodings. (None of the characters mentioned are encoded in ASCII. Some are encoded in ISO-8895-1. All are encoded in Unicode.)
The characters are used in HTML documents. There is no general reason against not using them, but loads of specific reasons why some specific characters might not be the best approach in a specific situation.
For example, the “little dots” you mention in your example (probably not dots at all but circles or bullets), when used as control elements as you describe, would mean poor usability and poor accessibility. Making them significantly larger would improve the situation, but this more or less proves that such text characters are not suitable for controls.
Screen readers could make sense of special characters if they used a database of various properties of characters. Well, they don’t, and they often fail to read properly even the most common special characters. Just reading the Unicode name of a character can be cryptic or outright misleading. The proper reading would generally depend on meaning and context.
The main issue, however, is that people do not generally recognize characters in the meanings that you would assign to them. How many people know what the circled plus symbol “⊕” stands for? Maybe 1 out of 1,000, optimistically thinking. It might be all right to use in on a page about advanced mathematics or physics, especially if the notation is defined there. But used in general text, it would be just… a weird character, and people would read different meanings into it, or just get puzzled.
So using special characters just because they look cool isn’t a good idea. Even when there is time and place for a special character, there are technical issues with them. How many fonts do you expect to contain “⊕”? How many of those fonts do you expect Joe Q. Public to have in his computer? In this specific case, you would find the font coverage reasonably good, but you would still have to analyze it and write a longish list of font names in your CSS code to cover most platforms. In the pile of poo case (♨), it would be unrealistic to expect most people to see anything but a symbol for unrepresentable character. Regarding the methods of finding out such things, check out my Guide to using special characters in HTML.
I've run into problems using unusual characters: the tools editor, compiler, interpreter etc.) often complain and report errors. In the end, it wasn't worth the hassle. Darn western hegemony, or homogeneity, or, well, something!

what are the disadvantages of having tons of entities?

I've been writing a source-to-display converter for a small project. Basically, it takes an input and transforms the input into an output that is displayable by the browser (think Wikipedia-like).
The idea is there, but it isn't like the MediaWiki style, nor is like the MarkDown style. It has a few innovations by itself. For example, when the user types in a chain of spaces, I would presume he wants the spaces preserved. Since html ignores spaces by default, I was thinking of converting these chain of spaces into respective s (for example 3 spaces in a row converted to 1 )
So what happens is that I can foresee a possibility of a ton of tags per post (and a single page may have multiple posts).
I've been hearing alot of anti-&nbsps in the web, but most of it boils down to readability headaches (in this case, the input is supplied by the user. if he decides to make his post unreadable he can do so with any of the other formatting actions supplied) or maintenance headaches (which in this case is not, since it's a converted output).
I'm wondering what are the disadvantages of having tons of tags on a webpage?
You are rendering every space as ?
Besides wasting so much bandwidth, this will not allow dynamic line breaking as "nbsp" means "*n*on *b*reaking *sp*ace". This will most probably cause much trouble.
If it's just being dumped to a client, it's just a matter of size, and if it's gzipped, it barely matters in terms of network traffic.
It'll slow down rendering, I'm sure, and take up DOM space, but whether or not that matters depends on stuff I don't know about your use case(s). You might be able to achieve the same result in other ways, too; not sure.
s aren't tags, but are character entities like ©, <, >, etc.
I'd say that the disadvantages would be readability. When I see a word, I expect the spacing to be constant (unless it is in a block of justified text).
Can you show me a case where you'd need s?
Have you considered trying to figure out what the user, by inserting those spaces, is really trying to achieve? Rather than the how (they want to insert the spaces), the what (if the spaces are at the beginning of a line, they want to indent the text in question).
An example of this is many programming sites convert 4 spaces at the start of a line to a pre+code block.
For your purposes, maybe it should be a <block> block.
The end goal being that of converting the spaces not to what the user (with their limited resources) intended to show up there but, rather, what they meant to convey with it.