What character should I use to maintain height of an empty (zero width) string? - html

I have a string that can potentially be empty, and in that case, I want to substitute it with a special character to maintain the ordinary text height while having zero width. In TeX, this would be called \strut. What is the counterpart for that in HTML? I came up with two candidates: ⁠ and . Should I use one of these?

On modern browsers, any zero-width character will do the job, provided that the browser either knows that the character is zero-width or uses a font that contains an empty glyph for it. But some characters may have effects, depending on the context and on software used to process the HTML file.
U+2060 WORD JOINER has the effect of preventing line break.
U+FEFF ZERO WIDTH NO-BREAK SPACE has the same effect. It is formally deprecated for any use except as Byte Order Mark, but in reality it works more often than WORD JOINER (though there are exceptions).
U+200B ZERO WIDTH SPACE has the effect of allowing a line break even when it would otherwise not be permitted; it’s like SPACE, but with zero width.
Usually the worst-case scenario for characters like this is an old version of IE. Checking in IE 6 shows that U+FEFF and U+200B are OK, but U+2060 shows as a small rectangle (i.e., the browser tries to render the character but finds no glyph for it).
So I’d use  or ​ depending on whether I’d like to prevent or allow line break at that point. If it does not matter, ​ is more logical to use.

I would suggest  or if zero width is not essential or if it is essential you could try the Unicode character ⁠ which is a zero width non-breaking space.

Related

Prevent Breaking of Negative Numbers

My HTML page contain tables with many negative numbers, like –0.25 . 8211 is the n-dash. Because my document is supposed to become epub2 eventually, javascript is not allowed. only xhtml+css.
Unfortunately, both ebook readers and the print function in Chrome think that it is a reasonable idea to line-break a negative number between the en-dash and the zero, even when there is a space before and/or after, e.g., in a table.
I need a "non-breaking" en-dash? there are non-breakable spaces, after all, too. Or is there a way to instruct css never to break such negative numbers anywhere throughout the entire document? (I doubt this one, but just had to ask.)
of course, I can wrap each negative number into a span to prevent breaking, but this is quite painful. literally, by the time I am all done, my number --0.25 would have to become <span class="nobreak">–0.25</span>. (joke: it's almost like a DOS 10x amplification attack, with 4 chars becoming 40 characters, all because I want to have negative numbers.)
advice appreciated.
/iaw
You can prevent negative numbers from breaking by using the proper MINUS SIGN “−” (U+2212). In
text rendering, browsers, ebook readers, and other software often treat EN DASH as well as HYPHEN-MINUS (the common Ascii hyphen) as allowing a line break after it, even when immediately followed by a digit. No such behavior has been observed for MINUS SIGN.
In HTML, you can write MINUS SIGN as − if you have difficulties in typing the character or if you wish to make it clear to anyone reading the HTML source that MINUS SIGN is used.

Why is a trailing punctuation mark rendered at the start with direction:rtl?

This is more a sort of curiosity. While working on a multilingual web application I noticed that certain characters like punctuation marks (!?.;,) at the end of a block element are rendered as if they were placed at the beginning instead when the writing direction is right-to-left (as it is the case for certain Asian languages I do not speak).
In other words, The string
Hello, World!
is rendered as
!Hello, World
when placed in a div block with direction: rtl
This becomes even more evident if the text is split in two parts and given different colors: a contiguous chunk of text at the end is rendered in two separated regions:
http://jsfiddle.net/22Qk9/
What's the point of this behavior? I guess this must be a peculiarity of (all?) right-to-left languages which is automatically handled by the browser, so I don't need to care about it, or should I?
If you want to fix this behavior add the LRM character ‎ in the end. It's a non=printing character.
Source : http://dotancohen.com/howto/rtl_right_to_left.html
Example : http://jsfiddle.net/yobjj6ed/
The reason is that the exclamation mark “!” has the BiDi class O.N. ('Other Neutrals'), which means effectively that it adapts to the directionality of the surrounding text. In the example case, it is therefore placed to the left of the text before it. This is quite correct for languages written right to left: the terminating punctuation mark appears at the end, i.e. on the left.
Normally, you use the CSS code direction: rtl or, preferably, the HTML attribute dir=rtl for texts in a language that is written right to left, and only for them. For them, this behavior is a solution, not a problem.
If you instead use direction: rtl or dir=rtl just for special effects, like making table columns laid out right to left, then you need to consider the implications. For example, in the table case, you would need to set direction to ltr for each cell of the table (unless you want them to be rendered as primarily right to left text).
If you have, say, an English sentence quoted inside a block of Arabic text, then you need to set the directionality of an element containing the English text to ltr, e.g.
<blockquote dir=ltr>Hello, World!</blockquote>
A similar case (just with Arabic inside English text) is discussed as use case 6 in the W3C document What you need to know about the bidi algorithm and inline markup (which has a few oddities, though, like using cite markup for quoted text, against W3C recommendations).
The accepted answer https://stackoverflow.com/a/20799360/477420 works if you can control markup/CSS of the value, if you have no control over HTML following approach could work.
If you don't know if page will be rendered RTL or LTR but some text is definitely LTR (i.e. English-only) you can wrap the value with LRE/PDF marks to signify that is LTR region. Text will be rendered LTR irrespective of page's LTR or RTL direction.
This works when you have some code that tries to render text without ability to change markup of how exactly it will show up on the page. I.e. you rendering value for "song tile" or "company name" field in some nested child component (or server side) without ability to control surrounding HTML elements.
One drawback of this and similar approaches (like LRM proposal in this question) with adding marks to text is copy-paste of such value from the resulting HTML page will generally preserve the marks but they are not visible/zero width. While for most cases it is fine consider if that is a problem for you.
Approximate sample code (some companies have "Inc." at the end which will end up with dot at the beginning when rendered as-is on RTL page):
// comanyName = "Alphabet Inc." - really likes dot at the end including RTL
if(stringIsDefinitelyAscii(companyName))
{
companyName = "\u202A" + companyName + "\u202C"
}
return companyName;
Details on LRE/PDF symbols can be found in https://unicode.org/reports/tr9/#Explicit_Directional_Embeddings:
LRE U+202A LEFT-TO-RIGHT EMBEDDING
Treat the following text as embedded left-to-right.
PDF U+202C POP DIRECTIONAL FORMATTING End the scope of the last LRE, RLE, RLO, or LRO.
Some approaches to figure out if string has RTL characters can be found in How to detect whether a character belongs to a Right To Left language?, JavaScript: how to check if character is RTL?, How to detect if a string contains any Right-to-Left character?.

How to force browser to break text where possible?

I need to determine minimum width adequate for displaying a possibly wrapped dynamic HTML string. Without word-wrapping this is simple: create a span, set its innerHTML and read offsetWidth. However, I'm not sure how to force linebreaks... Easiest incomplete approach I see is to replace all spaces by <br/>, but lines can be wrapped not only on spaces but also e.g. on hyphens.
So, basically, I want a browser to lay out sth. like
Max.
word-
wrapped
string
<----->
somewhere off-screen to measure width of the longest contained word. Is there a generic way to do that?
EDIT
I.e., for no line wraps:
function computeWidth (str) { // code to make it off-screen and caching omitted
var span = document.createElement ('span');
document.body.appendChild (span);
span.innerHTML = str;
return span.offsetWidth;
}
How would I write a similar function which forces line breaks on str, presumably playing with span.style?
You can use CSS work-break / word-wrap and programmatically inserted soft-hyphens (­, but I'd advise you to read up on cross browser problems regarding soft hyphens, as there are quite a few - I normally us the unicode for a soft hypen (U+00AD), but your mileage may vary), and then determine the width with javascript using the range object and measuring cursor offset from the left.
I'm suggesting the use of soft-hyphens, because even the same browser will normally break words differently depending on the OS / which dictionary (on OSX) is used. If that's not an issue for you, you can do it without soft hyphens.
Afaik there is no generic way to get what you want in html/js (it's different if you were using something like flash).
Range object:
https://developer.mozilla.org/en-US/docs/Web/API/range
A different approach would be using the canvas object, but you would probably not get exact results there, as there is just too much factors influencing text rendering in browsers nowadays (font, size, kerning, tracking, ...)
Again another approach would be using <pre> tags / whitespace: pre-wrap, setting the font to what you normally use, and then either emulate breaking words by inserting linebreaks or copying them from still another span/div/whatever set up with word wrap - I haven't tested this yet, but if it works, it might be easer than iterating with the range object.
Edit: Just so it's not only in the comments, still another solution:
Start your container with width 1px, then increase the width, checking the height every time ; when the height decreases, go back one step, and you got your width. Simplest implementation would use 1px increase/1px decrease, but you could of course optimize it to using something like a binary search algorithm, e.g. starting with 1px, then 2px, then 4px increases, then the same backwards, and forwards again and so on till you have a result at a 1px step. But that's only if the 1px inc/dec sollution is too slow ;)
Use the CSS word-break rule:
The word-break CSS property is used to specify how (or if) to break lines within words.
Normal
Use the default line break rule.
break-all
Word breaks may be inserted between any character for non-CJK (Chinese/Japanese/Korean) text.
keep-all
Don't allow word breaks for CJK text. Non-CJK text behavior is same as normal.
<p style="word-break:break-all;">
Max.word-wrapped string<----->
</p>
(source)
Try to play with the word-break Property.
More here: http://www.w3schools.com/cssref/tryit.asp?filename=trycss3_word-break
You should try:
word-break: break-all;
Add it to the CSS like in this fiddle, or read here.

How does Zalgo text work?

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾

Difference between "+" and "%A0" - urlencoding?

I am url encoding a string of text to pass along to a function. However, it encodes the second space in a double-space as "%A0". This means that when I decode the string, the "%A0" is displayed as a question mark in a black box.
I really just need to be able to remove the extra space, but I'd like to understand what is causing this and how to handle it correctly.
For example:
Something  Something else
Encodes to:
Something+%A0Something+else
%A0 indicates a NBSP (U+00A0). + indicates a normal space (U+0020). The NBSP displays as a replacement character (U+FFFD) because the encoding of the character does not match the encoding of the page, so its byte sequence is not valid for the page.
A quick Googling shows that %A0 is the non-breaking space character or in html. A + is the form-encoding for a standard space character.
Source
The problem you're having is that the second "space" is not really a space, it's a character that that font doesn't have a glyph (I think that's the term) to represent (hence the black box with the question mark). %A0 is the escape code for that character. Your code is technically handling it correctly, I think the problem is with whatever is generating the string in the first place.
If I refer to the chart on this page, %A0 is not a space. %20 is the space caracter's encoded value.