Difference between "+" and "%A0" - urlencoding? - html

I am url encoding a string of text to pass along to a function. However, it encodes the second space in a double-space as "%A0". This means that when I decode the string, the "%A0" is displayed as a question mark in a black box.
I really just need to be able to remove the extra space, but I'd like to understand what is causing this and how to handle it correctly.
For example:
Something  Something else
Encodes to:
Something+%A0Something+else

%A0 indicates a NBSP (U+00A0). + indicates a normal space (U+0020). The NBSP displays as a replacement character (U+FFFD) because the encoding of the character does not match the encoding of the page, so its byte sequence is not valid for the page.

A quick Googling shows that %A0 is the non-breaking space character or in html. A + is the form-encoding for a standard space character.
Source

The problem you're having is that the second "space" is not really a space, it's a character that that font doesn't have a glyph (I think that's the term) to represent (hence the black box with the question mark). %A0 is the escape code for that character. Your code is technically handling it correctly, I think the problem is with whatever is generating the string in the first place.

If I refer to the chart on this page, %A0 is not a space. %20 is the space caracter's encoded value.

Related

My HTML has different spacing

I'm really confused about this. I have same shortcode but different spacing which causes the second shortcode to fail.
[button link="#"]btn[/button]
[button link="#"]btn[/button]
They look identical but when runing on a compare tool, there's a difference in space before link.
Please see here:
http://www.diff-online.com/view/589710128ca13
How is that possible? Any idea? Thanks.
You can check what character it is at What Unicode Characters it is
You would have realized that the first one is:
U+0020 : SPACE [SP]
While second is:
U+00A0 : NO-BREAK SPACE [NBSP]
Just because they look same doesn't necessary mean they are same character.

Chrome doesn't respect Zero Width Joiner

If I create a text, where I got a dash at the start of a word (very common in German language), Google Chrome sets the hyphen at the end of the line and the word at the start of the next line. This is the wrong behavior. It should be the hyphen and the word on one line. Even if I put in a ‍ entity between hyphen and word, it still doesn't work correctly.
In Firefox all is well.
Example here: https://jsfiddle.net/p6dp2hLb/2/
You can use ‑ [Unicode Character 'NON-BREAKING HYPHEN' (U+2011)] as an alphabetical character instead of raw dash character because it has its special meanings in formatting.
Maybe you can use a hack to get round it?
<span style="white-space: nowrap;">-a</span>

What character should I use to maintain height of an empty (zero width) string?

I have a string that can potentially be empty, and in that case, I want to substitute it with a special character to maintain the ordinary text height while having zero width. In TeX, this would be called \strut. What is the counterpart for that in HTML? I came up with two candidates: ⁠ and . Should I use one of these?
On modern browsers, any zero-width character will do the job, provided that the browser either knows that the character is zero-width or uses a font that contains an empty glyph for it. But some characters may have effects, depending on the context and on software used to process the HTML file.
U+2060 WORD JOINER has the effect of preventing line break.
U+FEFF ZERO WIDTH NO-BREAK SPACE has the same effect. It is formally deprecated for any use except as Byte Order Mark, but in reality it works more often than WORD JOINER (though there are exceptions).
U+200B ZERO WIDTH SPACE has the effect of allowing a line break even when it would otherwise not be permitted; it’s like SPACE, but with zero width.
Usually the worst-case scenario for characters like this is an old version of IE. Checking in IE 6 shows that U+FEFF and U+200B are OK, but U+2060 shows as a small rectangle (i.e., the browser tries to render the character but finds no glyph for it).
So I’d use  or ​ depending on whether I’d like to prevent or allow line break at that point. If it does not matter, ​ is more logical to use.
I would suggest  or if zero width is not essential or if it is essential you could try the Unicode character ⁠ which is a zero width non-breaking space.

How should Quoted-Printable Mime-Words be wrapped to the correct line length?

I ran into a bug in a mime parsing library where it blows up on subject lines that contain foreign characters beyond a certain length. It turns out that it would convert the subject into a Quoted-Printable MIME "Encoded-Word" and then try to word-wrap the whole thing to 78 characters. Because MIME-Word encoding has no spaces (they are replaced with underscores) it failed to wrap.
Example line being wrapped:
Subject: =?UTF-8?Q?lalalla_=E7=84=A1=E6=AD=A4=E7=84=A1=E6=AD=A4=E9=A0=85=E7=9B=AE=AE=AE=AE=AE=AE=AE=AE=AE?=
I thought I might contribute a patch to the library to wrap the line correctly but I couldn't find a reference as to how to break up a MIME-Word as part of a word-wrapping algorithm.
The RFC 5322 says to word-wrap at spaces but doesn't provide any guidance about what to do if there's a string of characters with no white-space that exceeds the target width.
Anyone know the correct action to take here?
Just split the line where you need to, and continue with a 2nd wrapped line. For example:
Subject: =?UTF-8?Q?lalalla_=E7=84=A1=E6=AD=A4=E7=84=A1=E6=AD=A4?=
=?UTF-8?Q?=E9=A0=85=E7=9B=AE=AE=AE=AE=AE=AE=AE=AE=AE?=
Just be sure to make sure the 2nd (and following) wrapped lines start with either a space or a tab.
hth,
--Dave

How does Zalgo text work?

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾