How to display ASCII 26 (control characters) in HTML - html

We have a record in SQL database, which contains a ASCII 26 character:
SELECT char(26)
From the looking, it's like a arrow, which we can see it in the Eclipse debugging. However, when we try to output it to HTML front-end, it just skipped that character. What's more strange is, the arrow does appear in page source.
It seems 26 belongs to control characters. So is it possible to display the arrow in HTML? Why some place like the debugging window of Eclipse can show it well?

It's a control character, unprintable by definition. Some character sets (or fonts, not sure which determines that) do print control characters; Unicode is not one of them. See Browser Test Page for Unicode Character 'SUBSTITUTE' (U+001A).
Decide what you actually want to display, and replace this character with an actually printable Unicode character.
You could for example use →, →, Unicode Character 'RIGHTWARDS ARROW' (U+2192).

Related

How to display hidden characters in PhpStorm, especially line seperators

I got some special characters in my codes, take a look at:

 a




It's just shown in frontend with normal characters like an "a".
Now the same characters without any normal characters:
Characters starts here





Characters ends here
Ok it looks like this Editor will not save empty 
 , try it with snippet.
<html><p>
 </p></html>
The problem is, in PhpStorm this characters wont be shown, even not with
"settings - Editor - General - Appearance - show whitespaces" or
"settings - Editor - General - Appearance - show method separators"
Only "strg+f, strg+r" will find this characters.
I think this character is an "only-mac-char" :) I'm working with Windows, and I can't test it on mac.
EDIT: Sorry i could identify it as "U+2028 : LINE SEPARATOR"
http://www.babelstone.co.uk/Unicode/whatisit.html
The big problem is that phpStorm didn't show anything in the code. Like there is no character, but moving with the arrow keys notice 2 steps at this position, between 2 tags looks like "><" but it's "> <".
Based on your update it is now clear what character you have in mind:
Sorry I could identify it as "U+2028 : LINE SEPARATOR" http://www.babelstone.co.uk/Unicode/whatisit.html
Install and use Zero Width Characters locator 2 plugin: it can detect quite a few invisible characters (e.g. UTF-8 BOOM sequence, non-breakable space, Unicode line separator (your case) etc).
It is implemented as a separate inspection with highest (Error) severity so will be easy to spot or check the whole folder/project just for these issues.
There is a ticket (Feature Request) to have an option to show invisible characters in the editor.
https://youtrack.jetbrains.com/issue/IDEA-115572 -- watch this ticket (star/vote/comment) to get notified on any progress. implemented in 2020.2 version.
Other related tickets:
https://youtrack.jetbrains.com/issue/IDEA-99899 (your case, as I understand)
https://youtrack.jetbrains.com/issue/IDEA-140567
https://youtrack.jetbrains.com/issue/WEB-13506
UPDATE 2021-11-10:
As of 2020.2 version the IDE can show invisible/special symbols right in the editor.
An example:

How to show special character in UTF-8 of Kmer symbol

I have got ឴symbol, that i can't display on web page (utf-8) content type. This symbol without width and can't see at all. How to show it? Code is ឴
for example here http://www.endmemo.com/unicode/khmer.php 6068 and 6069 are not visible, but i need to show it, at least space
Edited:
I'm using Arial or sans-serif. I think, that it is pretty usual fonts. What people do: they making UNIQUE text by inserting this symbol inside usual symbols. For example, user write: "a(invisible symbol of kmer)b(invisible symbol of kmer)" and so on. I see on page only "ab" without any spaces. I tried to put actual character inside html to see it, but with no luck. I thought that symbol, that is not present in font should be question mark or empty square, but not in that case. Solution can't just be simple replace in text.
If your page is UTF-8 then it's better to use the actual character rather than a HTML entity.
Your requested character is not present in many fonts. You can try finding the latest version of Code2000 which appears to support it.
You can see fonts that support this particular character here:
http://www.fileformat.info/info/unicode/char/17b4/fontsupport.htm
If you can't find a font and you want to display an empty space instead you could replace it before showing it in the page or put it in a container. The page you linked uses a table cell to hold the character.

What is this INSANE space character??? (google chrome)

This is driving me absolutely, !&&%&$ insane... it defies everything that I can think of.
THIS character right here... " "
In between these quotes... open google chrome and inspect. You will see its a ... normal right? Now right click and actually view the source of this stack overflow page. It's a regular space... (also, the character I copied was an actual space).
I could understand if it's some kind of rich text editor or something, but in the raw html source is a regular space, so what gives?
Here's just with hitting the space key (which works fine)... " ".
You can even copy it and paste it everywhere and wreak havoc and make chrome put everywhere. Even though whats copied in your clipboard is just a SPACE.
I have these stupid characters show up everywhere randomly in my website and I have no idea where they come from, or WHY is google converting a SPACE into a nbsp;
I have tried inspecting the actual character code and it's a regular space from all things I can find...
Every single method I try shows it as a NORMAL space... so what gives?
If i use ruby and do " ".ord I get 32. If i do it with the broken space I also get 32.
Please help me im losing my mind.
edit: you can prove this... view source on this page and you will see two empty " " like normal. Now look in console and only the one will be a , yet the raw source is identical.
Image for people not using chrome (this is looking at this very post via chrome dev tools):
Here's the HTML of the same text you see when you view source... no nbsp to be found.
When I view this page's source in Internet Explorer, or download it directly from the server and view it in a text editor, the first space character in question is formatted like this in the actual HTML:
THIS character right here... " "
Notice the   entity. That is Unicode codepoint U+00A0 NO-BREAK SPACE. Chrome is just being nice and re-formatting it as when inspecting the HTML. But make no mistake, it is a real non-breaking space, not Unicode codepoint U+0020 SPACE like you are expecting. U+00A0 is visually displayed the same as U+0020, but they are semantically different characters.
The second space character in question is formatted like this in the actual HTML:
<p>Here's just with hitting the space key (which works fine)... <code>" "</code>.</p>
So it is Unicode codepoint U+0020 and not U+00A0. Viewing the raw hex data of this page confirms that:
It turns out the two seemingly identical whitespace characters are not the same character.
Behold:
var characters = ["a", "b", "c", "d", " "];
var typedSpace = " ";
var copiedSpace = " ";
alert("Typed: " + characters.indexOf(typedSpace)); // -1
alert("Copied: " + characters.indexOf(copiedSpace)); // 4
alert(typedSpace === copiedSpace); // false
JSFiddle
typedSpace.charCodeAt(0) returns 32, the classic space. Whereas copiedSpace.charCodeAt(0) returns 160, the &#160 AKA character.
The difference between the two is that a whole bunch of   repeated after one another will hold their ground and create additional space between them, whereas a whole bunch of repeated characters will squish together into one space.
For instance:
A       B results in: A       B
A B results in: A B
To convert the   character with a character in a string, try this:
.replace(new RegExp(String.fromCharCode(160),"g")," ");
To the people in the future like myself that had to debug this from a high level all the way down to the character codes, I salute you.
Don't get yer knickers in a knot. It's one of those special html characters that we old-school love because we was tort rite.
For many of us, we were taught that a sentence started with a capital letter and ended with a full-stop. But the next sentence is separated from this by TWO spaces.
Good-ol'-HTML doesn't like space(s). If you enter a string of words with 5 spaces between them (using an unintelligent editor like MS Notepad, then html shows it with single spaces.
SO, to get it looking like we old-farts like, we end a sentence with '.&NbSp; Next' This puts two spaces after the full-stop, and looks like '.  Next' rather than '. Next'.
Next point is that the real space (32) works as a linebreak, so that's good.
EXCEPT for we old-farts, who HATE to see our name split across a linebreak. That annoys us NO-END.
But, of course, that's where &NbSp; comes in handy again. If you enter 'John&NbSp;Brown', then the html thinks that's a single word, and it displays it just rite for we oldies.
How do these &NbSp; thingies get there? Well, good old Word (and I suspect many intelligent editors) see two spaces and output them as a non-breaking space followed by a normal space.
And when in Word, you can insert a non-breaking space between John and Brown by the key sequence alt-ctrl-space (sorry, you apple-users)
Lesson-over (with the exception that the term &NbSp; needs to be all lowercase - THIS viewer was even converting it)
It is a non breaking space. is the entity used to represent a non-breaking space. It is essentially a standard space, the primary difference being that a browser should not break (or wrap) a line of text at the point that this occupies.
Most likely the character is being inserted by your HTML Editor. Could you give a more specific example in context?
This is not actually an answer to the question but instead a tool that can be used to detect this special white space in the html of the pages of a website so we can proceed to locate and remove it.
The tool what basically does is:
Fetches the content of a URL
Looks for occurrences of chr(194).chr(160) in the HTML contents
Replaces and highlights the ocurrences with something more visible
This way you can actually know where the spaces are and edit your page properly to remove them.
The online version of the tool can be found here:
http://tools.heavydots.com/nbsp-space-char-detect/
A working example can be seen with the url of this question that contains one ocurrence:
http://tools.heavydots.com/nbsp-space-char-detect/?url=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F26962323%2Fwhat-is-this-insane-space-character-google-chrome&highlight=1&hstring=%7BNBSP%7D
There's a Github repo available if someone wants the code to run it locally:
https://github.com/HeavyDots/nbsp-space-char-detect
Hope someone finds it useful, for any feedback there's a comments section on the tool's page.
Updated 5th of January 2017
At our company blog we just wrote a funny post about this annoying white space. You're invited to drop by and read it! :-)
http://heavydots.com/blog/when-the-white-space-became-a-beast
As the previous answers have mentioned, it's a non-breaking space (nbsp). On Macs, this character gets inserted when you accidentally press Alt + Space (most of the time, this happens when entering code that requires Alt for special characters, e.g. [ on a German keyboard layout).
To remap this key combination to a plain ol' SPACE character, you can change your default keybinding as suggested on Apple SE
For whitespace, Press "Alt+0160" which is a character also.

Detect Multibyte and Chinese Characters in rtf markup

I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox and get the .PlainText out)
Take the RTF code for the string a基bমূcΟιd pasted straight into Wordpad:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
Notice the 基 (u+57FA) is \'8a\'ee but the মূ, which is actually two characters ম (\u2478?) and ূ (\u2498?), is \u2478?\u2498? which is fine, but the Οι which is two separate characters Ο and ι is \'cf\'e9.
Is there a way to determine if I'm looking at something that should be one character such as 基 = \'bb\'f9 or two characters Ο and ι = \'cf\'e9?
I was thinking that maybe the \lang was it, but that isn't the case at all because the \lang does not change from when it's first set. I am already accounting for the Different Codepages from different Charset values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.
How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?
\'xx escapes represent bytes and should be interpreted using the fcharset encoding. (Or potentially cchs. Falling back to the ansicpg if not present.)
You need to know that encoding intimately to be able to decide whether a single \'xx sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.
\uxxxx? escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (? is the fallback character for when the receiver can't cope with the Unicode.)
So:
The two characters Οι are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).
The one character 基 is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading \'8a byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).
The two characters মূ are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)
RTF has tags for specifying the codepage/encoding used to encode Unicode characters. The actual hex codes for the characters are the byte octets used by the specified encoding. In this case, \ansicpg1252 for Ansi codepage 1252.

SIFR and HTML entities

I have a webpage with an 'o' with an acute. How would I get this entity to display in sIFR3?
I have output the html as both ó and ó
Many thanks,
C
Quoting the doc:
Special characters
If you want to use special characters
in your text; characters like » or
special language characters that are
not in font by default you must add it
in Flash before publishing it to SWF.
Open the hidden text box. Select it
and click the Character button in the
Properties pallette. Add the special
characters in the textbox at the
bottom of the dialog box.