Vim editor showing some blue characters but cannot reach it - json

I was editing some data and when I checked the json file it's showing some blue chars like "<202b>202>" and "b>". I can see them through the vim but I can not change them or even find them.
When I print the texts in python, also python doesn't see them and just printing the normal texts. Are those chars important? How can I get rid of them? Thank you.

<202b> is how Vim represents the character U+202B, the presence of which kind of makes sense because your data appears to be a mix of left-to-right and right-to-left scripts.
You can't search it with /<202b> because the characters <202b> are not actually in the text, it is just how Vim displays the character U+202B when it encounters it.
You can:
insert that character with <C-v>u202b, see :help i_ctrl-v_digit,
search for it with /\%u202b, see :help /\%u.
As for getting rid of them… it depends on whether they are here accidentally or purposely.

Related

How to find all this kind of UNICODE � in multiple html pages (or, with notepad++)? [duplicate]

I have a bizarre problem: Somewhere in my HTML/PHP code there's a hidden, invisible character that I can't seem to get rid of. By copying it from Firebug and converting it I identified it as  or 'Zero width no-break space'. It shows up as non-empty text node in my website and is causing a serious layout problem.
The problem is, I can't get rid of it. I can't see it in my files even when turning Invisibles on (duh). I can't seem to find it, no search tool seems to pick up on it. I rewrote my code around where it could be, but it seems to be somewhere deeper in one of the framework files.
How can I find characters by charcode across files or something like that? I'm open to different tools, but they have to work on Mac OS X.
You don't get the character in the editor, because you can't find it in text editors. #FEFF or #FFFE are so-called byte-order marks. They are a Microsoft invention to tell in a Unicode file, in which order multi-byte characters are stored.
To get rid of it, tell your editor to save the file either as ANSI/ISO-8859 or as Unicode without BOM. If your editor can't do so, you'll either have to switch editors (sadly) or use some kind of truncation tool like, e.g., a hex editor that allows you to see how the file really looks.
On googling, it seems, that TextWrangler has a "UTF-8, no BOM" mode. Otherwise, if you're comfortable with the terminal, you can use Vim:
:set nobomb
and save the file. Presto!
The characters are always the very first in a text file. Editors with support for the BOM will not, as I mentioned, show it to you at all.
If you are using Textmate and the problem is in a UTF-8 file:
Open the file
File > Re-open with encoding > ISO-8859-1 (Latin1)
You should be able to see and remove the first character in file
File > Save
File > Re-open with encoding > UTF8
File > Save
It works for me every time.
It's a byte-order mark. Under Mac OS X: open terminal window, go to your sources and type:
grep -rn $'\xFEFF' *
It will show you the line numbers and filenames containing BOM.
In Notepad++, there is an option to show all characters. From the top menu:
View -> Show Symbol -> Show All Characters
I'm not a Mac user, but my general advice would be: when all else fails, use a hex editor. Very useful in such cases.
See "Comparison of hex editors" in WikiPedia.
I know it is a little late to answer to this question, but I am adding how to change encoding in Visual Studio, hope it will be helpfull for someone who will be reading this sometime:
Go to File -> Save (your filename) as...
And in File Explorer window, select small arrow next to the Save button -> click Save with Encoding...
Click Yes (on Do you want to replace existing file dialog)
And finally select e.g. Unicode (UTF-8 without signature) - that removes BOM

Line ending charactor LFs are automatically changed to CRLFs in HTML textarea

I noticed that all LFs are automatically changed to CRLFs if I put them into a HTML textarea.
■ Questions:
where and what causes this behavior?
is this because of Windows Operation system, i.e. it will not happen if using a different Operating system such as MacOS? (I just experienced this on a windows machine, not yet tested on a Mac though...)
or is this something which depends on Browser? (I have seen this behavior on Chrome, IE, and Firefox. Not yet tested on Safari...)
or is this something only happens on my editor? (i.e I am using sakura editor)
If possible, how to preserve the LF so that it does not get changed into CRLF?
■ Steps to reproduce this:
find a textarea where you can input, for example the following w3school website.
https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_textarea
prepare a text that at least 2 lines with some LFs using an editor which can detect the line ending charactors (so that you can make sure you have some LFs).
※ I am using Sakura editor as an example.
copy and paste the text prepared in step 2 to the textarea.
once text is copied into the textarea, this time, copy the entire content of the textarea.
paste the content of the textarea back to your editor.
the line ending characters all become CRLFs.
■ P.S.
Please see the screenshots for details
left side is original text with 3 LFs
right side is the content copied back from the textarea and all LFs becomes CRLFs)
「↓」indicated LF
「⏎」indicated CRLF
Thanks
I think I find myself the answer at least some helpful information, i will just leave a record in case there are people seeking for the answer for similar questions.
where and what causes this behavior?
For historical reasons, the element’s value is normalized in three different ways for three different purposes. The raw value is the value as it was originally set. It is not normalized. The API value is the value used in the value IDL attribute. It is normalized so that line breaks use U+000A LINE FEED (LF) characters. Finally, there is the value, as used in form submission and other processing models in this specification. It is normalized so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs, and in addition, if necessary given the element’s wrap attribute, additional line breaks are inserted to wrap the text at the given width.
for more information please read:
https://www.w3.org/TR/html5/forms.html#the-textarea-element
If possible, how to preserve the LF so that it does not get changed into CRLF?
I guess there are a lot of ways. Using javascript to replace all /r/n to /n before submit a form will likely be a client side solution. or if it doesn't have the necessity to be handled on client side which is exactly my case, I do the replacement process on the server side to force convert all line ending characters to LF.

How to display hidden characters in PhpStorm, especially line seperators

I got some special characters in my codes, take a look at:

 a




It's just shown in frontend with normal characters like an "a".
Now the same characters without any normal characters:
Characters starts here





Characters ends here
Ok it looks like this Editor will not save empty 
 , try it with snippet.
<html><p>
 </p></html>
The problem is, in PhpStorm this characters wont be shown, even not with
"settings - Editor - General - Appearance - show whitespaces" or
"settings - Editor - General - Appearance - show method separators"
Only "strg+f, strg+r" will find this characters.
I think this character is an "only-mac-char" :) I'm working with Windows, and I can't test it on mac.
EDIT: Sorry i could identify it as "U+2028 : LINE SEPARATOR"
http://www.babelstone.co.uk/Unicode/whatisit.html
The big problem is that phpStorm didn't show anything in the code. Like there is no character, but moving with the arrow keys notice 2 steps at this position, between 2 tags looks like "><" but it's "> <".
Based on your update it is now clear what character you have in mind:
Sorry I could identify it as "U+2028 : LINE SEPARATOR" http://www.babelstone.co.uk/Unicode/whatisit.html
Install and use Zero Width Characters locator 2 plugin: it can detect quite a few invisible characters (e.g. UTF-8 BOOM sequence, non-breakable space, Unicode line separator (your case) etc).
It is implemented as a separate inspection with highest (Error) severity so will be easy to spot or check the whole folder/project just for these issues.
There is a ticket (Feature Request) to have an option to show invisible characters in the editor.
https://youtrack.jetbrains.com/issue/IDEA-115572 -- watch this ticket (star/vote/comment) to get notified on any progress. implemented in 2020.2 version.
Other related tickets:
https://youtrack.jetbrains.com/issue/IDEA-99899 (your case, as I understand)
https://youtrack.jetbrains.com/issue/IDEA-140567
https://youtrack.jetbrains.com/issue/WEB-13506
UPDATE 2021-11-10:
As of 2020.2 version the IDE can show invisible/special symbols right in the editor.
An example:

What is this INSANE space character??? (google chrome)

This is driving me absolutely, !&&%&$ insane... it defies everything that I can think of.
THIS character right here... " "
In between these quotes... open google chrome and inspect. You will see its a ... normal right? Now right click and actually view the source of this stack overflow page. It's a regular space... (also, the character I copied was an actual space).
I could understand if it's some kind of rich text editor or something, but in the raw html source is a regular space, so what gives?
Here's just with hitting the space key (which works fine)... " ".
You can even copy it and paste it everywhere and wreak havoc and make chrome put everywhere. Even though whats copied in your clipboard is just a SPACE.
I have these stupid characters show up everywhere randomly in my website and I have no idea where they come from, or WHY is google converting a SPACE into a nbsp;
I have tried inspecting the actual character code and it's a regular space from all things I can find...
Every single method I try shows it as a NORMAL space... so what gives?
If i use ruby and do " ".ord I get 32. If i do it with the broken space I also get 32.
Please help me im losing my mind.
edit: you can prove this... view source on this page and you will see two empty " " like normal. Now look in console and only the one will be a , yet the raw source is identical.
Image for people not using chrome (this is looking at this very post via chrome dev tools):
Here's the HTML of the same text you see when you view source... no nbsp to be found.
When I view this page's source in Internet Explorer, or download it directly from the server and view it in a text editor, the first space character in question is formatted like this in the actual HTML:
THIS character right here... " "
Notice the   entity. That is Unicode codepoint U+00A0 NO-BREAK SPACE. Chrome is just being nice and re-formatting it as when inspecting the HTML. But make no mistake, it is a real non-breaking space, not Unicode codepoint U+0020 SPACE like you are expecting. U+00A0 is visually displayed the same as U+0020, but they are semantically different characters.
The second space character in question is formatted like this in the actual HTML:
<p>Here's just with hitting the space key (which works fine)... <code>" "</code>.</p>
So it is Unicode codepoint U+0020 and not U+00A0. Viewing the raw hex data of this page confirms that:
It turns out the two seemingly identical whitespace characters are not the same character.
Behold:
var characters = ["a", "b", "c", "d", " "];
var typedSpace = " ";
var copiedSpace = " ";
alert("Typed: " + characters.indexOf(typedSpace)); // -1
alert("Copied: " + characters.indexOf(copiedSpace)); // 4
alert(typedSpace === copiedSpace); // false
JSFiddle
typedSpace.charCodeAt(0) returns 32, the classic space. Whereas copiedSpace.charCodeAt(0) returns 160, the &#160 AKA character.
The difference between the two is that a whole bunch of   repeated after one another will hold their ground and create additional space between them, whereas a whole bunch of repeated characters will squish together into one space.
For instance:
A       B results in: A       B
A B results in: A B
To convert the   character with a character in a string, try this:
.replace(new RegExp(String.fromCharCode(160),"g")," ");
To the people in the future like myself that had to debug this from a high level all the way down to the character codes, I salute you.
Don't get yer knickers in a knot. It's one of those special html characters that we old-school love because we was tort rite.
For many of us, we were taught that a sentence started with a capital letter and ended with a full-stop. But the next sentence is separated from this by TWO spaces.
Good-ol'-HTML doesn't like space(s). If you enter a string of words with 5 spaces between them (using an unintelligent editor like MS Notepad, then html shows it with single spaces.
SO, to get it looking like we old-farts like, we end a sentence with '.&NbSp; Next' This puts two spaces after the full-stop, and looks like '.  Next' rather than '. Next'.
Next point is that the real space (32) works as a linebreak, so that's good.
EXCEPT for we old-farts, who HATE to see our name split across a linebreak. That annoys us NO-END.
But, of course, that's where &NbSp; comes in handy again. If you enter 'John&NbSp;Brown', then the html thinks that's a single word, and it displays it just rite for we oldies.
How do these &NbSp; thingies get there? Well, good old Word (and I suspect many intelligent editors) see two spaces and output them as a non-breaking space followed by a normal space.
And when in Word, you can insert a non-breaking space between John and Brown by the key sequence alt-ctrl-space (sorry, you apple-users)
Lesson-over (with the exception that the term &NbSp; needs to be all lowercase - THIS viewer was even converting it)
It is a non breaking space. is the entity used to represent a non-breaking space. It is essentially a standard space, the primary difference being that a browser should not break (or wrap) a line of text at the point that this occupies.
Most likely the character is being inserted by your HTML Editor. Could you give a more specific example in context?
This is not actually an answer to the question but instead a tool that can be used to detect this special white space in the html of the pages of a website so we can proceed to locate and remove it.
The tool what basically does is:
Fetches the content of a URL
Looks for occurrences of chr(194).chr(160) in the HTML contents
Replaces and highlights the ocurrences with something more visible
This way you can actually know where the spaces are and edit your page properly to remove them.
The online version of the tool can be found here:
http://tools.heavydots.com/nbsp-space-char-detect/
A working example can be seen with the url of this question that contains one ocurrence:
http://tools.heavydots.com/nbsp-space-char-detect/?url=http%3A%2F%2Fstackoverflow.com%2Fquestions%2F26962323%2Fwhat-is-this-insane-space-character-google-chrome&highlight=1&hstring=%7BNBSP%7D
There's a Github repo available if someone wants the code to run it locally:
https://github.com/HeavyDots/nbsp-space-char-detect
Hope someone finds it useful, for any feedback there's a comments section on the tool's page.
Updated 5th of January 2017
At our company blog we just wrote a funny post about this annoying white space. You're invited to drop by and read it! :-)
http://heavydots.com/blog/when-the-white-space-became-a-beast
As the previous answers have mentioned, it's a non-breaking space (nbsp). On Macs, this character gets inserted when you accidentally press Alt + Space (most of the time, this happens when entering code that requires Alt for special characters, e.g. [ on a German keyboard layout).
To remap this key combination to a plain ol' SPACE character, you can change your default keybinding as suggested on Apple SE
For whitespace, Press "Alt+0160" which is a character also.

What type of collation to use in a table

I have what is a simple problem that hopefully has a simple solution:
I have a site written in PHP and HTML, using a Linux server with MySQL.
It has a form where users fill in some personal info, including a textarea in which
they are meant to copy and paste a test CV.
I have also set up a back end for my client where she can query the database to see who
registered and retrieve their info.
My problem is that when I query and echo the content of the table row that contains the
CV (alot of text), the line breaks are all gone - everything is printed in one line.
Does someone know if I can solve this by using the right kind of collation/character encoding
for that specific row that contains the users's cvs? I am hoping that such collation exists that saves and maintains line breaks.
Collation has nothing to do with it - collations and charsets won't touch your newlines at all. If you want to see it, look at the page source of the echo'd text.
HTML, however, treats line breaks like all other whitespace under normal circumstances, so they won't be visible when you echo them to a browser. You shouldn't be outputting plain text as HTML anyway, because they're not the same. You must convert the plain text to HTML first; a simple method is to call htmlspecialchars() and nl2br() on the text (in that order, otherwise htmlspecialchars will eat your newly-created br tags and turn them into <br/>. Failing to do so will not only create undesired output, it can also be a major security risk (XSS).
Use nl2br($text) to add HTML line breaks.
I don't think collation is related to this. Break lines from the textarea come in the form of the \n or \r characters. If you are not doing anything "weird" those break lines should be stored into the DB.
I think your problem is when you echo the content of the table, since the browser doesn't display the \n and \r as new lines, you have to either substitute them for <br/> element or wrap each paragraph in a <p></p>
You can use nl2br() for that.
or how about wrapping the text in a <pre> </pre>
see: http://www.w3schools.com/tags/tag_pre.asp