Properly display utf-8 HTML characters in non-multibyte Vim - html

I've been attempting to open .epub files in vim for reading (yes it's silly, let's ignore that for now) and I'm having trouble with how the internal html of epubs displays characters such as ' and " among other things.
Vim displays ' as â~#~Y while opening the file with less gives me <E2><80><99>. I'm not sure how vim deals with this (it seems to treat ~# and ~Y as single characters) and as such I'm not sure how to go about replacing the special HTML characters with their utf-8 equivalent.
Is there a encoding setting that will display this properly? Or a way to manually input these characters such that I could create a search and replace macro?
Thanks

It looks like Vim doesn't properly detect the UTF-8 encoding; you can check with
:setlocal fileencoding?
and force UTF-8 with
:edit ++enc=utf-8 file.epub
(or tweak your 'fileencodings' option to have it automatically detected).

Related

How to find all this kind of UNICODE � in multiple html pages (or, with notepad++)? [duplicate]

I have a bizarre problem: Somewhere in my HTML/PHP code there's a hidden, invisible character that I can't seem to get rid of. By copying it from Firebug and converting it I identified it as  or 'Zero width no-break space'. It shows up as non-empty text node in my website and is causing a serious layout problem.
The problem is, I can't get rid of it. I can't see it in my files even when turning Invisibles on (duh). I can't seem to find it, no search tool seems to pick up on it. I rewrote my code around where it could be, but it seems to be somewhere deeper in one of the framework files.
How can I find characters by charcode across files or something like that? I'm open to different tools, but they have to work on Mac OS X.
You don't get the character in the editor, because you can't find it in text editors. #FEFF or #FFFE are so-called byte-order marks. They are a Microsoft invention to tell in a Unicode file, in which order multi-byte characters are stored.
To get rid of it, tell your editor to save the file either as ANSI/ISO-8859 or as Unicode without BOM. If your editor can't do so, you'll either have to switch editors (sadly) or use some kind of truncation tool like, e.g., a hex editor that allows you to see how the file really looks.
On googling, it seems, that TextWrangler has a "UTF-8, no BOM" mode. Otherwise, if you're comfortable with the terminal, you can use Vim:
:set nobomb
and save the file. Presto!
The characters are always the very first in a text file. Editors with support for the BOM will not, as I mentioned, show it to you at all.
If you are using Textmate and the problem is in a UTF-8 file:
Open the file
File > Re-open with encoding > ISO-8859-1 (Latin1)
You should be able to see and remove the first character in file
File > Save
File > Re-open with encoding > UTF8
File > Save
It works for me every time.
It's a byte-order mark. Under Mac OS X: open terminal window, go to your sources and type:
grep -rn $'\xFEFF' *
It will show you the line numbers and filenames containing BOM.
In Notepad++, there is an option to show all characters. From the top menu:
View -> Show Symbol -> Show All Characters
I'm not a Mac user, but my general advice would be: when all else fails, use a hex editor. Very useful in such cases.
See "Comparison of hex editors" in WikiPedia.
I know it is a little late to answer to this question, but I am adding how to change encoding in Visual Studio, hope it will be helpfull for someone who will be reading this sometime:
Go to File -> Save (your filename) as...
And in File Explorer window, select small arrow next to the Save button -> click Save with Encoding...
Click Yes (on Do you want to replace existing file dialog)
And finally select e.g. Unicode (UTF-8 without signature) - that removes BOM

How to insert these non-ascii characters as html content?

Any non-ascii representation is written as &#xYYYY.
As per below code,
Editor is Sublime Text.
How do I represent these emoticons in html?
I found this Sublime Text 3 Plugin to insert emojis into the editor.
https://packagecontrol.io/packages/Emoji
Is this what you are looking for?
As long as you save the file you're editing using the UTF-8 character encoding, and make sure it is delivered with a suitable content type header, such as Content-Type: text/html; charset=utf-8 you don't need to do anything at all.
Another option, as others have noted, is adding them as HTML entities instead. In order to do that you would need to know their character codes. How to do that differs between different environments, but there are multiple questions on SO about that.
Here's how you could do it in Python (Python 3, you'd need to use u"" strings in earlier versions):
chars = [
"😀",
"😐",
"😳",
"😫",
"💩"
]
for char in chars:
print("{}: &#{:02x};".format(char, ord(char)))
You can represent Emoticons in html as its unicode symbol formatted as 򪪪 (some unicode list)
<div>😁</div>
I am using sublime text as editor and when you see it on browser it should look like these:

utf-8 / utf-16 conversion

When I design a html page in Dreamweaver CS6 I use its validation tool (it sends the code to w3c) and I get no errors. However, when I validate the same page in UltraEdit 21 (it uses HTML Tidy) I get the warning:
"Specified input encoding (utf-8) does not match actual input encoding (utf-16)"
The page is set as html5 (with <!doctype html>), as utf-8 (with <meta charset="utf-8">) and contains greek text.
Well, the question is:
Does that problem affect the appearance of the page? I mean, when I publish it, will a user in China, Germany, or ...Tierra del Fuego see the greek text?
If yes, the rest are less important, but I'll ask them:
What makes HTML Tidy to define the document as utf-16? Is there a character, word or visible string of any kind that I can remove/delete to correct the problem?
If I use <meta charset="utf-16"> will browsers parse the code correctly (ending to greek text for the global user)?
The actual file encoding will be set in Dreamweaver properties for the file.
Dreamweaver Help / Set title and encoding properties for a page:
The Title/Encoding Page Properties options let you specify the document encoding type that is specific to the language used to author your web pages as well as specify which Unicode Normalization Form to use with that encoding type.
Select Modify > Page Properties, or click the Page Properties button in the text Property inspector.
Choose the Title/Encoding category and set the options.
...
Encoding
Specifies the encoding used for characters in the document.
If you select Unicode (UTF‑8) as the document encoding, entity encoding is not necessary because UTF‑8 can safely represent all characters. If you select another document encoding, entity encoding may be necessary to represent certain characters. For more information on character entities, see www.w3.org/TR/REC-html40/sgml/entities.html.
...
Include Unicode Signature (BOM)
Includes a Byte Order Mark (BOM) in the document. A BOM is 2 to 4 bytes at the beginning of a text file that identifies a file as Unicode, and if so, the byte order of the following bytes. Because UTF‑8 has no byte order, adding a UTF‑8 BOM is optional. For UTF‑16 and UTF‑32, it is required.
Choose UTF-8 without BOM.
UltraEdit automatically detects encoding of a file on opening and displays it at bottom in status bar. See in UltraEdit Advanced - Configuration - File Handling - Unicode/UTF-8 Detection and press button Help for some more details.
UTF-16 is displayed for a file encoded in UTF-16 Little Endian with or without BOM on using standard status bar since UE v19.00. Clicking on this list box in status bar and selecting Unicode - UTF-8 results in converting the file from UTF-16 LE to UTF-8 which then matches with the character set declaration in head of your HTML5 file.
When using basic status bar in UE v19.00 or any later version or using any UltraEdit version prior v19.00, the status bar field right to the field with line, column and clipboard number starts with U- for a file with UTF-16 LE encoding.
The UltraEdit help page about the Status Bar contains more information about information shown in standard and basic status bar in UltraEdit.
Conversion to UTF-8 can be done with UltraEdit also with command UNICODE/UTF-8 to UTF-8 (Unicode Editing) in submenu Conversions in menu File.
There are 2 configuration settings at Advanced - Configuration - File Handling - Save which define saving a UTF-8 encoded file with or without byte order mark (BOM):
Write UTF-8 BOM header to all UTF-8 files when saved
Write UTF-8 BOM on new files created within this program (if above is not set)
As UTF-8 encoded HTML files should be always without BOM, it is better to have both UTF-8 BOM settings unchecked when using UltraEdit mainly for editing HTML files.
Another possibility to convert a file with UltraEdit is using command Save As from menu File and use appropriate Encoding / Format setting. UTF-8 in Save As dialog means saving the file as UTF-8 encoded file with BOM and UTF-8 - NO BOM without BOM independent on the two configuration settings for standard Save.
For converting all files in a single folder, a folder tree, opened in UltraEdit, etc. to UTF-8 using UltraEdit, there is an UltraEdit scripting solution, see How to convert all files in a folder to UTF-8?
Unfortunately UE v21.30.0.1024 still does not recognize the short character set declaration <meta charset="utf-8"> as defined in HTML5 standard. See Short utf-8 charset declaration in HTML5 header with details about this limitation and how it can be worked around. This limitation does not matter if within first 64 KB at least one UTF-8 encoded character is found as it will be the case for your HTML5 files with Greek text.
HTML Tidy installed with UltraEdit v21.30.0.1024 is of version 25 March 2009. I'm not sure if HTML Tidy really supports short charset declaration of HTML5. But it looks so because otherwise you would not see the warning on validating the HTML5 file with HTML Tidy.
It might be useful for you to read UltraEdit power tip Unicode text and Unicode files in UltraEdit/UEStudio as it looks like you do not really know what encoding and character set really means and why it is important for applications that the declaration in the HTML5 matches with really used encoding.
I answer your questions now after all those general UltraEdit stuff.
Does that problem affect the appearance of the page?
Although the file contains the declaration that file contents is encoded with UTF-8, but is in real encoded with UTF-16 Little Endian, the browsers display the contents correct. UTF-16 detection is very easy, especially with BOM present and therefore browsers ignore wrong declaration and interpret the bytes of the HTML file from beginning right as UTF-16 encoded text file.
However, it would be much better to convert the UTF-16 encoded HTML files to UTF-8 without BOM. UTF-8 without BOM is most commonly used for HTML files worldwide and then the character set declaration in head of your HTML file would also match with really used encoding.
What makes HTML Tidy to define the document as utf-16?
The really used encoding of your HTML file is UTF-16 Little Endian and UltraEdit, HTML Tidy and the browsers detect that already after reading in the first 2 bytes of the text file - the byte order mark. That's the reason why HTML Tidy suggests to declare the encoding in head of HTML file correct as utf-16 as the file is really encoded with.
If I use <meta charset="utf-16"> will browsers parse the code correctly?
In case of keeping the file encoded in UTF-16 LE (always 2 bytes per character), it would be better to declare the character set right with <meta charset="utf-16">. But no Unicode aware text editor or browser has a problem to automatically detect UTF-16 Little Endian encoding with byte order mark.
The character set declaration becomes very important mainly for UTF-8 encoded files (1, 2, 3 or even 4 bytes per character) or files with single-byte coded characters using a code page like Windows-1252 / ISO 8859-1 (Latin 1) or Windows-1253 / ISO 8859-7 (Latin/Greek).

How to remove  extra character in my Website?

I see this character in Firebug .
I don't know why this is happening, there's no such character in my code. For Firefox it's OK, but in IE everything breaks. I can't even search for this character in Google.
I saved my file with utf-8 encoding without bom.
This occurs when you include files that contains the char  (ZERO WIDTH NO-BREAK SPACE).
this character is a BOM and you can only see BOMs with low level text editors (such as DOS edit).
you can debug it seeing the Element inspector, and commenting the requires or includes to see what file contains the BOM.
then , you can copy the content of the file, change the Encoding to "encode in utf8", save and you can include it again. (notepad++) can save it
success.
 is probably a UTF-8 byte order mark in one of the files Joomla uses or includes (it's only a zero-width no break space if it occurs inside a file).
You could try this tool to check for byte order marks - they're a bit of a pain to search for by hand.
just save again your php file with notepad++
I also had the same problem.
I opened the file with notepad and saved it as UTF8.
problem solved.

HTML encoding issues - "Â" character showing up instead of " "

I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.
The process works like this:
Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
Replace the tokens with real data
Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
Send off the HTML to a web service that creates the PDF.
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.
My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.
Private Shared Function ConvertToUTF8(ByVal html As String) As String
Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
Dim source As Byte() = isoEncoding.GetBytes(html)
Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function
Any ideas?
EDIT:
I'm getting by with this for now, though it hardly seems like a good solution:
Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
Return Regex.Replace(html, "[^\u0000-\u007F]", " ")
End Function
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character
That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.
What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.
Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:
for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
for HTML5: <meta charset="utf-8">
If you've done that, then any remaining problem is ActivePDF's fault.
If any one had the same problem as me and the charset was already correct, simply do this:
Copy all the code inside the .html file.
Open notepad (or any basic text editor) and paste the code.
Go "File -> Save As"
Enter you file name "example.html" (Select "Save as type: All Files (.)")
Select Encoding as UTF-8
Hit Save and you can now delete your old .html file and the encoding should be fixed
Problem:
Even I was facing the problem where we were sending '£' with some string in POST request to CRM System, but when we were doing the GET call from CRM , it was returning '£' with some string content. So what we have analysed is that '£' was getting converted to '£'.
Analysis:
The glitch which we have found after doing research is that in POST call we have set HttpWebRequest ContentType as "text/xml" while in GET Call it was "text/xml; charset:utf-8".
Solution:
So as the part of solution we have included the charset:utf-8 in POST request and it works.
In my case this (a with caret) occurred in code I generated from visual studio using my own tool for generating code. It was easy to solve:
Select single spaces ( ) in the document. You should be able to see lots of single spaces that are looking different from the other single spaces, they are not selected. Select these other single spaces - they are the ones responsible for the unwanted characters in the browser. Go to Find and Replace with single space ( ). Done.
PS: It's easier to see all similar characters when you place the cursor on one or if you select it in VS2017+; I hope other IDEs may have similar features
In my case I was getting latin cross sign instead of nbsp, even that a page was correctly encoded into the UTF-8. Nothing of above helped in resolving the issue and I tried all.
In the end changing font for IE (with browser specific css) helped, I was using Helvetica-Nue as a body font changing to the Arial resolved the issue .
I was having the same sort of problem. Apparently it's simply because PHP doesn't recognise utf-8.
I was tearing my hair out at first when a '£' sign kept showing up as '£', despite it appearing ok in DreamWeaver. Eventually I remembered I had been having problems with links relative to the index file, when the pages, if viewed directly would work with slideshows, but not when used with an include (but that's beside the point. Anyway I wondered if this might be a similar problem, so instead of putting into the page that I was having problems with, I simply put it into the index.php file - problem fixed throughout.
The reason for this is PHP doesn't recognise utf-8.
Here you can check it for all Special Characters in HTML
http://www.degraeve.com/reference/specialcharacters.php
Well I got this Issue too in my few websites and all i need to do is customize the content fetler for HTML entites. before that more i delete them more i got, so just change you html fiter or parsing function for the page and it worked. Its mainly due to HTML editors in most of CMSs. the way they store parse the data caused this issue (In My case). May this would Help in your case too