I know this problem is almost as old as a world and thousands of answers exists in the web, but I still cannot find what is a problem in my case and why characters shows as black question marks (�) :(
We have a multilingual site that currently supports 10 languages. Some characters are displayed incorrectly (ве��сией, 联合国���际). It can happen with regular characters in non Latin languages, and in other words on same page, the same characters are displayed correctly. In Latin languages, all special and regular characters are displayed correctly.
I tried to play with encoding, but when in one place it fixes the problem the problem appears in other place.
Here, how my encodings configured:
1) In MS SQL Server, we use NVARCHAR(MAX) column with SQL_Latin1_General_CP1_CI_AS collation.
2) In web application, in web.config file I have: <globalization requestEncoding="utf-8" responseEncoding="utf-8" />.
3) On page itself, we have <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />.
In response headers, Chrome shows: Content-Type:text/html; charset=utf-8.
What I miss? Why I still see those black question marks? What should I check/change in order to display all characters correctly.
Thanks
UPDATE
I found a problem and it is totally not related to transport encoding. I thought the problem is with encoding, in way how it passes DB -> ASP.NET -> Browser, but after lots of debugging, I found that the problem is in way, how the output has been written to HttpContext.Current.Response.Filter....we have our custom filter, and somehow, the buffer (byte[]) that was passed to the Write method of filter. It has corrupted array of Unicode string, so sometimes the last char of the string in bytes, was translated as gibberish. I still not found how to solve it correctly, but for now, i can disable our filter and there is no question marks any more.
Thanks to all.
I don't know about MS SQL server, but have you tried having it use UTF-8 encoding instead of latin-1? A quick Google search shows:
DEFAULT CHARACTER SET utf8;
DEFAULT COLLATE utf8_general_ci;
I would think that that would be a better option to use than SQL_Latin1_General_CP1_CI_AS.
If the page renders in a font which lacks those glyphs, they will be rendered with placeholders.
For example, on my phone, several of the examples you say are displaying correctly for you are shown to me with placeholders for some of the text.
Related
The question might be a bit basic – considering I'm not what the vast majority would consider a newcomer to front end web development.
I am teaching an 8 year old html, css and javascript. I'm taking the opportunity to also teach about utf-8 encoding, in particular the way HTML uses it to allow non-English characters to be encoded and displayed.
I want to show him how accented characters do not appear properly without including <meta charset="UTF-8"/>.
Surprisingly I was able to display "Á" in the test webpage when in theory this shouldn't have been possible as the utf-8 charset meta tag was missing.
After some research I came to the conclusion that in modern IDE's the encoding system comes "built in", hence there's no real need to write down <meta charset />. If this is wrong please correct me as I am currently confused as to what exactly happened and I don't want to teach wrong information to an 8 year old.
After some research I came to the conclusion that in modern IDE's the encoding system comes "built in", hence there's no real need to write down . If this is wrong please correct me
Yes, that is wrong!
Surprisingly I was able to display "Á" in the test webpage when in theory this shouldn't have been possible as the utf-8 charset meta tag was missing.
This is also wrong, let me explain!
UTF-8 is an encoding system. This means it describes how to map bytes into textual characters. It's certainly possible to display "Á" without using utf-8.
The letter A (normal, no accents) is encoded with the number 65 in both ASCII and UTF-8. In fact, all english characters and punctuation are encoded the same way across virtually all encodings, so encoding problems rarely become apparent in English-only text.
However, accented letters, non-english characters and emojis (😁) are encoded differently in different encoding systems. What causes "corrupt" text to be displayed is an encoding mismatch: your web browser thinks the encoding used is X while the file was actually encoded with system Y, so byte values no longer map to correct characters. For example, system X uses number 250 to encode 😁, while system Y uses number 190, and under system Y 250 is mapped to "Ë". So now my 😁 appear as "Ë".
<meta charset="utf-8"/> specifies the encoding used for the HTML file. It is absolutely needed. Your webpage worked without because browsers may use other ways to get it, including educated guesses, but it should always be explicitly written in the HTML to avoid problems down the line.
You should specify the encoding for several reasons:
Even if the encoding system would come buit-in, you cannot know which is the default encoding chosen for the IDE.
HTML5 specification says that the default encoding should be taken from the transport layer when not specified which will be the default encoding charset for HTTP1.1: ISO-8859-1.
See the full explaination here: Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8?
I'm making a website that is in Croatian, and I need to use signs like: "č", "ć", "ž", "đ" and "š". They are currently displayed as little boxes.
Info:
I use Notepad ++.
I set the encoding there to UTF-8.
I put the following line of HTML in: <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
However, it does not work. Even Notepad ++ can't display my characters using UTF-8, so that would suggest that I should probably use something else...
http://webdesign.maratz.com/lab/utf_table/
Use HTML entities, for example
č : č
ž : ž
This sounds more like a font issue than a character encoding issue. If it were a character encoding issue, the characters would most likely be displayed as 2+ ASCII characters. The boxes, however, typically mean the character encoding is correct, but that specific character is not available in the font being used (which is especially common with lesser-used fonts). This would explain why it's behaving incorrectly in both the website and Notepad++.
To fix the issue, simply use a different font in your editor and website.
Note: I recommend a widely used font for the best chance of it working. Specifying a generic name in the website (e.g. serif or sans-serif) will probably have even better results, as the OS/browser would decide on the best font to use.
In short, be consistent about your character encoding throughout.
Configure your editor to save in the encoding you want
If you use any server side programming, make sure it isn't transcoding your data
If you use a database, make sure it is configured to use the same encoding
Configure your server to emit a Content-Type header that specifies that encoding
Use the meta tag in your question
The W3C provides useful material on encodings that starts here.
A useful site for special characters and their ASCII-codes: CopyPaste Character
To 'type' them, use the alt codes.
However, to use them in your site, you'll better use the HTML codes like you can find on CPC
As a test, try this:
<span style="font-family:Arial Unicode MS">
č ć ž đ š
</span>
You should be able to see your characters correctly.
I've just copied and pasted a line from your question along with your meta tag, placed it into a plain text file in vi.
It works just fine - all characters are displayed fine: http://www.dusystems.com/tmp/1.html
If you can't do the same with your editor then the problem is with the editor and not character sets and encodings.
If you're on Windows you can use its built-in Notepad to edit UTF-8 files. Open Notepad, type all of your special characters, add the meta tag. When doing Save As select UTF-8 from the Encoding drop-down in the dialog. Save as something.html and open in IE. It will 100% work.
Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?
A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.
Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:
Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.
I am really amazed to see the magic of utf-8 but couldn't understand the logic behind it. I went through several documents but still confused though i know the basic only.
please take a look first example. it converts from language character to utf-8. there are two text box, in first text box enter the chars, click the button and get the utf-8 values in second text box as utf-8.
please take a look of the second example . i have used the utf-8 char from the example 1 and put the value in html and here i really do not understand how it translates. as i tested three language chinese, Hindi and Russian.
used google translator to translate from english to several language
Hello = 您好(chinese)
Hello = नमस्ते (Hindi)
Hello = привет (Russian)
how does a web page identify the language character on the basis of utf-8 ? is it possible that different computer will show different character ?
The "magic" behind UTF-8 is called Unicode. It is one of several encodings of the standard.
Unicode does have character ranges that correspond to languages and many characters are specifically associated with a language.
I suggest reading this - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
UTF-8 is a variable-length byte encoding of Unicode, the character numbering system for all languages.
Internet web pages by default base on ISO-8859-1, so called Latin-1. Other charsets can be set by:
Header lines of text, preceding an empty line and then the HTML content text.
There a header line:
Content-Type: text/html; charset=UTF-8
A Java EE server needs to do for this:
response.setContentType("text/html; charset=UTF-8");
In the HTML head a meta tag
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
...
I found a website that contains the string "don’t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "don’t". A Google search yielded nothing (expect lots of other sites that have the same problem!). Can anyone explain what's happening here?
Edit: Here's the meta tag that was used:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Would this not cause the page to be served up as Latin-1 in the HTTP header?
In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "’". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.
According to Character encondings in HTML a lemme in wikipedia:
HTML (Hypertext Markup Language) has
been in use since 1991, but HTML 4.0
(December 1997) was the first
standardized version where
international characters were given
reasonably complete treatment. When an
HTML document includes special
characters outside the range of
seven-bit ASCII two goals are worth
considering: the information's
integrity, and universal browser
display.
I suppose the site you checked, isn't impelemented with this in mind.
This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.
This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).