UTF-8 is an Encoding or a Document Character Set?

UTF-8 is an Encoding or a Document Character Set? - html

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).
A document character set consists of:
A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.
Each document is a sequence of characters from the repertoire.
Character Encoding is:
How those characters may be represented
When i save a file in Windows notepad im guessing that this are the "Document Character Sets":
ANSI
UNICODE
UNICODE BIG ENDIAN
UTF-8
Simple 3 questions:
I want to know if those are the "document character sets". And if they are,
Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?
If im not wrong with all this stuff:
Are there another Document Character Sets that Windows do not allow you to define?
How to define another document character sets?

In my understanding:
ANSI is both a character set and an encoding of that character set.
Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
UTF-8 is an encoding of Unicode.
The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.
(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)
Also, see Joel on Software's mandatory article on the subject.

UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).
To answer your questions:
It is on the list because Microsoft decided to implement it in notepad.
There are many other character sets, though defining your own is not useful, so not really possible.
You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.

Related

What's the difference between typing the Encoding of a Unicode character or just copying the character?

For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?

≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).
• is 1 single bullet point character.
That is the most obvious difference.
The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.
The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.
Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.
However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.
•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.
But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).
A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).

File enconding (UTF-8 not working properly)

In my webpage, there is a form with multiple inputs. However, the input chars behave differently from the input "label" chars. I tried setting the file encoding to UTF-8 and UTF-8 +BOM (I'm using EditPlus).
Using UTF-8:
Using UTF-8 + BOM:
The input chars come from a mysql database where the collation is utf8_unicode_ci (using phpmyadmin) so i don't know if that's the problem's source. Any ideas?

This means both pieces of data are not in the same encoding. If the file is interpreted as Latin-1 (or a similar encoding), you get the first result in which the data in the input field is valid (meaning it's Latin-1 encoded) but the label is wrong (meaning it's not Latin-1 encoded). When the file is interpreted as UTF-8, the label is correct (meaning it's UTF-8 encoded) but the data in the input field is wrong (meaning it's not UTF-8 encoded). If data shows up as the � UNICODE REPLACEMENT CHARACTER, it's a sure sign the document is being interpreted as a Unicode encoding (e.g. UTF-8), but the byte sequence is invalid.
I'll guess that the label is hardcoded in the file but the data in the input field comes from a database. In this case you need to set the connection encoding for the database to return UTF-8.
As to why the file is interpreted in Latin-1 without BOM and in UTF-8 with BOM: because the browser recognizes the BOM as signifying UTF-8, without it it defaults to Latin-1. You need to set the correct HTTP header to tell the browser what encoding the file is in, and get rid of the BOM.
Read these resources:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

solved it: Just changed the file enconding to "Western European (Windows) 1252" (using EditPlus) and now every character is correctly shown.

2 encodings between an HTML representation

Im reading one chapter from the W3C HTML Document Representation
In the 5.1 says this:
User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.
Then in the 5.2 says this:
The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters.
Char-Bytes
Bytes-Char
So im wrong or there are 2 encodings between the representation...

A "character encoding" such as UTF-8 is, strictly speaking, a specification for representing characters as a sequence of bytes. But the encodings are always reversible, so we can speak of a (single) character encoding as going both ways.
Other character encodings used in practice are UTF-16 ad UTF-32.
Each of these are specifications under which you can encode text as bytes and decode bytes into characters. Two parts of the same specification.

How to distinguish UTF-8 and ASCII files?

How to distinguish UTF-8 (no BOM) and ASCII files?

If the file contains any bytes with the top bit set, then it is not ASCII.
So if the only possibilities are ASCII or UTF-8, then it's UTF-8.
If the file contains only bytes with the top bit clear, then it's meaningless to distinguish whether it's ASCII or UTF-8, since it represents exactly the same series of characters either way. But you can call it ASCII.
Of course this doesn't distinguish UTF-8 from ISO Latin or CP1252, and neither does it confirm that the so-called UTF-8 is actually valid.

http://msdn.microsoft.com/en-us/library/dd318672%28v=vs.85%29.aspx
IsTextUnicode Function
Determines if a buffer is likely to contain a form of Unicode text.

what encoding stackoverflow used in mysql?

I can not save the character 𝑴 in my mysql which encoding is utf8, but i found stackoverflow can save it and display it.
I made a mistake. stackoverflow also can not save 𝑴 .

If you can't store the character, you are encoding or decoding it incorrectly, or converting it to a character set that doesn't support the character.
The UTF-8 encoding can handle almost any character that exists in any language, so it's quite unlikely that it's a limitation of that encoding.
You have to use the Unicode character set or some Unicode encoding (UTF-7, UTF-8, UTF-16, UTF-32) for all steps of the process. If you convert the text to some other character set and then back, you can only support the characters of that specific character set.

Stackoverflow is trying to display the character as &#119924. So maybe that character value is being saved in the database (certainly some character value is being saved in the database), but the reason why we can't see that character is because of the font which is being used to display the HTML: perhaps it's the font being used, not the database, that doesn't support that character value.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008