How to distinguish UTF-8 and ASCII files? - language-agnostic

How to distinguish UTF-8 (no BOM) and ASCII files?

If the file contains any bytes with the top bit set, then it is not ASCII.
So if the only possibilities are ASCII or UTF-8, then it's UTF-8.
If the file contains only bytes with the top bit clear, then it's meaningless to distinguish whether it's ASCII or UTF-8, since it represents exactly the same series of characters either way. But you can call it ASCII.
Of course this doesn't distinguish UTF-8 from ISO Latin or CP1252, and neither does it confirm that the so-called UTF-8 is actually valid.

http://msdn.microsoft.com/en-us/library/dd318672%28v=vs.85%29.aspx
IsTextUnicode Function
Determines if a buffer is likely to contain a form of Unicode text.

Related

What's the difference between typing the Encoding of a Unicode character or just copying the character?

For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?
≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).
• is 1 single bullet point character.
That is the most obvious difference.
The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.
The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.
Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.
However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.
•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.
But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).
A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).

File enconding (UTF-8 not working properly)

In my webpage, there is a form with multiple inputs. However, the input chars behave differently from the input "label" chars. I tried setting the file encoding to UTF-8 and UTF-8 +BOM (I'm using EditPlus).
Using UTF-8:
Using UTF-8 + BOM:
The input chars come from a mysql database where the collation is utf8_unicode_ci (using phpmyadmin) so i don't know if that's the problem's source. Any ideas?
This means both pieces of data are not in the same encoding. If the file is interpreted as Latin-1 (or a similar encoding), you get the first result in which the data in the input field is valid (meaning it's Latin-1 encoded) but the label is wrong (meaning it's not Latin-1 encoded). When the file is interpreted as UTF-8, the label is correct (meaning it's UTF-8 encoded) but the data in the input field is wrong (meaning it's not UTF-8 encoded). If data shows up as the � UNICODE REPLACEMENT CHARACTER, it's a sure sign the document is being interpreted as a Unicode encoding (e.g. UTF-8), but the byte sequence is invalid.
I'll guess that the label is hardcoded in the file but the data in the input field comes from a database. In this case you need to set the connection encoding for the database to return UTF-8.
As to why the file is interpreted in Latin-1 without BOM and in UTF-8 with BOM: because the browser recognizes the BOM as signifying UTF-8, without it it defaults to Latin-1. You need to set the correct HTTP header to tell the browser what encoding the file is in, and get rid of the BOM.
Read these resources:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
solved it: Just changed the file enconding to "Western European (Windows) 1252" (using EditPlus) and now every character is correctly shown.

confused by html5, utf-8 and 8859-1

Yesterday I upgraded an html page from "4.01 strict" to html5.
* http://r0k.us/rock/games/CoH/HallsOfHeroes/
The character encoding is iso-8859-1. The http://validator.w3.org fails and won't even parse it when utf-8 is specified as charset, apparently because I use footnote characters such as ² . They are in the upper 128 bytes of the character set. What confuses me is that I keep reading that the first 256 bytes of utf-8 is 8859-1.
Does anyone know why the page won't validate as utf-8 ?
Actually, only the first 128 code points are encoded in UTF-8 as ASCII, but UTF-8 is not ASCII, in particular, the next 128 code points differ.
You need to re-save the files as UTF-8 if you want them to be served as UTF-8.
The character ² ("SUPERSCRIPT TWO") is represented by the number 0xb2 (178 decimal) -- but it's represented differently in 8859-1 and UTF-8.
In 8859-1, it's represented as a single byte with the value 0xb2.
In UTF-8, it's represented as two consecutive bytes with the values 0xc2, 0xb2. See here for an explanation of the encoding.
(8859-1 is more compact that UTF-8 for files containing 8-bit characters, but it's incapable of representing anything past 255. UTF-8 is compatible with ASCII and with 8859-1 for 7-bit characters, is reasonably compact for most text, and can represent more than a million distinct characters.)
A file containing only 7-bit characters can be interpreted either as ASCII, 8859-1, or UTF-8. A file containing 8-bit characters cannot; it has to be translated.
If you're on a Unix-like system with the iconv command installed, this:
iconv -f iso-8859-1 -t utf-8
will perform the appropriate translation.

UTF-8 is an Encoding or a Document Character Set?

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).
A document character set consists of:
A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.
Each document is a sequence of characters from the repertoire.
Character Encoding is:
How those characters may be represented
When i save a file in Windows notepad im guessing that this are the "Document Character Sets":
ANSI
UNICODE
UNICODE BIG ENDIAN
UTF-8
Simple 3 questions:
I want to know if those are the "document character sets". And if they are,
Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?
If im not wrong with all this stuff:
Are there another Document Character Sets that Windows do not allow you to define?
How to define another document character sets?
In my understanding:
ANSI is both a character set and an encoding of that character set.
Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
UTF-8 is an encoding of Unicode.
The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.
(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)
Also, see Joel on Software's mandatory article on the subject.
UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).
To answer your questions:
It is on the list because Microsoft decided to implement it in notepad.
There are many other character sets, though defining your own is not useful, so not really possible.
You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.

2 encodings between an HTML representation

Im reading one chapter from the W3C HTML Document Representation
In the 5.1 says this:
User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.
Then in the 5.2 says this:
The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters.
Char-Bytes
Bytes-Char
So im wrong or there are 2 encodings between the representation...
A "character encoding" such as UTF-8 is, strictly speaking, a specification for representing characters as a sequence of bytes. But the encodings are always reversible, so we can speak of a (single) character encoding as going both ways.
Other character encodings used in practice are UTF-16 ad UTF-32.
Each of these are specifications under which you can encode text as bytes and decode bytes into characters. Two parts of the same specification.