Latin Vs utf8 Charset and index usage (mysql 5.5) - mysql

My understanding with latin vs utf8 is as per below:
"latin supports only latin characters (like english) but utf8 supports all international languages like french, chineese, arabic etc. (even not support fully as it uses 3 byte per character while it should use 4 byte per character to take care all international utf8 character). As per standard latin stores 1 character in 1 byte while utf8 1 character in 1-3 bytes. But if we store all characters in latin even in a utf8 type column then it will store 1 char in 1 byte."
latin vs utf8 Index: "Columns value takes byte as per character in columns and as per charset type but index always stored value in bytes."
May some one clear my below queries, I will be very thankful.
Suppose there is a title varchar(250) column and there is an index on it as Alter table mytable add index (title(16)) in utf8 charset type table;
If this columns contains a sting "This is my Title", which contains 16 character and all latin. then clear below queries:
1) As string contains 16 chars and all are latin type means it should stores only 16 bytes even table charset is utf8 or else.
2) Index on 16 bytes is sufficient to take care this 16 character string or else.
Thanks,
Zafar

1) Yes. 2) Yes.
Note that "latin" is not a character encoding. Encodings people usually call latin-something, like MySQL's "latin1," include characters that need 2 or 3 bytes when encoded in UTF-8. It's ASCII characters that can be stored with one byte in UTF-8.

1) latin1 (ISO-8859-1) characters can be more than 1 byte in utf8. If the characters are ASCII (as in your example string), then it would only need 1 byte for each character in utf8. If they're non-ASCII but still latin1, then more bytes would be needed.
2) Again, assuming the characters in the 16 byte string are always ASCII, then 16 bytes in the utf8 index would cover it. However, note that for indexes on a char/varchar/text column, the index length is characters not bytes. So (16) would mean that your index could be up to 48 bytes for utf8. Also, your column definition is the same (so varchar(250) is 250 characters which is up to 750 bytes for utf8).
Note that MySQL also supports the utf8mb4 encoding which is proper UTF-8 - i.e. characters can take up to 4 bytes to encode. However, if you use this and want longer indexes you'll need to mess around with table and row format/creation and InnoDB settings because indexes etc. will take up more than the standard 767 bytes (e.g. 250 character index would need space for 1000 bytes).

Related

SQL VARCHAR length

I read a lot of articles and SO questions about the topic, but all they did was to confuse me more.
I'm trying to understand what is the longest string that a VARCHAR can hold, and how to define it.
In some places it said that a VARCHAR can be created with a max length of 255 (i.e. VARCHAR(255)) - I don't understand if it means 255 bytes or characters.
In other places it said that a VARCHAR can hold up to 8000 bytes - and then depends on the language (if it's 1-byte per character, such as Latin, or more - that determined the length of the longest string).
In simple terms, what does the n in VARCHAR(n) stands for, and what is the range of n?
Is it bytes? Is it the number of a character? between 0-255? between 0-8000?
How does a really long text is saved? does it get split into multiple columns?
VARCHAR is storing strings in 1 byte per symbol (an opposite to nvarchar that can use 2 or more bytes per character). You can read details here.
A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored. This is similar to the definition of NCHAR(n) and NVARCHAR(n). The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n. However, for multi-byte encoding such as UTF-8, higher Unicode ranges (128-1,114,111) result in one character using two or more bytes. For example, in a column defined as CHAR(10), the Database Engine can store 10 characters that use single-byte encoding (Unicode range 0-127), but less than 10 characters when using multi-byte encoding (Unicode range 128-1,114,111). For more information about Unicode storage and character ranges, see Storage differences between UTF-8 and UTF-16.

How does SQL determine a character's length in a varchar?

After reading the documentation, I understood that there is a one-byte or two-byte length prefix to a varying character so as to determine its length. I understand too that, for a varchar, each character might have a different length in bytes depending on the character itself.
So my question is:
How does the DBMS determine each character's length after it's stored?
Meaning: After a string is stored, let's say it's 4 characters long, and let's suppose that the first character is 1 byte long, the second 2 bytes, the 3rd 3 bytes and the 4th is 4..
How does the DB know how long is each character when retrieving the string so as to read it correctly?
I hope the question is clear, sorry for any English mistakes I made. Thanks
The way UTF-8 works as a variable-length encoding is that the 1-byte characters can only use 7 bits of that byte.
If the high bit is 0, then the byte is a 1-byte character (which happens to be encoded in the same way as the 128 ASCII characters).
If the high bit is 1, then it's a multi-byte character.
Picture from https://en.wikipedia.org/wiki/UTF-8
If you're talking about UTF-8, that's not quite how it works. It uses the highest bit in each byte to indicate that the character continues into the next byte, and can store one, two, three or four byte characters fairly efficiently. This is in contrast to UTF-32 where every character is automatically four bytes, something that is obviously very wasteful for some types of text.
When using UTF-8, or any character set where the characters are a variable number of bytes, there's a disconnect between the length of the string in bytes and the length of the string in characters. In a fixed-length system like Latin1, which is rigidly 8-bit, there's no such drift.
Internally the database is most concerned with the length of a field in terms of bytes. The length in terms of characters is only explicitly exposed when calling functions like LENGTH(), as otherwise it's just a bunch of bytes that, if necessary, can be interpreted as a string.
Historically speaking the database stored the length of a field in bytes in a single byte, then the data itself. That's why VARCHAR(255) is so prevalent: It's the longest string you can represent with a single byte length field. Newer databases like Postgres allow >2GB character fields, so they're using four or more bytes to represent the length.

MySQL size in this case?

If I create a utf-8 column in MySQL and all the characters are ascii minus 1, how many characters will I be able to store?
Can someone please explain this?
21,844
Will be the maximum for a varchar in utf8.
It's actual chars not bytes.
from MySQL doc
For example, utf8 characters can require up to three bytes per
character, so a VARCHAR column that uses the utf8 character set can be
declared to be a maximum of 21,844 characters. See Section E.7.4,
“Limits on Table Column Count and Row Size”.
MySQL stores VARCHAR values as a 1-byte or 2-byte length prefix plus
data. The length prefix indicates the number of bytes in the value. A
VARCHAR column uses one length byte if values require no more than 255
bytes, two length bytes if values may require more than 255 bytes.
from Wiki
The first 128 characters (US-ASCII) need one byte. The next 1,920
characters need two bytes to encode. This covers the remainder of
almost all Latin alphabets, and also Greek, Cyrillic, Coptic,
Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as
Combining Diacritical Marks. Three bytes are needed for characters in
the rest of the Basic Multilingual Plane (which contains virtually all
characters in common use[11]). Four bytes are needed for characters in
the other planes of Unicode, which include less common CJK characters
and various historic scripts and mathematical symbols.

mysql varchar byte length 255 issue

According to the mysql documentation:
A column uses one length byte if values require no more than 255 bytes,
two length bytes if values may require
more than 255 bytes
AND
The maximum row size constrains the
number of columns because the total
width of all columns cannot exceed
this size. For example, utf8
characters require up to three bytes
per character, so for a CHAR(255)
CHARACTER SET utf8 column, the server
must allocate 255 × 3 = 765 bytes per
value. Consequently, a table cannot
contain more than 65,535 / 765 = 85
such columns.
For clarity, what then is the maximum value I can set in the varchar argument so it only uses 1 byte to store its length?
From the MySQL documentation:
The CHAR and VARCHAR types are
declared with a length that indicates
the maximum number of characters you
want to store. For example, CHAR(30)
can hold up to 30 characters.
A [VARCHAR] column uses one length
byte if values require no more than
255 bytes, two length bytes if
values may require more than 255
bytes.
This makes the answer to your question depend on the character encoding.
With a single-byte encoding like windows-1252 (which MySQL calls latin1), the character length is the same as the byte length, so you can use a VARCHAR(255).
With UTF-8, a VARCHAR(N) may require up to 3N bytes, as would be the case if all characters were in the range U+0800 to U+FFFF. Thus, a VARCHAR(85) is the greatest that ensures a single-byte byte length (requiring a maximum of 255 bytes).
(Note that MySQL apparently does not support characters outside the BMP. The official definition of UTF-8 allows 4 bytes per character.)
For clarity, what then is the maximum value I can set in the varchar argument so it only uses 1 byte to store its length?
This depends on the collation of the VARCHAR column.
As you noted, UTF8 may use up to three bytes per character, so if your declare a UTF8 column more than 85 characters long, there is a chance that it will use more than 255 bytes to store its data, and the length hence should be stored in a two-byte field.
If you use latin1, each character is stored in 1 byte.
So the answer is:
VARCHAR(85) COLLATE UTF8_GENERAL_CI
, or
VARCHAR(255) COLLATE LATIN1_GENERAL_CI
I think you're confusing string size with character representation.
For instance, you could have a character that takes 4 bytes to represent it, and put it inside of string whose max storage size requires only one byte to hold the length since there's less than 255 characters in it.

SQL tables using VARCHAR with UTF8 (with respect to multi byte character length)

Like in Oracle VARCHAR( 60 CHAR ) I would like to specify a varchar field with variable length depending on the inserted characters.
for example:
create table X (text varchar(3))
insert into X (text) VALUES ('äöü')
Should be possible (with UTF8 as the default charset of the database).
On DB2 I got this Error: DB2 SQL Error: SQLCODE=-302, SQLSTATE=22001
(Character data, right truncation occurred; for example, an update or insert value is a string that is too long for the column, or a datetime value cannot be assigned to a host variable, because it is too small.)
I'm looking for solutions for DB2, MsSql, MySql, Hypersonic.
DB2
The DB2 documentation says:
In multibyte UTF-8 encoding, each ASCII character is one byte, but non-ASCII characters take two to four bytes each. This should be taken into account when defining CHAR fields. Depending on the ratio of ASCII to non-ASCII characters, a CHAR field of size n bytes can contain anywhere from n/4 to n characters.
This means with a DB2 database you can't do what you asked for .
MySql
The MySql documentation says:
UTF-8 (Unicode Transformation Format with 8-bit units) is an alternative way to store Unicode data. It is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. Currently, MySQL support for UTF-8 does not include four-byte sequences. (An older standard for UTF-8 encoding, RFC 2279, describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)
This means with a MySql database you can use VARCHAR(3) CHARACTER SET utf8 as your column definition to get what you asked for.
For SQL Server, you'd need to use NVARCHAR (unicode). Hopefully someone can chip in with the others!
For HSQLDB (Hypersonic) VARCHAR(3) works as the default encoding is UTF16.