After reading the documentation, I understood that there is a one-byte or two-byte length prefix to a varying character so as to determine its length. I understand too that, for a varchar, each character might have a different length in bytes depending on the character itself.
So my question is:
How does the DBMS determine each character's length after it's stored?
Meaning: After a string is stored, let's say it's 4 characters long, and let's suppose that the first character is 1 byte long, the second 2 bytes, the 3rd 3 bytes and the 4th is 4..
How does the DB know how long is each character when retrieving the string so as to read it correctly?
I hope the question is clear, sorry for any English mistakes I made. Thanks
The way UTF-8 works as a variable-length encoding is that the 1-byte characters can only use 7 bits of that byte.
If the high bit is 0, then the byte is a 1-byte character (which happens to be encoded in the same way as the 128 ASCII characters).
If the high bit is 1, then it's a multi-byte character.
Picture from https://en.wikipedia.org/wiki/UTF-8
If you're talking about UTF-8, that's not quite how it works. It uses the highest bit in each byte to indicate that the character continues into the next byte, and can store one, two, three or four byte characters fairly efficiently. This is in contrast to UTF-32 where every character is automatically four bytes, something that is obviously very wasteful for some types of text.
When using UTF-8, or any character set where the characters are a variable number of bytes, there's a disconnect between the length of the string in bytes and the length of the string in characters. In a fixed-length system like Latin1, which is rigidly 8-bit, there's no such drift.
Internally the database is most concerned with the length of a field in terms of bytes. The length in terms of characters is only explicitly exposed when calling functions like LENGTH(), as otherwise it's just a bunch of bytes that, if necessary, can be interpreted as a string.
Historically speaking the database stored the length of a field in bytes in a single byte, then the data itself. That's why VARCHAR(255) is so prevalent: It's the longest string you can represent with a single byte length field. Newer databases like Postgres allow >2GB character fields, so they're using four or more bytes to represent the length.
Related
I read a lot of articles and SO questions about the topic, but all they did was to confuse me more.
I'm trying to understand what is the longest string that a VARCHAR can hold, and how to define it.
In some places it said that a VARCHAR can be created with a max length of 255 (i.e. VARCHAR(255)) - I don't understand if it means 255 bytes or characters.
In other places it said that a VARCHAR can hold up to 8000 bytes - and then depends on the language (if it's 1-byte per character, such as Latin, or more - that determined the length of the longest string).
In simple terms, what does the n in VARCHAR(n) stands for, and what is the range of n?
Is it bytes? Is it the number of a character? between 0-255? between 0-8000?
How does a really long text is saved? does it get split into multiple columns?
VARCHAR is storing strings in 1 byte per symbol (an opposite to nvarchar that can use 2 or more bytes per character). You can read details here.
A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored. This is similar to the definition of NCHAR(n) and NVARCHAR(n). The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n. However, for multi-byte encoding such as UTF-8, higher Unicode ranges (128-1,114,111) result in one character using two or more bytes. For example, in a column defined as CHAR(10), the Database Engine can store 10 characters that use single-byte encoding (Unicode range 0-127), but less than 10 characters when using multi-byte encoding (Unicode range 128-1,114,111). For more information about Unicode storage and character ranges, see Storage differences between UTF-8 and UTF-16.
I come across the below statement while studying about HTML Character Sets and Character Encoding :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
When we convert any decimal value from the ASCII character set to its binary equivalent it comes down to a 7-bits long binary number.
E.g. For Capital English Letter 'E' the decimal value of 69 exists in ASCII table. If we convert '69' to it's binary equivalent it comes down to the 7-bits long binary number 1000101
Then, why in the ASCII Table it's been mentioned as a 8-bits long binary number 01000101 instead of a 7-bits long binary number 1000101 ?
This is contradictory to the statement
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
The above statement is saying that ASCII used 7 bits for the character.
Please clear my confusion about considering the binary equivalent of a decimal value. Whether should I consider a 7-bits long binary equivalent or a 8-bits long binary equivalent of any decimal value from the ASCII Table? Please explain to me in an easy to understand language.
Again, consider the below statement :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
According to the above statement how does the number of characters(128) that ASCII supports relates to the fact that ASCII uses 7 bits to represent any character?
Please clear the confusion.
Thank You.
In most processors, memory is byte-addressable and not bit-addressable. That is, a memory address gives the location of an 8-bit value. So, almost all data is manipulated in multiples of 8 bits at a time.
If we were to store a value that has by its nature only 7 bits, we would very often use one byte per value. If the data is a sequence of such values, as text might be, we would still use one byte per value to make counting, sizing, indexing and iterating easier.
When we describe the value of a byte, we often show all of its bits, either in binary or hexadecimal. If a value is some sort of integer (say of 1, 2, 4, or 8 bytes) and its decimal representation would be more understandable, we would write the decimal digits for the whole integer. But in those cases, we might lose the concept of how many bytes it is.
BTW—HTML doesn't have anything to do with ASCII. And, Extended ASCII isn't one encoding. The fundamental rule of character encodings is to read (decode) with the encoding the text was written (encoded) with. So, a communication consists of the transferring of bytes and a shared understanding of the character encoding. (That makes saying "Extended ASCII" so inadequate as to be nearly useless.)
An HTML document represents a sequence of Unicode characters. So, one of the Unicode character encodings (UTF-8) is the most common encoding for an HTML document. Regardless, after it is read, the result is Unicode. An HTML document could be encoded in ASCII but, why do that? If you did know it was ASCII, you could just as easily know that it's UTF-8.
Outside of HTML, ASCII is used billions—if not trillions—of times per second. But, unless you know exactly how it pertains to your work, forget about it, you probably aren't using ASCII.
I understood that in a database an int takes less space than a string. But what if the int is really longer than the string. For example 9.455.487 vs "John". Which one will take more space? TY
From the documentation, size of int is 4 bytes, whereas for char it is "M × w bytes, 0 <= M <= 255, where w is the number of bytes required for the maximum-length character in the character set." and M is the declared column size.
So when you talk of how much space is taken, the int will take up 4 bytes for a value as long as the value is within the range of int. A string like "John", if declared as char(4) will take up 4 * w bytes, so at least 4 bytes assuming w is 1.
Long story short, the size of a number is not how many characters long it is when you write it out, but the number of bytes to represent it in the binary form.
You should be aware of what "int" (integer) is and what strings are. Integer always has some length and that length means how many bytes are in it's binary representation. On the other hand, strings are sequences of bytes. So, depending of encoding, each symbol may be one or more bytes.
The thing that 9.455.487 is "longer" than "John" is irrelevant here. What is relevant - is how DBMS (or whatever other environment) will represent those things. You're seeing "longer" integer versus "shorter" string while it's not so, it's only a matter of "screen" representation (i.e. what you see on the screen).
Answering question - for MySQL, INT is 4 bytes, while string data types may have dynamic length - such as VARCHAR. Static string length date type is CHAR and from that viewpoint, your number and your string will have same length (4 bytes). Strings and integers are just different things to compare for "length". And visual representation should not confuse you. This entities have different internal structure, and, therefore, should not be compared on "length" according to their visual representation.
Also, you should be aware that not always integer will have 4 bytes length - since even in MySQL your number may belong to, for example, BIGINT data type (which is 8 bytes length). And, as mentioned above, for strings there's also encoding issue. For instance, UTF-8 encoded string may have two (or even more) bytes to represent some non-ASCII symbols. In this case each symbol will add more that 1 byte to total string length.
If I create a utf-8 column in MySQL and all the characters are ascii minus 1, how many characters will I be able to store?
Can someone please explain this?
21,844
Will be the maximum for a varchar in utf8.
It's actual chars not bytes.
from MySQL doc
For example, utf8 characters can require up to three bytes per
character, so a VARCHAR column that uses the utf8 character set can be
declared to be a maximum of 21,844 characters. See Section E.7.4,
“Limits on Table Column Count and Row Size”.
MySQL stores VARCHAR values as a 1-byte or 2-byte length prefix plus
data. The length prefix indicates the number of bytes in the value. A
VARCHAR column uses one length byte if values require no more than 255
bytes, two length bytes if values may require more than 255 bytes.
from Wiki
The first 128 characters (US-ASCII) need one byte. The next 1,920
characters need two bytes to encode. This covers the remainder of
almost all Latin alphabets, and also Greek, Cyrillic, Coptic,
Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as
Combining Diacritical Marks. Three bytes are needed for characters in
the rest of the Basic Multilingual Plane (which contains virtually all
characters in common use[11]). Four bytes are needed for characters in
the other planes of Unicode, which include less common CJK characters
and various historic scripts and mathematical symbols.
Can MySQL VARCHAR's store any char like nulls and newlines?
MySQL can store any chars, but depending on the specified encoding, some characters may be misrepresented if different from your display encoding.
Varchars can contain any chars, even 0x00, since its not a null terminated string, but a length specified string, where the length is specified at the beginning of the string as a 2 or 4 byte value.
I'm not sure about Char columns, but I'm guessing they are null terminated in the sense that the first 0x00 indicates end of string (though storage still takes up the full length).
If you want to store true byte data though, you should use the BLOB type. That is also a length specified type, which is guaranteed to contain whatever data you put into it, and is not affected by encodings at all.
You should probably use the BINARY and VARBINARY data types for storing strings of non-printable characters, especially if you want strings of single bytes instead of actual text characters (which can be multi-byte).