We have some MySQL tables with 100,000 to 10,000,000 records. Some of the fields are VARCHAR(100) when in fact no entry exceeds 11 characters.
Clearly we are using up way more space then we should be... If one VARCHAR(100) field for a million-record table uses 100MB of space, then we might be wasting as much as several GB of space.
If we were to streamline these tables, and reduce the VARCHAR fields to their proper size, would it help us with more than just storage space? Could it possibly improve the lookup times for queries?
As of MySQL documentation to Data type storage requirements the varchar type stores the values as follows:
L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes, where L represents the actual length in bytes of a given string value
Seems to me that if your plan is to change the type from VARCHAR(100) to VARCHAR(11) it will not affect query performance because MySQL already stores the value on its "optimum".
If you had a type CHAR(100) your strings with less than 100 characters would be right padded with blank spaces and in this case you will have a bad space consumption and I think that a bad query performance too.
The length of CHAR type, referring the documentation, is:
M × w bytes, 0 <= M <= 255, where w is the number of bytes required for the maximum-length character in the character set, where M represents the declared column length in characters
But if all your records have fixed length 11 you should use CHAR(11) and it will improve the storage and performance of queries.
Another important point about string storage refers to the char set, as says in documentation:
To calculate the number of bytes used to store a particular CHAR, VARCHAR, or TEXT column value, you must take into account the character set used for that column and whether the value contains multi-byte characters. In particular, when using the utf8 Unicode character set, you must keep in mind that not all characters use the same number of bytes and can require up to three bytes per character.
Hope it helps!
I don't know the specifics of the mysql implementation, but I do know the typical implementation of a relational database, and in that implementation it does help.
Typically, records are stored consecutively in a file called a RID table. The record number in the RID table (using zero based counting) times the record size is an offset to where in the file the record is stored.
If the record size is smaller, then more records from the RID table fit into a disk sector fetched from the disk and more records fit into memory.
Even with a different implementation, a smaller record buffer allows more records to be cached in memory, which can reduce the number of disk accesses.
Related
I have heard that in MsSQL/Access databases that if you declare a varchar of length 100, it declares those 100 chars every row, even if there is only one char in that column.
I have two questions about this.
First: is this true? And if yes, does this also work like this in MySQL?
Why i'm asking this:
I'm working a lot with mysql, and i came across a table database with 128 longtext-columns. The reasoning behind this was "We cannot be certain how much data gets stored in these columns. sometimes it's 1 char, sometimes thousands." I was wondering if this was the right way storage-wise, or that he has to do some changes.
No, VARCHAR is meant for variable length text, while CHAR is fixed length. The number parameter is the character limit for the text but VARCHAR only uses up as much space as the actual characters you enter in that row (+ some bytes to store the length used).
MySQL, Microsoft SQL Server and pretty much all relational databases work the same way with VARCHAR. Every column takes up some minimum amount of space in a row but with VARCHAR it would be the bytes to store the text + bytes to store the length of the text. No text entered would mean just a 1 or 2 bytes used to save '0' as the length.
If you don't know how much text data will be entered, then use LONGTEXT in MySQL or NVARCHAR(MAX) in MS-SQL. This datatype allows you to store an unlimited amount of text efficiently (up to the row size limitations of the database itself). It's just a bigger, unlimited length version of standard VARCAHR.
For SQL Server the answer is no. From the documentation on MSDN:
varchar [ ( n | max ) ]
Variable-length, non-Unicode string data. n
defines the string length and can be a value from 1 through 8,000. max
indicates that the maximum storage size is 2^31-1 bytes (2 GB). The
storage size is the actual length of the data entered + 2 bytes. The
ISO synonyms for varchar are char varying or character varying.
It is possible someone was confusing VARCHAR and CHAR. The CHAR data type requires a fixed amount of storage, based on the maximum allowed size.
EDIT
Rereading your question I'm not entirely sure I've followed your meaning. If you were not referring to the required storage space then please disregard.
I have a MySQL database table that has a column of type varchar(386). I chose this number of characters because I counted the characters of the longest entry beforehand. I have 400,000 entries currently, but it is expected to increase with time.
I ran a few tests and found out that about 390,000 entries only use 60 or less characters whereas the last 10,000 entries use up to 386 characters.
Should I separate the 10,000 large entries into a separate table? How would I go about implementing that? Would this increase my querying speed efficiency in the long run?
VARCHAR is stored inline with the table. VARCHAR is faster when the size is reasonable, the tradeoff of which would be faster depends upon your data and your hardware, you'd want to benchmark a realworld scenario with your data.
The effective maximum number of bytes that can be stored in a VARCHAR or VARBINARY column is subject to the maximum row size of 65,535 bytes, which is shared among all columns.
For example, a VARCHAR(255) column can hold a string with a maximum length of 255 characters. Assuming that the column uses the latin1 character set (one byte per character), the actual storage required is the length of the string (L), plus one byte to record the length of the string. For the string 'abcd', L is 4 and the storage requirement is five bytes. If the same column is instead declared to use the ucs2 double-byte character set, the storage requirement is 10 bytes: The length of 'abcd' is eight bytes and the column requires two bytes to store lengths because the maximum length is greater than 255 (up to 510 bytes).
For larger data, consider using TEXT or BLOB. TEXT and BLOB columns are implemented differently in the NDB storage engine, wherein each row in a TEXT column is made up of two separate parts. One of these is of fixed size (256 bytes), and is actually stored in the original table. The other consists of any data in excess of 256 bytes, which is stored in a hidden table. The rows in this second table are always 2,000 bytes long. This means that the size of a TEXT column is 256 if size <= 256 (where size represents the size of the row); otherwise, the size is 256 + size + (2000 – (size – 256) % 2000).
http://dev.mysql.com/doc/refman/5.6/en/storage-requirements.html
Depends on your database relation, if you rarely using that fields in query. For example for additional info. Create separated table is good options (normalize).
NOTES : VARCHAR is different with CHAR. If you create VARCHAR(250) and insert just 20 characters on it then it will take 5 bytes + L different with CHAR(250), it will take 250 bytes + L for same condition.
Just because the field is a varchar(386) doesn't mean it takes up that much space for every row. If most of your date is 60 characters or less, then those records will only use 60 or less characters for that column.
I think you're safe leaving that column in your table if that make sense for your logical data model.
I saw comment "If you have 50 million values between 10 and 15 characters in a varchar(20) column, and the same 50 million values in a varchar(50) column, they will take up exactly the same space. That's the whole point of varchar, as opposed to char.". Can Anybody tell me the reason? See What is a reasonable length limit on person "Name" fields?
MySQL offers a choice of storage engines. The physical storage of data depends on the storage engine.
MyISAM Storage of VARCHAR
In MyISAM, VARCHARs typically occupy just the actual length of the string plus a byte or two of length. This is made practical by the design limitation of MyISAM to table locking as opposed to a row locking capability. Performance consequences include a more compact cache profile, but also more complicated (slower) computation of record offsets.
(In fact, MyISAM gives you a degree of choice between fixed physical row size and variable physical row size table formats depending on column types occuring in the whole table. Occurrence of VARCHAR changes the default method only, but the presence of a TEXT blob forces VARCHARs in the same table to use the variable length method as well.)
The physical storage method is particularly important with indexes, which is a different story than tables. MyISAM uses space compression for both CHAR and VARCHAR columns, meaning that shorter data take up less space in the index in both cases.
InnoDB Storage of VARCHAR
InnoDB, like most other current relational databases, uses a more sophisticated mechanism. VARCHAR columns whose maximum width is less than 768 bytes will be stored inline, with room reserved matching that maximum width. More accurately here:
For each non-NULL variable-length field, the record header contains
the length of the column in one or two bytes. Two bytes will only be
needed if part of the column is stored externally in overflow pages or
the maximum length exceeds 255 bytes and the actual length exceeds 127
bytes. For an externally stored column, the two-byte length indicates
the length of the internally stored part plus the 20-byte pointer to
the externally stored part. The internal part is 768 bytes, so the
length is 768+20. The 20-byte pointer stores the true length of the
column.
InnoDB currently does not do space compression in its indexes, the opposite of MyISAM as described above.
Back to the question
All of the above is however just an implementational detail that may even change between versions. The true difference between CHAR and VARCHAR is semantic, and so is the one between VARCHAR(20) and VARCHAR(50). By ensuring that there is no way to store a 30 character string in a VARCHAR(20), the database makes the life easier and better defined for various processors and applications that it supposedly integrates into a predictably behaving solution. This is the big deal.
Regarding personal names specifically, this question may give you some practical guidance. People with full names over 70 UTF-8 characters are in trouble anyway.
Yes, that is indeed the whole point of VARCHAR. It only takes up as much space as the text is long.
If you had CHAR(50), it would take up 50 bytes (or characters) no matter how short the data really is (it would be padded, usually by spaces).
Can Anybody tell me the reason?
Because people thought it was wasteful to store a lot of useless padding, they invented VARCHAR.
The manual states:
The CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters you want to store. (...)
In contrast to CHAR, VARCHAR values are stored as a one-byte or two-byte length prefix plus data. The length prefix indicates the number of bytes in the value. A column uses one length byte if values require no more than 255 bytes, two length bytes if values may require more than 255 bytes.
Notice that VARCHAR(255) is not the same as VARCHAR(256).
This is theory. As habeebperwad suggests, the actual footprint of one row depends on (engine) page size and (hard disk) block size.
I've begun to get very interested in the memory usage of MySQL. So I'm looking at this here:
http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
I get very excited about the prospect of saving memory by (for example) needing only a signed smallint where I was using an unsigned int in many places. Then I read about varchars...
"VARCHAR(M) - Length + 1 bytes if column values require 0 – 255 bytes"
What?! Now it appears to me as though storing a single varchar would use up so much memory, that I may as well not even get excited with my int vs. smallint because it's vastly overshadowed by the varchar field. So I come here asking if this is true, because it simply can't be? Are varchars really that terrible? Or should I really not be getting excited at all for my smallint discovery?
edit: Sorry! I should've been more clear. So, let's say I store a varchar with 7 characters, meaning 8 bytes. That means, then, that it uses the same as a number stored in a BIGINT column? That's what I'm concerned about.
What this is saying is that for a given string length, the amount of storage used is equal to the length of the string in bytes, plus one byte to tell MySQL how long the string is.
So for instance, the word "automobile" is 10 bytes (1 for each character), so if it is stored in a varchar column it will take up 11 bytes. 1 for the number 10 , and 1 each for each of the characters in the string.
From the link you posted:
http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
The storage requirements depend on these factors:
-The actual length of the column value
-The column's maximum possible length
-The character set used for the column, because some character sets contain multi-byte characters
For example, a VARCHAR(255) column can hold a string with a maximum length of 255 characters. Assuming that the column uses the latin1 character set (one byte per character), the actual storage required is the length of the string (L), plus one byte to record the length of the string. For the string 'abcd', L is 4 and the storage requirement is five bytes. If the same column is instead declared to use the ucs2 double-byte character set, the storage requirement is 10 bytes: The length of 'abcd' is eight bytes and the column requires two bytes to store lengths because the maximum length is greater than 255 (up to 510 bytes).
While I am no MySQL DBA, it appears there is a very simple answer to this question, and no need to go deeper into storage sizes - because it is NOT configureable.
Per MySQL memory storage documentation,
MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length.
Thus, you won't have any specific gains by using VARCHAR for a table using the MEMORY storage engine, no matter how VARCHAR is stored on other storage engines such as MyISAM or InnoDB.
How does MySQL store a varchar field? Can I assume that the following pattern represents sensible storage sizes :
1,2,4,8,16,32,64,128,255 (max)
A clarification via example. Lets say I have a varchar field of 20 characters. Does MySQL when creating this field, basically reserve space for 32 bytes(not sure if they are bytes or not) but only allow 20 to be entered?
I guess I am worried about optimising disk space for a massive table.
To answer the question, on disk MySql uses 1 + the size that is used in the field to store the data (so if the column was declared varchar(45), and the field was "FooBar" it would use 7 bytes on disk, unless of course you where using a multibyte character set, where it would be using 14 bytes). So, however you declare your columns, it wont make a difference on the storage end (you stated you are worried about disk optimization for a massive table). However, it does make a difference in queries, as VARCHAR's are converted to CHAR's when MySql makes a temporary table (SORT, ORDER, etc) and the more records you can fit into a single page, the less memory and faster your table scans will be.
MySQL stores a varchar field as a variable length record, with either a one-byte or a two-byte prefix to indicate the record size.
Having a pattern of storage sizes doesn't really make any difference to how MySQL will function when dealing with variable length record storage. The length specified in a varchar(x) declaration will simply determine the maximum length of the data that can be stored. Basically, a varchar(16) is no different disk-wise than a varchar(128).
This manual page has a more detailed explanation.
Edit: With regards to your updated question, the answer is still the same. A varchar field will only use up as much space on disk as the data you store in it (plus a one or two byte overhead). So it doesn't matter if you have a varchar(16) or a varchar(128), if you store a 10-character string in it, you're only going to use 10 bytes (plus 1 or 2) of disk space.