vachar maximum length for index with InnoDB and UTF-8 - mysql

I am reading that MySQL 5.6 can only index the first 767 bytes of a varchar (or other text-based types). My schema character set is utf-8, so each character can be stored on up to 3 bytes. Since 767/3 = 255.66, this would indicate that the maximum length for a text column that needs to be indexed in 255 characters. Experience seems to confirm this as the following goes through:
create table gaga (
val varchar(255),
index(val)
) engine = InnoDB;
But changing the definition of val to varchar(256) yields an "Error Code: 1071. Specified key was too long; max key length is 767 bytes".
In this day in age, the limit to 255 characters seems awfully low, so: is this correct? If it is what is the best way to get larger pieces of text indexed with MySQL? (Should I avoid it? Store a SHA? Use another sort of index? Use another database character encoding?)

Though the limitation might seem ridiculous, it makes you think over if you really need the index for such a long varchar field. Even with 767 bytes the index size grows very fast and for a large table (where it is most useful) most probably won't fit into memory.
From the other side, the only frequent case at least in my experience where I needed to index a long varchar field was a unique constraint. And in all those cases a composite index of some group id and MD5 from the varchar field was sufficient. The only problem is to mimick the case-insensitive collation (which considers accented charactes and not-accented equal), though in all my cases I anyway used binary collation, so it was not a problem.
UPD. Another frequent case for indexing a long varchar is ordering. For this case I usually define a separate indexed sorter field which is a prefix of 5-15 characters depending on data distribution. For me, a compact index is more preferable than rarely inaccurate ordering.

Related

What is the maximum optimal size of MYSQL PRIMARY CHAR KEY

My main issue is that after we expanded CHAR(50) to CHAR(64) we started receiving timeouts on internal backup queries. The record size is a few kb and the database is very very big, so this column that is a primary key must be the reason for our trouble.
I searched through the internet but I found only selecting the type of the keys or comparison of CHAR vs VARCHAR but nothing about the optimal size.
For example, is there some special optimization in MYSQL that for indices smaller than, let's say, 60 bytes it uses some for of caching while for larger it starts swapping stuff?
Any help would be appreciated. Even those suggesting there is no difference and simply the % of time spent on join was increased by % of the index size.
EDIT
THIS IS NOT THE ANSWER FITTING THE QUESTION however I have found out the reason our change got a HUGE performance hit.
We expanded column using
ALTER TABLE table MODIFY job_id CHAR(64);
This caused CHARACTER SET fall back to the default one = utf8mb4 dropping previous latin1
That would conclude my research, but I will leave this question opened for anyone that would be able to answer the impact on resizing key column.
This question does look for suggestion on type change.
Thank you all for their time and inputs!
Short answer: There is no caching/swapping/optimal/etc size.
Long answer:
Don't use CHAR unless the data for that column really is fixed length -- such as country_code, postal_code, UUID, SSN, etc. Furthermore, use the minimal charset needed, such as ascii for those. CHAR wastes space by padding with spaces.
There is no such cutoff. There is no inherent problem in having a long PRIMARY KEY except that ...
Every secondary key has a copy of the PK. (This says the break-even is at 2 indexes; for more than 2, the extra bulk in secondary keys adds up for long PKs.)
Columns in other tables that need to JOIN to this PK (with or without an Foreign Key declared) will be bulkier than an INT.
Many users (or 3rd party software that generates SQL) blindly uses 8-byte BIGINT for ids. Even the 4-byte INT is usually overkill; see the smaller INT types.
Indexes are limited, but many things factor in:
767 / 1000 / 3072 bytes depending on engine and version
Character set of char/varchar: CHAR/VARCHAR(50) may take 50 / 100 / 150 / 200 bytes, depending on charset.
InnoDB's buffer_pool is limited by innodb_buffer_pool_size, which should be set to something like 70% of RAM, does have caching. This implies that the bigger a table or index is, the more I/O is likely to be done.
Bottom line: Your timeouts are coming from other things. Consider increasing the timeout.
Also
When doing ALTER TABLE ... MODIFY COLUMN ..., you must specify all the characteristics of the column, specifically including the ones you are not changing. These include CHARACTER SET, COLLATION, [NOT] NULL, DEFAULT, etc.
I like to do SHOW CREATE TABLE to get the current definition for the columns, then copy the one I want to change into my fresh ALTER, modifying the one thing I am changing.

MySQL create index error

Issue is to create index for table 'visits_visit' (Django visit app), because every query lasts at least 60 ms and is going to be worse.
CREATE INDEX resource ON visits_visit (object_app(200), object_model(200), object_id(200));
It returns:
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
What to do? Structure of table is on the screenshot.
See the reference to a possible duplicate question already answered in comments under your question. Or should I say a canonical duplicate target to close this question to if it does close. That said, not much there in that reference in terms of storage engines or character sets.
In your case the character set factors in with the use of string-type columns in your composite index.
A side note is certainly performance. Don't expect a great one in general with what you are attempting. Your index is way too wide and may very well not even be of the intended use. Indexes and their benefit need careful scrutiny. This can be ascertained with the use of mysql explain. See the following, in particular the General Comments section.
Please see the following article Using Innodb_large_prefix to Avoid ERROR 1071 and below is an excerpt.
The character limit depends on the character set you use. For example
if you use latin1 then the largest column you can index is
varchar(767), but if you use utf8 then the limit is varchar(255).
There is also a separate 3072 byte limit per index. The 767 byte limit
is per column, so you can include multiple columns (each 767 bytes or
smaller) up to 3072 total bytes per index, but no column longer than
767 bytes. (MyISAM is a little different. It has a 1000 byte index
length limit, but no separate column length limit within that). One
workaround for these limits is to only index a prefix of the longer
columns, but what if you want to index more than 767 bytes of a column
in InnoDB? In that case you should consider using innodb_large_prefix,
which was introduced in MySQL 5.5.14 and allows you to include columns
up to 3072 bytes long in InnoDB indexes. It does not affect the index
limit, which is still 3072 bytes.
Also see the Min and Max section from the Mysql Manual Page Limits on InnoDB Tables
The 'right' answer is to shorten the fields and/or normalize them.
Do you really have 200-character-long apps, models, etc? If not, shorten the fields.
Probably model is repeated in the table a lot? If so, normalize it and replace the column with the id from normalizing it.
You seem to be using MyISAM; you could (should) also switch to InnoDB. That will change the error message, or it might make it go away.
Are you using utf8 characters? Are you doing everything in English? Changing the CHARACTER SET could make 200 characters mean 200 bytes, not 600 (utf8) or 800 (utf8mb4).
Changing the character set for ip_address would shrink its footprint from 15 * (bytes/char). So would changing from CHAR to VARCHAR. Note also that 15 is insufficient to handle IPv6.

always use 255 chars for varchar fields decreases performance?

I usually use maximum chars possible for varchar fields, so in most cases I set 255 but only using 16 chars in columns...
does this decreases performance for my database?
When it comes to storage, a VARCHAR(255) column will take up 1 byte to store the length of the actual value plus the bytes required to store the actual value.
For a latin1 VARCHAR(255) column, that's at most 256 bytes. For a UTF8 column, where each character can take up to 3 bytes (though rarely), the maximum size is 766 bytes. As we know the maximum index length for a single column in bytes in InnoDB is 767 bytes, hence perhaps the reason some declare 255 as the maximum supported column length.
So, again, when storing the value, it only takes up as much room as is actually needed.
However, if the column is indexed, the index automatically allocates the maximum possible size so that each node in the index has enough room to store any possible value. When searching through an index, MySQL loads the nodes in specific byte size chunks at a time. Large nodes means less nodes per read, which means it takes longer to search the index.
MySQL will also use the maximum size when storing the values in a temp table for sorting.
So, even if you aren't using indexes, but are ever performing a query that can't utilize an index for sorting, you will get a performance hit.
Therefore, if performance is your goal, setting any VARCHAR column to 255 characters should not be a rule of thumb. Instead, you should use the minimum required.
There may be edge cases where you'd rather suffer the performance every day so that you never have to lock a table completely to increase the size of a column, but I don't think that's the norm.
One possible exception is if you are joining on a VARCHAR column between two tables. MySQL says:
MySQL can use indexes on columns more efficiently if they are declared
as the same type and size.
In that case, you might use the max size between the two.
Whenever you're talking about "performance" you can only find out one way: Benchmarking.
In theoretical terms there's no difference between VARCHAR(20) and VARCHAR(255) if they're both populated with the same data. Keep in mind if you get your length wrong you will have massive truncation problems and MySQL does not warn you before it starts chopping data to fit.
I try to avoid setting limits on VARCHAR columns unless the data would be completely invalid if it was longer. For instance, two-character ISO country codes can be stored in VARCHAR(2) because longer strings are meaningless. For other things, especially names or phone numbers, limiting the length is potentially and probably harmful.
Still, you will want to test any schema you create to be sure it meets your performance requirements. I expect you'd have a hard time detecting any difference at all between VARCHAR(25) and VARCHAR(16).
There are two ways in which this will decrease performance.
if you're loading those columns many many times, performing a join on the column, or other such thing that means they need to be accessed a large number of times. The number of times depends on your machine, but think on the order of millions.
if you're always filling the field (using 20 chars in a varchar(20), then the length checks are adding a little overhead whenever you perform an insert.
The best way to determine this though is to benchmark your database though.

varchar(20) and varchar(50) are same?

I saw comment "If you have 50 million values between 10 and 15 characters in a varchar(20) column, and the same 50 million values in a varchar(50) column, they will take up exactly the same space. That's the whole point of varchar, as opposed to char.". Can Anybody tell me the reason? See What is a reasonable length limit on person "Name" fields?
MySQL offers a choice of storage engines. The physical storage of data depends on the storage engine.
MyISAM Storage of VARCHAR
In MyISAM, VARCHARs typically occupy just the actual length of the string plus a byte or two of length. This is made practical by the design limitation of MyISAM to table locking as opposed to a row locking capability. Performance consequences include a more compact cache profile, but also more complicated (slower) computation of record offsets.
(In fact, MyISAM gives you a degree of choice between fixed physical row size and variable physical row size table formats depending on column types occuring in the whole table. Occurrence of VARCHAR changes the default method only, but the presence of a TEXT blob forces VARCHARs in the same table to use the variable length method as well.)
The physical storage method is particularly important with indexes, which is a different story than tables. MyISAM uses space compression for both CHAR and VARCHAR columns, meaning that shorter data take up less space in the index in both cases.
InnoDB Storage of VARCHAR
InnoDB, like most other current relational databases, uses a more sophisticated mechanism. VARCHAR columns whose maximum width is less than 768 bytes will be stored inline, with room reserved matching that maximum width. More accurately here:
For each non-NULL variable-length field, the record header contains
the length of the column in one or two bytes. Two bytes will only be
needed if part of the column is stored externally in overflow pages or
the maximum length exceeds 255 bytes and the actual length exceeds 127
bytes. For an externally stored column, the two-byte length indicates
the length of the internally stored part plus the 20-byte pointer to
the externally stored part. The internal part is 768 bytes, so the
length is 768+20. The 20-byte pointer stores the true length of the
column.
InnoDB currently does not do space compression in its indexes, the opposite of MyISAM as described above.
Back to the question
All of the above is however just an implementational detail that may even change between versions. The true difference between CHAR and VARCHAR is semantic, and so is the one between VARCHAR(20) and VARCHAR(50). By ensuring that there is no way to store a 30 character string in a VARCHAR(20), the database makes the life easier and better defined for various processors and applications that it supposedly integrates into a predictably behaving solution. This is the big deal.
Regarding personal names specifically, this question may give you some practical guidance. People with full names over 70 UTF-8 characters are in trouble anyway.
Yes, that is indeed the whole point of VARCHAR. It only takes up as much space as the text is long.
If you had CHAR(50), it would take up 50 bytes (or characters) no matter how short the data really is (it would be padded, usually by spaces).
Can Anybody tell me the reason?
Because people thought it was wasteful to store a lot of useless padding, they invented VARCHAR.
The manual states:
The CHAR and VARCHAR types are declared with a length that indicates the maximum number of characters you want to store. (...)
In contrast to CHAR, VARCHAR values are stored as a one-byte or two-byte length prefix plus data. The length prefix indicates the number of bytes in the value. A column uses one length byte if values require no more than 255 bytes, two length bytes if values may require more than 255 bytes.
Notice that VARCHAR(255) is not the same as VARCHAR(256).
This is theory. As habeebperwad suggests, the actual footprint of one row depends on (engine) page size and (hard disk) block size.

Database MySql design - varchar length for utf8 fields :: 1. password 2. username 3.email

Most of the times I define varchar(255) length auto.
But now I thinking how much varchar length should be best to define for utf8 fields:
password
username
email
If this fields should be define less than varchar 255, how much performance it will improve?
Thanks
'password' should be char(40) if you use SHA1 hashes. This might have binary collation if you are sure that the cases of the hash is always the same. This gives you better performance. If you're not, use latin1, but don't use utf8.
'email'... use 255, you cannot know how long someone's email address is.
For the username I'd just use whatever your max username length is. 20 or 30 would probably be good.
If you have an index on a character field (especially if it's part f the PK) choose the length very carefully, because longer and longer indexes might reduce performance heavily (and increases memory usage).
Also, if you use UTF8 char field in an index, you have to be aware, that MySQL reserves 3 times more bytes that the actual character length of the field, preparing for the worst case (UTF8 might store certain characters on 3 bytes). This can also cause lack of memory.
If you index any of those fields (and you don't use a prefix as the index), bear in mind that MySQL will index the field as though it were CHAR rather than VARCHAR, and each index record will use the maximum potential space (so 3n bytes for a VARCHAR(n), since a UTF8 character can be up to 3 bytes long). That could mean the index will be larger than necessary. To get around this, make the field smaller, or index on a prefix.
(I should say: I'm sure I've read that this is the case somewhere in the MySQL documentation, but I couldn't find it when I looked just now.)
changing that won't have a big effect on performance (depending on how much rows are in that table - probably you won't notice any effect), but maybe it will make your database using less disk space. (i use a lengh of 30 for user names, 64 for passwords(legth of the hash) and 50 for email adresses).