MySQL lookup performance PK unsigned INT vs indexed varchar(15) column - mysql

Would there be a significant performance difference between looking up a MySQL table (currently 2M rows) using PK unsigned int(4 bytes) vs varchar(15) (unique index) column?

The problem with VARCHAR being used for any KEY is that they can hold WHITE SPACE. White space consists of ANY non-screen-readable character, like spaces tabs, carriage returns etc. Using a VARCHAR as a key can make your life difficult when you start to hunt down why tables aren't returning records with extra spaces at the end of their keys.
Sure, you CAN use VARCHAR, but you do have to be very careful with the input and output. They also take up more space and are likely slower when doing a Queries.
Integer types have a small list of 10 characters that are valid, 0,1,2,3,4,5,6,7,8,9. They are a much better solution to use as keys.
You could always use an integer-based key and use VARCHAR as a UNIQUE value if you wanted to have the advantages of faster lookups.

Shorter keys means more keys per index block, means fewer index block retrievals, means faster lookups.

Related

Which one is faster to get the row? The primary key that carries numbers or that carries characters?

ID (Int 11) (Primary key) (Auto increment)
TITLE
1
...
2
...
3
...
4
...
5
...
To 10 million rows
ID (Char 32) (Primary key)
TITLE
a4a0FCBbE614497581da84454f806FbA
...
40D553d006EF43f4b8ef3BcE6B08a542
...
781DB409A5Db478f90B2486caBaAdfF2
...
fD07F0a9780B4928bBBdbb1723298F92
...
828Ef8A6eF244926A15a43400084da5D
...
To 10 million rows
If I want to get a specific row from the first table, How much time will take approximately, Same thing with the second table, How much time will take approximately?
Is the primary key that carries numbers will be found faster than that carries characters?
I do not want to use auto-increment with int like the first table because of this problem
UUIDs and MD5s and other hashes suck because of the "randomness" and lack of "locality of reference", not because of being characters instead of numeric.
You could convert those to BINARY(16), thereby making them half as big.
10M INT = 40MB = 600/block
10M CHAR(32) = 320MB = 300/block
10M VARCHAR(32) = 330MB = 300/block
10M BINARY(16) = 160MB = 450/block
Add that much more for each secondary key in that table.
Add again for each other table that references that PK (eg, FOREIGN KEY).
Let's look at the B+Tree that is the structure of the PK and secondary indexes. In a 16KB block, some number of entries can be placed. I have estimated them above. (Yes, the 'overhead' is much than an INT.) For INT, the BTree for 10M rows will probably be 3 levels deep. Ditto for the others. (As the table grows, Varchar would move to 4 levels before the others.)
So, I conclude, there is little or no difference in how many BTree blocks are needed to do your "point query".
Summary of how much slower a string is than an INT:
BTree depth -- little or none
Cachability of index blocks -- some; not huge
CPU time to compare numbers vs strings -- some; not huge
Use of a fancy COLLATION -- some; not huge
Overall -- not enough difference to worry about.
What I will argue for in some cases is whether you need a fabricated PK. In 2/3 of the tables I build, I find that there is a 'natural' PK -- some column(s) that is, by the business logic, naturally UNIQUE and NOT NULL. These are the two main qualifications (in MySQL) for a PRIMARY KEY. In some situations the speedup afforded by a "natural PK" can be more than a factor of 2.
A Many-to-many mapping table is an excellent (and common) example of such.
It is impossible to tell the exact times needed to retrieve a specific record, because it depends on lots of factors.
In general, numeric values take less storage space, thus scanning the index requires less I/O operations, therefore are usually faster.
However in this specific case the second table looks like a hexadecimal representation of a large number. You can probably store it as a binary value to save storage space.
On top of the above, in general numeric values are not affected by various database and column settings, while strings are (like collation), which also can add some processing time while querying.
The real question is what is the purpose of using the binary representation. 10 million values can easily fit in INT what is the need to have a key which can store way more (32 long hexadecimal value)?
As long as you are within the range of the numeric values and there is no other requirement, just to be able to store that many different values, I would go with an integer.
The 'problem' you mention in the question is usually not a problem. There is no need to not have gaps in the identifiers in most caes. In fact in lots of systems, gaps are naturally occurring during normal operations. You most probably won't reassign the records to other IDs when one record is being deleted from the middle of the table.
Unless there is a semantic meaning of the ID (it should not), I would just go with an AUTO_INCREMENT, there is no need to reinvent the wheel.

Mysql : 'UNIQUE' constraint over a large string

What could be the possible downside of having UNIQUE constraint for a large string (varchar) (roughly 100 characters or so) in MYSQL during :
insert phase
retrieval phase (on another primary key)
Can the length of the query impact the performance of read/writes ? (Apart from disk/memory usage for book-keeping).
Thanks
Several issues. There is a limit on the size of a column in an index (191, 255, 767, 3072, etc, depending on various things).
Your column fits within the limit.
Simply make a UNIQUE or PRIMARY key for that column. There are minor performance concerns, but keep this in mind: Fetching a row is more costly than any datatype issues involving the key used to locate it.
Your column won't fit.
Now the workarounds get ugly.
Index prefixing (INDEX foo(50)) has a number of problems and inefficiencies.
UNIQUE foo(50) is flat out wrong. It is declaring that the first 50 characters are constrained to be unique, not the entire column.
Workarounds with hashing the string (cf md5, sha1, etc) have a number of problems and inefficiencies. Still, this may be the only viable way to enforce uniqueness of a long string.
(I'll elaborate if needed.)
Fetching a row (Assuming the statement is parsed and the PRIMARY KEY is available.)
Drill down the BTree containing the data (and ordered by the PK). This may involve bring a block (or more) from disk into the buffer_pool.
Parse the block to find the row. (There are probably dozens of rows in the block.)
At some point in the process lock the row for reading and/or be blocked by some other connection that is, say, updating or deleting.
Pick apart the row -- that is, split into columns.
For any text/blob columns needed, reach into the off-record storage. (Wide columns are not stored with the narrow columns of the row; they are stored in other block(s).) The costly part is locating (and reading from disk if not cached) the extra block(s) containing the big TEXT/BLOB.
Convert from the internal storage (not word-aligned, little-endian, etc) into the desired format. (A small amount of CPU code, but necessary. This means that the data files are compatible across OS and even hardware.)
If the next step is to compare two strings (for JOIN or ORDER BY), then that a simple subroutine call to a scan over however many characters there are. (OK, most utf8 collations are not 'simple'.) And, yes, comparing two INTs would be faster.
Disk space
Should INT be used instead of VARCHAR(100) for the PRIMARY KEY? It depends.
Every secondary key has a copy of the PRIMARY KEY in it. This implies that a PK that is VARCHAR(100) makes secondary indexes bulkier than if the PK were INT.
If there are no secondary keys, then the above comment implies that INT is the bulkier approach!
If there are more than 2 secondary keys, then using varchar is likely to be bulkier.
(For exactly one secondary key, it is a tossup.)
Speed
If all the columns of a SELECT are in a secondary index, the query may be performed entirely in the index's BTree. ("Covering index", as indicated in EXPLAIN by "Using index".) This is sometimes a worthwhile optimization.
If the above does not apply, and it is useful to look up row(s) via a secondary index, then there are two BTree lookups -- once in the index, then via the PK. This is sometimes a noticeable slowdown.
The point here is that artificially adding an INT id may be slower than simply using the bulky VARCHAR as the PK. Each case should be judged on its tradeoffs; I am not making a blanket statement.

How it will effects the database if I change a field type from varchar to char in larger size tables of mysql

I have existed a database where primary key as varchar type in one table.
I found that this degrade the query performance. Now I want to change the data type from varchar to char. My question is if I change the table how it will affect the entire database. Note that, in the table there are more than 3 million records stored now.
Is there anything I need to worry regarding this.
Thanks in advance for any helps and suggestions.
The MySQL docs list differences between CHAR and VARCHAR, here are the significant ones.
Truncation.
For VARCHAR columns, trailing spaces in excess of the column length are truncated prior to insertion and a warning is generated, regardless of the SQL mode in use.
For CHAR columns, truncation of excess trailing spaces from inserted values is performed silently regardless of the SQL mode.
Padding.
When CHAR values are stored, they are right-padded with spaces to the specified length. When CHAR values are retrieved, trailing spaces are removed.
VARCHAR values are not padded when they are stored. Trailing spaces are retained when values are stored and retrieved, in conformance with standard SQL.
I seriously doubt that switching from VARCHAR to CHAR will have any effect on query performance, and will probably take up more space. I also doubt having your primary key be a string is affecting performance. More likely there is a lack of indexing or poorly written queries. Check with EXPLAIN.
In general, having your primary key be anything but an auto incremented integer type (INT or BIGINT) is bad table design. A primary key which is a string implies that string carries significance, and you will be tempted to change it. Changing a primary key causes all sorts of problems.
Instead, add an integer primary key field with AUTO_INCREMENT to your table and use that.
ALTER TABLE whatever
DROP PRIMARY KEY,
ADD COLUMN id INT PRIMARY KEY AUTO_INCREMENT;
Be sure to alter any tables which refer to this one to use the new key.

Is there a recommended size for a Mysql Primary key

Each entry in my 'projects' table has a unique 32 characters Hash identifier stored using a varchar(32).
Would that be considered a bad practice to use this as the primary key ? Is there a recommended size, datatype for Primary keys ?
I would say yes, it's a bad idea to use such a large column for a primary key. The reason is that every index you create on that table will have that 32 character column in it, which will bloat the size of all the indexes. Bigger indexes mean more disk space, memory and I/O.
Better to use an auto-increment integer key if possible, and simply create a unique index on your hash identifier column.
Dependstm ;)
Judging by your description, this field is intrinsic to your data and must be unique. If that really is the case, then you must make it a key. If you have child tables, consider introducing another, so called "surrogate" key simply to keep child FKs slimmer and possibly avoid ON UPDATE CASCADE. But beware that every additional index introduces overhead, especially for clustered tables. More on surrogate keys here.
On the other hand, if this key is not intrinsic to your data model, replace it with a smaller one (e.g auto-incremented integer). You'll save some disk space and (more importantly) increase the effectiveness of the cache.
Depends on your usage on how your primary key should be defined. I typically use an INT(11) for my primary keys. It makes it really easy for foreign keys.
I just saw your edit. I would personally use the int(11) with auto increment. Depending on your setup, this would allow for you to have other tables with foreign key restraints very easily. You could do the same thing with varchar but it has always been my understanding that int is faster than varchar especially with indexes.
There's nothing inherently wrong with using this as the PKEY. If you've got many other tables using this as an FKEY, perhaps not. There's no one answer.
Also note, if you know it's always going to be exactly 32 chars, you should make it a CHAR(32) instead.
In database engines, one of the the most important items is the disk space. Keeping small and compact data is normally associated with good performance, by reducing the quantity of data that is transmitted and transferred by the database. If a table is going to have a few lines, there's no reason to define a PK of type INT; MEDIUMINT, SMALLINT or even TINYINT can be used instead (just as you would use DATE instead of DATETIME), it's all about keeping it succinct.
This key is bad for several reasons.
One is addressed by #Eric in that every secondary index will contain those same 32 characters
Primary keys tend to be used in as look up from other tables and those tables also need to have those 32 characters, perhaps in there primary key and the same problem will arise again on those tables.
The biggest reason I can think of is performance. As you insert records of hash type you are basically inserting keys in random order and that in turn will eventually lead to a lot of page splits and pages that only between 50% and 90% filled. That leads to a unnecessary deep tree, longer search times, bigger table space and that the index takes more memory.

Mysql: Best type for this non-standard ID field

I have a db of people ( < 400 total ), imported from another system. The IDs are like this
_200802190302239ILXNSL
I do queries and joins to other tables on that field
Because I'm lazy and ignorant about MySQL data types and performance, I just set it to Text.
What data type should I be using and what sort of index should I set on it for best performance?
varchar (or char, if they all have the same length).
http://dev.mysql.com/doc/refman/5.0/en/char.html
Type: as said, varchar or char(way better if the length of this ID is fixed).
Index type: a UNIQUE probably (if you won't have multiple entries with the same ID)
As a further observation, I would probably hesitate (for performance reasons) to use this field as a natural primary key, especially if it will be referenced by multiple foreign keys. I would probably just create a synthetic primary key(for instance an AUTO_INCREMENT column) and a UNIQUE index on this non-standard ID column.
On the other hand, with less that 400 rows, it doesn't really matter, it will be smoking fast anyways, unless there are big/huge tables referencing this persons table.
Set it as a VARCHAR with appropriate length; this way you can index the whole column, or use it as primary key or unique index component.
If you are really paranoid about performance AND you're sure it won't contain any non-ascii characters, you can set it as ascii character set and save a few bytes by not needing space for utf8 (in things such as sort-buffers and memory temporary tables).
But if you have 400 records in your DB, you almost definitely don't care.