MySQL - Speed of string comparison for primary key - mysql

I have a MySQL table where I would like my primary key to be a string. This string may potentially be a bit longer (hundreds of characters).
A very common query would be an INSERT ... ON DUPLICATE KEY UPDATE, which means MySQL would have to check whether the primary key already exists in the table a lot. If this is done with a naive strcmp I imagine this might take quite a while the longer the strings are. Would it thus be better to hash the string manually (either to a shorter string or some other data type) and use that as my primary key or can I just use the long string directly? Does MySQL hash primary key strings internally?

First off, when you have an index on a varchar field mysql doesn't do a strcmp on all entries to find the correct one; instead it uses a binary tree, which is a lot faster than strcmp to navigate through to find the proper entry.
Note: I include some info to improve performance if needs be below, but please do not do that until you hit an actual problem. Varchar indexes are quick, they have been optimized by a lot of very smart people, and in the large majority of cases it will be way more than you need.
With that said, if you have a lot of entries and/or very long keys it can be nice performance wise to use an index of hashes on top of it.
CREATE TABLE users
(
username varchar not null,
username_hashed varchar(32) not null,
primary key (username),
index (username_hashed)
);
When you insert you can set username_hashed = md5(username) for example. And then you search with something like select otherfields from users where username_hashed = md5(username) and username = username
Note that it seems mysql 5.5 support hash index natively, which would allow you to not have to do that by hand.

Does the primary key need to be a string? Can't it just be a unique index, with an integer primary auto increment?
Searching will always be faster with integers, and it might take a bit of code rearrangement in your app, but you'll always be better off searching numbered primary keys vs. strings.
Look at these two posts that show the difference in memory for int and varchar:
What is the size of column of int(11) in mysql in bytes?
Memory usage of storing strings as varchar in MySQL

Related

How to save varchar field for query performance in mysql innodb

I made the following sample table.
CREATE TABLE `User` (
`userId` INT(11) NOT NULL AUTO_INCREMENT,
`userKey` VARCHAR(20) NOT NULL,
PRIMARY KEY (`userId`),
UNIQUE KEY (`userKey`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
I want a more quick query speed for the userKey field.
SELECT * FROM `User` WHERE userKey='KEY123'
In order to do that, what else should I consider besides the index?
For example, when saving the userKey, would using a value such as a DATE(sortable value) as prefix be more beneficial to the search speed?
Or, if I save the userKey as the hash value, will it improve?
If not all, is it enough to use the index for the above query?
You've declared userKey as UNIQUE, so MySQL will automatically create an index for you. That should generally be enough, given that you are searching by "exact match" or by "starting with" type of queries.
If you consider storing hash values, you would only be able to perform "exact match" queries. However, for short strings that may turn out to be not really worth it. Also, with hashes, you need to consider the risk of a collision. If you use a smaller hash value (say 4 bytes) that risk increases. Those are just some guidelines. It's not possible to provide a definitive solution without delving into more details of your case.
One way to speed the table is to get rid of it.
Building a mapping from an under 20 char string to a 4-byte INT does not save much space. Getting rid of the lookup saves a little time.
If you need the to do the 'right thing' had have one copy of each string to avoid redundancy, then what you have is probably the best possible.
If there are more columns, then there may a slight improvement.
But first, answer this: Do you do the lookup by UserKey more of often than by UserId?
I give a resounding NO to a Hash. (Such might be necessary if the strings were too big to index.)
A BTree is very efficient as doing a "point query". (FROM User WHERE UserId = 123 or FROM User WHERE UserKey = 'xyz'). Adding a date or hash would only make it slower.

Mysql timestamp and AUTO_INCREMENT as primary key

I am thinking about the best way to index my data. Is it a good idea to use the timestamp as my primary key? I am saving it anyway and I though about saving some columns. The timestamp should be an integer not a datetime column, because of performance. Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an additionary AUTO_INCREMENT column. Now I have a unique key (timestamp and AI) and I can get the current inserted id easily by using the command "LAST_INSERT_ID". Is it possible to reset the AI counter every second / when there is a new timestamp? Or is it possible to detect if there is a dataset with the same timestamp and increase the AI value (I still want to be able to use LAST_INSERT_ID).
Please share some thoughts.
The timestamp should be an integer not a datetime column, because of performance.
I think you are of the belief that datetime is stored as a string. It is stored as numbers quite efficiently and with a wider range and more accuracy than an integer.
Using an integer may decrease performance because the database may not be able to correctly index it for use as a timestamp. It will complicate queries because you will not be able to use the full suite of date and time functions without first converting the integer to a datetime.
Use the appropriate date/time type, index it, and let the database optimize it.
Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an [additional] AUTO_INCREEMENT column.
This would seem to defeat the point of "saving some columns". Now your primary key is two integers. Worse, it's a compound key which requires all references to store both values increasing storage requirements and complicating joins.
All the extra work necessary to determine the next primary key could be done in an insert trigger, but now you'd added complexity and extra work to every insert.
Is it a good idea to use the timestamp as my primary key?
A primary key should be A) unique and B) immutable. A timestamp is not unique, and you might need to change it.
Your primary key is unlikely to be a performance or storage bottleneck. Unless you have a good reason, stick with a simple, auto-incrementing big integer. A big integer because 2 billion is smaller than you think.
MySQL encapsulates this in serial which is bigint unsigned not null auto_increment unique.
TIMESTAMP and DATETIME are risky as a PRIMARY KEY since the PK must be Unique.
Otherwise, it is fine to use them for the PK or an index. But here are some caveats:
When using composite indexes (multi-column), put the things tested with = first; put the datetime last.
Smaller is slightly better when picking a PK. TIMESTAMP and DATETIME take 5 bytes (when not including microseconds); INT is 4 bytes; BIGINT is 8.
The time taken for comparing one PK value to another is insignificant. That includes character PKs. For example, country_code CHAR(2) CHARACTER SET ascii is only 2 bytes -- better than 'normalizing' it and replacing it with a 4-byte cc_id INT.
So, no, don't bother using INT instead of TIMESTAMP.
In my experience, 2/3 of tables have a "natural" PK and don't need an auto_increment PK.
One of the worst places to use a auto_inc is on a many-to-many mapping table. It is likely to slow down most operations by a factor of 2.
You hinted at PRIMARY KEY(timestamp, ai):
You need to add INDEX(ai) to keep AUTO_INCREMENT happy.
It provides locality of reference for temporarily 'near' rows. But so does ai, by itself.
No, there is no practical way to reset the ai each second. (MyISAM has such, but do not use that engine.) Instead be sure to declare ai big enough to last 'forever' before overflowing.
But I can't think of a use case where there isn't a better way.

Mysql: Best type for this non-standard ID field

I have a db of people ( < 400 total ), imported from another system. The IDs are like this
_200802190302239ILXNSL
I do queries and joins to other tables on that field
Because I'm lazy and ignorant about MySQL data types and performance, I just set it to Text.
What data type should I be using and what sort of index should I set on it for best performance?
varchar (or char, if they all have the same length).
http://dev.mysql.com/doc/refman/5.0/en/char.html
Type: as said, varchar or char(way better if the length of this ID is fixed).
Index type: a UNIQUE probably (if you won't have multiple entries with the same ID)
As a further observation, I would probably hesitate (for performance reasons) to use this field as a natural primary key, especially if it will be referenced by multiple foreign keys. I would probably just create a synthetic primary key(for instance an AUTO_INCREMENT column) and a UNIQUE index on this non-standard ID column.
On the other hand, with less that 400 rows, it doesn't really matter, it will be smoking fast anyways, unless there are big/huge tables referencing this persons table.
Set it as a VARCHAR with appropriate length; this way you can index the whole column, or use it as primary key or unique index component.
If you are really paranoid about performance AND you're sure it won't contain any non-ascii characters, you can set it as ascii character set and save a few bytes by not needing space for utf8 (in things such as sort-buffers and memory temporary tables).
But if you have 400 records in your DB, you almost definitely don't care.

Primary key in MySQL: INT(n) or UUID as varchar(36)

Does it make sense to use UUID as primary key in MySQL?
What would be pros and cons of using UUID instead of regular INT, beside trouble of hand querying?
From my point of view, using UUID as primary key in MySQL is bad idea, if we speak about large databases (and large amount of inserts).
MySQL always creates Primary Keys as clustered, and there is no option to switch it off.
Taking this in to consideration, when you insert large amounts of records with non-sequential identifiers (UUIDs), database gets fragmented, and each new insert would take more time.
Advice: Use PostgreSQL / MS-SQL / Oracle with GUIDs. For MySQL use ints (bigints).
The major downside of UUIDs is that you have to create them beforehand if you want to refer back to the record for further usage afterwards (ie: adding child records in dependent foreign keyed tables):
INSERT INTO table (uuidfield, someotherfield) VALUES (uuid(), 'test'));
will not let you see what the new UUID value is, and since you're not using a regular auto_incremented primary key, you can't use last_insert_id() to retrieve it. You'd have to do it in a two-step process:
SELECT #newuid := uuid();
INSERT INTO table (uuidfield, someotherfield) VALUES (#newuid, 'test');
INSERT INTO childtable ..... VALUES (#newuid, ....);
The PRO I can think of is that your ID will be unique, not only in your table but on every other table of your database. Furthermore, it should be unique among any table from any database in the world.
If your table semantic needs that feature, then use a UUID. Otherwise, just use a plain INT ID (faster, easier to handler, smaller).
The cons of UUID is that it's bulkier and hence a bit slower to search. It's hard to type in a long hex string for every ad hoc query. It solves a need that you might not have, i.e. multi-server uniqueness.
By the way, INT(n) is always a 32-bit integer in MySQL, the (n) argument has nothing to do with size or range of allowed values. It's only a display width hint.
If you need an integer with a range of values greater than what 32-bit provides, use BIGINT.

CHAR() or VARCHAR() as primary key in an ISAM MySQL table?

I need a simple table with a user name and password field in MySQL. Since user names must be unique, it makes sense to me to make them the primary key.
Is it better to use CHAR() or VARCHAR() as a primary key?
may as well just use a user ID index, it's much faster for joins vs char/varchar. the two seconds it takes to add that now could save you a lot of time later if you accidently have to expand the functionality of your schema.
some pitfalls to think about:
say we add a few tables at a future date, what if someone wants to change a username?
say the app is more successful then we think, and we have to look at optimization, do you really want to redo your schema at this point to reduce the overhead of a varchar'ed index?
I would work hard to NOT use CHAR() or VARCHAR() as a PK but use an int with an auto_increment instead. This allows you to use that user_id in child tables if needed and queries on the PK should be faster. If you have to use either a CHAR() or VARCHAR(), I'd go with the CHAR() since it's a fixed width.
I'm not 100% sure how MySQL deals with VARCHAR()'s but most database engines have to do some magic under the hood to help the engine know where the VARCHAR() fields ends and where the next field begins, a CHAR() makes it straight forward and keeps the engine from having to think to much.
[I would work hard to NOT use CHAR() or VARCHAR() as a PK but use an int with an auto_increment instead.]
+1
Put a unique constraint of the username but use the int field as the PK
I don't see CHAR used much in any MySQL databases i've worked on. I would go with the VARCHAR
For a CHAR(30) for example, the entire 30 characters are stored in the table meaning every entry will take up the same space, even if your username is only 10 characters long.
Using VARCHAR(30) it will only use enough space to store the string you have entered.
On a small table it wont make much difference, but on a larger table the VARCHAR should prove to keep it smaller overall.