Primary key in MySQL: INT(n) or UUID as varchar(36)

Primary key in MySQL: INT(n) or UUID as varchar(36) - mysql

Does it make sense to use UUID as primary key in MySQL?
What would be pros and cons of using UUID instead of regular INT, beside trouble of hand querying?

From my point of view, using UUID as primary key in MySQL is bad idea, if we speak about large databases (and large amount of inserts).
MySQL always creates Primary Keys as clustered, and there is no option to switch it off.
Taking this in to consideration, when you insert large amounts of records with non-sequential identifiers (UUIDs), database gets fragmented, and each new insert would take more time.
Advice: Use PostgreSQL / MS-SQL / Oracle with GUIDs. For MySQL use ints (bigints).

The major downside of UUIDs is that you have to create them beforehand if you want to refer back to the record for further usage afterwards (ie: adding child records in dependent foreign keyed tables):
INSERT INTO table (uuidfield, someotherfield) VALUES (uuid(), 'test'));
will not let you see what the new UUID value is, and since you're not using a regular auto_incremented primary key, you can't use last_insert_id() to retrieve it. You'd have to do it in a two-step process:
SELECT #newuid := uuid();
INSERT INTO table (uuidfield, someotherfield) VALUES (#newuid, 'test');
INSERT INTO childtable ..... VALUES (#newuid, ....);

The PRO I can think of is that your ID will be unique, not only in your table but on every other table of your database. Furthermore, it should be unique among any table from any database in the world.
If your table semantic needs that feature, then use a UUID. Otherwise, just use a plain INT ID (faster, easier to handler, smaller).

The cons of UUID is that it's bulkier and hence a bit slower to search. It's hard to type in a long hex string for every ad hoc query. It solves a need that you might not have, i.e. multi-server uniqueness.
By the way, INT(n) is always a 32-bit integer in MySQL, the (n) argument has nothing to do with size or range of allowed values. It's only a display width hint.
If you need an integer with a range of values greater than what 32-bit provides, use BIGINT.

Related

MySQL query and insertion optimisation with varchar(255) UUIDs

I think this question has been asked in some way shape or form but I couldn't find a question that had asked exactly what I wish to understand so I thought I'd put the question here
Problem statement
I have built a web application with a MySQL database of say customer records with an INT(11) id PK AI field and a VARCHAR(255) uuid field. The uuid field is not indexed nor set as unique. The uuid field is used as a public identifier so its part of URLs etc. - e.g. https://web.com/get_customer/[uuid]. This was done because the UUID is 'harder' to guess for a regular John Doe - but understand that it is certainly not 'unguessable' in theory. But the issue now is that as the database is growing larger I have observed that the query to retrieve a particular customer record is taking longer to complete.
My thoughts on how to solve the issue
The solution that is coming to mind is to make the uuid field unique and also index the same. But I've been doing some reading in relation to this and various blog posts, StackOverflow answers on this have described putting indices on UUIDs as being really bad for performance. I also read that it will also increase the time it takes to insert a new customer record into the database as the MySQL database will take time to find the correct location in which to place the record as a part of the index.
The above mentioned https://web.com/get_customer/[uuid] can be accessed without having to authenticate which is why I'm not using the id field for the same. It is possible for me to consider moving to integer based UUIDs (I don't need the UUIDs to be universally unique - they just need to be unique for that particular table) - will that improve the the indicing performance and in turn the insertion and querying performance?
Is there a good blog post or information page on how to best set up a database for such a requirement - Need the ability to store a customer record which is 'hard' to guess, easy to insert and easy to query in a large data set.
Any assistance is most appreciated. Thank you!

The received wisdom you mention about putting indexes on UUIDs only comes up when you use them in place of autoincrementing primary keys. Why? The entire table (InnoDB) is built behind the primary key as a clustered index, and bulk loading works best when the index values are sequential.
You certainly can put an ordinary index on your UUID column. If you want your INSERT operations to fail in the astronomically unlikely event you get a random duplicate UUID value you can use an index like this.
ALTER TABLE customer ADD UNIQUE INDEX uuid_constraint (uuid);
But duplicate UUIDv4s are very rare indeed. They have 122 random bits, and most software generating them these days uses cryptographic-quality random number generators. Omitting the UNIQUE index is, I believe, an acceptable risk. (Don't use UUIDv1, 2, 3, or 5: they're not hard enough to guess to keep your data secure.)
If your UUID index isn't unique, you save time on INSERTs and UPDATEs: they don't need to look at the index to detect uniqueness constraint violations.
Edit. When UUID data is in a UNIQUE index, INSERTs are more costly than they are in a similar non-unique index. Should you use a UNIQUE index? Not if you have a high volume of INSERTs. If you have a low volume of INSERTs it's fine to use UNIQUE.
This is the index to use if you omit UNIQUE:
ALTER TABLE customer ADD UNIQUE INDEX uuid (uuid);
To make lookups very fast you can use covering indexes. If your most common lookup query is, for example,
SELECT uuid, givenname, surname, email
FROM customer
WHERE uuid = :uuid
you can create this so-called covering index.
ALTER TABLE customer
ADD INDEX uuid_covering (uuid, givenname, surname, email);
Then your query will be satisfied directly from the index and therefore be faster.
There's always an extra cost to INSERT and UPDATE operations when you have more indexes. But the cost of a full table scan for a query is, in a large table, far far greater than the extra INSERT or UPDATE cost. That's doubly true if you do a lot of queries.
In computer science there's often a space / time tradeoff. SQL indexes use space to save time. It's generally considered a good tradeoff.
(There's all sorts of trickery available to you by using composite primary keys to speed things up. But that's a topic for when you have gigarows.)
(You can also save index and table space by storing UUIDs in BINARY(16) columns and use UUID_TO_BIN() and BIN_TO_UUID() functions to convert them. )

Mysql timestamp and AUTO_INCREMENT as primary key

I am thinking about the best way to index my data. Is it a good idea to use the timestamp as my primary key? I am saving it anyway and I though about saving some columns. The timestamp should be an integer not a datetime column, because of performance. Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an additionary AUTO_INCREMENT column. Now I have a unique key (timestamp and AI) and I can get the current inserted id easily by using the command "LAST_INSERT_ID". Is it possible to reset the AI counter every second / when there is a new timestamp? Or is it possible to detect if there is a dataset with the same timestamp and increase the AI value (I still want to be able to use LAST_INSERT_ID).
Please share some thoughts.

The timestamp should be an integer not a datetime column, because of performance.
I think you are of the belief that datetime is stored as a string. It is stored as numbers quite efficiently and with a wider range and more accuracy than an integer.
Using an integer may decrease performance because the database may not be able to correctly index it for use as a timestamp. It will complicate queries because you will not be able to use the full suite of date and time functions without first converting the integer to a datetime.
Use the appropriate date/time type, index it, and let the database optimize it.
Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an [additional] AUTO_INCREEMENT column.
This would seem to defeat the point of "saving some columns". Now your primary key is two integers. Worse, it's a compound key which requires all references to store both values increasing storage requirements and complicating joins.
All the extra work necessary to determine the next primary key could be done in an insert trigger, but now you'd added complexity and extra work to every insert.
Is it a good idea to use the timestamp as my primary key?
A primary key should be A) unique and B) immutable. A timestamp is not unique, and you might need to change it.
Your primary key is unlikely to be a performance or storage bottleneck. Unless you have a good reason, stick with a simple, auto-incrementing big integer. A big integer because 2 billion is smaller than you think.
MySQL encapsulates this in serial which is bigint unsigned not null auto_increment unique.

TIMESTAMP and DATETIME are risky as a PRIMARY KEY since the PK must be Unique.
Otherwise, it is fine to use them for the PK or an index. But here are some caveats:
When using composite indexes (multi-column), put the things tested with = first; put the datetime last.
Smaller is slightly better when picking a PK. TIMESTAMP and DATETIME take 5 bytes (when not including microseconds); INT is 4 bytes; BIGINT is 8.
The time taken for comparing one PK value to another is insignificant. That includes character PKs. For example, country_code CHAR(2) CHARACTER SET ascii is only 2 bytes -- better than 'normalizing' it and replacing it with a 4-byte cc_id INT.
So, no, don't bother using INT instead of TIMESTAMP.
In my experience, 2/3 of tables have a "natural" PK and don't need an auto_increment PK.
One of the worst places to use a auto_inc is on a many-to-many mapping table. It is likely to slow down most operations by a factor of 2.
You hinted at PRIMARY KEY(timestamp, ai):
You need to add INDEX(ai) to keep AUTO_INCREMENT happy.
It provides locality of reference for temporarily 'near' rows. But so does ai, by itself.
No, there is no practical way to reset the ai each second. (MyISAM has such, but do not use that engine.) Instead be sure to declare ai big enough to last 'forever' before overflowing.
But I can't think of a use case where there isn't a better way.

Index vs Auto Increment ID set as PK

In MySQL, what is the difference between Index and an ID set to AUTO_INCREMENT and as Primary Key?
Will this also increase the speed to searching the database? Or is the AUTO_INCREMENT ID just for the purpose of the of the user and the computer doesn't consider it while searching the database? Reading about INDEX on w3schools.com I came across this line:
Indexes are used to retrieve data from the database very fast. The
users cannot see the indexes, they are just used to speed up
searches/queries.

In MySQL, the primary key creates an index on the key . . . but the original data pages are the leafs of the index. This may be a little convoluted, but the effect is that the data is actually sorted on the data pages.
A regular index is implemented as a b-tree (note: the "b" standards for "balanced" rather than "binary", contrary to what many people believe). The leafs are stored separately from the original data.
auto_increment is a property of one column of a table, where the value is set to a new value on each insert and the new value is larger than the previous value. The increment is usually 1, but that is not guaranteed. auto_increment does not directly relate to indexing, but is almost always associated with the primary key of the table.
So, in both cases, you have an index. The primary key index is slightly smaller because storage is combined with the data pages themselves. On the other hand, the data needs to be in order on the disk, which can complicate inserts and updates. On the other hand, auto-increment guarantees that all new rows go at the end of the data. On the other hand, I've run out of hands.

When you index a column, you make a binary tree (or possibly another data structure) to speed up the search process.
ID or primary key by default is indexed.
Auto_Increment means you want MySQL to automatically set a value to the column whenever a new row gets inserted. The value will be set incrementally.

AUTO_INCREMENT is an integer sequence generator and nothing else. It has no inherent relation to indexes of any kind, and only exists to generate unique sequential numbers. It is frequently used to generate integer surrogate keys which are frequently used as primary keys.
You don't have to use either AUTO_INCREMENT or integer IDs as an index so long as you have one or more fields that you can use to uniquely identify a row.
In fact, in terms of scalability, sequence generators like AUTO_INCREMENT are counter-productive as you can only ever have a single instance of a sequence generator, limiting the number of 'master/write' servers and/or bottlenecking insert performance to that of the node running the generator.

MySQL - Speed of string comparison for primary key

I have a MySQL table where I would like my primary key to be a string. This string may potentially be a bit longer (hundreds of characters).
A very common query would be an INSERT ... ON DUPLICATE KEY UPDATE, which means MySQL would have to check whether the primary key already exists in the table a lot. If this is done with a naive strcmp I imagine this might take quite a while the longer the strings are. Would it thus be better to hash the string manually (either to a shorter string or some other data type) and use that as my primary key or can I just use the long string directly? Does MySQL hash primary key strings internally?

First off, when you have an index on a varchar field mysql doesn't do a strcmp on all entries to find the correct one; instead it uses a binary tree, which is a lot faster than strcmp to navigate through to find the proper entry.
Note: I include some info to improve performance if needs be below, but please do not do that until you hit an actual problem. Varchar indexes are quick, they have been optimized by a lot of very smart people, and in the large majority of cases it will be way more than you need.
With that said, if you have a lot of entries and/or very long keys it can be nice performance wise to use an index of hashes on top of it.
CREATE TABLE users
(
username varchar not null,
username_hashed varchar(32) not null,
primary key (username),
index (username_hashed)
);
When you insert you can set username_hashed = md5(username) for example. And then you search with something like select otherfields from users where username_hashed = md5(username) and username = username
Note that it seems mysql 5.5 support hash index natively, which would allow you to not have to do that by hand.

Does the primary key need to be a string? Can't it just be a unique index, with an integer primary auto increment?
Searching will always be faster with integers, and it might take a bit of code rearrangement in your app, but you'll always be better off searching numbered primary keys vs. strings.
Look at these two posts that show the difference in memory for int and varchar:
What is the size of column of int(11) in mysql in bytes?
Memory usage of storing strings as varchar in MySQL

Mysql: Best type for this non-standard ID field

I have a db of people ( < 400 total ), imported from another system. The IDs are like this
_200802190302239ILXNSL
I do queries and joins to other tables on that field
Because I'm lazy and ignorant about MySQL data types and performance, I just set it to Text.
What data type should I be using and what sort of index should I set on it for best performance?

varchar (or char, if they all have the same length).
http://dev.mysql.com/doc/refman/5.0/en/char.html

Type: as said, varchar or char(way better if the length of this ID is fixed).
Index type: a UNIQUE probably (if you won't have multiple entries with the same ID)
As a further observation, I would probably hesitate (for performance reasons) to use this field as a natural primary key, especially if it will be referenced by multiple foreign keys. I would probably just create a synthetic primary key(for instance an AUTO_INCREMENT column) and a UNIQUE index on this non-standard ID column.
On the other hand, with less that 400 rows, it doesn't really matter, it will be smoking fast anyways, unless there are big/huge tables referencing this persons table.

Set it as a VARCHAR with appropriate length; this way you can index the whole column, or use it as primary key or unique index component.
If you are really paranoid about performance AND you're sure it won't contain any non-ascii characters, you can set it as ascii character set and save a few bytes by not needing space for utf8 (in things such as sort-buffers and memory temporary tables).
But if you have 400 records in your DB, you almost definitely don't care.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008