Mysql : 'UNIQUE' constraint over a large string - mysql

What could be the possible downside of having UNIQUE constraint for a large string (varchar) (roughly 100 characters or so) in MYSQL during :
insert phase
retrieval phase (on another primary key)
Can the length of the query impact the performance of read/writes ? (Apart from disk/memory usage for book-keeping).
Thanks

Several issues. There is a limit on the size of a column in an index (191, 255, 767, 3072, etc, depending on various things).
Your column fits within the limit.
Simply make a UNIQUE or PRIMARY key for that column. There are minor performance concerns, but keep this in mind: Fetching a row is more costly than any datatype issues involving the key used to locate it.
Your column won't fit.
Now the workarounds get ugly.
Index prefixing (INDEX foo(50)) has a number of problems and inefficiencies.
UNIQUE foo(50) is flat out wrong. It is declaring that the first 50 characters are constrained to be unique, not the entire column.
Workarounds with hashing the string (cf md5, sha1, etc) have a number of problems and inefficiencies. Still, this may be the only viable way to enforce uniqueness of a long string.
(I'll elaborate if needed.)
Fetching a row (Assuming the statement is parsed and the PRIMARY KEY is available.)
Drill down the BTree containing the data (and ordered by the PK). This may involve bring a block (or more) from disk into the buffer_pool.
Parse the block to find the row. (There are probably dozens of rows in the block.)
At some point in the process lock the row for reading and/or be blocked by some other connection that is, say, updating or deleting.
Pick apart the row -- that is, split into columns.
For any text/blob columns needed, reach into the off-record storage. (Wide columns are not stored with the narrow columns of the row; they are stored in other block(s).) The costly part is locating (and reading from disk if not cached) the extra block(s) containing the big TEXT/BLOB.
Convert from the internal storage (not word-aligned, little-endian, etc) into the desired format. (A small amount of CPU code, but necessary. This means that the data files are compatible across OS and even hardware.)
If the next step is to compare two strings (for JOIN or ORDER BY), then that a simple subroutine call to a scan over however many characters there are. (OK, most utf8 collations are not 'simple'.) And, yes, comparing two INTs would be faster.
Disk space
Should INT be used instead of VARCHAR(100) for the PRIMARY KEY? It depends.
Every secondary key has a copy of the PRIMARY KEY in it. This implies that a PK that is VARCHAR(100) makes secondary indexes bulkier than if the PK were INT.
If there are no secondary keys, then the above comment implies that INT is the bulkier approach!
If there are more than 2 secondary keys, then using varchar is likely to be bulkier.
(For exactly one secondary key, it is a tossup.)
Speed
If all the columns of a SELECT are in a secondary index, the query may be performed entirely in the index's BTree. ("Covering index", as indicated in EXPLAIN by "Using index".) This is sometimes a worthwhile optimization.
If the above does not apply, and it is useful to look up row(s) via a secondary index, then there are two BTree lookups -- once in the index, then via the PK. This is sometimes a noticeable slowdown.
The point here is that artificially adding an INT id may be slower than simply using the bulky VARCHAR as the PK. Each case should be judged on its tradeoffs; I am not making a blanket statement.

Related

Which one is faster to get the row? The primary key that carries numbers or that carries characters?

ID (Int 11) (Primary key) (Auto increment)
TITLE
1
...
2
...
3
...
4
...
5
...
To 10 million rows
ID (Char 32) (Primary key)
TITLE
a4a0FCBbE614497581da84454f806FbA
...
40D553d006EF43f4b8ef3BcE6B08a542
...
781DB409A5Db478f90B2486caBaAdfF2
...
fD07F0a9780B4928bBBdbb1723298F92
...
828Ef8A6eF244926A15a43400084da5D
...
To 10 million rows
If I want to get a specific row from the first table, How much time will take approximately, Same thing with the second table, How much time will take approximately?
Is the primary key that carries numbers will be found faster than that carries characters?
I do not want to use auto-increment with int like the first table because of this problem
UUIDs and MD5s and other hashes suck because of the "randomness" and lack of "locality of reference", not because of being characters instead of numeric.
You could convert those to BINARY(16), thereby making them half as big.
10M INT = 40MB = 600/block
10M CHAR(32) = 320MB = 300/block
10M VARCHAR(32) = 330MB = 300/block
10M BINARY(16) = 160MB = 450/block
Add that much more for each secondary key in that table.
Add again for each other table that references that PK (eg, FOREIGN KEY).
Let's look at the B+Tree that is the structure of the PK and secondary indexes. In a 16KB block, some number of entries can be placed. I have estimated them above. (Yes, the 'overhead' is much than an INT.) For INT, the BTree for 10M rows will probably be 3 levels deep. Ditto for the others. (As the table grows, Varchar would move to 4 levels before the others.)
So, I conclude, there is little or no difference in how many BTree blocks are needed to do your "point query".
Summary of how much slower a string is than an INT:
BTree depth -- little or none
Cachability of index blocks -- some; not huge
CPU time to compare numbers vs strings -- some; not huge
Use of a fancy COLLATION -- some; not huge
Overall -- not enough difference to worry about.
What I will argue for in some cases is whether you need a fabricated PK. In 2/3 of the tables I build, I find that there is a 'natural' PK -- some column(s) that is, by the business logic, naturally UNIQUE and NOT NULL. These are the two main qualifications (in MySQL) for a PRIMARY KEY. In some situations the speedup afforded by a "natural PK" can be more than a factor of 2.
A Many-to-many mapping table is an excellent (and common) example of such.
It is impossible to tell the exact times needed to retrieve a specific record, because it depends on lots of factors.
In general, numeric values take less storage space, thus scanning the index requires less I/O operations, therefore are usually faster.
However in this specific case the second table looks like a hexadecimal representation of a large number. You can probably store it as a binary value to save storage space.
On top of the above, in general numeric values are not affected by various database and column settings, while strings are (like collation), which also can add some processing time while querying.
The real question is what is the purpose of using the binary representation. 10 million values can easily fit in INT what is the need to have a key which can store way more (32 long hexadecimal value)?
As long as you are within the range of the numeric values and there is no other requirement, just to be able to store that many different values, I would go with an integer.
The 'problem' you mention in the question is usually not a problem. There is no need to not have gaps in the identifiers in most caes. In fact in lots of systems, gaps are naturally occurring during normal operations. You most probably won't reassign the records to other IDs when one record is being deleted from the middle of the table.
Unless there is a semantic meaning of the ID (it should not), I would just go with an AUTO_INCREMENT, there is no need to reinvent the wheel.

MYSQL: What is the impact of varchar length on performance when used a primary key? [duplicate]

What would be the performance penalty of using strings as primary keys instead of bigints etc.? String comparison is much more expensive than integer comparison, but on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.
An application that I work on uses strings as primary keys in several tables (MySQL). It is not trivial to change this, and I'd like to know what can be gained performance wise to justify the work.
on the other hand I can imagine that
internally a DBMS will compute hash
keys to reduce the penalty.
The DB needs to maintain a B-Tree (or a similar structure) with the key in a way to have them ordered.
If the key is hashed and stored it in the B-Tree that would be fine to check rapidly the uniqueness of the key -- the key can still be looked up efficiently. But you would not be able to search efficient for range of data (e.g. with LIKE) because the B-Tree is no more ordered according to the String value.
So I think most DB really store the String in the B-Tree, which can (1) take more space than numeric values and (2) require the B-Tree to be re-balanced if keys are inserted in arbitrary order (no notion of increasing value as with numeric pk).
The penalty in practice can range from insignificant to huge. It all depends on the usage, the number of rows, the average size of the string key, the queries which join table, etc.
In our product we use varchar(32) for primary keys (GUIDs) and we haven't met performance issues of this. Our product is a web site with extreme overload and is critical to be stable.
We use SQL Server 2005.
Edit: In our biggest tables we have more than 3 000 000 records with lots of inserts and selects from them. I think in general, the benefit of migrating to int key will be very low, but the problems while migrating very high.
One thing to watch out for is page splits (I know this can happen in SQL Server - probably the same in MySQL).
Primary keys are physically ordered. By using an auto-increment integer you guarantee that each time you insert you are inserting the next number up, so there is no need for the db to reorder the keys. If you use strings however, the pk you insert may need to be placed in the middle of the other keys to maintain the pk order. That process of reordering the pks on the insert can get expensive.
It depends on several factors: RDBMS, number of indexes involving those columns but in general it will be more efficient using ints, folowed by bigints.
Any performance gains depend on usage, so without concrete examples of table schema and query workload it is hard to say.
Unless it makes sense in the domain (I'm thinking unique something like social security number), a surrogate integer key is a good choice; referring objects do not need to have their FK reference updated when the referenced object changes.

How does MySQL determine if an INSERT is unique?

I would like to know if there is an implicit SELECT being run prior to performing an INSERT on a table that has any column defined as UNIQUE. I cannot find anything about this in the documentation for INSERT.
I have asked some other questions that nobody seems to be able to answer - perhaps because I'm not properly explaining myself - that are related to the above question.
If I understand correctly, then I assume the following would be true:
CASE 1:
You have a table with 1 billion rows. Each row has a UUID column which is unique. If you perform an insert the server must do some kind of implicit SELECT COUNT(*) FROM table WHERE UUID = [new uuid] and determine if the count is 0 or 1. Correct?
CASE 2:
You have a table with 1 billion rows. Each row has a composite unique key consisting of a DATE and a UUID. If you perform an insert the server must do some kind of implicit SELECT COUNT(*) FROM table WHERE DATE = [date] AND UUID = [new uuid] and check if the count is 0 or 1. Yes?
I use the word implicit because at some point, somewhere in the process, the server MUST be checking the value. If not it would require that the laws of physics dictate that two identical rows cannot exist - and as far as I'm informed physics don't play a big role when it comes to the uniqueness of numbers written down somewhere, in binary, on a magnetic disk in a computer.
Let's assume your 1 billion rows are equally and sequentially distributed across 2,000 different dates. Would this not mean that case 2 would perform the insert faster because it can look up the UUIDs segmented into dates? If not, then would it be better to use case 1 for insert speed - and in that case, why?
This question is theoretical, so don't bother with considering regular SELECT performance in this case. The primary key wouldn't be the UUID+DATE index.
As a response to comments: The UUID in my case is designed solely for the purpose of avoiding duplicate entries because of bad connections. Since you cannot make the same entry for a different date twice (without it logically being a new entry), the UUID does not need to be globally unique - it needs only be unique for each date. This is why I can permit it being part of a composite key.
There are a few flaws and misconceptions in the previous answers; rather than point them out, I will start from scratch.
Referring to InnoDB only...
An INDEX (including UNIQUE and PRIMARY KEY) is a BTree. BTrees are very efficient a locating one row based on the key the BTree is sorted on. (It is also efficient at scanning in key-order.) The "fan out" of a typical BTree in MySQL is on the order of 100. So, for a million rows, the BTree is about 3 levels deep (log100(million)); for a trillion rows, it is only twice as deep (approximately). So, even if nothing is cached, it takes only 3 disk hits to locate one particular row in a million-row index.
I am being loose here with "index" versus "table" because they are essentially the same (in InnoDB, at least). Both are BTrees. What differs is what is in the leaf nodes: The leaf nodes of a table BTree has all the columns. (I am ignoring the off-block storage for TEXT/BLOB in InnoDB.) An INDEX (other than the PRIMARY KEY) has a copy of the PRIMARY KEY in the leaf node. This is how a secondary key can get from the INDEX BTree to the rest of the row's columns, and how InnoDB does not have to store multiple copies of all the columns.
The PRIMARY KEY is "clustered" with the data. That is one BTree contains both all the columns of all the rows, and it is ordered according to the PRIMARY KEY specification.
Locating a record by PRIMARY KEY is one BTree search. Locating a record by a SECONDARY KEY is two BTree searches, one in the secondary INDEX's BTree which gives you the PRIMARY KEY; then a second one to drill down the data/PK BTree.
PRIMARY KEY(UUID)... Since the UUID is very random, the "next" row you INSERT will be located at a 'random' spot. If the table is much bigger than be cached in the buffer_pool, the block the new row needs to go into is very likely to not be cached. This leads to a disk hit to pull the block into cache (the buffer pool), and eventually another disk hit to write it back to disk.
Since a PRIMARY KEY is a UNIQUE KEY, something else is going on at the same time (No SELECT COUNT(*) etc). The UNIQUEness is checked after the block is fetched and before deciding whether to give a "duplicate key" error, or to store the row. Also, if the block is "full" then the block will need to be 'split' to make room for the new row.
INDEX(UUID) or UNIQUE(UUID)... There is a BTree for that index. On INSERT, some randomly located block will need to be fetched, modified, possibly split, and written back to disk, very much like the PK discussion above. If you had UNIQUE(UUID), there would also be a check for UNIQUEness and possibly an error message. In either case, there is, now and/or later, disk I/O.
AUTO_INCREMENT PK... If the PRIMARY KEY is an auto_increment, then new records are added to the 'last' block in the data BTree. When it gets full (every 100 or so records) there is (logically) a block split and flush of the old block to disk. (Actually, the I/O is probably delayed and done in the background.)
PRIMARY KEY(id) + UNIQUE(UUID)... Two BTrees. On an INSERT, there is activity in both. This is likely to be worse than simply PRIMARY KEY(UUID). Add up the disk hits above to see what I mean.
"Disk hits" are the killer in huge tables, and especially with UUIDs. "Count the disk hits" to get a feel for performance, especially when comparing two possible techniques.
Now for your secret sauce... PRIMARY KEY(date, UUID)... You are allowing the same UUID to show up on two different days. This can help! Back to how a PK works and checking for UNIQUEness... The "compound" index (date, UUID) is checked for UNIQUEness as the record is inserted. The records are sorted by date+UUID, so all of today's records are clumped together. IF (and this might be a big IF) one day's data fits in the buffer pool (but the entire table does not), then this is what is happening every morning... INSERTs are suddenly adding new records to the "end" of the table because of the new "date". These inserts are occurring randomly within the new date. Blocks in the buffer_pool are being pushed out to disk to make room for the new blocks. But, nicely, what you see is smooth, fast, INSERTs. This is unlike what you saw with PRIMARY KEY(UUID), when many rows had to wait for a disk read before UNIQUEness could be checked. All of today's blocks stay cached, and you don't have to wait for I/O.
But, if you ever get so big that you cannot fit one day's data in the buffer pool, things will start slowing down, first at the end of the day, then it will creep earlier and earlier as the frequency of INSERTs increases.
By the way, PARTITION BY RANGE(date), together with PRIMARY KEY(uuid, date) has somewhat similar characteristics. (Yes I deliberately flipped the PK columns.)
When inserting large amounts of data in a table, keep in mind that the data ends up being physically stored on a disk somewhere. To actually read and write the data from the disk, MySQL (and most other RDBMS) uses something called a clustered index. If you specify a Primary Key or a Unique Index on a table, the column or columns participating in the key/index becomes the clustered index key. This means that on the disk, data is physically stored in the same order as the values in the key column(s).
By utilising the clustered index, the database engine can quickly determine whether a value already exists, without having to scan the whole table. In theory, if a table contains N = 1.000.000 records, the engine on average needs log2(N) = 20 operations to check if a value exists, regardless of how many columns participate in the index. For secondary indexes, a B-tree or a hash table is typically used (search the web for these terms, for a detailed explanation of how they work).
The conclusion of this article is wrong:
"... MySQL is unable to buffer enough data to guarantee a value is
unique and is therefore caused to perform a tremendous amount of
reading for each insert to guarantee uniqueness"
This is incorrect. Checking uniqueness does not really require any additional work, as the engine had to locate the place to insert the new record anyway. What causes the performance slowdown, is the use of UUID's. Remember that UUID's are randomly generated, whenever a new record is inserted. This means that the new record needs to be inserted at a random physical position on the disk, and this causes existing data to be shifted around, to accomodate the new record. If, on the other hand, the index column is a value that increases monotonically (such as an auto-increment INT), new records will always be inserted after the last record, meaning no existing data will ever need to be moved.
In your case, there won't be any performance difference between case 1 and case 2. But you will still run into trouble because of the randomness of the UUID's. It would be much better if you used an auto-incrementing value instead of the UUID. Also, since UUID's are always unique by nature, it really doesn't make much sense to index them with a UNIQUE constraint. Alternatively, if you really must use UUID's, make sure that you have a primary key on your table, that is based on auto-incrementing INT's, to ensure that new records are never randomly inserted on the disk.
This is the very purpose of a UNIQUE constraint:
A UNIQUE index creates a constraint such that all values in the index must be distinct. An error occurs if you try to add a new row [or update an existing row] with a key value that matches [another] existing row.
Earlier in the same manual page, it is stated that
A column list of the form (col1,col2,...) creates a multiple-column index. Index key values are formed by concatenating the values of the given columns.
How this constraint is implemented is not documented, but it must somehow equate to a preliminary SELECT with the values to be inserted/updated. The cost of such a check is often negligible, because, by definition, the fields are indexed (this overhead becomes relevant when dealing with bulk inserts).
The number of columns covered by the index is not meaningful in terms of performance (for example, compared to the number of rows in the table). It does impact the disk space occupied by the index, but this should really not matter in your design decisions.

What are the merits of using numeric row IDs in MySQL?

I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.
Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)
So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)
So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.
Thanks in advance.
I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.
Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).
There are better reasons to use an auto-incremented primary key.
You know which values were most recently inserted.
If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
You don't have to worry about leading spaces, trailing spaces, and other character oddities.
You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.
Gordon is right (which is no surprise).
Here are the considerations for you not to worry about, in my view.
When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.
INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.
Here's what you should worry about.
There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says
WHERE colm IS NULL /* slow! */
or
WHERE colm IS NOT NULL /* slow! */
This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.
WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */
One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.
parent varchar(20) pk fk to parent table
birthorder int pk
name varchar(20)
Then, you could have rows like...
parent birthorder name
homer 1 bart
homer 2 lisa
homer 3 maggie
But, if you tried to insert a fourth row here like this
homer 1 badbart
you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.
Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.
The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.
Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.
Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.
This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.
Numeric row id's have many advantages over a string based id.
Most of them are mentioned in other answers.
1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient.
2. Numeric fields are stored much more efficiently
2. Joins are much faster with numeric keys.
3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient
4. I think using a auto-increment on primary key has its advantages too
-Thanks
_san

MySQL large index integers for few rows performance

A developer of mine was making an application and came up with the following schema
purchase_order int(25)
sales_number int(12)
fulfillment_number int(12)
purchase_order is the index in this table. (There are other fields but not relevant to this issue). purchase_order is a concatenation of sales_number + fulfillment.
Instead i proposed an auto_incrementing field of id.
Current format could be essentially 12-15 characters long and randomly generated (Though always unique as sales_number + fulfillment_number would always be unique).
My question here is:
if I have 3 rows each with a random btu unique ID i.e. 983903004, 238839309, 288430274 vs three rows with the ID 1,2,3 is there a performance hit?
As an aside my other argument (for those interested) to this was the schema makes little sense on the grounds of data redundancy (can easily do a SELECT CONCATENAE(sales_number,fulfillment_number)... rather than storing two columns together in a third)
The problem as I see is not with bigint vs int ( autoicrement column can be bigint as well, there is nothing wrong with it) but random value for primary key. If you use INNODB engine, primary key is at the same time a clustered key which defines physical order of data. Inserting random value can potentially cause more page splits, and, as a result a greater fragmentation, which in turn causes not only insert/update query to slow down, but also selects.
Your argument about concatenating makes sense, but executing CONCATE also has its cost(unfortunately, mysql doesn't support calculated persistent columns, so in some cases it's ok to store result of concatenation in a separate column; )
AFAIK integers are stored and compared as integers so the comparisons should take the same length of time.
Concatenating two ints (32bit) into one bigint (64bit) may have a performance hit that is hardware dependent.
having incremental id's will put records that were created around the same time near each other on the hdd. this might make some queries faster. if this is the primary key on innodb or for the index that these id's are used.
incremental records can sometimes be inserted a little bit quicker. test to see.
you'll need to make sure that the random id is unique. so you'll need an extra lookup.
i don't know if these points are material for you application.