MYSQL autoincrement a column or just have an integer, difference? - mysql

If I have a column, set as primary index, and set as INT.
If I don't set it as auto increment and just insert random integers which are unique into it, does that slow down future queries compared to autincrementing?
Does it speed things up if I run OPTIMIZE on a table with its primary and only index as INT? (assuming only 2 columns, and second column is just some INT value)
(the main worry is the upper limit on the autoincrement as theres lots of adds and deletes in my table)

If I don't set it as auto increment and just insert random integers which are unique into it, does that slow it down compared to autincrementing?
In MyISAM it will in fact speed it (marginally).
In InnoDB, this may slow the INSERT operations down due to page splits.
This of course implies that your numbers are really unique.
Does it speed things up if I optimise a table with its primary and only index as INT? (assuming only 2 columns, and second column is just some INT value)
AUTO_INCREMENT and INT may be used together.
OPTIMIZE TABLE will compact you table and indexes, freeing the space left from the deleted rows and page splits. If you had lots of DELETE operations on the table or INSERT out of order (like in your solution with random numbers), this will help.
It will also bring the logical and physical order of the index pages into consistency with each other which will speed up full scans or ranged queries on PK (PK BETWEEN val1 AND val2), but will hardly matter for random seeks.
(the main worry is the upper limit on the autoincrement as theres lots of adds and deletes in my table)
BIGINT UNSIGNED (which can also be used with AUTO_INCREMENT) may hold up values up to 18446744073709551615.

The upper limit for autoincremented integers is 18446744073709551615:
http://dev.mysql.com/doc/refman/5.1/en/numeric-types.html
Are you really hitting such limit? If you do, allowing MySQL to add one to the previous number is an algorithm that can hardly improve.

The upper limit on AUTOINCREMENT is the upper limit on the number type in the respective column. Even with INT UNSIGNED, this can take a while to hit; with BIGINT it's going to be very hard to reach that (and seriously, what kind of app are you building that 4 extra bytes per row are way too much?). So, if you're going to hit that limit, you'll hit it with autoincrement or without it.
Also, although not having AUTOINCREMENT will speed your inserts up a tiny bit, I'm willing to bet that any code to generate a unique integer to use instead of the AUTOINCREMENT will slow down the code more than the autoincrement would (generating random non-conflicting numbers will get progressively harder as your table fills up).
In other words, IMNSHO this looks like premature optimization, and will not significantly contribute to faster code (if at all), but it will make it less maintainable (as the PK will need to be generated explicitly, instead of the database taking care of it).

Related

Which one is faster to get the row? The primary key that carries numbers or that carries characters?

ID (Int 11) (Primary key) (Auto increment)
TITLE
1
...
2
...
3
...
4
...
5
...
To 10 million rows
ID (Char 32) (Primary key)
TITLE
a4a0FCBbE614497581da84454f806FbA
...
40D553d006EF43f4b8ef3BcE6B08a542
...
781DB409A5Db478f90B2486caBaAdfF2
...
fD07F0a9780B4928bBBdbb1723298F92
...
828Ef8A6eF244926A15a43400084da5D
...
To 10 million rows
If I want to get a specific row from the first table, How much time will take approximately, Same thing with the second table, How much time will take approximately?
Is the primary key that carries numbers will be found faster than that carries characters?
I do not want to use auto-increment with int like the first table because of this problem
UUIDs and MD5s and other hashes suck because of the "randomness" and lack of "locality of reference", not because of being characters instead of numeric.
You could convert those to BINARY(16), thereby making them half as big.
10M INT = 40MB = 600/block
10M CHAR(32) = 320MB = 300/block
10M VARCHAR(32) = 330MB = 300/block
10M BINARY(16) = 160MB = 450/block
Add that much more for each secondary key in that table.
Add again for each other table that references that PK (eg, FOREIGN KEY).
Let's look at the B+Tree that is the structure of the PK and secondary indexes. In a 16KB block, some number of entries can be placed. I have estimated them above. (Yes, the 'overhead' is much than an INT.) For INT, the BTree for 10M rows will probably be 3 levels deep. Ditto for the others. (As the table grows, Varchar would move to 4 levels before the others.)
So, I conclude, there is little or no difference in how many BTree blocks are needed to do your "point query".
Summary of how much slower a string is than an INT:
BTree depth -- little or none
Cachability of index blocks -- some; not huge
CPU time to compare numbers vs strings -- some; not huge
Use of a fancy COLLATION -- some; not huge
Overall -- not enough difference to worry about.
What I will argue for in some cases is whether you need a fabricated PK. In 2/3 of the tables I build, I find that there is a 'natural' PK -- some column(s) that is, by the business logic, naturally UNIQUE and NOT NULL. These are the two main qualifications (in MySQL) for a PRIMARY KEY. In some situations the speedup afforded by a "natural PK" can be more than a factor of 2.
A Many-to-many mapping table is an excellent (and common) example of such.
It is impossible to tell the exact times needed to retrieve a specific record, because it depends on lots of factors.
In general, numeric values take less storage space, thus scanning the index requires less I/O operations, therefore are usually faster.
However in this specific case the second table looks like a hexadecimal representation of a large number. You can probably store it as a binary value to save storage space.
On top of the above, in general numeric values are not affected by various database and column settings, while strings are (like collation), which also can add some processing time while querying.
The real question is what is the purpose of using the binary representation. 10 million values can easily fit in INT what is the need to have a key which can store way more (32 long hexadecimal value)?
As long as you are within the range of the numeric values and there is no other requirement, just to be able to store that many different values, I would go with an integer.
The 'problem' you mention in the question is usually not a problem. There is no need to not have gaps in the identifiers in most caes. In fact in lots of systems, gaps are naturally occurring during normal operations. You most probably won't reassign the records to other IDs when one record is being deleted from the middle of the table.
Unless there is a semantic meaning of the ID (it should not), I would just go with an AUTO_INCREMENT, there is no need to reinvent the wheel.

Mysql : 'UNIQUE' constraint over a large string

What could be the possible downside of having UNIQUE constraint for a large string (varchar) (roughly 100 characters or so) in MYSQL during :
insert phase
retrieval phase (on another primary key)
Can the length of the query impact the performance of read/writes ? (Apart from disk/memory usage for book-keeping).
Thanks
Several issues. There is a limit on the size of a column in an index (191, 255, 767, 3072, etc, depending on various things).
Your column fits within the limit.
Simply make a UNIQUE or PRIMARY key for that column. There are minor performance concerns, but keep this in mind: Fetching a row is more costly than any datatype issues involving the key used to locate it.
Your column won't fit.
Now the workarounds get ugly.
Index prefixing (INDEX foo(50)) has a number of problems and inefficiencies.
UNIQUE foo(50) is flat out wrong. It is declaring that the first 50 characters are constrained to be unique, not the entire column.
Workarounds with hashing the string (cf md5, sha1, etc) have a number of problems and inefficiencies. Still, this may be the only viable way to enforce uniqueness of a long string.
(I'll elaborate if needed.)
Fetching a row (Assuming the statement is parsed and the PRIMARY KEY is available.)
Drill down the BTree containing the data (and ordered by the PK). This may involve bring a block (or more) from disk into the buffer_pool.
Parse the block to find the row. (There are probably dozens of rows in the block.)
At some point in the process lock the row for reading and/or be blocked by some other connection that is, say, updating or deleting.
Pick apart the row -- that is, split into columns.
For any text/blob columns needed, reach into the off-record storage. (Wide columns are not stored with the narrow columns of the row; they are stored in other block(s).) The costly part is locating (and reading from disk if not cached) the extra block(s) containing the big TEXT/BLOB.
Convert from the internal storage (not word-aligned, little-endian, etc) into the desired format. (A small amount of CPU code, but necessary. This means that the data files are compatible across OS and even hardware.)
If the next step is to compare two strings (for JOIN or ORDER BY), then that a simple subroutine call to a scan over however many characters there are. (OK, most utf8 collations are not 'simple'.) And, yes, comparing two INTs would be faster.
Disk space
Should INT be used instead of VARCHAR(100) for the PRIMARY KEY? It depends.
Every secondary key has a copy of the PRIMARY KEY in it. This implies that a PK that is VARCHAR(100) makes secondary indexes bulkier than if the PK were INT.
If there are no secondary keys, then the above comment implies that INT is the bulkier approach!
If there are more than 2 secondary keys, then using varchar is likely to be bulkier.
(For exactly one secondary key, it is a tossup.)
Speed
If all the columns of a SELECT are in a secondary index, the query may be performed entirely in the index's BTree. ("Covering index", as indicated in EXPLAIN by "Using index".) This is sometimes a worthwhile optimization.
If the above does not apply, and it is useful to look up row(s) via a secondary index, then there are two BTree lookups -- once in the index, then via the PK. This is sometimes a noticeable slowdown.
The point here is that artificially adding an INT id may be slower than simply using the bulky VARCHAR as the PK. Each case should be judged on its tradeoffs; I am not making a blanket statement.

Index on Boolean field to delete records in a partitioned table

I have a large MySQL table which may contain 100 million records. The schema of the table is something like this-
Id varchar(36), --guid, primary key
IsDirty bit(1),
CreatedOn(Date),
Info varchar(500)
I have created a partition on CreatedOn field which creates a partition for monthly data. Some of the rows in the table are updated and isDirty is set to 1. At max, only 10% of the rows would have IsDirty = 1. There is a process that runs every night and deletes data which is 6 months old with value IsDirty = 0.
Is there any performance gain if I create an index on IsDirty field as well? From what I've read is, creating indexes on bit field may not add much to the performance but reindexing after deleting the records may downgrade the performance due to index.
Is my understanding correct? Is there a better way to achieve the desired functionality?
There is a rule of thumb which says, that it's best to index columns with a high cardinality. Cardinality is the estimated number of distinct values in the column. When you do a show indexes from your_table; you would see, that your IsDirty column has a cardinality of 2. Very bad.
However this does not consider the distribution of the data. When only 10% have IsDirty = 1, queries like select * from your_table where IsDirty = 1 would benefit from the index. Your delete job on the other hand, which checks for IsDirty = 0 would not benefit, as it's cheaper to simply do a full table scan, because using a secondary index means, that from the index the primary key is read (in every secondary index the primary key is stored, therefore it's always good to make the primary key as small as possible) to identify the row to be read.
The manual states the following about when a full table scan is prefered:
Each table index is queried, and the best index is used unless the optimizer believes that it is more efficient to use a table scan. At one time, a scan was used based on whether the best index spanned more than 30% of the table, but a fixed percentage no longer determines the choice between using an index or a scan. The optimizer now is more complex and bases its estimate on additional factors such as table size, number of rows, and I/O block size.
Also note, that the bit datatype is not ideal to store values 0 or 1. There is a bool datatype (which is internally realised as tinyint(1). I think I've read somewhere a reason for this, but I've forgotten about it).
Don't bother with partitioning, it is unlikely to help performance. Anyway, you would need to have a growing number of partitions and use PARTITION BY RANGE(to_days(..)). You would not be able to use DROP PARTITION, which would make the deletion very fast.
I'll tentatively take that back. This may work, and may allow for DROP PARTITION, but I am baffled as to the syntax.
PARTITION BY RANGE(TO_DAYS(CreatedOn))
SUBPARTITION BY LINEAR KEY(IsDirty)
SUBPARTITIONS 2
If you do end up with a big DELETE every night, then either
Do it hourly (or continually) so that the delete is not to big
Chunk it as discussed here
Also, have
INDEX(IsDirty, CreatedOn) -- in this order.
(Note: If the subpartitioning can be made to work; this index is not needed.)
Other tips:
Use InnoDB.
Set innodb_buffer_pool_size to about 70% of RAM size.
UUIDs are horrible for large tables due to the randomness of accessing -- hence high I/O.
Id varchar(36), --guid, primary key -- Pack it into BINARY(16). (Let me know if you need help.) Saving space --> shrinks table --> cuts back on I/O.
Because of the awfullness of uuids, the partitioning may help avoid a lot of the I/O -- This is because all of this month's inserts will be going into one partition. That is, the "working set", hence buffer_pool size can be less.

Short, single-field indexes or enormous covering indexes in MySQL

I am trying to understand exactly what is and is not useful in a multiple-field index. I have read this existing question (and many more) plus other sites/resources (MySQL Performance Blog, Percona slideshares, etc.) but I'm not totally confident that what I've found on the subject is current and accurate. So please bear with me while I repeat some of what I think I know.
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Assuming the above is at least generally accurate, here's my puzzle. I have a slowly changing dimension table defined with the following columns (more or less) and using MyISAM:
dim_owner_ID INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
person_ID INT UNSIGNED NOT NULL,
raw_name VARCHAR(92) NOT NULL,
first VARCHAR(30),
middle VARCHAR(50),
last VARCHAR(30),
suffix CHAR(3),
flag CHAR(1)
Each "owner" is a unique instance of a particular individual with a particular name, so if Sue Smith changes her name to Sue Brown, that results in two rows that are the same except for the last field and the surrogate key. My understanding is that the only way to enforce this constraint internally is to do:
UNIQUE INDEX uq_owner_complete (person_ID, raw_name, first, middle, last, suffix, flag)
And that's basically going to duplicate the entire table (except for the surrogate key).
I also need to index a few other fields for quick joins and searches. While there will be some writes, and disk space is neither free nor infinite, read performance is absolutely the #1 priority here. These smaller indexes should serve very well to cover the conditions of the queries that will be run against the table, but in almost every case, the entire row needs to be selected.
With that in mind:
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
If this were an InnoDB index, the PK would already be there, but since it's MyISAM it's got pointers to the full rows instead. So if I'm understanding things correctly, there's no point (no pun intended) to adding the PK to any other index, unless doing so would allow the retrieval of the desired result set directly from the index. Which is not likely here.
I understand if it seems like I'm trying too hard to optimize, and maybe I am, but the tasks I need to perform using this database take weeks at a time, so every little bit helps.
You have to understand one concept. An index (either InnoDB or MyiSAM, ether Primary or secondary) is a data structure that's called "B+ tree".
Each node in the B+ tree is a couple (k, v), where k is a key, v is a value. If you build index on last_name your keys will be "Smith", "Johnson", "Kuzminsky" etc.
Value in the index is some data. If the index is the secondary index then the data is a primary key values.
So if you build index on last_name each node will be a couple (last_name, id) e.g. ("Smith", 5).
Primary index is an index where k is primary key and data is all other fields.
Bearing in mind the above let me comment some points:
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
Not exactly. If your secondary index is good you can quickly find v based on you query condition. E.g. you can quickly find PK by last name.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Index is B+tree where each node is a couple of indexed field(s) value(s) and PK.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Not exactly. If there were no index you'd have to scan whole table and choose only records where last_name = "Smith". But you have index (last_name, PK), so having key "Smith" you can quickly find all PK where last_name = "Smith". And then you can quickly find your full result (because you need not only the last name, but the first name too). So you're right, queries like SELECT * FROM table WHERE last_name = "Smith" are executed in two steps:
Find all PK
By PK find full record.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
Exactly. If your index is actually (last_name, first_name, id) and your query is SELECT first_name WHERE last_name = "Smith" you don't do the second step. You have the first name in the secondary index, so you don't have to go to the Primary index.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
Right. Two neighbor PK values will most likely be in the same page. Well, except cases when one PK is the last value in a page and next PK value is stored in the next page.
Basically, this is why B+ tree structure was invented. It's not only efficient for search but also efficient in sequential access. And until recently we had rotating hard drives.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Right. If you insert new records to MyISAM table the records will be added to the end of MYD file regardless the PK order.
Primary index of MyISAM table will be B+tree with pointers to records in the MYD file.
Now about your particular problem. I don't see any reason to define UNIQUE INDEX uq_owner_complete.
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
The best is to have the secondary index on all columns that are used in the WHERE clause, except low selective fields (like sex). The most selective fields must go first in the index. For example (last_name, eye_color) is good. (eye_color, last_name) is bad.
If the covering index allows to avoid additional PK lookup, that's excellent. But if not that's acceptable too.
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Yes.
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
PK is already a part of the index.( Remember, it's stored as data.) So, it makes no sense to explicitly add PK fields to the secondary index. I think (but not sure) that MyISAM secondary indexes store the PK values too (and Primary indexes do store the pointers).
To summarize:
Make your PK shorter as possible (surrogate PK works great)
Add as many indexes as you need until writes performance becomes unacceptable for you.

MySQL large index integers for few rows performance

A developer of mine was making an application and came up with the following schema
purchase_order int(25)
sales_number int(12)
fulfillment_number int(12)
purchase_order is the index in this table. (There are other fields but not relevant to this issue). purchase_order is a concatenation of sales_number + fulfillment.
Instead i proposed an auto_incrementing field of id.
Current format could be essentially 12-15 characters long and randomly generated (Though always unique as sales_number + fulfillment_number would always be unique).
My question here is:
if I have 3 rows each with a random btu unique ID i.e. 983903004, 238839309, 288430274 vs three rows with the ID 1,2,3 is there a performance hit?
As an aside my other argument (for those interested) to this was the schema makes little sense on the grounds of data redundancy (can easily do a SELECT CONCATENAE(sales_number,fulfillment_number)... rather than storing two columns together in a third)
The problem as I see is not with bigint vs int ( autoicrement column can be bigint as well, there is nothing wrong with it) but random value for primary key. If you use INNODB engine, primary key is at the same time a clustered key which defines physical order of data. Inserting random value can potentially cause more page splits, and, as a result a greater fragmentation, which in turn causes not only insert/update query to slow down, but also selects.
Your argument about concatenating makes sense, but executing CONCATE also has its cost(unfortunately, mysql doesn't support calculated persistent columns, so in some cases it's ok to store result of concatenation in a separate column; )
AFAIK integers are stored and compared as integers so the comparisons should take the same length of time.
Concatenating two ints (32bit) into one bigint (64bit) may have a performance hit that is hardware dependent.
having incremental id's will put records that were created around the same time near each other on the hdd. this might make some queries faster. if this is the primary key on innodb or for the index that these id's are used.
incremental records can sometimes be inserted a little bit quicker. test to see.
you'll need to make sure that the random id is unique. so you'll need an extra lookup.
i don't know if these points are material for you application.