I have a 100 million rows, and it's getting too big.
I see a lot of gaps. (since I delete, add, delete, add.)
I want to fill these gaps with auto-increment.
If I do reset it..is there any harM?
If I do this, will it fill the gaps?:
mysql> ALTER TABLE tbl AUTO_INCREMENT = 1;
Potentially very dangerous, because you can get a number again that is already in use.
What you propose is resetting the sequence to 1 again. It will just produce 1,2,3,4,5,6,7,.. and so on, regardless of these numbers being in a gap or not.
Update: According to Martin's answer, because of the dangers involved, MySQL will not even let you do that. It will reset the counter to at least the current value + 1.
Think again what real problem the existence of gaps causes. Usually it is only an aesthetic issue.
If the number gets too big, switch to a larger data type (bigint should be plenty).
FWIW... According to the MySQL docs applying
ALTER TABLE tbl AUTO_INCREMENT = 1
where tbl contains existing data should have no effect:
To change the value of the
AUTO_INCREMENT counter to be used for
new rows, do this:
ALTER TABLE t2 AUTO_INCREMENT = value;
You cannot reset the counter to a
value less than or equal to any that
have already been used. For MyISAM, if
the value is less than or equal to the
maximum value currently in the
AUTO_INCREMENT column, the value is
reset to the current maximum plus one.
For InnoDB, if the value is less than
the current maximum value in the
column, no error occurs and the
current sequence value is not changed.
I ran a small test that confirmed this for a MyISAM table.
So the answers to you questions are: no harm, and no it won't fill the gaps. As other responders have said: a change of data type looks like the least painful choice.
Chances are you wouldn't gain anything from doing this, and you could easily screw up your application by overwriting rows, since you're going to reset the count for the IDs. (In other words, the next time you insert a row, it'll overwrite the row with ID 1, and then 2, etc.) What will you gain from filling the gaps? If the number gets too big, just change it to a larger number (such as BIGINT).
Edit: I stand corrected. It won't do anything at all, which supports my point that you should just change the type of the column to a larger integer type. The maximum possible value for a BIGINT is 2^64, which is over 18 quintillion. If you only have 100 million rows at the moment, that should be plenty for the foreseeable future.
I agree with musicfreak... The maximum for an integer (int(10)) is 4,294,967,295 (unsigned ofcoarse). If you need to go even higher, switching to BIGINT brings you up to 18,446,744,073,709,551,615.
Since you can't change the next auto-increment value, you have other options. The datatype switch could be done, but it seems a little unsettling to me since you don't actually have that many rows. You'd have to make sure your code can handle IDs that large, which may or may not be tough for you.
Are you able to do much downtime? If you are, there are two options I can think of:
Dump/reload the data. You can do this so it won't keep the ID numbers. For example you could use a SELECT ... INTO to copy the data, sans-IDs, to a new table with identical DDL. Then you drop the old table and rename the new table to the old name. Depending on how much data there is, this could take a noticeable about of time (and temporary disk space).
You could make a little program to issue UPDATE statements to change the IDs. If you let that run slowly, it would "defragment" your IDs over time. Then you could temporarily stop the inserts (just a minute or two), update the last IDs, then restart it. After updating the last IDs you can change the AUTO_INCREMENT value to be the next number and your hole will be gone. This shouldn't cause any real downtime (at least on InnoDB), but it could take quite a while depending on how aggressive your program is.
Of course, both of these ignore referential integrity. I'm assuming that's not a problem (log statements that aren't used as foreign keys, or some such).
Does it really matter if there are gaps?
If you really want to go back and fill them, you can always turn off auto increment, and manually scan for the next available id every time you want to insert a row -- remembering to lock the table to avoid race conditions, of course. But it's a lot of work to do for not much gain.
Do you really need a surrogate key anyway? Depending on the data (you haven't mentioned a schema) you can probably find a natural key.
Related
I have a MYSQL table, where (to an already existing table) I added another column "Number" that is auto_incremented and has a UNIQUE KEY constraint.
There are 17000+ records in the table. After adding the "Number" column, one value is missed - there is a value of 14 369 and the next one is 14 371.
I tried removing the column and adding it again, but the missing value is still missing.
What might be the problem, and what is the least painfull way to solve this?
There is no problem and there is nothing to fix.
MySQL's auto_increment provides unique values, and it calculates them using sequential increment algorithm (it just increments a number).
That algorithm guarantees the fastest and accurate way of generating unique values.
That's its job. It doesn't "reuse" numbers and forcing it to do so comes with disastrous performance and stability.
Since queries do fail sometimes, these numbers get "lost" and you can't have them back.
If you require sequential numbers for whatever reason, create a procedure or scheduled event and maintain the numbers yourself.
You have to bear in mind that MySQL is a transactional database designed to operate under concurrent access. If it were to reuse these numbers, the performance would be abysmal since it'd have to use locks and force people to wait until it reorganizes the numbers.
InnoDB engine, the default engine, uses primary key values to organize records on the hard drive. If you were to change any of the values, it would start re-writing the records incurring a HUGE I/O wait that depends on the amount of data on the disk - it could bring the whole serve to a grinding halt.
TL:DR; there is no problem, there is nothing to fix, don't do it. If you persist, expect abnormal behavior.
I have encountered the fact that some people, after performing deletion of rows from a table, also reset the AUTO_INCREMENT for the primary key column of that table to re-number all the values as if they started from 1 again (or whatever the initial starting point).
My question is, is there a specific reason for doing this, other than just preference? As in, is there any detrimental impact on the database or future queries if you do not reset the auto-increment and just leave it as-is? If there is, could somebody provide an example where it would be necessary to reset AUTO_INCREMENT?
Thanks!
I don't think it is ever necessary to reset auto_increment, unless you are running out of values.
One case where auto-increment is often reset is when all the rows are deleted. If you use truncate table, then the auto-increment value is reset automatically. This does not always happen with delete without a where clause, so for consistency, you might want to reset it.
Another case is when a large insert fails, particularly if it fails repeatedly. You might not want the really large gaps.
When moving tables around you might want to keep the original id values. So, essentially, you ignore the auto-increment on inserts. Afterwards, though, you might want to set the automatic value to be consistent with other systems.
In general, though, resetting the auto-increment is not recommended.
Unfortunately, I've seen this behavior. And from what I observed, it's not due to a technical reason - it's closer to OCD.
Some people really don't like gaps in the ID column - they like the idea of it smoothly increasing by 1 for each record. The idea that some manual data manipulation they're doing screwing that up isn't pleasant - so they go through some hoops to make sure they don't cause gaps in the numbers.
But, yeah, this is a terrible practice. It's just asking for data integrity problems.
Resetting auto-inc is an uncommon operation. Under normal day to day work, just let it keep incrementing.
I've done reset of auto-inc in MySQL instances used for automated testing. A given set of tables is loaded with data over and over, and deletes its test data afterwards. Resetting the auto-inc may be the best way to make tests repeatable, if they're looking for specific values in the results.
Another scenario is when creating archive tables. Suppose you have a huge table, and you want to empty out the data efficiently (not using DELETE), but you want to archive the data, and you want new data to use id values higher than your old data.
CREATE TABLE mytable_new LIKE mytable;
SELECT AUTO_INCREMENT FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME='mytable';
ALTER TABLE mytable_new AUTO_INCREMENT = /* value + 10000 */;
RENAME TABLE mytable TO mytable_archive, mytable_new TO mytable;
The above series of statements allow you to shuffle a new empty table into place atomically, so your app can continue writing to the table by the name it's used to. The auto-inc value you reset in the new table should be a value higher than the max id value in the old table, plus some comfortable gap to avoid overlap during the moments between the statements.
Reseting the auto increment usually helps in terms of organization, you can see no gap between id 6 and 60 if the rows between have been deleted.
However, you should be carefull about working with resetting auto-increments, because most likely, your code will depend on specific id's to fetch certain information.
In my opinion, just truncate the whole thing after your tests and seed the database with the correct information. If it's production, let it run wild and free, it could cause more harm and no beneficial output
As per comment on abr's answer, assuming that auto-increment ids are contiguous (or even sequential) is not just a bad idea, it is a dangerous one.
There may be good reason for deliberately creating gaps in the allocated ids if you intend to patch the data at a later point (e.g. if you have restored from an old backup and expect to recover some of the missing data but need to restore a service asap) or when you migrate from a single active server to multiple master nodes. But in these scenarios you are setting the counter to higher value than currently used - not resetting it back to the start.
If there is a risk that you are going to wrap around the numbers, then you've probably picked the wrong data type for your auto-increment attribute - changing the data type is the right way to fix the problem, not deleting data and resetting the counter to 0.
I'm going to create a table which will have an amount of rows between 1000-20000, and I'm having fields that might repeat a lot... about 60% of the rows will have this value, where about each 50-100 have a shared value.
I've been concerned about efficiency lately and I'm wondering whether it would be better to store this string on each row (it would be between 8-20 characters) or to create another table and link them with its representative ID instead... So having ~1-50 rows in this table replacing about 300-5000 strings with ints?
Is this a good approach, or at all even neccessary?
Yes, it's a good approach in most circumstances. It's called normalisation, and is mainly done for two reasons:
Removing repeated data
Avoiding repeating entities
I can't tell from your question what the reason would be in your case.
The difference between the two is that the first reuses values that just happen to look the same, while the second connects values that have the same meaning. The practical difference is in what should happen if a value changes, i.e. if the value changes for one record, should the value itself change so that it changes for all other records also using it, or should that record be connected to a new value so that the other records are left unchanged.
If it's for the first reason then you will save space in the database, but it will be more complicated to update records. If it's for the second reason you will not only save space, but you will also reduce the risk of inconsistency, as a value is only stored in one place.
That is a good approach to have a look-up table for the strings. That way you can build more efficient indexes on the integer values. It wouldn't be absolutely necessary but as a good practice I would do that.
I would recommend using an int with a foreign key to a lookup table (like you describe in your second scenario). This will cause the index to be much smaller than indexing a VARCHAR so the storage required would be smaller. It should perform better, too.
Avitus is right, that it's generally a good practice to create lookups.
Think about the JOINS you will use this table in. 1000-20000 rows are not a lot to be handled by MySQL. If you don't have any, I would not bother about the lookups, just index the column.
BUT as soon as you start joining the table with others (of the same size) that's where the performance loss comes in, which you can (most likely) compensate by introducing lookups.
I was previously under the impression that deleting rows in an autoincremented table can harm SELECT performance, and so I've been using a tinyint column called "removed" to mark whether an item is removed or not.
My SELECT queries are something like this:
SELECT * FROM items WHERE removed = 0 ORDER BY id DESC LIMIT 25
But I'm wondering whether it does, in fact, make sense to just delete those rows instead. Less than 1% of rows are marked as "removed" so it seems dumb for mysql to have to check whether removed = 0 for each row.
So can deleting rows harm performance in any way?
That depends a lot on your use case - and on your users. Marking the row as deleted can help you in various situations:
if a user decides "oh, I did need that item after all", you don't need to go through the backups to restore it - just flip the "deleted" bit again (note potential privacy implications)
with foreign keys, you can't just go around deleting rows, you'd break the relationships in the database; same goes for security/audit logs
you aren't changing the number of rows (which may decrease index efficiency if the removed rows add up)
Moreover, when properly indexed, in my measurements, the impact was always insignificant (note that I wrote "measurements" - go and profile likewise, don't just blindly trust some people on the Internet). So, my advice would be "use the removed column, it has significant benefits and no significant negative impact".
I don't think deleting rows harm on select query. Normally peoples takes an extra column named deleted [removed in your case] to provide a restore like functionality. So if you are not providing restore functionality then you can delete the row it will not affect the select query as far as I know. But while deleting keep relationships in mind they should also get deleted or will result in error or provide wrong results.
You just fill the table with more and more records which you don't need. If you don't plan to use them in the future, I don't think you need to store them at all. If you want to keep them anyway, but don't plan to use them often, you can just create a temp table to hold your "removed" records.
The database I am working on right now has records being added and removed by the 10s of thousands and because of this there are gaps in the auto-incremented key hundreds of thousands big and auto-increment numbers well into the billions.
These numbers are never stored to reference the individual record but are used to reference the record when doing on-the-fly calculations.
Is there any reason to remove these gaps and reset the auto-increment number or is this inconsequential?
The id field is an unsigned int, should I increase it to an unsigned big int? From what I understand, right now if it hits 4,294,967,295 it will break.
The only reason I'd worry about it is if you find yourself close to that 2^32 limit. If you're not using the column as a row id, then don't even worry about it.
EDIT If you are using this column for any kind of identifying information, then I'd switch the column over to a GUID or something, because you're gonna get overflow, and then you'll get duplicate values. And that's no bueno.
I don't know what the growth rate of your autoincrement field is, but it should be simple math for you to estimate when you will hit the 4294967295 limit.
If you still feel that you need to do something, you have the following options:
reset the current count to 1. Do this ideally by dropping the column and recreating it. Since you are not using this field for referential integrity, should be a quick and simple fix until the next time...
Change the datatype to an unsigned BIGINT. Now you can go up to 18446744073709551615. But you need more space in the heap to store this increased amount of data, and you have only postponed your problem.
Change from an autoincrement (INT / BIGINT) to a UUID. Then you can stop worrying about numbers and the nature of infinity, but you most likely will have to change all of your client code.
On a separate note, I sense a poor decision or two somewhere earlier up the line here.