I was previously under the impression that deleting rows in an autoincremented table can harm SELECT performance, and so I've been using a tinyint column called "removed" to mark whether an item is removed or not.
My SELECT queries are something like this:
SELECT * FROM items WHERE removed = 0 ORDER BY id DESC LIMIT 25
But I'm wondering whether it does, in fact, make sense to just delete those rows instead. Less than 1% of rows are marked as "removed" so it seems dumb for mysql to have to check whether removed = 0 for each row.
So can deleting rows harm performance in any way?
That depends a lot on your use case - and on your users. Marking the row as deleted can help you in various situations:
if a user decides "oh, I did need that item after all", you don't need to go through the backups to restore it - just flip the "deleted" bit again (note potential privacy implications)
with foreign keys, you can't just go around deleting rows, you'd break the relationships in the database; same goes for security/audit logs
you aren't changing the number of rows (which may decrease index efficiency if the removed rows add up)
Moreover, when properly indexed, in my measurements, the impact was always insignificant (note that I wrote "measurements" - go and profile likewise, don't just blindly trust some people on the Internet). So, my advice would be "use the removed column, it has significant benefits and no significant negative impact".
I don't think deleting rows harm on select query. Normally peoples takes an extra column named deleted [removed in your case] to provide a restore like functionality. So if you are not providing restore functionality then you can delete the row it will not affect the select query as far as I know. But while deleting keep relationships in mind they should also get deleted or will result in error or provide wrong results.
You just fill the table with more and more records which you don't need. If you don't plan to use them in the future, I don't think you need to store them at all. If you want to keep them anyway, but don't plan to use them often, you can just create a temp table to hold your "removed" records.
Related
As a MySQL database user,
I'm working on a script using MySQL database with an auto-increment primary key tables, that users may need to remove (lots of) data rows as mistaken, duplicated, canceled data and so on.
for now, I use a tinyint last col as 'delete' for each table and update the rows to delete=1 instead of deleting the row.
considering the deleted data as not important data,
which way do you suggest to have a better database and performance?
does deleting (maybe lots of) rows every day affect select queries for large tables?
is it better to delete the rows instantly?
or keep the rows using 'delete' col and delete them for example monthly then re-index the data?
I've searched about this but most of the results were based on personal opinions or preferred ones and not referenced or tested data.
PS) Edit:
AND Refering to the question and considering below pic, there's one more point to ask in this topic and I would be grateful if you could guide me.
deleting a row (row 6) while auto increment index was 225, leaded the not-sorted table to put the next inserted row by id=225 at deleted-id=6 place (at least visually!). if deleting action happens lots of times, then primary column and its rows will be completely out of order and messed up.
It should be considered as the good point of database that fill up the deleted spaces or something bad that leads to reducing the performance or none of them and doesn't matter what it's showing in front!?
Thanks.
What percentage of the table is "deleted"?
If it is less than, say, 20%, it would be hard to measure any difference between a soft "deleted=1" and a hard "DELETE FROM tbl". The disk space would probably be the same. A 16KB block would either have soft-deleted rows to ignore, or the block would be not "full".
Let's say 80% of the rows have been deleted. Now there are some noticeable differences.
In the "soft-delete" case, a SELECT will be looking at 5 rows to find only 1. While this sounds terrible, it does not translate into 5 times the effort. There is overhead for fetching a block; if it contains 4 soft-deleted rows and 1 useful row, that overhead is shared. Once a useful row is found, there is overhead to deliver that row to the client, but that applies only to the 1 row.
In the "hard-delete" case, blocks are sometimes coalesced. That is, when two "adjacent" blocks become less than half full, they may be combined into a single block. (Or so the documentation says.) This helps to cut down on the number of blocks that need to be touched. But it does not shrink the disk space -- hard-deleted rows leave space that can be reused; deleted blocks can be reused. Blocks are not returned to the OS.
A "point-query" is a SELECT where you specify exactly the row you want (eg, WHERE id = 123). That will be very fast with either type of delete. The only possible change is if the BTree is a different depth. But even if 80% of the rows are deleted, the BTree is unlikely to change in depth. You need to get to about 99% deleted before the depth changes. (A million rows has a depth of about 3; 100M -> 4.)
"Range queries (eg, WHERE blah BETWEEN ... AND ...) will notice some degradation if most are soft-deleted -- but, as already mentioned, there is a slight degradation in either deletion method.
So, is this my "opinion"? Yes. But it is based on an understanding of how InnoDB tables work. And it is based on "experience" in the sense that I have detected nothing to significantly shake this explanation in about 19 years of using InnoDB.
Further... With hard-delete, you have the option of freeing up the free space with OPTIMIZE TABLE. But I have repeatedly said "don't bother" and elaborated on why.
On the other hand, if you need to delete a big chunk of a table (either one-time or repeatedly), see my blog on efficient techniques: http://mysql.rjweb.org/doc.php/deletebig
(Re: the PS)
SELECT without an ORDER BY -- It is 'fair game' for the query to return the rows in any order it feels like. If you want a certain order, add ORDER BY.
What Engine is being used? MyISAM and InnoDB work differently; neither are predictable with out ORDER BY.
If you wanted the new entry to have id=6, that is a different problem. (And I will probably argue against designing the ids like that.)
The simple answer is no. Because DBMS systems are already designed to make changes at any time but system performance is important. Sometimes it's will affect a little bit. But no need to care it
The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.
I have encountered the fact that some people, after performing deletion of rows from a table, also reset the AUTO_INCREMENT for the primary key column of that table to re-number all the values as if they started from 1 again (or whatever the initial starting point).
My question is, is there a specific reason for doing this, other than just preference? As in, is there any detrimental impact on the database or future queries if you do not reset the auto-increment and just leave it as-is? If there is, could somebody provide an example where it would be necessary to reset AUTO_INCREMENT?
Thanks!
I don't think it is ever necessary to reset auto_increment, unless you are running out of values.
One case where auto-increment is often reset is when all the rows are deleted. If you use truncate table, then the auto-increment value is reset automatically. This does not always happen with delete without a where clause, so for consistency, you might want to reset it.
Another case is when a large insert fails, particularly if it fails repeatedly. You might not want the really large gaps.
When moving tables around you might want to keep the original id values. So, essentially, you ignore the auto-increment on inserts. Afterwards, though, you might want to set the automatic value to be consistent with other systems.
In general, though, resetting the auto-increment is not recommended.
Unfortunately, I've seen this behavior. And from what I observed, it's not due to a technical reason - it's closer to OCD.
Some people really don't like gaps in the ID column - they like the idea of it smoothly increasing by 1 for each record. The idea that some manual data manipulation they're doing screwing that up isn't pleasant - so they go through some hoops to make sure they don't cause gaps in the numbers.
But, yeah, this is a terrible practice. It's just asking for data integrity problems.
Resetting auto-inc is an uncommon operation. Under normal day to day work, just let it keep incrementing.
I've done reset of auto-inc in MySQL instances used for automated testing. A given set of tables is loaded with data over and over, and deletes its test data afterwards. Resetting the auto-inc may be the best way to make tests repeatable, if they're looking for specific values in the results.
Another scenario is when creating archive tables. Suppose you have a huge table, and you want to empty out the data efficiently (not using DELETE), but you want to archive the data, and you want new data to use id values higher than your old data.
CREATE TABLE mytable_new LIKE mytable;
SELECT AUTO_INCREMENT FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME='mytable';
ALTER TABLE mytable_new AUTO_INCREMENT = /* value + 10000 */;
RENAME TABLE mytable TO mytable_archive, mytable_new TO mytable;
The above series of statements allow you to shuffle a new empty table into place atomically, so your app can continue writing to the table by the name it's used to. The auto-inc value you reset in the new table should be a value higher than the max id value in the old table, plus some comfortable gap to avoid overlap during the moments between the statements.
Reseting the auto increment usually helps in terms of organization, you can see no gap between id 6 and 60 if the rows between have been deleted.
However, you should be carefull about working with resetting auto-increments, because most likely, your code will depend on specific id's to fetch certain information.
In my opinion, just truncate the whole thing after your tests and seed the database with the correct information. If it's production, let it run wild and free, it could cause more harm and no beneficial output
As per comment on abr's answer, assuming that auto-increment ids are contiguous (or even sequential) is not just a bad idea, it is a dangerous one.
There may be good reason for deliberately creating gaps in the allocated ids if you intend to patch the data at a later point (e.g. if you have restored from an old backup and expect to recover some of the missing data but need to restore a service asap) or when you migrate from a single active server to multiple master nodes. But in these scenarios you are setting the counter to higher value than currently used - not resetting it back to the start.
If there is a risk that you are going to wrap around the numbers, then you've probably picked the wrong data type for your auto-increment attribute - changing the data type is the right way to fix the problem, not deleting data and resetting the counter to 0.
i have a db with a table 'product' in relation with a table 'discount'.
the discount table, in turn, is in relation with 'brand' and 'category'.
i need to know the 'calculatedPrice' of each product and than sort them (about 30.000 products).
but, in this way, the query is too slow.
it is acceptable to violate the 3rd rule of normalization, and add 'calculatedPrice' as a column in db?
the column will than be calculated by a query run once every 5 min or so...
i don't see any other solution.
I think it depends on your situation: for example, I had a table once which contained the information to generate an invoice, such as the price, but also the VAT value at the time of invoice's emission.
The VAT Value changes overtime, but it cannot influence previously emitted invoices, so the only way to avoid problems was to store the value itself rather than a reference on a "constant values" table. This, indeed, produces redundancy of the information inside the table and possible "inconsistencies" inside the database.
That said, I would consider very carefully the reasons behind your choice of adding a column (note that these are just to make you think, not pointing a finger anywhere :) ):
are you sure that performances are affected by a couple of joins of your tables?
if so, are you sure that the problem doesn't lie somewhere else in your design?
do you always need the value of calculatedPrice for all your products or can you reduce the number of rows, calculating only the values for the products you actually need?
If your answer is "Yes" for all the questions above, then go for the extra column.
P.S.: I would, in any case, avoid things like having "a query run once every 5 min or so": this opens your system to synchronization problems, and concurrency issues. What would happen if the discount has changed, but your "update query" has not yet run? Would then your program retrieve an old value?
I have a 100 million rows, and it's getting too big.
I see a lot of gaps. (since I delete, add, delete, add.)
I want to fill these gaps with auto-increment.
If I do reset it..is there any harM?
If I do this, will it fill the gaps?:
mysql> ALTER TABLE tbl AUTO_INCREMENT = 1;
Potentially very dangerous, because you can get a number again that is already in use.
What you propose is resetting the sequence to 1 again. It will just produce 1,2,3,4,5,6,7,.. and so on, regardless of these numbers being in a gap or not.
Update: According to Martin's answer, because of the dangers involved, MySQL will not even let you do that. It will reset the counter to at least the current value + 1.
Think again what real problem the existence of gaps causes. Usually it is only an aesthetic issue.
If the number gets too big, switch to a larger data type (bigint should be plenty).
FWIW... According to the MySQL docs applying
ALTER TABLE tbl AUTO_INCREMENT = 1
where tbl contains existing data should have no effect:
To change the value of the
AUTO_INCREMENT counter to be used for
new rows, do this:
ALTER TABLE t2 AUTO_INCREMENT = value;
You cannot reset the counter to a
value less than or equal to any that
have already been used. For MyISAM, if
the value is less than or equal to the
maximum value currently in the
AUTO_INCREMENT column, the value is
reset to the current maximum plus one.
For InnoDB, if the value is less than
the current maximum value in the
column, no error occurs and the
current sequence value is not changed.
I ran a small test that confirmed this for a MyISAM table.
So the answers to you questions are: no harm, and no it won't fill the gaps. As other responders have said: a change of data type looks like the least painful choice.
Chances are you wouldn't gain anything from doing this, and you could easily screw up your application by overwriting rows, since you're going to reset the count for the IDs. (In other words, the next time you insert a row, it'll overwrite the row with ID 1, and then 2, etc.) What will you gain from filling the gaps? If the number gets too big, just change it to a larger number (such as BIGINT).
Edit: I stand corrected. It won't do anything at all, which supports my point that you should just change the type of the column to a larger integer type. The maximum possible value for a BIGINT is 2^64, which is over 18 quintillion. If you only have 100 million rows at the moment, that should be plenty for the foreseeable future.
I agree with musicfreak... The maximum for an integer (int(10)) is 4,294,967,295 (unsigned ofcoarse). If you need to go even higher, switching to BIGINT brings you up to 18,446,744,073,709,551,615.
Since you can't change the next auto-increment value, you have other options. The datatype switch could be done, but it seems a little unsettling to me since you don't actually have that many rows. You'd have to make sure your code can handle IDs that large, which may or may not be tough for you.
Are you able to do much downtime? If you are, there are two options I can think of:
Dump/reload the data. You can do this so it won't keep the ID numbers. For example you could use a SELECT ... INTO to copy the data, sans-IDs, to a new table with identical DDL. Then you drop the old table and rename the new table to the old name. Depending on how much data there is, this could take a noticeable about of time (and temporary disk space).
You could make a little program to issue UPDATE statements to change the IDs. If you let that run slowly, it would "defragment" your IDs over time. Then you could temporarily stop the inserts (just a minute or two), update the last IDs, then restart it. After updating the last IDs you can change the AUTO_INCREMENT value to be the next number and your hole will be gone. This shouldn't cause any real downtime (at least on InnoDB), but it could take quite a while depending on how aggressive your program is.
Of course, both of these ignore referential integrity. I'm assuming that's not a problem (log statements that aren't used as foreign keys, or some such).
Does it really matter if there are gaps?
If you really want to go back and fill them, you can always turn off auto increment, and manually scan for the next available id every time you want to insert a row -- remembering to lock the table to avoid race conditions, of course. But it's a lot of work to do for not much gain.
Do you really need a surrogate key anyway? Depending on the data (you haven't mentioned a schema) you can probably find a natural key.