Does deleting rows from table effect on db performance? - mysql

As a MySQL database user,
I'm working on a script using MySQL database with an auto-increment primary key tables, that users may need to remove (lots of) data rows as mistaken, duplicated, canceled data and so on.
for now, I use a tinyint last col as 'delete' for each table and update the rows to delete=1 instead of deleting the row.
considering the deleted data as not important data,
which way do you suggest to have a better database and performance?
does deleting (maybe lots of) rows every day affect select queries for large tables?
is it better to delete the rows instantly?
or keep the rows using 'delete' col and delete them for example monthly then re-index the data?
I've searched about this but most of the results were based on personal opinions or preferred ones and not referenced or tested data.
PS) Edit:
AND Refering to the question and considering below pic, there's one more point to ask in this topic and I would be grateful if you could guide me.
deleting a row (row 6) while auto increment index was 225, leaded the not-sorted table to put the next inserted row by id=225 at deleted-id=6 place (at least visually!). if deleting action happens lots of times, then primary column and its rows will be completely out of order and messed up.
It should be considered as the good point of database that fill up the deleted spaces or something bad that leads to reducing the performance or none of them and doesn't matter what it's showing in front!?
Thanks.

What percentage of the table is "deleted"?
If it is less than, say, 20%, it would be hard to measure any difference between a soft "deleted=1" and a hard "DELETE FROM tbl". The disk space would probably be the same. A 16KB block would either have soft-deleted rows to ignore, or the block would be not "full".
Let's say 80% of the rows have been deleted. Now there are some noticeable differences.
In the "soft-delete" case, a SELECT will be looking at 5 rows to find only 1. While this sounds terrible, it does not translate into 5 times the effort. There is overhead for fetching a block; if it contains 4 soft-deleted rows and 1 useful row, that overhead is shared. Once a useful row is found, there is overhead to deliver that row to the client, but that applies only to the 1 row.
In the "hard-delete" case, blocks are sometimes coalesced. That is, when two "adjacent" blocks become less than half full, they may be combined into a single block. (Or so the documentation says.) This helps to cut down on the number of blocks that need to be touched. But it does not shrink the disk space -- hard-deleted rows leave space that can be reused; deleted blocks can be reused. Blocks are not returned to the OS.
A "point-query" is a SELECT where you specify exactly the row you want (eg, WHERE id = 123). That will be very fast with either type of delete. The only possible change is if the BTree is a different depth. But even if 80% of the rows are deleted, the BTree is unlikely to change in depth. You need to get to about 99% deleted before the depth changes. (A million rows has a depth of about 3; 100M -> 4.)
"Range queries (eg, WHERE blah BETWEEN ... AND ...) will notice some degradation if most are soft-deleted -- but, as already mentioned, there is a slight degradation in either deletion method.
So, is this my "opinion"? Yes. But it is based on an understanding of how InnoDB tables work. And it is based on "experience" in the sense that I have detected nothing to significantly shake this explanation in about 19 years of using InnoDB.
Further... With hard-delete, you have the option of freeing up the free space with OPTIMIZE TABLE. But I have repeatedly said "don't bother" and elaborated on why.
On the other hand, if you need to delete a big chunk of a table (either one-time or repeatedly), see my blog on efficient techniques: http://mysql.rjweb.org/doc.php/deletebig
(Re: the PS)
SELECT without an ORDER BY -- It is 'fair game' for the query to return the rows in any order it feels like. If you want a certain order, add ORDER BY.
What Engine is being used? MyISAM and InnoDB work differently; neither are predictable with out ORDER BY.
If you wanted the new entry to have id=6, that is a different problem. (And I will probably argue against designing the ids like that.)

The simple answer is no. Because DBMS systems are already designed to make changes at any time but system performance is important. Sometimes it's will affect a little bit. But no need to care it

Related

MariaDB, do new inserts replace deleted rows on disk?

Can someone point me in the right direction, i can't find any documentation on this behavior.
We know when you delete rows from a table you end up with "holes" which you can defrag with OPTIMIZE. Do new inserts automatically fill in those holes if left alone? Is there a way to force that behavior if not? Using InnoDB tables for revolving logs, deleting old rows and adding new, would the table roll over or continuously consume disk space? Or would a different engine be better suited for this?
Yes i know of table partitions, i want to explore all options first.
Since this is mostly a non-issue, I will assume you are asking for academic reasons?
InnoDB (you should be using that Engine!) stores the data (and each secondary index) in separate B+Trees.
The data's BTree is ordered by the PRIMARY KEY. The various leaf nodes will be filled to different degrees, based on the order of inserts, deletes, updates (that change the row length), temporary transactional locks on rows, etc, etc.
That last one is because one transaction sees effectively an instantaneous snapshot of the data, possibly different than another transaction's view. This implies that multiple copies of a row may coexist.
The buffer_pool holds 16KB blocks. Each block holds a variable number of rows. This number changes with the changing tides. If too adjacent blocks become "too empty", they will be combined.
Totally empty blocks (say, due to lots of deletes) will be put on a free chain for later reuse by Inserts. But note that the disk used by the table will not shrink.
The failure to shrink is usually not a problem -- most tables grow; any shrinkage is soon followed by a new growth spurt.
PARTITIONs are usually not worth using. However, that is the best way to "keep data for only 90 days", then use DROP PARTITION instead of a big, slow DELETE. (That is about the only use for PARTITION.)
If you add up all the bytes in the INTs (4 bytes each) VARCHARs (pick the average length), etc, etc, you will get what seems like a good estimate for the disk space being used. But due to the things discussed above, you need to multiply that number by 2 to 3 to a better estimate of the disk space actually consumed by the table.

Does MySQL table size matters when doing JOINs?

I'm currently trying to design a high-performance database for tracking clicks and then displaying analytics of these clicks.
I expect at least 10M clicks to be coming in per 2 weeks time.
There are a few variables (each of them would need a unique column) that I'll allow people to use when using the click tracking - but I don't want to limit them to a number of these variables to 5 or so. That's why I thought about creating Table B where I can store these variables for each click.
However each click might have like 5-15+ of these variables depending on how many are they using. If I store them in a separate table that will multiple the 10M/2 weeks by the variables that the user might use.
In order to display analytics for the variables, I'll need to JOIN the tables.
Looking at both writing and most importantly reading performance, is there any difference if I JOIN a 100M rows table to a:
500 rows table OR to a 100M rows table?
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
is there any difference if I JOIN a 100M rows table to a...
Yes there is. A JOIN's performance matters solely on how long it takes to find matching rows based on your ON condition. This means increasing row size of a joined table will increase the JOIN time, since there's more rows to sift through for matches. In general, a JOIN can be thought of as taking A*B time, where A is the number of rows in the first table and B is the number of rows in the second. This is a very broad statement as there are many optimization strategies the optimizer may take to change this value, but this can be thought of as a general rule.
To increase a JOIN's efficiency, for reads specifically, you should look into indexing. Indexing allows you to mark a column that the optimizer should index, or keep a running track of to allow quicker evaluation of the values. This increases any write operation since the data needs to modify an encompassing data structure, usually a B-Tree, but decreases the time read operations since the data is presorted in this data structure allowing for quick look ups.
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
There's a lot of factors that would go into saying yes or no here. Mainly, would storage space be an issue and how likely is duplicate data to appear. If the answers are that storage space is not an issue and duplicates are not likely to appear, then one large table may be the right decision. If you have limited storage space, then storing the excess nulls may not be smart. If you have many duplicate values, then one large table may be more inefficient than a JOIN.
Another factor to consider when denormalizing is if another table would ever want to access values from just one of the previous two tables. If yes, then the JOIN to obtain these values after denormalizing would be more inefficient than having the two tables separate. This question is really something you need to handle yourself when designing the database and seeing how it is used.
First: There is a huge difference between joining 10m to 500 or 10m to 10m entries!
But using a propper index and structured table design will make this manageable for your goals I think. (at least depending on the hardware used to run the application)
I would totally NOT recommend to use denormalized tables, cause adding more than your 20 values will be a mess once you have 20m entries in your table. So even if there are some good reasons which might stand for using denormalized tables (performance, tablespace,..) this is a bad idea for further changes - but in the end your decison ;)

Query performance increase from deleting rows in SQL database?

I have a database with a single table that keeps track of user state. When I'm done handling the row, its no longer necessary to keep it in the database and can be deleted.
Now lets say I wanted to keep track of the row instead of deleting it (for historical purposes, analytics, etc). Would it be better to:
Leave the data in the same table and mark the row as 'used' (with an extra column or something like that)
Delete the row from the table and insert it into a separate table that is created only for historical purposes
For choice #1, I wonder if leaving the unnecessary rows in the database will start to affect query performance. (All of my queries are on indexed columns, so maybe this doesn't matter?)
For choice #2, I wonder if the constant deleting of rows will end up causing problems such as fragmentation?
Query performance will be better in the long run:
What is happening with forever inserts:
The table grows, indexes grow, index performance (lookup) is decreases with the size of the table, especially insert performance is hurt.
What is happening with delete:
Table pages get fragmented, so the deleted space is not re-used 100% as expected, more near 50% in MySQL. So the table still grows to about twice the size you might expect for your amount of data. The index gets fragmented and becomes lob sided: It contains your new data but also the structure for your old data. It depends on the structure of your data on how bad this gets. This situation however stabilizes at a certain performance. This performance point has 2 benefits:
1) The table is more limited in size, so potential full table scans are faster
2) Your performance is predictable.
Due to the fragmentation however this performance point is not equal to about twice your data amount, it tends to be a bit worse (benchmark it to see yourself). The benefit of the delete scenario is however since you have a smaller data set, that you might be able to rebuild your index once every reasonable period, thus improving your performance.
Alternatives
There are two alternatives you can look at to improve performance:
Switch to MariaDB: This gains about 8% performance on large datasets (my observation, dataset just about 200GB compressed data)
Look at partitioning: If you have a handy partitioning parameter, you can create a series of "small tables" for you and prevent logic for delete, rebuild and historic data management. This might give you the best performance profile.
If most of the table is flagged as deleted, you will be stumbling over them as you look for the non-deleted records. Adding is_deleted to many of the indexes is likely to help.
If you are deleting records purely on age, then PARTITION BY RANGE(TO_DAYS(...)) is an excellent way to build the table. The DROP TABLE is instantaneous and the ALTER TABLE ... REORGANIZE ... to create a new week (or month or ...) partition is also instantaneous. See my blog for details.
If you "move" records to another table, then the table will not shrink very fast due to fragmentation. If you have enough disk space, this is not a bug deal. If some queries need to see both current and archived records, use UNION ALL; it is pretty easy and efficient.

Mysql: deleting rows vs. using a "removed" column

I was previously under the impression that deleting rows in an autoincremented table can harm SELECT performance, and so I've been using a tinyint column called "removed" to mark whether an item is removed or not.
My SELECT queries are something like this:
SELECT * FROM items WHERE removed = 0 ORDER BY id DESC LIMIT 25
But I'm wondering whether it does, in fact, make sense to just delete those rows instead. Less than 1% of rows are marked as "removed" so it seems dumb for mysql to have to check whether removed = 0 for each row.
So can deleting rows harm performance in any way?
That depends a lot on your use case - and on your users. Marking the row as deleted can help you in various situations:
if a user decides "oh, I did need that item after all", you don't need to go through the backups to restore it - just flip the "deleted" bit again (note potential privacy implications)
with foreign keys, you can't just go around deleting rows, you'd break the relationships in the database; same goes for security/audit logs
you aren't changing the number of rows (which may decrease index efficiency if the removed rows add up)
Moreover, when properly indexed, in my measurements, the impact was always insignificant (note that I wrote "measurements" - go and profile likewise, don't just blindly trust some people on the Internet). So, my advice would be "use the removed column, it has significant benefits and no significant negative impact".
I don't think deleting rows harm on select query. Normally peoples takes an extra column named deleted [removed in your case] to provide a restore like functionality. So if you are not providing restore functionality then you can delete the row it will not affect the select query as far as I know. But while deleting keep relationships in mind they should also get deleted or will result in error or provide wrong results.
You just fill the table with more and more records which you don't need. If you don't plan to use them in the future, I don't think you need to store them at all. If you want to keep them anyway, but don't plan to use them often, you can just create a temp table to hold your "removed" records.

Does number of columns affect MYSQL speed?

I have a table. I only need to run one type of query: to find a given unique in column 1, then get say, the first 3 columns out.
now, how much would it affect speed if I added an extra few columns to the table for basically "data storage". I know I should use a saparate table, but lets assume I am constrained to having just 1 table, so the only way is to add on some columns at the end.
So, if I add on some columns, say 10 at the end, 30 varchar each, will this slow down any query given in the first sentence? If so, by how much of a factor do you think compared to without the extra reduntant yet present columns?
Yes, extra data can slow down queries because it means fewer rows can fit into a page, and this means more disk accesses to read a certain number of rows and fewer rows can be cached in memory.
The exact factor in slow down is hard to predict. It could be negligible, but if you are near the boundary between being able to cache the entire table in memory or not, a few extra columns could make a big difference to the execution speed. The difference in the time it takes to fetch a row from a cache in memory or from disk is several orders of magnitude.
If you add a covering index the extra columns should have less of an impact as the query can use the relatively narrow index without needing to refer to the wider main table.
I don't understand the 'I know I should use a separate table' bit. What you've described is the reason you have a DB, to associate a key with some related data. Look at it another way, how else do you retrieve this information if you don't have the key?
To answer your question, the only way to know what the performance hit is going to be is empirical testing (though Mark's answer, posted just prior to mine, is one - of VERY many - factors to speed).
That depends a bit on how much data you already have in the records. The difference would normally be somewhere between almost none at all to not so big.
The difference comes from how much more data has to be loaded from disk to get to the data. The extra columns will likely mean that there is room for less records in each page, but it's possible that it happens to be room enough left in each page for most of the extra data so that there are few extra blocks needed. It depends on how well the current data lines up in the pages.