Query performance increase from deleting rows in SQL database? - mysql

I have a database with a single table that keeps track of user state. When I'm done handling the row, its no longer necessary to keep it in the database and can be deleted.
Now lets say I wanted to keep track of the row instead of deleting it (for historical purposes, analytics, etc). Would it be better to:
Leave the data in the same table and mark the row as 'used' (with an extra column or something like that)
Delete the row from the table and insert it into a separate table that is created only for historical purposes
For choice #1, I wonder if leaving the unnecessary rows in the database will start to affect query performance. (All of my queries are on indexed columns, so maybe this doesn't matter?)
For choice #2, I wonder if the constant deleting of rows will end up causing problems such as fragmentation?

Query performance will be better in the long run:
What is happening with forever inserts:
The table grows, indexes grow, index performance (lookup) is decreases with the size of the table, especially insert performance is hurt.
What is happening with delete:
Table pages get fragmented, so the deleted space is not re-used 100% as expected, more near 50% in MySQL. So the table still grows to about twice the size you might expect for your amount of data. The index gets fragmented and becomes lob sided: It contains your new data but also the structure for your old data. It depends on the structure of your data on how bad this gets. This situation however stabilizes at a certain performance. This performance point has 2 benefits:
1) The table is more limited in size, so potential full table scans are faster
2) Your performance is predictable.
Due to the fragmentation however this performance point is not equal to about twice your data amount, it tends to be a bit worse (benchmark it to see yourself). The benefit of the delete scenario is however since you have a smaller data set, that you might be able to rebuild your index once every reasonable period, thus improving your performance.
Alternatives
There are two alternatives you can look at to improve performance:
Switch to MariaDB: This gains about 8% performance on large datasets (my observation, dataset just about 200GB compressed data)
Look at partitioning: If you have a handy partitioning parameter, you can create a series of "small tables" for you and prevent logic for delete, rebuild and historic data management. This might give you the best performance profile.

If most of the table is flagged as deleted, you will be stumbling over them as you look for the non-deleted records. Adding is_deleted to many of the indexes is likely to help.
If you are deleting records purely on age, then PARTITION BY RANGE(TO_DAYS(...)) is an excellent way to build the table. The DROP TABLE is instantaneous and the ALTER TABLE ... REORGANIZE ... to create a new week (or month or ...) partition is also instantaneous. See my blog for details.
If you "move" records to another table, then the table will not shrink very fast due to fragmentation. If you have enough disk space, this is not a bug deal. If some queries need to see both current and archived records, use UNION ALL; it is pretty easy and efficient.

Related

Does deleting rows from table effect on db performance?

As a MySQL database user,
I'm working on a script using MySQL database with an auto-increment primary key tables, that users may need to remove (lots of) data rows as mistaken, duplicated, canceled data and so on.
for now, I use a tinyint last col as 'delete' for each table and update the rows to delete=1 instead of deleting the row.
considering the deleted data as not important data,
which way do you suggest to have a better database and performance?
does deleting (maybe lots of) rows every day affect select queries for large tables?
is it better to delete the rows instantly?
or keep the rows using 'delete' col and delete them for example monthly then re-index the data?
I've searched about this but most of the results were based on personal opinions or preferred ones and not referenced or tested data.
PS) Edit:
AND Refering to the question and considering below pic, there's one more point to ask in this topic and I would be grateful if you could guide me.
deleting a row (row 6) while auto increment index was 225, leaded the not-sorted table to put the next inserted row by id=225 at deleted-id=6 place (at least visually!). if deleting action happens lots of times, then primary column and its rows will be completely out of order and messed up.
It should be considered as the good point of database that fill up the deleted spaces or something bad that leads to reducing the performance or none of them and doesn't matter what it's showing in front!?
Thanks.
What percentage of the table is "deleted"?
If it is less than, say, 20%, it would be hard to measure any difference between a soft "deleted=1" and a hard "DELETE FROM tbl". The disk space would probably be the same. A 16KB block would either have soft-deleted rows to ignore, or the block would be not "full".
Let's say 80% of the rows have been deleted. Now there are some noticeable differences.
In the "soft-delete" case, a SELECT will be looking at 5 rows to find only 1. While this sounds terrible, it does not translate into 5 times the effort. There is overhead for fetching a block; if it contains 4 soft-deleted rows and 1 useful row, that overhead is shared. Once a useful row is found, there is overhead to deliver that row to the client, but that applies only to the 1 row.
In the "hard-delete" case, blocks are sometimes coalesced. That is, when two "adjacent" blocks become less than half full, they may be combined into a single block. (Or so the documentation says.) This helps to cut down on the number of blocks that need to be touched. But it does not shrink the disk space -- hard-deleted rows leave space that can be reused; deleted blocks can be reused. Blocks are not returned to the OS.
A "point-query" is a SELECT where you specify exactly the row you want (eg, WHERE id = 123). That will be very fast with either type of delete. The only possible change is if the BTree is a different depth. But even if 80% of the rows are deleted, the BTree is unlikely to change in depth. You need to get to about 99% deleted before the depth changes. (A million rows has a depth of about 3; 100M -> 4.)
"Range queries (eg, WHERE blah BETWEEN ... AND ...) will notice some degradation if most are soft-deleted -- but, as already mentioned, there is a slight degradation in either deletion method.
So, is this my "opinion"? Yes. But it is based on an understanding of how InnoDB tables work. And it is based on "experience" in the sense that I have detected nothing to significantly shake this explanation in about 19 years of using InnoDB.
Further... With hard-delete, you have the option of freeing up the free space with OPTIMIZE TABLE. But I have repeatedly said "don't bother" and elaborated on why.
On the other hand, if you need to delete a big chunk of a table (either one-time or repeatedly), see my blog on efficient techniques: http://mysql.rjweb.org/doc.php/deletebig
(Re: the PS)
SELECT without an ORDER BY -- It is 'fair game' for the query to return the rows in any order it feels like. If you want a certain order, add ORDER BY.
What Engine is being used? MyISAM and InnoDB work differently; neither are predictable with out ORDER BY.
If you wanted the new entry to have id=6, that is a different problem. (And I will probably argue against designing the ids like that.)
The simple answer is no. Because DBMS systems are already designed to make changes at any time but system performance is important. Sometimes it's will affect a little bit. But no need to care it

How will partitioning affect my current queries in MySQL? When is it time to partition my tables?

I have a table that contains 1.5 million rows, has 39 columns, contains sales data of around 2 years, and grows every day.
I had no problems with it until we moved it to a new server, we probably have less memory now.
Queries are currently taking a very long time. Someone suggested partitioning the large table that is causing most of the performance issues but I have a few questions.
Is it wise to partition the table I described and is it
likely to improve its performance?
If I do partition it, will
I have to make changes to my current INSERT or SELECT statements or
will they continue working the same way?
Does the partition
take a long time to perform? I worry that with the slow performance,
something would happen midway through and I would lose the data.
Should I be partioning it to years or months? (we usually
look at the numbers within the month, but sometimes we take weeks or
years). And should I also partition the columns? (We have some
columns that we rarely or never use, but we might want to use them
later)
(I agree with Bill's answer; I will approach the Question in a different way.)
When is it time to partion my tables?
Probably never.
is it likely to improve its performance?
It is more likely to decrease performance a little.
I have a table that contains 1.5 million rows
Not big enough to bother with partitioning.
Queries are currently taking a very long time
Usually that is due to the lack of a good index, probably a 'composite' one. Secondly is the formulation of the query. Please show us a slow query, together with SHOW CREATE TABLE.
data of around 2 years, and grows every day
Will you eventually purge "old" data? If so, the PARTITION BY RANGE(TO_DAYS(..)) is an excellent idea. However, it only helps during the purge. This is because DROP PARTITION is a lot faster than DELETE....
we probably have less memory now.
If you are mostly looking at "recent" data, then the size of memory (cf innodb_buffer_pool_size) may not matter. This is due to caching. However, it sounds like you are doing table scans, perhaps unnecessarily.
will I have to make changes to my current INSERT or SELECT
No. But you probably need to change what column(s) are in the PRIMARY KEY and secondary key(s).
Does the partition take a long time to perform?
Slow - yes, because it will copy the entire table over. Note: that means extra disk space, and the partitioned table will take more disk.
something would happen midway through and I would lose the data.
Do not worry. The new table is created, then a very quick RENAME TABLE swaps it into place.
Should I be partioning it to years or months?
Rule of thumb: aim for about 50 partitions. With "2 years and growing", a likely choice is "monthly".
we usually look at the numbers within the month, but sometimes we take weeks or years
Smells like a typical "Data Warehouse" dataset? Build and incrementally augment a "Summary table" with daily stats. With that table, you can quickly get weekly/monthly/yearly stats -- possibly 10 times as fast. Ditto for any date range. This also significantly helps with "low memory".
And should I also partition the columns? (We have some columns that we rarely or never use, but we might want to use them later)
You should 'never' use SELECT *; instead, specify the columns you actually need. "Vertical partitioning" is the term for your suggestion. It is sometimes practical. But we need to see SHOW CREATE TABLE with realistic column names to discuss further.
More on partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
In most circumstances, you're better off using indexes instead of partitioning as your main method of query optimization.
The first thing you should learn about partitioning in MySQL is this rule:
All columns used in the partitioning expression for a partitioned table must be part of every unique key that the table may have.
Read more about this rule here: Partitioning Keys, Primary Keys, and Unique Keys.
This rule makes many tables ineligible for partitioning, because you might want to partition by a column that is not part of the primary or unique key in that table.
The second thing to know is that partitioning only helps queries using conditions that unambiguously let the optimizer infer which partitions hold the data you're interested in. This is called Partition Pruning. If you run a query that could find data in any or all partitions, MySQL must search all the partitions, and you gain no performance benefit compared to have a regular non-partitioned table.
For example, if you partition by date, but then you run a query for data related to a specific user account, it would have to search all your partitions.
In fact, it might even be a little bit slower to use partitioned tables in such a query, because MySQL has to search each partition serially.
You asked how long it would take to partition the table. Converting to a partitioned table requires an ALTER TABLE to restructure the data, so it takes about the same time as any other alteration that copies the data to a new tablespace. This is proportional to the size of the table, but varies a lot depending on your server's performance. You'll just have to test it out, there's no way we can estimate how long it will take on your server.

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

in mysql, does quantity of data in one table affect performance

Setting up a new database that has a comments table. I've been told to expect this table to get extremely large. I'm wondering if there is any particular reason why I wouldn't want to keep this table in the same database as the rest of the data for the site.
Quantity of data will certainly affect performance in any RDBMS, however, there's no reason that this table should exist in a separate database on the same server. If the table truly will become very large with a lot of insert activity, then you might want to consider an ETL process that keeps an indexed copy for fast selection (mostly because indexes can negatively affect inserts despite the performance gain on selects)
Keep the table in the same database for now, but it is well known that insert speed slows as the number of records increases.
There are some options to consider if the performance becomes unacceptable, for instance, if the data can be partitioned, split it across multiple tables in separate databases.

mysql index performance on small "fast-moving" tables

We've got a table we use as a queue. Entries are constantly being added, and constantly being updated, and then deleted. Though we might be adding 3 entries/sec the table never grows to be more than a few hundred rows.
To get entries out of table we are doing a simple select.
SELECT * FROM queue_table WHERE some_id = ?
We are debating adding an index on some_id. I think the small size and speed at which we are adding and removing rows would say no, but conventionally, it seems we should have an index.
Any thoughts?
If you are using InnoDB (which you should do having a table of this kind) and the table is accessed concurrently, then you should definitely create the index.
When performing DML operations, InnoDB locks all rows it scans, not only those that match the WHERE clause conditions.
This means that without an index, a query like this:
DELETE
FROM mytable
WHERE some_id = ?
will have to do a full table scan and lock all rows.
This kills all concurrency (even if the threads access different some_id's they'll still have to wait for each other), and may even result in deadlocks.
With 3 transactions per second, no index should be a problem, so just create it.
To be sure, a benchmark using both techniques would be needed.
But generally, if the access is 50% reads and 50% writes, the penalty of updating an index could well not be worth it. But if the number of rows increase, that weights both the read and write performance such that an index should be used.
The only way to know for sure would be doing some benchmarks in actual/real conditions ; for example, measure the time each query takes, and :
for one day, collect that information each time the query is run -- without the index
and for another day, do exactly the same -- with the index.
For a table with a few hundreds rows doing both lots and inserts/deletes and select/updates, the difference should not be that big, so I think you can test in your production environment (and in real conditions) without much danger.
Yes, I know, testing in production is bad ; but in that case, it's the best way to know for sure : those conditions are probably too hard to replicate in a testing environment...