Any benefit to deleting a soft-deleted row in MySQL? - mysql

I have a table in MySQL dB that records when a person clicks on certain navigation tabs. Each time it will soft-delete the last entry and insert a new one. The reason for soft-delete is for analytics purposes, so I can track over time where/when/what users are clicking. The ratio of soft-deletes to new entries are 9:1, and the table size is about 20K at the moment but growing fast.
So my question is: if deleting the soft-delete entries would help optimize any queries that involve this table? There is one at the moment that joins 4 tables together and only needs the new entries. Since the analytics on the soft-deletes could be performed on backup copies, I don't need these rows on the production dB.

There is most likely a performance implication to have 90% of your table excluded out of all queries but the analytics: your indexes are probably bigger than they would be if these soft-deleted rows were in their own table, the disk head has to seek across bigger distances so your disk accesses are more expensive than they would be in table 1/10 the size etc.

Related

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

Is adding and dropping indexes everyday on huge tables a good practice?

I'm building a Web Application that is connected to a MySQL database.
I've got two huge tables containing each about 40 millions rows at the moment, and they are receiving new rows everyday (which adds ~ 500 000-1000 000 rows everyday).
The process to add new rows runs during the night, while no one can use the application, and the new rows' content depends on the result of some basic SELECT queries on the current database.
In order to get the result of those SELECT statement fast enough, I'm using simple indexes (one column per index) on each column that appears at least once in a WHERE clause.
The thing is, during the day, some totally different queries are run against those tables, including some "range WHERE clause" (SELECT * FROM t1 WHERE a = a1 AND b = b1 AND (date BETWEEN d1 AND d2)).
I found on stack this very helpful mini-cookbook that advises you on which INDEXes you should use depending on how the database is queried: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
They advice to use compound index: in my example query above it would give INDEX(a, b, date).
It indeed increased the speed of the queries run during the day (from 1 minute to 8 seconds so I was truly happy).
However, with those compound indexes, the required time to add new rows during the night totally explode (it would take more than one day to add the daily content).
Here is my question: would that be ok to drop all the indexes every night, add the new content, and set back up the daily indexes?
Or would that be dangerous since indexes are not meant to be rebuilt every day, especially on such big tables?
I know such an operation would take approximately two hours in total (drop and recreate INDEXes).
I am aware of the existence of ALTER TABLE table_name DISABLE KEYS; but I'm using InnoDB and I believe it is not made to work on InnoDB table.
I believe you have answered your own question: You need the indexes during the day, but not at night. Given what you describe, you should drop the indexes for the bulk inserts at night and re-create them afterwards. Dropping indexes for data loads is not unheard of, and seems appropriate in your case.
I would ask about how you are inserting new data. One method is to insert the values one row at a time. Another is to put the values into a temporary table (with no index) and do a bulk insert:
insert into bigtable( . . .)
select . . .
from smalltable;
These have different performance characteristics. You might find that using a single insert (if you are not already doing so) is fast enough for your purposes.
A digression... PARTITIONing by date should be very useful for you since you are deleting things over a year ago. I would recommend PARTITION BY RANGE(TO_DAYS(...)) and breaking it into 14 or 54 partitions (months or weeks, plus some overhead). This will eliminate the time it takes to delete the old rows, since DROP PARTITION is almost instantaneous.
More details are in my partition blog. Your situation sounds like both Use case #1 and Use case #3.
But back to your clever idea of dropping and rebuilding indexes. To others, I point out the caveat that you have the luxury of not otherwise touching the table for long enough to do the rebuild.
With PARTITIONing, all the rows being inserted will go into the 'latest' partition, correct? This partition is a lot smaller than the entire table, so there is a better chance that the indexes will fit in RAM, thereby be 10 times as fast to update (without rebuilding the indexes). If you provide SHOW CREATE TABLE, SHOW TABLE STATUS, innodb_buffer_pool_size, and RAM size, I can help you do the arithmetic to see if your 'last' partition will fit in RAM.
A note about index updates in InnoDB -- they are 'delayed' by sitting in the "Change buffer", which is a portion of the buffer_pool. See innodb_change_buffer_size_max, available since 5.6. Are you using that version, or newer? (If not, you ought to upgrade, for many reasons.)
The default for that setting is 25, meaning that 25% of the buffer_pool is set aside for pending updates to indexes, as caused by INSERT, etc. That acts like a "cache", such that multiple updates to the same index block are held there until they get bumped out. A higher setting should make index updates hit the disk less often, hence finish faster.
Where I am heading with this... By increasing this setting, you would make the inserts (direct, not rebuild) more efficient. I'm thinking that this might speed it up:
Just before the nightly INSERTs:
innodb_change_buffer_size_max = 70
innodb_old_blocks_pct = 10
Soon after the nightly INSERTs:
innodb_change_buffer_size_max = 25
innodb_old_blocks_pct = 37
(I am not sure about that other setting, but it seems reasonable to push it out of the way.)
Meanwhile, what is the setting of innodb_buffer_pool_size? Typically, it should be 70% of available RAM.
In a similar application, I had big, hourly, dumps to load into a table, and a 90-day retention. I stretched my Partition rules by having 90 daily partitions and 24 hourly partitions. Every night, I spent a lot of time (but less than an hour) doing REORGANIZE PARTITION to turn the 24 hourly partitions into a new daily (and dropping the 90-day-old partition). During each hour, the load had the added advantage that nothing else was touching the 1-hour partition -- I could do normalization, summarization, and loading all in 7 minutes. The entire 90 days fit in 400GB. (Side note: a large number of partitions is a performance killer until 8.0; so don't even consider daily partitions for you 1-year retention.)
The Summary tables made so that 50-minute queries (in the prototype) shrank to only 2 seconds. Perhaps you need a summary table with PRIMARY KEY (a, b, date)? That will let you get rid of such an index on the 'Fact' table. Oops, that eliminates the entire premise of your original question ! See the links at the bottom of my blogs; look for "Summary Tables". A general rule: Don't have any indexes (other than the PRIMARY KEY) on the Fact table; use Summary tables for things that need messier indexes.

Query performance increase from deleting rows in SQL database?

I have a database with a single table that keeps track of user state. When I'm done handling the row, its no longer necessary to keep it in the database and can be deleted.
Now lets say I wanted to keep track of the row instead of deleting it (for historical purposes, analytics, etc). Would it be better to:
Leave the data in the same table and mark the row as 'used' (with an extra column or something like that)
Delete the row from the table and insert it into a separate table that is created only for historical purposes
For choice #1, I wonder if leaving the unnecessary rows in the database will start to affect query performance. (All of my queries are on indexed columns, so maybe this doesn't matter?)
For choice #2, I wonder if the constant deleting of rows will end up causing problems such as fragmentation?
Query performance will be better in the long run:
What is happening with forever inserts:
The table grows, indexes grow, index performance (lookup) is decreases with the size of the table, especially insert performance is hurt.
What is happening with delete:
Table pages get fragmented, so the deleted space is not re-used 100% as expected, more near 50% in MySQL. So the table still grows to about twice the size you might expect for your amount of data. The index gets fragmented and becomes lob sided: It contains your new data but also the structure for your old data. It depends on the structure of your data on how bad this gets. This situation however stabilizes at a certain performance. This performance point has 2 benefits:
1) The table is more limited in size, so potential full table scans are faster
2) Your performance is predictable.
Due to the fragmentation however this performance point is not equal to about twice your data amount, it tends to be a bit worse (benchmark it to see yourself). The benefit of the delete scenario is however since you have a smaller data set, that you might be able to rebuild your index once every reasonable period, thus improving your performance.
Alternatives
There are two alternatives you can look at to improve performance:
Switch to MariaDB: This gains about 8% performance on large datasets (my observation, dataset just about 200GB compressed data)
Look at partitioning: If you have a handy partitioning parameter, you can create a series of "small tables" for you and prevent logic for delete, rebuild and historic data management. This might give you the best performance profile.
If most of the table is flagged as deleted, you will be stumbling over them as you look for the non-deleted records. Adding is_deleted to many of the indexes is likely to help.
If you are deleting records purely on age, then PARTITION BY RANGE(TO_DAYS(...)) is an excellent way to build the table. The DROP TABLE is instantaneous and the ALTER TABLE ... REORGANIZE ... to create a new week (or month or ...) partition is also instantaneous. See my blog for details.
If you "move" records to another table, then the table will not shrink very fast due to fragmentation. If you have enough disk space, this is not a bug deal. If some queries need to see both current and archived records, use UNION ALL; it is pretty easy and efficient.

Move inactive rows to another table?

I have a table where when a row is created, it will be active for 24 hours with some writes and lots of reads. Then it becomes inactive after 24 hours and will have no more writes and only some reads, if any.
Is it better to keep these rows in the table or move them when they become inactive (or via batch jobs) to a separate table? Thinking in terms of performance.
This depends largely on how big your table will get, but if it grows forever, and has a significant number of rows per day, then there is a good chance that moving old data to another table would be a good idea. There are a few different ways you could accomplish this, and which is best depends on your application and data access patterns.
Essentially as you said, when a row becomes "old", INSERT to the archive table, and DELETE from the current table.
Create a new table every day (or perhaps every week, or every month, depending on how big your dataset is), and never worry about moving old rows. You'll just have to query old tables when accessing old data, but for the current day, you only ever access the current table.
Have a "today" table and a "all time" table. Duplicate the "today" rows in both tables, keeping them in sync with triggers or other mechanisms. When a row becomes old, simply delete from the "today" table, leaving the "all time" row in tact.
One advantage to #2, that may not be immediately obvious, is that I believe MySQL indexes can be optimized for read-only tables. So by having old tables that are never written to, you can take advantage of this extra optimization.
Generally moving rows between tables in proper RDBMS should not be necessary.
I'm not familiar with mysql specifics, but you should do fine with the following:
Make sure your timestamp column is indexed
In addition, you can use active BOOLEAN default true column
Make a batch run every day to mark >24h old rows inactive
Use a partial index for timestamp column so only rows marked active are indexed
Remember to have timestamp and active = TRUE in your where conditions to hit indexes. Use EXPLAIN a lot.
That all depends on the balance between ease of programming, and performance. Performance wise, yes it will definitely be faster. But whether the speed increase is worth the effort is hard to say.
I've worked on systems that run perfectly fine with millions of rows. However, if the data is ever growing it does eventually become a problem.
I've worked on a database storing transaction logging for automated equipment. It generates hundreds of thousands of events per day. After a year, the queries just wouldn't run at acceptable speeds any more. We now keep the last month's worth of logs in the main table (millions of rows still), and move older data to archive tables.
None of the application's functionality ever looks in the archive table (if you do a query of the transaction log, it will return no results). It is only really kept for emergency use, and is just queried with any standalone database query tool. Because the archive has well over a hundred million rows, and the nature of this emergency use is generally unplannable (and therefore mostly un-indexed) queries, they can take a long time to run.
There is another solution. To have another table containing only the active records (tblactiverecords). When the number of active records is really small, you could just do an inner join and get the active records. This should take very less time because primary key by default are indexed in mysql. As your rows become inactive, you could delete them from the tblactiverecords table.
create table tblrecords (id int primary key, data text);
Then,
create table tblactiverecords (tblrecords_id primary key);
you can do
select data from tblrecords join tblactiverecords on tblrecords.id = tblactiverecords.tblrecords_id;
to get all data that are active.

mySQL database efficienty question

I have a database efficiency question.
Here is some info about my table:
-table of about 500-1000 records
-records are added and deleted every day.
- usually have about the same amount being added and deleted every day (size of active records stays the same)
Now, my question is.....when I delete records,...should I (A) delete the record and move it to a new table?
Or,...should I (B) just have and "active" column and set the record to 0 when it is no long active.
The reason I am hesitant to use B is because my site is based on the user being able to filter/sort this table of 500-1000 records on the fly (using ajax)....so I need it to be as fast as possible,..(i'm guessing a table with more records would be slower to filter)...and I am using mySQL InnoDB.
Any input would be great, Thanks
Andrew
~1000 records is a very small number.
If a record can be deleted and re-added later, maybe it makes sense to have an "active" indicator.
Realistically, this isn't a question about DB efficiency but about network latency and the amount of data you're sending over the wire. As far as MySQL goes, 1000 rows or 100k rows are going to be lightning-fast, so that's not a problem.
However, if you've got a substantial amount of data in those rows, and you're transmitting it all to the client through AJAX for filtering, the network latency is your bottleneck. If you're transmitting a handful of bytes (say 20) per row and your table stays around 1000 records in length, not a huge problem.
On the other hand, if your table grows (with inactive records) to, say, 20k rows, now you're transmitting 400k instead of 20k. Your users will notice. If the records are larger, the problem will be more severe as the table grows.
You should really do the filtering on the server side. Let MySQL spend 2ms filtering your table before you spend a full second or two sending it through Ajax.
It depends on what you are filtering/sorting on and how the table is indexed.
A third, and not uncommon, option, you could have a hybrid approach where you inactivate records (B) (optionally with a timestamp) and periodically archive them to a separate table (A) (either en masse or based on the timestamp age).
Realistically, if your table is in the order 1000 rows, it's probably not worth fussing too much over it (assuming the scalability of other factors is known).
If you need to keep the records for some future purpose, I would set an Inactive bit.
As long as you have a primary key on the table, performance should be excellent when SELECTing the records.
Also, if you do the filtering/sorting on the client-side then the records would only have to be retrieved once.