When to delete old entries from MySQL DB? - mysql

So I run this TYPO3 website with nearly 80 tables. TYPO3 don't delete records really, it only writes an "1" into the table field deleted to mark them. This leads to a big table with many records that are not visible in the application but have to be processed in every database query.
My question is: Until how many dead entries should you keep those entries before facing disadvantages like performance decrease? Is there any known number of entries no matter the server hardware?
Thanks in advance!

TYPO3 has an included task to do cleanups for old/deleted entries, called:
Table garbage collection : cleans up old records from any table in the database.
See https://docs.typo3.org/c/typo3/cms-scheduler/master/en-us/Installation/BaseTasks/Index.html#table-garbage-collection-task
You may decide, which kind of entries should be cleaned in which period, depend on your use case and your server environment.

It depends. If there are good indexes, the extra rows may not hurt performance much. Are you seeing a slowdown? (There's an old saying: "If it ain't broke, don't fix it.")
Something like DELETE FROM t WHERE deleted may be a viable way to clean up table t. But it may run into issues with FOREIGN KEYs.
How many rows in the tables? If there are millions of rows to DELETE, it gets tricky to do the task without bringing the system to its knees.

Related

Cleaning out an insanely large table

I have a backup table that - thanks to poorly planned management by a programmer who is bad at math - has 3.5 billion records in it. The data drive is nearly full, performance is suffering. I need to clean out this table but just a simple SELECT COUNT(1) statement takes 30 minutes or more to return.
There is no primary key on the table.
The database uses SIMPLE logging. There's only 25gb left on the drive, so I need to be mindful that whatever I do has to leave space for the database to continue functioning for everyone else. I'm waiting for confirmation as I type this, but I don't think I need to keep any of the data that's in there now.
On a table with that many records, would TRUNCATE TABLE grind the system to a halt?
Also looking into the solutions proposed here: How to delete large data of table in SQL without log?
The idea is to allow my clients to keep working while I'm doing all this.
Truncate table would work if you no longer need the records. This will not reduce the size of the database. If you need to do that you would need to shrink the data file.
If you would rather delete the records Aaron Bertrand has good examples and test results he did located here: https://sqlperformance.com/2013/03/io-subsystem/chunk-deletes

Optimising "NOT IN(...)" query for millions of rows

Note: I do not have access to the source code/database to which this question pertains. The two tables in question are located on different servers.
I'm working with a 3rd party company that have systems integrated with our own. They have a query that runs something like this;
DELETE FROM table WHERE column NOT IN(1,2,3,4,5,.....3 000 000)
It's pretty much referencing around 3 million values in the NOT IN.
I'm trying to point out that this seems like an inefficient method for deleting multiple rows and keeping all the ones noted in the query. The problem is, as I don't have the access myself to the source code/database I'm not totally sure what to suggest as a solution.
I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.
With this limited knowledge, what possible suggestions could I present to them?
The first thing that comes to mind is having some kind of flag column that indicates whether it's been deleted or not. When the sync script runs it would first perform an update on the target server for all rows marked as deleted (or insert for new rows), then a second query to delete all rows marked for deletion.
Is there more logical way to do something like this, bearing in mind complete overhauls in functionality are out of the question. Only small tweaks to the current process will be possible for a number of reasons.
Instead of
DELETE FROM your_table
WHERE column NOT IN(1,2,3,4,5,.....3 000 000)
you could do
delete t1
from your_table t1
left join table_where_the_ids_come_from t2 on t1.column = t2.id
where t2.id is null
I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.
I know this is obvious, but why don't these two servers stay in sync using replication? I'm guessing it's because aside from this one table, they don't have identical data.
If out-of-the-box replication isn't flexible enough, you could use a change-data capture tool.
The idea is that the tool monitors changes in a MySQL binary log stream, and reacts to them. The reaction is user-defined, and it can include applying the same change to another MySQL instance, which would keep them in sync.
Here's a blog that shows how to use Maxwell, which is one of the open-source CDC tools, this one released from Zendesk:
https://www.percona.com/blog/2016/09/13/mysql-cdc-streaming-binary-logs-and-asynchronous-triggers/
A couple of advantages of this approach:
No need to re-sync the whole table. You'd only apply incremental changes as they occur.
No need to schedule re-syncs daily or whatever. Since incremental changes are likely to be small, you could apply the changes nearly immediately.
Deleting a large number of rows will take a huge amount of time. This is likely to require a full table scan. As it finds rows to delete, it will stress the undo/redo log. It will clog replication (if using such). Etc.
How many rows do you expect to delete?
Better would be to break the list up into chunks of 1000. (This applies whether using IN(list of constants) or JOIN.) But, since you are doing NOT, it gets stickier. Possibly the best way is to copy over what you want:
CREATE TABLE new LIKE real;
INSERT INTO new
SELECT * FROM real WHERE id IN (...); -- without NOT
RENAME TABLE real TO old,
new TO real;
DROP TABLE old;
I go into details of chunking, partitioning, and other techniques in Big Deletes .

Microsoft Access Record Corruption

I'm having another issue with an Microsoft Access database. Every so often, some records will get corrupted. Something happens and different shapes, Chinese characters, and wrong data will be in the records. I did find a way on not losing the corrupted records by having a backup for that table that I update everyday. Still, it's a bit of an annoyance especially when an update is ran.
I've tried to look for different solutions for this problem but none have really worked. It's a database that can be used by multiple users at the same time. It's an older one that I've had to update a bit. I don't have any memo fields present in the table either.
If you are using an autonumber field as a primary key, that could cause an increased corruption risk if the autonumber seed is reset and begins duplicating existing values. This has since been fixed, but you may need to update your Jet Engine Service Pack
If you are in a multi-user environment and have not split your database, you should try that. You can split the database using the database tools tab on the ribbon in the "Move Data" section. That can reduce corruption risk by better managing concurrent updates to the same record. See further discussion here.
Unfortunately I can't tell you the problem without more information regarding your tables and relationships. If the corruption is a common result of your update query, I would start by looking through your update routine for errors.

Database query efficiency

My boss is having me create a database table that keeps track of some of our inventory with various parameters. It's meant to be implemented as a cron job that runs every half hour or so, but the scheduling part isn't important since we've already discussed that we're handling it later.
What I'm want to know is if it's more efficient to just delete everything in the table each time the script is called and repopulate it, or go through each record to determine if any changes were made and update each entry accordingly. It's easier to do the former, but given that we have over 700 separate records to keep track of, I don't know if the time it takes to do this would put a huge load on the server. The script is written in PHP.
700 records is an extremely small number of records to have performance concerns. Don't even think about it, do whichever is easier for you.
But if it is performance that you are after, updating rows is slower than inserting rows, (especially if you are not expecting any generated keys, so an insertion is a one-way operation to the database instead of a roundtrip to and from the database,) and TRUNCATE TABLE tends to be faster than DELETE * FROM.
If you have IDs for the proper inventory talking about SQL DB, then it would be good practice to update them, since in theory your IDs will get exhausted (overflow).
Another approach would be to use some NoSQL DB like MongoDB and simply update the DB with given json bodies apparently with existing IDs, and the DB itself will figure it out on its own.

How to continuously remove anything older than the newst 10 entries of a MySQLdatabase (Possibly in JPQL/JPA)

I'm looking for a way to continuously monitor and delete the oldest entries so that the database is never larger than a certain value. I'm only interested in the latest 10 for example and everything past that number should be deleted. The database is updated through varous programs but the program that does the monitoring and deleting will probably be a Java EE application with JPA. I don't know at which layer of the implementation this will be done. If MySQL has build in management that does this, if I'll have to write a query that does this, or if there is a feature of Java that can do this.
Edit: I'm using an autoincremented id that could be used to determine threshhold of deleting.
This is a complex problem, because unless your table is not linked to any other table, you might very well have the latest row in table A referencing a very old row in table B. In this case, although the table B's row is very old, you can't delete it without breaking the coherence of your database.
Doing it "continuously" is even harder (read: impossible). I would first
examine if it's really needed. Disks are cheap, and 10 entries in an enterprise database is really nothing.
implement some purge mechanism and execute it very now and then, when the database is not used by anyone else.
I'll have a stab without knowing anything about your table schema:
DELETE FROM MyTable WHERE Id NOT IN (SELECT TOP 10 Id FROM MyTable ORDER BY Date DESC)
This is pretty inefficient to run all the time and there may by a MySql-specific TRUNCATE that does the job nicer. You'd probably get better performance from limiting your reads to the 10 rows you need, and actually archiving / deleting the extraneous data only periodically.