MYSQL mass deletion FROM 80 tables - mysql

I have 50GB mysql database (80 tables) that I need to delete some contents from it.
I have a reference table that contains list if product ids that needs to be deleted from the the other tables.
Now, the other tables can be 2 GB each, contains the items that needs to be deleted.
My question is: since it is not a small database, what is the safest way to delete
the data in one shot in order to avoid problems?
What is the best method to verify the the entire data was deleted?

Probably this doesn't help anymore. But you should keep this in mind when creating the database. In mysql (depending on the table storage type, for instance in InnoDB) you can specify relations (They are called foreign key constraints). These relations mean that if you delete an entry from one row (for instance products) you can automatically update or delete entries in other tables that have that row as foreign key (such as product_storage). These relations guard that you have a 100% consistent state. However these relations might be hard to add on hindsight. If you plan to do this more often, it is definitely worth researching if you can add these to your database, they will save you a lot of work (all kinds of queries become simpler)
Without these relations you can't be 100% sure. So you'd have to go over all the tables, not which columns you want to check on and write a bunch of sql queries to make sure there are no entries left.

As Thirler has pointed out, it would be nice if you had foreign keys. Without them burnall 's solution can be used to transactions to ensure that no inconsistencies creep.
Regardless of how you do it, this could take a long time, even hours so please be prepared for that.

As pointed out earlier foreign keys would be nice in this place. But regarding question 1 you could perhaps run the changes within a transaction from the MySQL prompt. This assumes you are using a transaction safe storage engine like InnoDB. You can convert from myisam to InnoDB if you need to. Anyway something like this:
START TRANSACTION;
...Perform changes...
...Control changes...
COMMIT;
...or...
ROLLBACK;
Is it acceptable to have any downtime?
When working with PostgreSQL with databases >250Gb we use this technique on production servers in order to perform database changes. If the outcome isn't as expected we just rollback the transaction. Of course there is a penalty as the I/O-system has to work a bit.
// John

I am agree with Thirler that using of foreign keys is preferrable. It guarantees referential integrity and consisitency of the whole database.
I can believe that life sometimes requires more tricky logic.
So you could use manual queries like
delete from a where id in (select id from keys)
You could delete all records at once or by range of keys or using LIMIT in DELETE. Proper index is a must.
To verify consistency you need function or query. For example:
create function check_consistency() returns boolean
begin
return not exists(select * from child where id not in (select id from parent) )
and not exists(select * from child2 where id not in (select id from parent) );
-- and so on
end

Also maybe something to look into is Partitioning in MySQL tables. For more information check out the ref manual:
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Comes down that you can divide tables (for example) in different partitions per datetime values or indexsets.

Related

how many records can be deleted using a single transaction in mysql innodb

I wanted to delete old records from 10 related tables every 6 months using primary keys and foreignkeys. I am planning to do it in a single transaction block, because in case of any failure I have to rollback the changes. My queries will be somethign like this
DELETE FROM PARENT_TABLE WHERE PARENT_ID IN (1, 2, 3,etc);
DELETE FROM CHILD_TABLE1 WHERE PARENT_ID IN (1, 2, 3,etc);
The records to delete will be around 1million. Is it safe to delete all these in a single transaction? how will be the performanace?
Edit
To be more clear on my question. I will detail my execution plan
I am first retreiving primary keys of all the records from the parent table which has to be deleted and store it to a temporary table
START TRANSACITON
DELETE FROM CHILD_ONE WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
DELETE FROM CHILD_TWO WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
DELETE FROM PARENT_TABLE WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
COMMIT;
ROLLBACK on any failure.
Given that I can have around a million records to delete from all these tables, is it safe to put everything inside a single transaction block?
You can probably succeed. But it is not wise. Something random (eg, a network glitch) could come along to cause that huge transaction to abort. You might be blocking other activity for a long time. Etc.
Are the "old" records everything older than date X? If so, it would much more efficient to make use of PARTITIONing for DROPping old rows. We can discuss the details. Oops, you have FOREIGN KEYs, which are incompatible with PARTITIONing. Do all the tables have FKs?
Why do you wait 6 months before doing the delete? 6K rows a day would would have the same effect and be much less invasive and risky.
IN ( SELECT ... )
has terrible performance, use a JOIN instead.
If some of the tables are just normalizations, why bother deleting from them?
Would it work to delete 100 ids per transaction? That would be much safer and less invasive.
First of all: Create a proper backup AND test it before you start to delete the records
The number of record you asked for is mostly depends on the configuration (hardware) of your database server. You have to test it out, how many records could be deleted on that specific server without problems. Start with e.g. 1000 records then increase the amount in each iteration while it seems to be too slow. If you have replication, the setup and the slave's performance affects the row number too (too much write requests could cause serious delay in replication).
An advice: Remove all foreign keys and indexes (except the primary key and the indexes related to the where clauses you use to perform the action) if possible before you start the delete.
Edit:
If the count of records which will be deleted is larger than the count of records which will not, consider to just copy the records into a new table, then rename the old and new tables. For the first step, copy the structure of table using the CREATE TABLE .. LIKE statement, then drop all unnecessary indexes and constraints, copy the records, add the indexes, then rename the tables. (Copy the lastest new records from the original table into the copy if necessary), then you can drop the old table.
what i believe first you have to move the data in another database then
use single Transaction to delete all 10 table which is very safe to rollback immediately and delete the data from live data base when interaction of user is very less more info

Handling Deletes/Inserts/Select in a huge table

I have a dating website. In this website I used to send daily 10 photo matches to a user and store them in the structure as
SENDER RECEIVER
11 1
12 1
13 1
14 1
I maintain two month log.
User can also check them by logging to my website.
Which means there is parallel insert and select, which surely is not an issue.
Problem is when a user becomes inactive or deletes its id, I need to remove all the entries from the log where sender='inactive-id'.
Size of log is approx 60 million.
So whenever a delete queries comes in this huge table, all selects get locked and my site is getting down.
Note my table is merge myisam
as I need to store 2-3 month records and on 1st of every month I change the definition
Normally, Table is the most granular object that is locked by DELETE statement. Therefore, by using MERGE table you combine several objects that can be lock independently into a single big object that will be locked, when DELETE hits ANY of its tables.
MERGE is a solution for tables which change rarely or never: MERGE Table Advantages and Disadvantages.
You have 2 options:
Minimise impact of locks:
Delete in small batches
Run delete job during low load hours
Consider not deleting at all, if it does not save you much space
Instead of deleting rows mark them as "deleted" or obsolete and exclude from SELECT queries
Have smaller objects locked (rather than locking all your tables at once):
Have several Delete statements to delete from each of underlying tables
Drop MERGE definition, delete data from each underlying table create MERGE. However, I think you can do it without dropping MERGE definition.
Use partitioning.
Quote from MySQL Manual:
An alternative to a MERGE table is a partitioned table, which stores
partitions of a single table in separate files. Partitioning enables
some operations to be performed more efficiently and is not limited to
the MyISAM storage engine. For more information, see Chapter 18, Partitioning.
I would strongly advocate for partitioning, because:
- You can fully automate your logging / data retention process: a script can create new and remove empty partitions, move obsolete data to a different table and then truncate that table.
- key uniqueness is enforced
- Only partition that contains data to be deleted is locked. Selects on other partitions runs as normal.
- Searches run on all partitions at the same time (as with MERGE), but you can use HASH SubPartitioning to further speed up searches.
However if you believe that benefits of partitioning will be outweighed by cost of development, then may be you should not delete that data at all?
I think that the best solution would be setting partitions on log based on user id. This way when you run a delete Db will block only one partition.
If you Google on "delete on huge table" you'll get some informative results. Here are the first three hits:
http://www.dba-oracle.com/t_oracle_fastest_delete_from_large_table.htm
Improving DELETE and INSERT times on a large table that has an index structure
http://www.dbforums.com/microsoft-sql-server/1635228-fastest-way-delete-large-table.html
One method they all mention is deleting in small batches instead of all at once. You say that the table contains data for a 2 month period. Maybe you run delete statements for each day separate?
I hope this helps!
If you use InnoDB and create FOREIGN KEY relations, you can get the rows deleted automatically when the user themself is deleted:
CREATE TABLE `DailyChoices`(
sender INT(11) NOT NULL,
receiver INT(11) NOT NULL,
CONSTRAINT FOREIGN KEY (sender) REFERENCES users (userid) ON DELETE CASCADE ON UPDATE CASCADE
) TYPE = InnoDB;

Does mysql index modify on update?

Let's say i have a mysql database table 'article' with the following fields: id, title, url, views
I have the field title marked with a FULLTEXT index and the field url marked with a UNIQUE index.
My question is, if i do an ordinary update something like:
UPDATE 'article' SET views = views + 1 WHERE id = {id}
...will this result in a update of the mysql table indexes?
Is it safe (from speed point of view) to keep the field views in the table article or maybe i should create a separate table, let's say, article_stats with the following fields: article_id, views ?
Yes, UPDATE statements update indexes. MySQL manages indexes automatically - you never need to worry about updating them manually or triggering an update. If you are asking whether that particular UPDATE will change your indexes which don't include the views column - no, it wont. Only related indexes get updated.
Keeping a views column is fine, unless you need to track extra information about each view (when it occurred, user who made the view, etc)
Your SQL does contain a syntax error, however. You can't quote table names like 'article'. If you need to quote a table name (e.g. if it contains a SQL reserved word), then use backticks like this:
UPDATE `article` SET ...
I agree with Cal, index gets updated with update statement "if you update the index". In your case specific example the index or indexes are not updated because they are not related to the view field but updating the view will still slow down because that's such a frequent operation and you can programmatically keep the view updates in a shared memory with a binary tree or hash table and update them together after some time or size point. For the best speed you also can use memory tables, which are volatile but you can transfer the data from time to time to the actual table. By this way you won't get to deal with harddisk writer for every "view" update.
Keeping seperate table will result the same thing. Your application slows down because there is an update when there is a select - that row you are updating will be locked until you done with it and another readers will wait your row-level operation. You will still have selects to show that view count even in a seperate table.
Well, when you have that much load, you can have master and slave servers to seperate reads and writes and synchronize time to time.

Setting MySQL unique key or checking for duplicate in application part?

Which one is more reliable and has better performance? Setting MySQL unique key and using INSERT IGNORE or first checking if data exists on database and act according to the result?
If the answer is the second one, is there any way to make a single SQL query instead of two?
UPDATE: I ask because my colleagues in the company I work believe that deal with such issues should be done in application part which is more reliable according to them.
You application won't catch duplicates.
Two concurrent calls can insert the same data, because each process doesn't see the other while your application checks for uniqueness. Each process thinks it's OK to INSERT.
You can force some kind of serialisation but then you have a bottleneck and performance limit. And you will have other clients writing to the database, even if it is just a release script-
That is why there are such things as unique indexes and constraints generally. Foreign keys, triggers, check constraints, NULL/NIOT NULL, datatype constraints are all there to enforce data integrity
There is also the arrogance of some code monkey thinking they can do better.
See programmers.se: Constraints in a relational databases - Why not remove them completely? and this Enforcing Database Constraints In Application Code (SO)
Settings a unique key is better. It will reduce the amount of round-trips to mysql you'll have to do for a single operation, and item uniqueness is ensured, reducing errors caused by your own logic.
You definitely should set a unique key in your MySQL table, no matter what you decide.
As far as the other part of your question, definitely use insert ignore on duplicate key update if that is what you intend for your application.
I.e. if you're going to load a bunch of data and you don't care what the old data was, you just want the new data, that is the way to go.
On the other hand, if there is some sort of decision branch that is based on whether the change is an update or a new value, I think you would have to go with option 2.
I.e. If changes to the table are recorded in some other table (e.g. table: change_log with columns: id,table,column,old_val,new_val), then you couldn't just use INSERT IGNORE because you would never be able to tell which values were changed vs. which were newly inserted.

Session / Log tables keys design question

I have almost always heard people say not to use FKs with user session and any log tables as those are usually High write tables and once written data almost always tays forever without any updates or deletes.
But the question is I have colunms like these:
User_id (link a session or activity log to the user)
activity_id (linking the log activity table to the system activity lookup table)
session_id (linking the user log table with the parent session)
... and there are 4-5 more colunms.
So if I dont use FKs then how will i "relate" these colunms? Can i join tables and get the user info without FKs? Can i write correct data without FKs? Any performance impact or do people just talk and say this is a no no?
Another question I have is if i dont use FKs can i still connect my data with lookup tables?
In fact, you can build the whole database without real FKs in mysql. If you're using MyISAM as a storage engine, the FKs aren't real anyway.
You can nevertheless do all the joins you like, as long as the join keys match.
Performance impact depends on how much data you stuff into a referenced table. It takes extra time if you have a FK in a table and insert data into it, or update a FK value. Upon insertion or modification, the FK needs to be looked up in the referenced table to ensure the reference integrity.
On highly used tables which don't really need reference integrity, I'd just stick with loose columns instead of FKs.
AFAIK InnoDB is currently the only one supporting real foreign keys (unless MySQL 5.5 got new or updated storage engines which support them as well). Storage engines like MyISAM do support the syntax, but don't actually validate the referential integrity.
FK's can be detrimental in "history log" tables. This kind of table wants to preserve the exact state of what happened at a point in time.
The problem with FK's is they don't store the value, just a pointer to the value. If the value changes, then the history is lost. You DO NOT WANT updates to cascade into your history log. It's OK to have a "fake Foreign key" that you can join on, but you also want to intensionally de-normalize relevant fields to preserve the history.