Efficiency when deleting rows in two tables - mysql

I have table1(id_table1) and table2(id_table2, id_table1). I'd like to remove records in table2 (under a given condition) but then also remove items in table1 that have no more relationships to table2. What is the most efficient way to do that in SQL? I'm using mySql.
Thanks in advance!

If you use InnoDB, add a foreign key constraint with an ON DELETE CASCADE. This will automatically delete the rows if the relationship is no longer correct. That way, you don't have to query the database after deleting rows in table2 to check if the relation is still intact.
Foreign key constraints

In addition to cularis's answer a less efficient option, if you're using MyISAM you don't have foreign key constraints.
Create a trigger:
DELIMITER $$
CREATE TRIGGER ad_table1_each AFTER DELETE ON table2 FOR EACH ROW
BEGIN
DELETE FROM table1 WHERE table1.table2_id = OLD.id;
END $$
DELIMITER ;
http://dev.mysql.com/doc/refman/5.5/en/triggers.html
http://dev.mysql.com/doc/refman/5.5/en/create-trigger.html

Assuming you did not set up any cascading deletes and since you asked how to do it in sql, i can see two options:
1)delete from table2 where (condition)
delete from table1 where id not in (select distinct id_table1 from table2)
2)delete from table1 where id in (select distinct id_table1 from table2 where (condition))
delete from table2 where id_table2 not in (select id from table1)
Assuming table2 size is much larger than table1, and the condition considerably shortens the size
method 1 scans full table2 once,deletes many records,then scans once
method 2 scans full table2 twice
This makes me think method 1 is little bit more efficient if the sizes of tables are very very large.

Related

Removing duplicate rows within a trigger in SQL

So I've made table1 and table2 and a trigger such that when there's an insert on table1, data gets inserted into table2 from table1. I then go another step with another trigger where after insert on table2, it inserts into another table, table3 with data from table2. The trigger in place is 'FOR EACH ROW', so unfortunately, when a second insert happens on table1, it goes into table2, and table3 reads in the new, second row, AND the first row again.
Ideally to prevent this from happening, or to reduce the impact, it would make sense to remove duplicates at the end or at the start of a respective trigger so it's not exponentially filling up tables with duplicate rows. However, I've not been able to find a way to do it thus far within a trigger. Is it even possible? Any help? The tables also do not have Primary or Foreign Keys. Thanks in advance.
An example of what I've tried so far:
DELETE FROM table2 WHERE rowid NOT IN (SELECT MIN(rowid) FROM table2 GROUP BY col1, col2, col3, ...);
Though I think this is for SQLite? As I've seen this working for SQLite databases, whereas here I just get an error saying it can't recognise the column rowid.
I also tried WHERE NOT EXISTS during insert, which works for not inserting duplicates in the first place, however I need an update as part of the trigger which changes some column values, so it won't work in this case as the rows will always be different from their initial insert.

SUbstiute for SubQuery to delete records from table

I'm using this query to delete unique records from one table.
DELETE FROM TABLE 1 WHERE ID NOT IN (SELECT ID form TABLE 2)
But the problem is that both the tables have millions of records and using subquery will be very slow.
Can anyone tell me any alternative.
Delete t1
from table_1 t1
left join table_2 t2 on t1.id = t2.id
where t2.id is null
SubQuery are really slow infact joins exists!
DELETE table1
FROM table1 LEFT JOIN table2 ON table1.id = table2.id
WHERE table2.id is null
Deleting millions of records from a table always have performance issues; you need to check if the table has -
1. Constraints
2. Triggers, &
3. Indexes
on it. These things will make your delete even slower...
Please disable them before this activity. You should also check ratio of the "to be deleted" records to the entire table volume. If the number of records to be deleted is more than 50% of the entire table volume then you should consider below approach -
Create a temporary table containing records that you want to retain from the original table.
Drop the original table.
Rename temporary table to original table.
Before going for the above approach, please make sure that you have a copy of the definition of each of the objects dependent on this original table like the constraints, indexes, triggers etc. You may also need to check if the table that you are going to delete has any children.
Once this activity is complete, you can enable the constraints, indexes, triggers again!
Thanks,
Aditya

Should I check if rows exist across tables before deleting in MyISAM?

Let's say I have 5 MyISAM tables in my database. Each table has a key, let's call it "id_num" ...
"id_num" is the field which I use to connect all the tables together. A certain value of "id_num" may appear in all tables or sometimes only a subset of the tables.
If I want to delete all instances of a certain "id_num" in the database, can I just make a DELETE command on all tables or should I check to see if that value for "id_num" exists?
DELETE * FROM table1 WHERE id_num = 123;
DELETE * FROM table2 WHERE id_num = 123;
DELETE * FROM table3 WHERE id_num = 123;
DELETE * FROM table4 WHERE id_num = 123;
DELETE * FROM table5 WHERE id_num = 123;
Or should I perform a SELECT command first on each table to check if these rows exist in the table before deletion? What is best practice?
(I am using MyISAM so cascading delete is not an option here.)
To answer your question about first running SELECT, there's no advantage to doing so. If there's no row in a given table, then the DELETE will simply affect zero rows. If there are matching rows, then doing the SELECT first and then the DELETE would just be doing double the work of finding the rows. So just do the DELETE and get it over with.
Are you aware that MySQL has multi-table DELETE syntax?
If you are certain that table1 has a matching row, you can use outer joins for the others:
DELETE table1.*, table2.*, table3.*, table4.*, table5.*
FROM table1
LEFT OUTER JOIN table2 USING (id_num)
LEFT OUTER JOIN table3 USING (id_num)
LEFT OUTER JOIN table4 USING (id_num)
LEFT OUTER JOIN table5 USING (id_num)
WHERE table1.idnum = 123;
I'm assuming id_num is indexed in all these tables, otherwise doing the JOIN will perform poorly. But doing the DELETE without the aid of an index to find the rows would perform poorly too.
Sounds like you need to change your design as follows - have a table with id_num as a PK and make id_num a FK in the other tables, with on-delete-cascade. This will allow you to only run a single delete statement to delete all applicable data (and this is also generally the more correct way of doing things).
The above apparently doesn't work in MyISAM, but there is a workaround using triggers (but now it does seem like a less appealing option).
But I believe your above queries should work, no need to check if something exists first, DELETE will just not do anything.
Most APIs provide you with some sort of rows affected count if you'd like to see whether data was actually deleted.
You should not execute select query before deleting from the table. As select query will put some extra load to the server. However after executing delete query you can check how many rows has been deleted using mysql_affected_rows() function in php.

Delete many rows in MySQL

I am deleting rows in order of hundreds of thousands from a remote DB. Each delete has it's own target eg.
DELETE FROM tablename
WHERE (col1=c1val1 AND col2=c2val1) OR (col1=c1val2 AND col2=c2val2) OR ...
This has been almost twice as fast for me than individual queries, but I was wondering if there's a way to speed this up more, as I haven't been working with SQL very long.
Create a temporary table and fill it with all your value pairs, one per row. Name the columns the same as the matching columns in your table.
CREATE TEMPORARY TABLE donotwant (
col1 INT NOT NULL,
col2 INT NOT NULL,
PRIMARY KEY (c1val, c2val)
);
INSERT INTO donotwant VALUES (c1val1, c2val1), (c1val2, c2val2), ...
Then execute a multi-table delete based on the JOIN between these tables:
DELETE t1 FROM `tablename` AS t1 JOIN `donotwant` USING (col1, col2);
The USING clause is shorthand for ON t1.col1=donotwant.col1 AND t1.col2=donotwant.col2, assuming the columns are named the same in both tables, and you want the join condition where both columns are equal to their namesake in the joined table.
Generally speaking, the fastest way to do bulk DELETEs is to put the ids to be deleted into a temp table of some sort, then use that as part of the query:
DELETE FROM table
WHERE (col1, col2) IN (SELECT col1, col2
FROM tmp)
Inserting can be done via a standard:
INSERT INTO tmp VALUES (...), (...), ...;
statement, or by using the DB's bulk-load utility.
I doubt it makes much difference to performance but you can write that kind of thing this way...
DELETE
FROM table
WHERE (col1,col2) IN(('c1val1','c2val1'),('c1val2','c2val2')...);

How to properly index a linking table for many-to-many connection in MySQL?

Lets say I have a simple many-to-many table between tables "table1" and "table2" that consists from two int fields: "table1-id" and "table2-id". How should I index this linking table?
I used to just make a composite primary index (table1-id,table2-id), but I read that this index might not work if you change order of the fields in the query. So what's the optimal solution then - make independent indexes for each field without a primary index?
Thanks.
It depends on how you search.
If you search like this:
/* Given a value from table1, find all related values from table2 */
SELECT *
FROM table1 t1
JOIN table_table tt ON (tt.table_1 = t1.id)
JOIN table2 t2 ON (t2.id = tt.table_2)
WHERE t1.id = #id
then you need:
ALTER TABLE table_table ADD CONSTRAINT pk_table1_table2 (table_1, table_2)
In this case, table1 will be leading in NESTED LOOPS and your index will be usable only when table1 is indexed first.
If you search like this:
/* Given a value from table2, find all related values from table1 */
SELECT *
FROM table2 t2
JOIN table_table tt ON (tt.table_2 = t2.id)
JOIN table1 t1 ON (t1.id = tt.table_1)
WHERE t2.id = #id
then you need:
ALTER TABLE table_table ADD CONSTRAINT pk_table1_table2 (table_2, table_1)
for the reasons above.
You don't need independent indices here. A composite index can be used everywhere where a plain index on the first column can be used. If you use independent indices, you won't be able to search efficiently for both values:
/* Check if relationship exists between two given values */
SELECT 1
FROM table_table
WHERE table_1 = #id1
AND table_2 = #id2
For a query like this, you'll need at least one index on both columns.
It's never bad to have an additional index for the second field:
ALTER TABLE table_table ADD CONSTRAINT pk_table1_table2 PRIMARY KEY (table_1, table_2)
CREATE INDEX ix_table2 ON table_table (table_2)
Primary key will be used for searches on both values and for searches based on value of table_1, additional index will be used for searches based on value of table_2.
As long as you are specifying both keys in the query, it doesn't matter what order they have in the query, nor does it matter what order you specify them in the index.
However, it's not unlikely that you will sometimes have only one or the other of the keys. If you sometimes have id_1 only, then that should be the first (but you still only need one index).
If you sometimes have one, sometimes the other, sometimes both, you'll need one index with both keys, and a second (non-unique) index with one field - the more selective of the two keys - and the primary composite index should start with the other key.
#Quassnoi, in your first query you're actually using only tt.table_1 key as we can see from the WHERE-clause: WHERE t1.id = #id. And in the second query - only tt.table_2.
So the multi-column index could be useful only in the third query because of WHERE table_1 = #id1 AND table_2 = #id2. If the queries of this kind are not going to be used, do you think it's worth to use two separate one-column indices instead?