MySQL : updating a table from another table by leftjoin vs iterating - mysql

I have two tables T1 and T2 and want to update one field of T1 from T2 where T2 holds massive data.
What is more efficient?
Updating T1 in a for loop iteration over the values
or
Left join it with T2 and update.
Please note that i'm updating these tables in a shell script

In general, the JOIN will always work much better than a loop. The size should not be an issue if it is properly indexed.

There is no simple answer which will be more effective, it will depend on table size and data size to which you are going to update in one go.
Suppose you are using innodb engine and trying to update 1,000 or more rows in one go with 2 heavy tables join and it is quite frequent then it will not be good idea on production server as it will lock your table for some time and due to this locking some other operations also can be hit on your production server.
Option1: If you are trying to update few rows and based on proper indexed fields (preferred based on primary key) then you can go with join.
Option2: If you are trying to update a large amount of data based on multiple tables join then below option will be better:
Step1: Create a stored procedure.
Step2: Keep below query results in a cursor.
suppose you want TO UPDATE corresponding field2 DATA of TABLE table2 IN field1 of TABLE table1:
SELECT a.primary_key,b.field2 FROM table1 a JOIN table2 b ON a.primary_key=b.foreign_key WHERE [place CONDITION here IF any...];
Step3: Now update all rows one by one based on primary key using stored values in cursor.
Step4: You can call this stored procedure from your script.

Related

Best way to write SQL delete statement, deleting pairs of records

I have a MySQL database with just 1 table:
Fields are: blocknr (not unique), btcaddress (not unique), txid (not unique), vin, vinvoutnr, netvalue.
Indexes exist on both btcaddress and txid.
Data in it looks like this:
I need to delete all "deletable" record pairs. An example is given in red.
Conditions are:
txid must be the same (there can be more than 2 records with same txid)
vinvoutnr must be the same
vin must be different (can have only 2 values 0 and 1, so 1 must be 0 other must be 1)
In a table of 36M records, about 33M records will be deleted.
I've used this:
delete t1
from registration t1
inner join registration t2
where t1.txid=t2.txid and t1.vinvoutnr=t2.vinvoutnr and t1.vin<>t2.vin;
It works but takes 5 hours.
Maybe this would work too (not tested yet):
delete t1
from registration as t1, registration as t2
where t1.txid=t2.txid and t1.vinvoutnr=t2.vinvoutnr and t1.vin<>t2.vin;
Or do I forget about a delete query and try to make a new table with all non-delatables in and then drop the original ?
Database can be offline for this delete query.
Based on your question, you are deleting most of the rows in the table. That is just really expensive. A better approach is to empty the table and re-populate it:
create table temp_registration as
<query for the rows to keep here>;
truncate table registration;
insert into registration
select *
from temp_registration;
Your logic is a bit hard to follow, but I think the logic on the rows to keep is:
select r.*
from registration r
where not exists (select 1
from registration r2
where r2.txid = r.txid and
r2.vinvoutnr = r.vinvoutnr and
r2.vin <> r.vin
);
For best performance, you want an index on registration(txid, vinvoutnr, vin).
Given that you expect to remove the majority of your data it does sound like the simplest approach would be to create a new table with the correct data and then drop the original table as you suggest. Otherwise ADyson's corrections to the JOIN query might help to alleviate the performance issue.

How to select from multiple tables after joining?

I try to inner join multiple tables (table_A, table_B and table_C) with a table_X. The table_X is selected from anther table (table_Y) using LIKE. table_X takes a very long time to create. How do I do the task efficiently?
Currently, I do the following query for table_A. And repeat the process for table_B and Table_C.
SELECT * FROM
Table_A INNER JOIN
(SELECT ID FROM table_Y where ID LIKE "%keyword%") as table_X
USING (ID)
Since table_X takes a lot of time to create, I would like to select from table_A, table_B and table_C in one query. How do I do it?
Several things to note:
My expected result is three separate tables, not one combined table.
I do not have permission to create a temporary table in the database.
A query returns a result set not a table, and a query can only return single result set.
You will need three separate queries to get your desired results.
If your core goal is to reduce the cost of your table_X subquery, you could create a temporary table to store the results of the table_X subquery, and then join to the that table for your queries with table_A, table_B, and table_C.
Edit, things to keep in mind:
True TEMPORARY tables are only visible by the connection in which they are created, and will be automatically dropped when the connection is closed; but will still be persist if a connection is reused (from a connection pool for a example), so it is still good practice to drop them explicitly. True temporary tables also have limits on how they can be used, most noticeably that they can only be referenced once in any given query (no self joins, or joins to multiple references, or unions that have multiple parts referencing the same table).
Assuming you have the proper permissions, you can create normal tables that you intend to drop when finished; but care must be taken because such tables can generally be seen by all connections and a disconnect will not "clean up" such tables. They can perform better, and do not have the limitations of true temporary tables, but you need to weigh the risks vs the benefits.
If you do not have any create table permissions, most of your data processing is happening client side, and you do not expect enormous results from the costly subquery, you could collect the subquery results first and use them in dynamic construction of the later queries.
very pseudo code:
query: SELECT ID FROM table_Y WHERE [expensive condition(s)];
code: convert ID values received into a comma separated list
query: SELECT [stuff] FROM Table_A WHERE ID IN ([ID values from expensive query]);
query: SELECT [other_stuff] FROM Table_B WHERE ID IN ([ID values from expensive query]);
query: SELECT [more_stuff] FROM Table_C WHERE ID IN ([ID values from expensive query]);

Raising mySQL 61 table join limit

I have a Drupal 6 application that requires more joins than that 61 table join mySQL limit allows. I understand that this is an excessive number, but it is ran only once a day, and the results are cached for further reference.
Are there any mySQL configuration parameters that could be of help, or any other approaches short of changing the logic behind collecting the data?
My approach would be to split the humongous query into smaller, simpler queries, and use temporary tables to store the intermediate steps. I use this approach frequently and it helps me a lot (sometimes it is even faster to create some temp tables than to join all the tables in one big query).
Something like this:
drop table if exists temp_step01;
create temporary table temp_step01
select t1.*, t2.someField
from table1 as t1 inner join table2 as t2 on t1.id = t2.table1_id;
-- Add the appropriate indexes to optimize the subsequent queries
alter table temp_step01
add index idx_1 (field1);
-- Create all the temp tables that you need, and finally show the results
select sXX.*
from temp_stepXX as sXX;
Remember: Temporary tables are visible only to the connection that creates them. If you need to make the result visible to other connections, you'll need to create a "real" table (of course, that is only worth with the last step of your process).

SUbstiute for SubQuery to delete records from table

I'm using this query to delete unique records from one table.
DELETE FROM TABLE 1 WHERE ID NOT IN (SELECT ID form TABLE 2)
But the problem is that both the tables have millions of records and using subquery will be very slow.
Can anyone tell me any alternative.
Delete t1
from table_1 t1
left join table_2 t2 on t1.id = t2.id
where t2.id is null
SubQuery are really slow infact joins exists!
DELETE table1
FROM table1 LEFT JOIN table2 ON table1.id = table2.id
WHERE table2.id is null
Deleting millions of records from a table always have performance issues; you need to check if the table has -
1. Constraints
2. Triggers, &
3. Indexes
on it. These things will make your delete even slower...
Please disable them before this activity. You should also check ratio of the "to be deleted" records to the entire table volume. If the number of records to be deleted is more than 50% of the entire table volume then you should consider below approach -
Create a temporary table containing records that you want to retain from the original table.
Drop the original table.
Rename temporary table to original table.
Before going for the above approach, please make sure that you have a copy of the definition of each of the objects dependent on this original table like the constraints, indexes, triggers etc. You may also need to check if the table that you are going to delete has any children.
Once this activity is complete, you can enable the constraints, indexes, triggers again!
Thanks,
Aditya

is the UPDATE command more resource hungry than INSERT

the scripts i've been working with in SQL work with close to 40,000 records and i've noticed a huge increase in execution time for when i use an UPDATE command
in 2 tables that have like 10 fields in them each, INSERT executes quicker for both combined than this UPDATE command
UPADTE table1
INNER JOIN table2 ON table1.primarykey = table2.primarykey
SET table1.code = table2.code
really what the UPDATE is doing is copying the code from one table to another where the identical records exists, this is because table1 is a staging table between 2 databases while table2 is a possessing table to insert staging table's data across multiple tables, both tables have the same number of record which is about 40,000
now to me UPDATE should be executing a lot quicker, considering it's only connecting 2 identical tables and inserting data for 1 field it should be running quicker than 2 INSERTS where 40,000 records are being created over 10 fields (so in other words, inserting 800,000 pieces of data) and i'm running the queries in a SQL console windows to avoid php timeouts
is UPDATE somehow more resource hungry than INSERT and is there any way to get it to go faster (apart from changing the fact that i use a separate table for processing, the staging table updates frequently so i copy the data like a snapshot and work with that, the code field is NULL to begin with so i only copy over records with a NULL code meaning records where code is not NULLhave already been worked with)
Is that UPDATE command the actual SQL? Because you need a WHERE clause to avoid updating every record in the table...
Also, INSERT doesn't first need to find the record to update from 2 joined tables.