Mysql update query over 10 million records - mysql

I've got to update a field in a table (250 000 records) with the value of a field from another table (10 000 000 records) based on the email...
I've tried:
UPDATE table1 t1, table2 t2
SET t1.country = t2.country
WHERE t1.email = t2.email
But I got a "Query is being executed" forever.
What query should I use?
Thanks

This would be a good opportunity to employ a JOIN.
UPDATE table1 as t1
JOIN table2 t2 ON t1.email = t2.email
SET t1.country = t2.country
It will still take a while for your query to process, but it should reduce the time by a significant amount.

I don't see an obvious error in your query (and the database would yield an error in that case). So the problem we're looking at is to speed up the execution of your update. Is one of your two email fields indexed? If not, could you add an index and try again? E.g. ALTER TABLE table2 ADD INDEX(email)

Related

Removing duplicates on mysql table [the table is > 2Gb]

Problem - we have many duplicate rows in our tables which makes the calculations non-accurate
The solution I tried - I wrote a delete inner join query that shall delete the duplicates (based on my research this is the fastest method), tested it on staging and it worked, run on the production hoping it will last max 1-2 days,
Here is the query I was using:
DELETE t1 FROM table t1
INNER JOIN
table t2
WHERE t1.id > t2.id
AND t1.col1 = t2.col1
AND t1.col2 = t2.col2
AND t1.col3 = t2.col3
AND t1.col4 = t2.col4
Problems with the solution -
I expected the query to be run some hours or 2-3 days but when I tried this solution of the entire table it took 4 days and it was still on and I had to kill the process.
the query has been running for 4 days and it is still on, I tried on a smaller table which was a segment of my original table and again it took hours and hours. I cannot afford to have a query run on my table for weeks as I am doing lots of calculation on this table and I don't want my table to get locked.
Deleting a large number of rows from a table is very expensive. I would recommend creating a new table with the rows you want and then (maybe) repopulating the original table.
You can start with:
CREATE TABLE temp_t AS
SELECT t1.*
FROM t t1
WHERE t1.id = (SELECT MIN(t2.id)
FROM t t2
WHERE t2.col1 = t2.col1 AND
t2.col2 = t2.col2 AND
t2.col3 = t2.col3 AND
t2.col4 = t2.col4
);
For this to work in a reasonable amount of time, you need an index on t(col1, col2, col3, col4)! The index is quite important (and might take some time to build).
Then, you can decide if you want to repopulate your original table. If you have validated that the above is correct, you can do:
truncate table t;
insert into t
select * from temp_t;
Of course, you should backup your table/database before doing something like this.

SQL UPDATE query taking too long

So I am very new to MySQL, and I am trying to run a query to update a column if a cell value is present in both tables, and the query is taking forever to run (It's been running for 10 minutes now and no result yet). One of my tables is about 250,000 rows, and the other is about 80,000, so I'm not sure why it is taking so long. The query I am using is:
USE the_db;
UPDATE table1
JOIN table2
ON table2.a = table1.b
SET table1.c = "Y";
I've changed the names of the tables and columns, but the query is exactly the same. I've looked at other answers on here and all of them take a very long time as well. Any help would be appreciated, thanks.
For this query:
UPDATE table1 JOIN
table2
ON table2.a = table1.b
SET table1.c = 'Y';
You want an index on table2(a):
create index idx_table2_a on table2(a);
Also, if there are multiple values of a that match each b, then you could also be generating a lot of intermediate rows, and that would have a big impact on performance.
If that is the case, then phrase the query as:
UPDATE table1
SET table1.c = 'Y'
WHERE EXISTS (SELECT 1 FROM table2 WHERE table2.a = table1.b);
And you need the same index.
The difference between the queries is that this one stops at the first matching row in table2.

Is mysql update query slow?

I have been trying to update column from table through raw_sql in ruby on rails as below,
db_connection.execute("update table1 t1 join table2 t2 on t1.s_number = t2.product_id set t1.name = (select name from table3 where mid_size = t2.level)")
Its very slow and taking too much time. Is there any best approach for bulk update in rails through SQL? hope same will be happen if i do from ACTIVERECORD also.
More information table1 having the 1 lac and table2 having 2.5 lacs records
share your thoughts
It is completely dependent on your indexes in both tables.
Try EXPLAIN for your query it will give you insight
It would be great if you share your create table code for both of your tables

Performance of mysql WHERE ... IN

I have mysql queries with a WHERE IN statement.
SELECT * FROM table1 WHERE id IN (1, 2, 15, 17, 150 ....)
How will it perform with hundreds of ids in the IN clause? is it designed to work with many arguments? (my table will have hundreds of thousands of rows and id is the primary field)
is there a better way to do it?
EDIT: I am getting the Ids from the result set of a search server query. So not from the database. I guess a join statement wouldn't work.
I am not sure how WHERE ... IN performes but for me it sounds like a JOIN or maybe a subselect would be the better choice here.
See also: MYSQL OR vs IN performance and http://www.slideshare.net/techdude/how-to-kill-mysql-performance
You should put the IN clause "arguments" into table2 for instance.
Afterwords you make this:
SELECT t1.* FROM table1 t1
INNER JOIN table2 t2 ON t1.Id = t2.Id

Single query to get data from 3 separate tables based on provided timestamp

I'm working with three tables which can be summarized as follows:
Table1 (pid, created, data, link1)
Table2 (pid, created, data)
Table3 (link1, created, data)
The created field is a UNIX timestamp, and the data field is the same format in each table.
I want to get the data in all 3 tables such that Table1.pid = Table2.pid AND Table1.link1 = Table3.link1 AND created is less than and closest to a given timestamp value.
So, as an example, say I provide a date of May 7, 2011, 1:00pm. I'd want the data record from each table created most recently before this date-time.
Now I've managed to do this in a rather ugly single query involving sub-queries (using either INNER JOINs or UNIONs), but I'm wondering whether it can be done w/o sub-queries in a single query? Thanks for any suggestions.
Obviously I haven't really run the query, but it seems like it would do whatever you need if I'm understanding your question right.
t1.pid and t2.pid are the same, t1.link1 and t3.link1 are the same, and none of the .created are above your time, and you want the row closest to the date.
SELECT t1.data, t2.data, t3.data FROM Table1 t1, Table2 t2, Table3 t3 WHERE t1.pid = t2.pid AND t1.link1 = t3.link1 AND GREATEST(t1.created,t2.created,t3.created) < 'your-formatted-time' ORDER BY GREATEST(t1.created,t2.created,t3.created) DESC;