Removing duplicates on mysql table [the table is > 2Gb] - mysql

Problem - we have many duplicate rows in our tables which makes the calculations non-accurate
The solution I tried - I wrote a delete inner join query that shall delete the duplicates (based on my research this is the fastest method), tested it on staging and it worked, run on the production hoping it will last max 1-2 days,
Here is the query I was using:
DELETE t1 FROM table t1
INNER JOIN
table t2
WHERE t1.id > t2.id
AND t1.col1 = t2.col1
AND t1.col2 = t2.col2
AND t1.col3 = t2.col3
AND t1.col4 = t2.col4
Problems with the solution -
I expected the query to be run some hours or 2-3 days but when I tried this solution of the entire table it took 4 days and it was still on and I had to kill the process.
the query has been running for 4 days and it is still on, I tried on a smaller table which was a segment of my original table and again it took hours and hours. I cannot afford to have a query run on my table for weeks as I am doing lots of calculation on this table and I don't want my table to get locked.

Deleting a large number of rows from a table is very expensive. I would recommend creating a new table with the rows you want and then (maybe) repopulating the original table.
You can start with:
CREATE TABLE temp_t AS
SELECT t1.*
FROM t t1
WHERE t1.id = (SELECT MIN(t2.id)
FROM t t2
WHERE t2.col1 = t2.col1 AND
t2.col2 = t2.col2 AND
t2.col3 = t2.col3 AND
t2.col4 = t2.col4
);
For this to work in a reasonable amount of time, you need an index on t(col1, col2, col3, col4)! The index is quite important (and might take some time to build).
Then, you can decide if you want to repopulate your original table. If you have validated that the above is correct, you can do:
truncate table t;
insert into t
select * from temp_t;
Of course, you should backup your table/database before doing something like this.

Related

MYSQL update from another table with multiple entries

I have seen a bunch of helpful answers about updating table values from a different table with multiple values based on a timestamp using a MAX() subquery.
e.g. Update another table based on latest record
I was wondering how this compares with doing an ALTER first and relying on the order in the table to simplify the UPDATE. Something like this:
ALTER TABLE `table_with_multiple_data` ORDER BY `timestamp` DESC;
UPDATE `table_with_single_data` as `t1`
LEFT JOIN `table_with_multiple_data` AS `t2`
ON `t1`.`id`=`t2`.`t1id`
SET `t1`.`value` = `t2`.`value`;
(Apologies for the pseudocode but I hope you get what I'm asking)
Both achieve the same for me but don't really have a big enough data set to see any difference in speed.
Thanks!!
You would normally use a correlated subquery:
UPDATE table_with_single_data t1
SET t1.value = (select t2.value
from table_with_multiple_data t2
where t2.t1id = t1.id
order by t2.timestamp desc
limit 1
);
If your method happens to work, that is just happenstance. Even if MySQL respected the ordering of tables, such ordering would not survive the join operation. Not to mention the fact that there is no guarantee on *which * value is assigned when there is multiple matching rows.

MySQL Query with inner join very slow

I have two tables containing 6M rows each. I'm trying to join the two using an inner join but the query ran for 2 days without finishing. The join is (note I've used count(*) just to enable me to run an explain, I'm actually using the join in a CTAS):
SELECT count(*)
FROM table1 t1,
table2 t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
After a bit of investigation I've found the below query runs fine:
SELECT count(*)
FROM
(SELECT *
FROM table1) t1,
(SELECT *
FROM table2) t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
The only difference between that instead of the table, I use the sub-query SELECT * FROM table;
Running the explain plans shows that the latter query is building up an index when it selects table2. Whereas the first query is using a join buffer (Block Nested Loop).
Surely MySQL is clever enough to work out that the two queries are practically identical and do the same with both queries? I don't see why an index should be need because a full scan is required for both tables anyway. These are temporary/transitory tables so if I did put an index on, it would literally be just to perform this join.
Is there a way to fix this via MySQL configuration?
You NEED the index on at least ONE of the tables, even such as
create index Temp1 on Table2 ( colA, colB )
So, your query from Table 1 joined to table 2, so even if a table scan is on all of table 1, you need it to quickly find the record(s) that match in table 2. If NEITHER has an index, then think of it this way. For every record in Table1, scan through ALL records in Table 2 and grab all records that match for ColA, ColB. Now, go back to table 1 for the SECOND record... go back through table 2 for ALL records until it finds a match.
Being that you have 6M records, you could practically choke a cow (so-to-speak) on performance. By having an index, even on the SECOND table, when the query is on the first record, it can immediately jump to the rows that match ColA, ColB and as soon as those A/B records are done, it goes back to the first table.
Now, for other overhead efficiencies. If you have BOTH tables indexed on respective Col1, Col2 and ColA, ColB, then the engine will have in its memory / cache a whole block of records for each common area and doesn't have to keep going back to the raw data pages for other elements repeatedly.
So, even though you think it might not be practical, it is still good to handle large table queries. Also, if you have multiple records in the first table with the same values for Col1, Col2, but have different other values for other columns in the table, and similarly in the second table for multiple ColA, ColB, you would get a Cartesian result. Consider the following scenario
Table1
Col1 Col2 OtherColumn
X Y blah1
X Y blah2
X Y blah3
Table2
ColA ColB OtherColumn
X Y second blah1
X Y second blah2
X Y second blah3
A simple query like you have
SELECT count(*)
FROM table1 t1,
table2 t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
would result in a count of 9. You have 6M records and a possible Cartesian result? Hopefully this clarifies some problems you may be encountering.

SQL UPDATE query taking too long

So I am very new to MySQL, and I am trying to run a query to update a column if a cell value is present in both tables, and the query is taking forever to run (It's been running for 10 minutes now and no result yet). One of my tables is about 250,000 rows, and the other is about 80,000, so I'm not sure why it is taking so long. The query I am using is:
USE the_db;
UPDATE table1
JOIN table2
ON table2.a = table1.b
SET table1.c = "Y";
I've changed the names of the tables and columns, but the query is exactly the same. I've looked at other answers on here and all of them take a very long time as well. Any help would be appreciated, thanks.
For this query:
UPDATE table1 JOIN
table2
ON table2.a = table1.b
SET table1.c = 'Y';
You want an index on table2(a):
create index idx_table2_a on table2(a);
Also, if there are multiple values of a that match each b, then you could also be generating a lot of intermediate rows, and that would have a big impact on performance.
If that is the case, then phrase the query as:
UPDATE table1
SET table1.c = 'Y'
WHERE EXISTS (SELECT 1 FROM table2 WHERE table2.a = table1.b);
And you need the same index.
The difference between the queries is that this one stops at the first matching row in table2.

Mysql update query over 10 million records

I've got to update a field in a table (250 000 records) with the value of a field from another table (10 000 000 records) based on the email...
I've tried:
UPDATE table1 t1, table2 t2
SET t1.country = t2.country
WHERE t1.email = t2.email
But I got a "Query is being executed" forever.
What query should I use?
Thanks
This would be a good opportunity to employ a JOIN.
UPDATE table1 as t1
JOIN table2 t2 ON t1.email = t2.email
SET t1.country = t2.country
It will still take a while for your query to process, but it should reduce the time by a significant amount.
I don't see an obvious error in your query (and the database would yield an error in that case). So the problem we're looking at is to speed up the execution of your update. Is one of your two email fields indexed? If not, could you add an index and try again? E.g. ALTER TABLE table2 ADD INDEX(email)

Single query to get data from 3 separate tables based on provided timestamp

I'm working with three tables which can be summarized as follows:
Table1 (pid, created, data, link1)
Table2 (pid, created, data)
Table3 (link1, created, data)
The created field is a UNIX timestamp, and the data field is the same format in each table.
I want to get the data in all 3 tables such that Table1.pid = Table2.pid AND Table1.link1 = Table3.link1 AND created is less than and closest to a given timestamp value.
So, as an example, say I provide a date of May 7, 2011, 1:00pm. I'd want the data record from each table created most recently before this date-time.
Now I've managed to do this in a rather ugly single query involving sub-queries (using either INNER JOINs or UNIONs), but I'm wondering whether it can be done w/o sub-queries in a single query? Thanks for any suggestions.
Obviously I haven't really run the query, but it seems like it would do whatever you need if I'm understanding your question right.
t1.pid and t2.pid are the same, t1.link1 and t3.link1 are the same, and none of the .created are above your time, and you want the row closest to the date.
SELECT t1.data, t2.data, t3.data FROM Table1 t1, Table2 t2, Table3 t3 WHERE t1.pid = t2.pid AND t1.link1 = t3.link1 AND GREATEST(t1.created,t2.created,t3.created) < 'your-formatted-time' ORDER BY GREATEST(t1.created,t2.created,t3.created) DESC;