MySQL, how to merge table duplicates entries [duplicate] - mysql

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I remove duplicate rows?
Remove duplicates using only a MySQL query?
I have a large table with ~14M entries. The table type is MyISAM ans not InnoDB.
Unfortunately, I have some duplicate entries in this table that I found with the following request :
SELECT device_serial, temp, tstamp, COUNT(*) c FROM up_logs GROUP BY device_serial, temp, tstamp HAVING c > 1
To avoid these duplicates in the future, I want to convert my current index to a unique constraint using SQL request :
ALTER TABLE up_logs DROP INDEX UK_UP_LOGS_TSTAMP_DEVICE_SERIAL,
ALTER TABLE up_logs ADD INDEX UK_UP_LOGS_TSTAMP_DEVICE_SERIAL ( `tstamp` , `device_serial` )
But before that, I need to clean up my duplicates!
My question is : How can I keep only one entry of my duplicated entries? Keep in mind that my table contain 14M entries, so I would like avoid loops if it is possible.
Any comments are welcome!

Creating a new unique key on the over columns you need to have as uniques will automatically clean the table of any duplicates.
ALTER IGNORE TABLE `table_name`
ADD UNIQUE KEY `key_name`(`column_1`,`column_2`);
The IGNORE part does not allow the script to terminate after the first error occurs. And the default behavior is to delete the duplicates.

Since MySQL allows Subqueries in update/delete statements, but not if they refer to the table you want to update, I´d create a copy of the original table first. Then:
DELETE FROM original_table
WHERE id NOT IN(
SELECT id FROM copy_table
GROUP BY column1, column2, ...
);
But I could imagine that copying a table with 14M entries takes some time... selecting the items to keep when copying might make it faster:
INSERT INTO copy_table
SELECT * FROM original_table
GROUP BY column1, column2, ...;
and then
DELETE FROM original_table
WHERE id IN(
SELECT id FROM copy_table
);
It was some time since I used MySQL and SQL in general last time, so I´m quite sure that there is something with better performance - but this should work ;)

This is how you can delete duplicate rows... I'll write you my example and you'll need to apply to your code. I have Actors table with ID and I want to delete the rows with repeated first_name
mysql> select actor_id, first_name from actor_2;
+----------+-------------+
| actor_id | first_name |
+----------+-------------+
| 1 | PENELOPE |
| 2 | NICK |
| 3 | ED |
....
| 199 | JULIA |
| 200 | THORA |
+----------+-------------+
200 rows in set (0.00 sec)
-Now I use a Variable called #a to get the ID if the next row have the same first_name(repeated, null if it's not).
mysql> select if(first_name=#a,actor_id,null) as first_names,#a:=first_name from actor_2 order by first_name;
+---------------+----------------+
| first_names | #a:=first_name |
+---------------+----------------+
| NULL | ADAM |
| 71 | ADAM |
| NULL | AL |
| NULL | ALAN |
| NULL | ALBERT |
| 125 | ALBERT |
| NULL | ALEC |
| NULL | ANGELA |
| 144 | ANGELA |
...
| NULL | WILL |
| NULL | WILLIAM |
| NULL | WOODY |
| 28 | WOODY |
| NULL | ZERO |
+---------------+----------------+
200 rows in set (0.00 sec)
-Now we can get only duplicates ID:
mysql> select first_names from (select if(first_name=#a,actor_id,null) as first_names,#a:=first_name from actor_2 order by first_name) as t1;
+-------------+
| first_names |
+-------------+
| NULL |
| 71 |
| NULL |
...
| 28 |
| NULL |
+-------------+
200 rows in set (0.00 sec)
-the Final Step, Lets DELETE!
mysql> delete from actor_2 where actor_id in (select first_names from (select if(first_name=#a,actor_id,null) as first_names,#a:=first_name from actor_2 order by first_name) as t1);
Query OK, 72 rows affected (0.01 sec)
-Now lets check our table:
mysql> select count(*) from actor_2 group by first_name;
+----------+
| count(*) |
+----------+
| 1 |
| 1 |
| 1 |
...
| 1 |
+----------+
128 rows in set (0.00 sec)
it works, if you have any question write me back

Related

MySQL database relationship without an ID

Hi StackOverflow community,
I have these two tables:
tbl_users
ID_user (PRIMARY KEY)
Username (UNIQUE)
Password
...
tbl_posts
ID_post (PRIMARY KEY)
Owner (UNIQUE)
Description
...
Why always everybody make database relationships with foreign keys? What about if I want to relate Username with Owner instead of doing ID_user with ID_user in both tables?
Username is UNIQUE and the Owner is the username of the creator of the post.
Can it be done like that? There is something to correct or make better? Maybe I have a misconception.
I would appreciate detailed and understandable answers.
Thank you in advance.
The reason is primarily for data integrity. The argument concerning performance is a little misleading. While neither exhaustive, nor definitive, I hope this little example will shed some light on that fact:
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(i INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,s CHAR(12) NOT NULL UNIQUE
);
STEP1:
INSERT IGNORE INTO my_table (s)
SELECT CONCAT(CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
,CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
);
STEP2:
INSERT IGNORE INTO my_table (s)
SELECT CONCAT(CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
,CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
)
FROM my_table;
[REPEAT STEP 2 SEVERAL TIMES]
SELECT COUNT(*) FROM my_table;
+----------+
| COUNT(*) |
+----------+
| 16384 |
+----------+
1 row in set (0.01 sec)
SELECT * FROM my_table ORDER BY i LIMIT 12;;
+----+------------+
| i | s |
+----+------------+
| 1 | kkxeehxsvy |
| 2 | iuyhrk{vaq |
| 3 | ngpedelooc |
| 4 | irkbyqgkhc |
| 6 | yqkcifcxdz |
| 7 | sgezlgvjjq |
| 8 | blavbvxbnl |
| 9 | wdbtqvgvgt |
| 13 | pakzpbnhxr |
| 14 | vpoy{gdwyd |
| 15 | ezlhz{drwg |
| 16 | ncwcwbpudh |
+----+------------+
SELECT * FROM my_table x JOIN my_table y ON y.i < x.i ORDER BY x.i,y.i LIMIT 1;
+---+------------+---+------------+
| i | s | i | s |
+---+------------+---+------------+
| 2 | iuyhrk{vaq | 1 | kkxeehxsvy |
+---+------------+---+------------+
1 row in set (1 min 22.60 sec)
SELECT * FROM my_table x JOIN my_table y ON y.s < x.s ORDER BY x.s,y.s LIMIT 1;
+-------+------------+------+------------+
| i | s | i | s |
+-------+------------+------+------------+
| 21452 | aabetdlvum | 6072 | aabdnegtav |
+-------+------------+------+------------+
1 row in set (1 min 13.59 sec)
So, we have two queries doing essentially the same thing (a comparison of 270 million values). The first joins the table to itself on an integer value. The second joins the table to itself on a string value. Both columns are indexed. As you can see, in this example, the string join actually performs better than the integer join - even though the hit on the CPU may actually be greater!

How to delete duplicate records but except one for Int?

Under my Table records for MySQL, i have these:
SELECT * FROM dbo.online;
+-------+
| Id |
+-------+
| 10128 |
| 10240 |
| 6576 |
| 32 |
| 10240 |
| 10128 |
| 10128 |
| 12352 |
+-------+
8 rows in set (0.00 sec)
How to make it to:
SELECT * FROM dbo.online;
+-------+
| Id |
+-------+
| 10128 |
| 10240 |
| 6576 |
| 32 |
| 12352 |
+-------+
8 rows in set (0.00 sec)
In other words, I want to do is, using DELETE command instead of SELECT * FROM dbo.online GROUP BY id.. So, any idea how?
Copy data to back up table with distinct, that steop eliminates duplicates
create table backUp_online as
SELECT distinct *
FROM online;
Clear source table
truncate table online
Copy data from back up to source table without duplicates
insert into online
select *
from backUp_online
There is a trick in MySQL:
ALTER IGNORE TABLE `dbo`.`online` ADD UNIQUE KEY `ukId`(`Id`)
This can also be useful.
Simplest query to do the same.
DELETE n1 FROM online n1, online n2 WHERE n1.id < n2.id AND n1.name = n2.name

Find value within a range in database table

I need the SQL equivalent of this.
I have a table like this
ID MN MX
-- -- --
A 0 3
B 4 6
C 7 9
Given a number, say 5, I want to find the ID of the row where MN and MX contain that number, in this case that would be B.
Obviously,
SELECT ID FROM T WHERE ? BETWEEN MN AND MX
would do, but I have 9 million rows and I want this to run as fast as possible. In particular, I know that there can be only one matching row, I now that the MN-MX ranges cover the space completely, and so on. With all these constraints on the possible answers, there should be some optimizations I can make. Shouldn't there be?
All I have so far is indexing MN and using the following
SELECT ID FROM T WHERE ? BETWEEN MN AND MX ORDER BY MN LIMIT 1
but that is weak.
If you have an index spanning MN and MX it should be pretty fast, even with 9M rows.
alter table T add index mn_mx (mn, mx);
Edit
I just tried a test w/ a 1M row table
mysql> select count(*) from T;
+----------+
| count(*) |
+----------+
| 1000001 |
+----------+
1 row in set (0.17 sec)
mysql> show create table T\G
*************************** 1. row ***************************
Table: T
Create Table: CREATE TABLE `T` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`mn` int(10) DEFAULT NULL,
`mx` int(10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `mn_mx` (`mn`,`mx`)
) ENGINE=InnoDB AUTO_INCREMENT=1048561 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> select * from T order by rand() limit 1;
+--------+-----------+-----------+
| id | mn | mx |
+--------+-----------+-----------+
| 112940 | 948004986 | 948004989 |
+--------+-----------+-----------+
1 row in set (0.65 sec)
mysql> explain select id from T where 948004987 between mn and mx;
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 239000 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)
mysql> select id from T where 948004987 between mn and mx;
+--------+
| id |
+--------+
| 112938 |
| 112939 |
| 112940 |
| 112941 |
+--------+
4 rows in set (0.03 sec)
In my example I just had an incrementing range of mn values and then set mx to +3 that so that's why I got more than 1, but should apply the same to you.
Edit 2
Reworking your query will definitely be better
mysql> explain select id from T where mn<=947892055 and mx>=947892055;
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 9 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
It's worth noting even though the first explain reported many more rows to be scanned I had enough innodb buffer pool set to keep the entire thing in RAM after creating it; so it was still pretty fast.
If there are no gaps in your set, a simple gte comparison will work:
SELECT ID FROM T WHERE ? >= MN ORDER BY MN ASC LIMIT 1

Getting max value from many tables

There are two ways, that I can think of, to obtain similar results from multiple tables. One is UNION and the other is JOIN. The similar questions on SO have all been answered with a UNION. Here's the coder I just found:
SELECT max(up.id) AS up, max(sc.id) AS sc, max(cl.id) AS cl
FROM updates up, chat_staff sc, change_log cl
explain:
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
My question is -- Is this better than the following?
SELECT "up.id" AS K, max(id) AS V FROM updates
UNION
SELECT "sc.id" AS K, max(id) AS V FROM chat_staff
UNION
SELECT "cl.id" AS K, max(id) AS V FROM change_log
explain:
+----+--------------+--------------+------+---------------+------+---------+------+-------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+------+---------------+------+---------+------+------+------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
| 2 | UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
| 3 | UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+------+---------------+------+---------+------+------+------------------------------+
Both of those methods are just fine. In fact, I have another method:
SELECT
IFNULL(maxidup,0) max_id_up,
IFNULL(maxscup,0) max_sc_up,
IFNULL(maxclup,0) max_cl_up
FROM
(SELECT max(id) maxidup FROM updates) up,
(SELECT max(id) maxidsc FROM chat_staff) sc,
(SELECT max(id) maxidcl FROM change_log) cl
;
This method presents the three values side by side like your first example. It also shows 0 in the event one of the tables are empty.
mysql> DROP DATABASE IF EXISTS junk;
Query OK, 3 rows affected (0.11 sec)
mysql> CREATE DATABASE junk;
Query OK, 1 row affected (0.00 sec)
mysql> use junk
Database changed
mysql> CREATE TABLE updates (id int not null auto_increment primary key,x int);
Query OK, 0 rows affected (0.07 sec)
mysql> CREATE TABLE chat_staff LIKE updates;
Query OK, 0 rows affected (0.07 sec)
mysql> CREATE TABLE change_log LIKE updates;
Query OK, 0 rows affected (0.06 sec)
mysql> INSERT INTO updates (x) VALUES (37),(84),(12);
Query OK, 3 rows affected (0.06 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> INSERT INTO change_log (x) VALUES (37),(84),(12),(14),(35);
Query OK, 5 rows affected (0.09 sec)
Records: 5 Duplicates: 0 Warnings: 0
mysql> SELECT
-> IFNULL(maxidup,0) max_id_up,
-> IFNULL(maxidsc,0) max_sc_up,
-> IFNULL(maxidcl,0) max_cl_up
-> FROM
-> (SELECT max(id) maxidup FROM updates) up,
-> (SELECT max(id) maxidsc FROM chat_staff) sc,
-> (SELECT max(id) maxidcl FROM change_log) cl
-> ;
+-----------+-----------+-----------+
| max_id_up | max_sc_up | max_cl_up |
+-----------+-----------+-----------+
| 3 | 0 | 5 |
+-----------+-----------+-----------+
1 row in set (0.00 sec)
mysql> explain SELECT IFNULL(maxidup,0) max_id_up, IFNULL(maxidsc,0) max_sc_up, IFNULL(maxidcl,0) max_cl_up FROM (SELECT max(id) maxidup FROM updates) up, (SELECT max(id) maxidsc FROM chat_staff) sc, (SELECT max(id) maxidcl FROM change_log) cl;
+----+-------------+------------+--------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+------+---------+------+------+------------------------------+
| 1 | PRIMARY | <derived2> | system | NULL | NULL | NULL | NULL | 1 | |
| 1 | PRIMARY | <derived3> | system | NULL | NULL | NULL | NULL | 1 | |
| 1 | PRIMARY | <derived4> | system | NULL | NULL | NULL | NULL | 1 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No matching min/max row |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+---------------+------+---------+------+------+------------------------------+
6 rows in set (0.02 sec)
In my EXPLAIN plan, it has Select tables optimized away just like yours. Why ?
Since id is indexed in all the tables, the index is used to retrieve the max(id) rather than the table. Thus, Select tables optimized away is the correct response.
Six of one, half dozen of the other. How you present data from there is strictly your personal preference.
UPDATE 2011-10-20 15:32 EDT
You commented : Do you know how table locking would compromise this? Let's say one of the tables in question is locked. Would this query lock the other two and keep 'em locked until the first one was freed up?
This would depend on the storage engine. If all tables in question are MyISAM, definite possibility since MyISAM performs a full table lock on INSERT, UPDATE, DELETE. If the three tables are InnoDB, you have the benefit of MVCC to provide transaction isolation. This would allow everyone their view of the data in a point-in-time. Aside from DDL and an explcit LOCK TABLES against InnoDB, your query should not be blocked.
Actually, while they're similar, there's a subtle difference. The first gives you a one-row, three-column table (with the values going "across") and the second gives you a three-row, two-column table (with the values going "down").
Provided you're happy processing or viewing that data in either form, it's probably going to come down to performance.
In my experience (and this is nothing to do specifically with MySQL), the latter query will probably be better. That's because the DBMS' I work with are able to run queries like that in parallel for efficiency, combining them at completion of all. The fact that they're on different tables means that lock contention between them will be zero.
It may be that the query analysis engine of a DBMS could do a similar optimisation for the first query but it would require a lot more intelligence than I've seen from most of them.
One quick point, if you use union all instead of just union, you tell the database not to remove duplicate rows. You won't get any duplicates in this case due to the K column being different for all three sub-queries.
But, as with all optimisations, measure, don't guess! Certainly don't take as gospel the rants of random internet roamers (yes, even me).
Put together various candidate tables with the properties you're likely to have in production, and compare the performance of each.

Why is this MySQL JOIN statement returning more results?

I have two (characteristic_list and measure_list) tables that are related to each other by a column called 'm_id'. I want to retrieve records using filters (columns from characteristic_list) within a date range (columns from measure_list). When I gave the following SQL using INNER JOIN, it takes a while to retrieve the record. What am I doing wrong?
mysql> explain select c.power_set_point, m.value, m.uut_id, m.m_id, m.measurement_status, m.step_name from measure_list as m INNER JOIN characteristic_lis
t as c ON (m.m_id=c.m_id) WHERE (m.sequence_end_time BETWEEN '2010-06-18' AND '2010-06-20');
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | c | ALL | NULL | NULL | NULL | NULL | 82952 | |
| 1 | SIMPLE | m | ALL | NULL | NULL | NULL | NULL | 85321 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
2 rows in set (0.00 sec)
mysql> select count(*) from measure_list;
+----------+
| count(*) |
+----------+
| 83635 |
+----------+
1 row in set (0.18 sec)
mysql> select count(*) from characteristic_list;
+----------+
| count(*) |
+----------+
| 83635 |
+----------+
1 row in set (0.10 sec)
The reason this query takes a while to execute is because it has to scan the entire table. You never want to see "ALL" as the type of the query. To speed things up, you need to make smart decisions about what to index.
See the following documents at the MySQL site:
http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html
http://dev.mysql.com/doc/refman/5.1/en/using-explain.html
As an add-on to the previous answer by Dan, you should consider indexing the join columns and the where columns. In this case, that means the m_id cols in both tables and the sequence_end_time in the measure_list table. They are small enough that you could add an index, run explain plan and time it, then change the index and compare. Should be relatively quick to solve.