This question already has answers here:
SQL select only rows with max value on a column [duplicate]
(27 answers)
Closed 2 years ago.
We have a situation where duplicate entries have crept into our table with more than 60 million entries (duplicate here implies that all fields, except the AUTO_INCREMENT index field have the same value). We suspect that there are about 2 million duplicate entries in the table. We would like to delete these duplicate entries such that the earliest instances of the duplicate entries are retained.
Let me explain with an illustrative table:
CREATE TABLE people
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(40) NOT NULL DEFAULT '',
age INT NOT NULL DEFAULT 0,
phrase VARCHAR(40) NOT NULL DEFAULT '',
PRIMARY KEY (id)
);
INSERT INTO people(name, age, phrase) VALUES ('John Doe', 25, 'qwert'), ('William Smith', 19, 'yuiop'),
('Peter Jones', 19, 'yuiop'), ('Ronnie Arbuckle', 32, 'asdfg'), ('Ronnie Arbuckle', 32, 'asdfg'),
('Mary Evans', 18, 'hjklp'), ('Mary Evans', 18, 'hjklpd'), ('John Doe', 25, 'qwert');
SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name | age | phrase |
+----+-----------------+-----+--------+
| 1 | John Doe | 25 | qwert |
| 2 | William Smith | 19 | yuiop |
| 3 | Peter Jones | 19 | yuiop |
| 4 | Ronnie Arbuckle | 32 | asdfg |
| 5 | Ronnie Arbuckle | 32 | asdfg |
| 6 | Mary Evans | 18 | hjklp |
| 7 | Mary Evans | 18 | hjklpd |
| 8 | John Doe | 25 | qwert |
+----+-----------------+-----+--------+
We would like to remove duplicate entries so that we get the following output:
SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name | age | phrase |
+----+-----------------+-----+--------+
| 1 | John Doe | 25 | qwert |
| 2 | William Smith | 19 | yuiop |
| 3 | Peter Jones | 19 | yuiop |
| 4 | Ronnie Arbuckle | 32 | asdfg |
| 6 | Mary Evans | 18 | hjklp |
| 7 | Mary Evans | 18 | hjklpd |
+----+-----------------+-----+--------+
On smaller sized tables the following approach would work:
CREATE TABLE people_uniq LIKE people;
INSERT INTO people_uniq SELECT * FROM people GROUP BY name, age, phrase;
DROP TABLE people;
RENAME TABLE people_uniq TO people;
SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name | age | phrase |
+----+-----------------+-----+--------+
| 1 | John Doe | 25 | qwert |
| 2 | William Smith | 19 | yuiop |
| 3 | Peter Jones | 19 | yuiop |
| 4 | Ronnie Arbuckle | 32 | asdfg |
| 6 | Mary Evans | 18 | hjklp |
| 7 | Mary Evans | 18 | hjklpd |
+----+-----------------+-----+--------+
Kindly suggest a solution that would scale to a table with tens of millions of entries and many more columns. We are using MySQL version 5.6.49.
why not deleting duplicates?
DELETE FROM people
where id in (
SELECT MAX(id)
FROM people
GROUP BY name, age, phrase
HAVING count(*) > 1
)
if it still takes too much time , you can do it in batch
Related
I read that the output of the subquery doesn't matter and only its existence matter. But, when I change the code in the subquery, why is my output changing?
These are the tables:
mysql> select * from boats;
+------+-----------+-------+
| bid | bname | color |
+------+-----------+-------+
| 101 | Interlake | blue |
| 102 | Interlake | red |
| 103 | Clipper | green |
| 104 | Marine | red |
+------+-----------+-------+
mysql> select * from sailors;
+------+---------+--------+------+
| sid | sname | rating | age |
+------+---------+--------+------+
| 22 | Dustin | 7 | 45 |
| 29 | Brutus | 1 | 33 |
| 31 | Lubber | 8 | 55.5 |
| 32 | Andy | 8 | 25.5 |
| 58 | Rusty | 10 | 35 |
| 64 | Horatio | 7 | 35 |
| 71 | Zorba | 10 | 16 |
| 74 | Horatio | 9 | 40 |
| 85 | Art | 3 | 25.5 |
| 95 | Bob | 3 | 63.5 |
+------+---------+--------+------+
10 rows in set (0.00 sec)
mysql> select * from reserves;
+------+------+------------+
| sid | bid | day |
+------+------+------------+
| 22 | 101 | 1998-10-10 |
| 22 | 102 | 1998-10-10 |
| 22 | 103 | 1998-10-08 |
| 22 | 104 | 1998-10-08 |
| 31 | 102 | 1998-11-10 |
| 31 | 103 | 1998-11-06 |
| 31 | 104 | 1998-11-12 |
| 64 | 101 | 1998-09-05 |
| 64 | 102 | 1998-09-08 |
| 74 | 103 | 1998-09-08 |
+------+------+------------+
select sname from sailors s where exists(select * from reserves r where r.bid=103);
+---------+
| sname |
+---------+
| Dustin |
| Brutus |
| Lubber |
| Andy |
| Rusty |
| Horatio |
| Zorba |
| Horatio |
| Art |
| Bob |
+---------+
10 rows in set (0.00 sec)
mysql> select sname from sailors s where exists(select * from reserves r where r.bid=103 and r.sid=s.sid);
+---------+
| sname |
+---------+
| Dustin |
| Lubber |
| Horatio |
+---------+
Also, I am not able to understand what r.sid=s.sid is doing here. All the sid in reserves are already from sailors table. Please someone explain it to me.
The EXISTS is a Boolean Operator which indicates that if there is ANY row in the sub-query you passed to it. When you execute this:
EXISTS(SELECT * FROM reserves r WHERE r.bid=103)
It will return TRUE after finding the FIRST row which has the condition bid = 103 in Reserves table. The first part of the query doesn't matter, it does not matter what you SELECT in Exists and MySQL engine will ignore it, just the WHERE clause is the part which makes the difference, you can use Exists even like this:
EXISTS(SELECT 1 FROM reserves r WHERE r.bid=103)
In the query above, nothing depends on the values in main query, nothing depends on Sailors table, and if there is ANY row in the Reserves table with bid = 103, then it always will return TRUE.
In the second sub-query with EXISTS, you have a different WHERE clause, and it depend on the value of the fields of the main Query, so it will have different result per each row:
EXISTS(SELECT * FROM reserves r WHERE r.bid=103 AND r.sid=s.sid)
In the above query, per each row in Sailors table, MySQL uses sid value to produce the WHERE condition of the sub-query in EXISTS operator, so it will returns TRUE for a row in Sailors table if there are ANY rows in Reserves table which has a bid = 103 and sid = Sailors.sid, and it will returns False for those that has not such a record in Reserves table, and finally you will get a different result
I think I got that. Exists is used to check if the subquery is existing for the main query. I didn't give any link for the main query and subquery in the first query.
For every name in sailors, independently, the subquery is existing. Hence, I got all the names. In the second query, I added s.sid=r.sid which links the main query and subquery. It checks if for a sname, if bid=103, and also, if s.sid=r.sid.
Please comment if I got that right.
I have the following sample data:
| key_id | name | name_id | data_id |
+--------+-------+---------+---------+
| 1 | jim | 23 | 098 |
| 2 | joe | 24 | 098 |
| 3 | john | 25 | 098 |
| 4 | jack | 26 | 098 |
| 5 | jim | 23 | 091 |
| 6 | jim | 23 | 090 |
I have tried this query:
INSERT INTO temp_table
SELECT
DISTINCT #key_id,
name,
name_id,
#data_id FROM table1,
I am trying to dedupe a table by all fields in a row.
My desired output:
| key_id | name | name_id | data_id |
+--------+-------+---------+---------+
| 1 | jim | 23 | 098 |
| 2 | joe | 24 | 098 |
| 3 | john | 25 | 098 |
| 4 | jack | 26 | 098 |
What I'm actually getting:
| key_id | name | name_id | data_id |
+--------+-------+---------+----------+
| 1 | jim | 23 | NULL |
| 2 | joe | 24 | NULL |
| 3 | john | 25 | NULL |
| 4 | jack | 26 | NULL |
I am able to dedupe the table, but I am setting the 'data_Id' value to NULL by attempting to override the field with '#'
Is there anyway to select distinct on all fields and while keeping the value for 'data_id'? I will take the highest or MAX data_id # if possible.
If you only want one row returned for a specific value (in this case, name), one option you have is to group by that value. This seems like a good approach because you also said you wanted the largest data_id for each name, so I would suggest grouping and using the MAX() aggregate function like this:
SELECT name, name_id, MAX(data_id) AS data_id
FROM myTable
GROUP BY name, name_id;
The only thing you should be aware of is the possibility that a name occurs multiple times under different name_ids. If that is possible in your table, you could group by the name_id too, which is what I did.
Since you stated you're not interested in the key_id but only the name, I just excluded it from the query altogether to get this:
| name | name_id | data_id |
+-------+---------+---------+
| jim | 23 | 098 |
| joe | 24 | 098 |
| john | 25 | 098 |
| jack | 26 | 098 |
Here is the SQL Fiddle example.
RENAME TABLE myTable to Old_mytable,
myTable2 to myTable
INSERT INTO myTable
SELECT *
FROM Old_myTable
GROUP BY name, name_id;
This groups my tables by the values I want to dedupe while still keeping structure and ignoring the 'Data_id' column
i have two tables main_table and new_data and i would like to update main_table by data from new_data table as you can see there are several empty places in time column in main_table. It should be fill in by data from new_data table. The 3rd table is the result. What is the best solution for this?
main_table
---------------------
id | name | time
---------------------
1 | tom | 60
2 | daniel | 30
3 | monica | 42
4 | gabriela |
5 | rachel |
6 | michael | 15
7 | adriana |
---------------------
new_data
--------------------
id | name | time
--------------------
1 | gabriela | 22
2 | rachel | 15
3 | adriana | 17
--------------------
main_table - updated by new_data - it should be result
---------------------
id | name | time
---------------------
1 | tom | 60
2 | daniel | 30
3 | monica | 42
4 | gabriela | 22
5 | rachel | 15
6 | michael | 15
7 | adriana | 17
---------------------
UPDATE new_data t1, JOIN main_table t2
SET t2.Time=t1.Time
WHERE t2.name=t1.name
I need to write an SQL select statement that groups together values from one column into one cell.
e.g.
table name: Customer_Hobbies
+------------+------------+-----------+
| CustomerId | Age | Hobby |
+------------+------------+-----------+
| 123 | 17 | Golf |
| 123 | 17 | Football |
| 324 | 14 | Rugby |
| 627 | 28 | Football |
+------------+------------+-----------+
should return...
+------------+------------+----------------+
| CustomerId | Age | Hobbies |
+------------+------------+----------------+
| 123 | 17 | Golf,Football |
| 324 | 14 | Rugby |
| 627 | 28 | Football |
+------------+------------+----------------+
Is this possible?
N.B. I know the data's not laid out in a particularly sensible way, but I can't change that.
You want group_concat():
select customerId, age, group_concat(hobby) as hobbies
from t
group by customerId, age
I'm trying to write an SQL statement that duplicates all rows WHERE employee = 16(i.e.), but the new rows would have a different employee value.
Table before INSERT:
| employee | property_name | property_value |
|:--------:|:--------------|:---------------|
| 16 | Salary | 28,000 |
| 16 | Department | 12 |
| 17 | Salary | 38,000 |
| 17 | Department | 8 |
Desired outcome after INSERT:
| employee | property_name | property_value |
|:--------:|:--------------|:---------------|
| 16 | Salary | 28,000 |
| 16 | Department | 12 |
| 17 | Salary | 38,000 |
| 17 | Department | 8 |
| 18 | Salary | 28,000 |
| 18 | Department | 12 |
I've seen some threads that use variables. Could I set and reference a variable somehow that would replace values from an insert/select?
The answer to this thread looks like it would work. But I'd rather not create and drop tables like that.
insert into YourTable (employee,property_name, property_value)
select 18, property_name, property_value from YourTable where employee = 16