Simply delete duplicate content in a sql table - mysql

I wanted to know if there is an easy way to remove duplicates from a table sql.
Rather than fetch the whole table and delete the data if they appear twice.
Thank you in advance
This is my structure :
CREATE TABLE IF NOT EXISTS `mups` (
`idgroupe` varchar(15) NOT NULL,
`fan` bigint(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

If you are using Sql Server
Check this: SQL SERVER – 2005 – 2008 – Delete Duplicate Rows
Sample Code using CTE:
/* Delete Duplicate records */
WITH CTE (COl1,Col2, DuplicateCount)
AS
(
SELECT COl1,Col2,
ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO

Add a calculated column that takes the checksum of the entire row. Search for any duplicate checksums, rank and remove the duplicates.

you can do something like this :
DELETE from yourTable WHERE tableID in
(SELECT clone.tableID
from yourTable origine,
yourTable clone
where clone.tableID= origine.tableID)
But in the WHERE, you can either compare the indexes or compare each other fields...
depending on how you find your doubles.
note, this solution has the advantage of letting you choose what IS a double (if the PK changes for example)

You can find the duplicates by joining the table to itself, doing a group by the fields you are looking for duplicates in, and a having clause where count is greater than one.
Let's say your table name is customers, and your looking for duplicate name fields.
select cust_out.name, count(cust_count.name)
from customers cust_out
inner join customers cust_count on cust_out.name = cust_count.name
group by cust_out.name
having count(cust_count.name) > 1
If you use this in a delete statement you would be deleting all the duplicate records, when you probably intend to keep on of the records.
So to select the records to delete,
select cust_dup.id
from customers cust
inner join customers cust_dup on cust.name = cust_dup.name and cust_dup.id > cust.id
group by cust_dup.id

Related

Delete based on a distinct select

I have a mysql query that returns quite a few distinct, yet correct, results
select distinct page_id, display_id
from display_to_page;
But now I'm trying to delete everything that isn't in that result set (deleting duplicates) but I'm a bit stuck.
I know I can do something like:
delete from display_to_page dp
(select distinct page_id, display_id
from display_to_page) dp2 ...
But i'm unsure how to complete the syntax there.
How can I structure a delete that will remove anything not in that result set?
If you only have two columns in the table, then the easiest way is probably truncate/reload:
create temporary table temp_pd as
select distinct page_id, display_id
from display_to_page;
truncate table display_to_page;
insert into display_to_page (page_id, display_id)
select page_id, display_id
from temp_pd;
Be sure to copy the table before trying this on your data!

Subquery returns more rows than straight same query in MySQL

I want to remove duplicates based on the combination of listings.product_id and listings.channel_listing_id
This simple query returns 400.000 rows (the id's of the rows I want to keep):
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
While this variation returns 1.600.000 rows, which are all records on the table, not only is_verified = 0:
SELECT *
FROM (
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
) AS keepem
I'd expect them to return the same amount of rows.
What's the reason for this? How can I avoid it (in order to use the subselect in the where condition of the DELETE statement)?
EDIT: I found that doing a SELECT DISTINCT in the outer SELECT "fixes" it (it returns 400.000 records as it should). I'm still not sure if I should trust this subquery, for there is no DISTINCT in the DELETE statement.
EDIT 2: Seems to be just a bug in the way phpMyAdmin reports the total count of the rows.
Your query as it stands is ambiguous. Suppose you have two listings with the same product_id and channel_id. Then what id is supposed to be returned? The first, the second? Or both, ignoring the GROUP request?
What if there is more than one id with different product and channel ids?
Try removing the ambiguity by selecting MAX(id) AS id and adding DISTINCT.
Are there any foreign keys to worry about? If not, you could pour the original table into a copy, empty the original and copy back in it the non-duplicates only. Messier, but you only do SELECTs or DELETEs guaranteed to succeed, and you also get to keep a backup.
Assign aliases in order to avoid field reference ambiguity:
SELECT
keepem.*
FROM
(
SELECT
innerStat.id
FROM
`listings` AS innerStat
WHERE
innerStat.is_verified = 0
GROUP BY
innerStat.product_id,
innerStat.channel_listing_id
) AS keepem

sql order by not working with group by only

I have one table stock activity where i have multiple records attached with single item_id. note item_id is playing foreign key role here in stock activity table . so actually i am tracking the item(in,out) of inventory. now i want to retrieve the last record activity stored in the table. i have written query which is supposed to be returning the last record from the table but it is returning the first record ..
Columns are :
activity_id pk
item_id fk
balance int(11)
Here is my query:
SELECT DISTINCT(item_id),balance
FROM `stock_activity`
GROUP BY (item_id)
ORDER BY(activity_id) DESC
Remember if a column that doesn't belongs to the grouping key is being referenced without any sort of aggregation so such statement is impossible.
So remember a little formula to came our this problem.
SELECT * FROM
(
SELECT * FROM `table`
ORDER BY AnotherColumn
) t1
GROUP BY SomeColumn
;
Modify your query like this and hope it will work fine!!!.
SELECT * FROM(
SELECT DISTINCT(item_id),balance
FROM `stock_activity`
ORDER BY(activity_id) DESC
) t1
GROUP BY (item_id)
This is a common problem folks have an issue with. you want the GROUPWISE MAXIMUM (or MINIMUM) of a column. Fortunately such an example exists right in the tutorial section of the manual

INSERT ... SELECT Syntax issue when value(s) from two or more different query

I have a table name shop_balance. which has 3 columns (shop_balance_id(INT,PK), shop_balance(DOUBLE), balance_date(DATE)).
For shop_balance(DOUBLE) column I use two sub query.
1.Get last shop balance amount row shop_balance column in shop_balance table.
2.Get purchase amount after one purchase product(s).
and finally I subtract them and get current shop balance
My query is here
INSERT INTO shop_balance
SELECT null,
(
(SELECT shop_balance FROM shop_balance
WHERE
shop_balance_id=(SELECT MAX(shop_balance_id) FROM shop_balance)
)
-
(
SELECT
SUM(pr_pur_cost_price*quantity) AS net FROM product_purchase_item AS i
LEFT JOIN
product_purchases AS p
ON
p.product_purchase_item_id=i.product_purchase_item_id
WHERE
p.insert_operation=$id
GROUP by
p.insert_operation
)
),curdate();
It is clear that the two sub query are different condition and no direct relation them. Above INSERT query is work well. But is it good idea to use many sub query without INSERT ... SELECT Syntax for INSERT one value? If not, how can I convert to INSERT ... SELECT Syntax?
You have to do the calculation the way you do the calculation. I am guessing that a trigger might better meet your needs, but doing the logic in an insert is fine.
The following slightly simplifies your query. It eliminates the double subquery on shop_balance, changes the left join to an inner join (you have a condition on the second table), and eliminates the group by from the second subquery:
INSERT INTO shop_balance
SELECT null,
((SELECT shop_balance
FROM shop_balance
ORDER BY shop_balance_id desc
LIMIT 1
) -
(SELECT SUM(pr_pur_cost_price*quantity) AS net
FROM product_purchase_item i JOIN
product_purchases p
ON p.product_purchase_item_id=i.product_purchase_item_id
WHERE p.insert_operation=$id
)
), curdate();
You should also list the columns in the insert clause, and probably eliminate the first NULL (it would be set to NULL by default). The final curdate() suggests that you might want an automatic column to store the insertion time as well.

Find and remove duplicate rows by two columns

I read all the relevant duplicated questions/answers and I found this to be the most relevant answer:
INSERT IGNORE INTO temp(MAILING_ID,REPORT_ID)
SELECT DISTINCT MAILING_ID,REPORT_IDFROM table_1
;
The problem is that I want to remove duplicates by col1 and col2, but also want to include to the insert all the other fields of table_1.
I tried to add all the relevant columns this way:
INSERT IGNORE INTO temp(M_ID,MAILING_ID,REPORT_ID,
MAILING_NAME,VISIBILITY,EXPORTED) SELECT DISTINCT
M_ID,MAILING_ID,REPORT_ID,MAILING_NAME,VISIBILITY,
EXPORTED FROM table_1
;
M_ID(int,primary),MAILING_ID(int),REPORT_ID(int),
MAILING_NAME(varchar),VISIBILITY(varchar),EXPORTED(int)
But it inserted all rows into temp (including duplicates)
The best way to delete duplicate rows by multiple columns is the simplest one:
Add an UNIQUE index:
ALTER IGNORE TABLE your_table ADD UNIQUE (field1,field2,field3);
The IGNORE above makes sure that only the first found row is kept, the rest discarded.
(You can then drop that index if you need future duplicates and/or know they won't happen again).
This works perfectly in any version of MySQL including 5.7+. It also handles the error You can't specify target table 'my_table' for update in FROM clause by using a double-nested subquery. It only deletes ONE duplicate row (the later one) so if you have 3 or more duplicates, you can run the query multiple times. It never deletes unique rows.
DELETE FROM my_table
WHERE id IN (
SELECT calc_id FROM (
SELECT MAX(id) AS calc_id
FROM my_table
GROUP BY identField1, identField2
HAVING COUNT(id) > 1
) temp
)
I needed this query because I wanted to add a UNIQUE index on two columns but there were some duplicate rows that I needed to discard first.
For Mysql:
DELETE t1 FROM yourtable t1
INNER JOIN yourtable t2 WHERE t1.id < t2.id
AND t1.identField1 = t2.identField1
AND t1.identField2 = t2.identField2;
You will first need to find your duplicates by grouping on the two fields with a having clause.
Select identField1, identField2, count(*) FROM yourTable
GROUP BY identField1, identField2
HAVING count(*) >1
If this returns what you want, you can then use it as a subquery and
DELETE FROM yourTable WHERE field in (Select identField1, identField2, count(*) FROM yourTable
GROUP BY identField1, identField2
HAVING count(*) >1 )
you can always get the primary ids by grouping that two unique fields
select count(*), id as count from table group by col a, col b having count(*)>1;
and then
delete from table where id in ( select count(*), id as count from table group by col a, col b having count(*)>1) limit maxlimit;
you can also use max() in place of limit
NOTE: This solution is an alternative & old school solution.
If you couldn't achieve what you wanted, then you can try my "oldschool" method:
First, run this query to get the duplicate records:
select column1,
column2,
count(*)
from table
group by column1,
column2
having count(*) > 1
order by count(*) desc
After that, select those results and paste them into the notepad++:
Now by using the find and replace specialty of the notepad++ replace them with; first "delete" then "insert" queries like this (from now on, for security reasons, my values will be AAAA).
Special Note: Please make another new line for the end of the last line of your data inside notepad++ because regex matched the '\r\n' at the end of the each line:
Find what regex: \D*(\d+)\D*(\d+)\D*\r\n
Replace with string: delete from table where column1 = $1 and column2 = $2; insert into table set column1 = $1, column2 = $2;\r\n
Now finally, paste those queries to your MySQL Workbench's query console and execute. You will see only one occurrences of each duplicate record.
This answer is for a relation table constructed of just two columns without ID. I think you can apply it to your situation.
In a large data set if you are selecting the multiple columns in the select clause ex:
select x,y,z from table1.
And the requirement is to remove duplicate based on two columns:from above example let y,z
then you may use below instead of using combo of "group by" and "sub query", which is bad in performance:
select x,y,z
from (
select x,y,z , row_number() over (partition by y,z) as index_num
from table1) main
where main.index_num=1