Find and remove duplicate rows by two columns

Find and remove duplicate rows by two columns - mysql

I read all the relevant duplicated questions/answers and I found this to be the most relevant answer:
INSERT IGNORE INTO temp(MAILING_ID,REPORT_ID)
SELECT DISTINCT MAILING_ID,REPORT_IDFROM table_1
;
The problem is that I want to remove duplicates by col1 and col2, but also want to include to the insert all the other fields of table_1.
I tried to add all the relevant columns this way:
INSERT IGNORE INTO temp(M_ID,MAILING_ID,REPORT_ID,
MAILING_NAME,VISIBILITY,EXPORTED) SELECT DISTINCT
M_ID,MAILING_ID,REPORT_ID,MAILING_NAME,VISIBILITY,
EXPORTED FROM table_1
;
M_ID(int,primary),MAILING_ID(int),REPORT_ID(int),
MAILING_NAME(varchar),VISIBILITY(varchar),EXPORTED(int)
But it inserted all rows into temp (including duplicates)

The best way to delete duplicate rows by multiple columns is the simplest one:
Add an UNIQUE index:
ALTER IGNORE TABLE your_table ADD UNIQUE (field1,field2,field3);
The IGNORE above makes sure that only the first found row is kept, the rest discarded.
(You can then drop that index if you need future duplicates and/or know they won't happen again).

This works perfectly in any version of MySQL including 5.7+. It also handles the error You can't specify target table 'my_table' for update in FROM clause by using a double-nested subquery. It only deletes ONE duplicate row (the later one) so if you have 3 or more duplicates, you can run the query multiple times. It never deletes unique rows.
DELETE FROM my_table
WHERE id IN (
SELECT calc_id FROM (
SELECT MAX(id) AS calc_id
FROM my_table
GROUP BY identField1, identField2
HAVING COUNT(id) > 1
) temp
)
I needed this query because I wanted to add a UNIQUE index on two columns but there were some duplicate rows that I needed to discard first.

For Mysql:
DELETE t1 FROM yourtable t1
INNER JOIN yourtable t2 WHERE t1.id < t2.id
AND t1.identField1 = t2.identField1
AND t1.identField2 = t2.identField2;

You will first need to find your duplicates by grouping on the two fields with a having clause.
Select identField1, identField2, count(*) FROM yourTable
GROUP BY identField1, identField2
HAVING count(*) >1
If this returns what you want, you can then use it as a subquery and
DELETE FROM yourTable WHERE field in (Select identField1, identField2, count(*) FROM yourTable
GROUP BY identField1, identField2
HAVING count(*) >1 )

you can always get the primary ids by grouping that two unique fields
select count(*), id as count from table group by col a, col b having count(*)>1;
and then
delete from table where id in ( select count(*), id as count from table group by col a, col b having count(*)>1) limit maxlimit;
you can also use max() in place of limit

NOTE: This solution is an alternative & old school solution.
If you couldn't achieve what you wanted, then you can try my "oldschool" method:
First, run this query to get the duplicate records:
select column1,
column2,
count(*)
from table
group by column1,
column2
having count(*) > 1
order by count(*) desc
After that, select those results and paste them into the notepad++:
Now by using the find and replace specialty of the notepad++ replace them with; first "delete" then "insert" queries like this (from now on, for security reasons, my values will be AAAA).
Special Note: Please make another new line for the end of the last line of your data inside notepad++ because regex matched the '\r\n' at the end of the each line:
Find what regex: \D*(\d+)\D*(\d+)\D*\r\n
Replace with string: delete from table where column1 = $1 and column2 = $2; insert into table set column1 = $1, column2 = $2;\r\n
Now finally, paste those queries to your MySQL Workbench's query console and execute. You will see only one occurrences of each duplicate record.
This answer is for a relation table constructed of just two columns without ID. I think you can apply it to your situation.

In a large data set if you are selecting the multiple columns in the select clause ex:
select x,y,z from table1.
And the requirement is to remove duplicate based on two columns:from above example let y,z
then you may use below instead of using combo of "group by" and "sub query", which is bad in performance:
select x,y,z
from (
select x,y,z , row_number() over (partition by y,z) as index_num
from table1) main
where main.index_num=1

Related

Keep only last two rows for grouped columns in table

I have a table "History" with about 300.000 rows, which is filled with new data daily. I want to keep only the last two lines of every refSchema/refId combination.
Actually I go this way:
First Step:
SELECT refSchema,refId FROM History GROUP BY refSchema,refId
With this statement I get all combinations (which are about 40.000).
Second Step:
I run a foreach which looks up for the existing rows for the query above like this:
SELECT id
FROM History
WHERE refSchema = ? AND refId = ? AND state = 'done'
ORDER BY importedAt
DESC LIMIT 2,2000
Please keep in mind, that I want to hold the last two rows in my table, so I limit 2,2000. If I find matching rows I put the id's in an array called idList.
Final Step
I delete all id's from the array in that way:
DELETE FROM History WHERE id in ($idList)
This all seems not to be the best performance, because I have to check every combination with an extra query. Is there a way to have one delete statement that does the magic to avoid the 40.000 extra queries?
Edit Update: I use AWS Aurora DB

If you are using MySQL 8+, then one conceptually simple way to proceed here is to use a CTE to identify the top two rows per group which you do want to retain. Then, delete any record whose schema/id pair do not appear in this whitelist:
WITH cte AS (
SELECT refSchema, refId
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY refSchema, refId ORDER BY importedAt DESC) rn
FROM History
) t
WHERE rn IN (1, 2)
)
DELETE
FROM History
WHERE (refSchema, refId) NOT IN (SELECT refSchema, refId FROM cte);
If you can't use CTE, then try inlining the above CTE:
DELETE
FROM History
WHERE (refSchema, refId) NOT IN (
SELECT refSchema, refId
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY refSchema, refId ORDER BY importedAt DESC) rn
FROM History
) t
WHERE rn IN (1, 2)
);

MySQL: SELECT UNIQUE VALUE

In my table I have several duplicates. Ineed to find unique values in mysql table column.
SQL
SELECT column FROM table
WHERE column is unique
SELECT column FROM table
WHERE column = DISTINCT
I've been trying to Google, but almost all queries are more complex.
The result I's like is all non duplicate values.
EDIT
I'd like to have UNIQUE values...
Not all values one time... (Distinct)

Try to use DISTINCT like this:
SELECT DISTINCT mycolumn FROM mytable
EDIT:
Try
select mycolumn, count(mycolumn) c from mytable
group by mycolumn having c = 1

Here is the query that you want!
SELECT column FROM table GROUP BY column HAVING COUNT(column) = 1
This query took 00.34 seconds on a data set of 1 Million rows.
Here is a query for you though, in the future if you DO want duplicates, but NOT non-duplicates...
SELECT column, COUNT(column) FROM table GROUP BY column HAVING COUNT(column) > 1
This query took 00.59 seconds on a data set of 1 Million rows. This query will give you the (column) value for every duplicate, and also the COUNT(column) result for how many duplicates. You can obviously choose not to select COUNT(column) if you don't care how many there are.
You can also check this out, if you need access to more than just the column with possible duplicates... Finding duplicate values in a SQL table

Try this one:
SELECT COUNT(column_name) AS `counter`, column_name
FROM tablename
GROUP BY column_name
WHERE COUNT(column_name) = 1
Have a look at this fiddle: http://sqlfiddle.com/#!9/15147/2/0

Try this:
SELECT DISTINCT (column_name) FROM table_name

using distinct with all attributes

We can use * to select all attribute from table ,I am using distinct and my table contain 16 columns, How can I use distinct with it.I cannot do select distinct Id,* from abc;
What would be the best way.
Another way could be select distinct id,col1,col2 etc.

If you want in the results, one row per id, you can use GROUP BY id. But then, it's not advisable to use the other columns in the SELECT list (even if MySQL allows it - that depends on whether you have ANSI setting On or Off). It's advisable to use the other columns with aggregate functions like MIN(), MAX(), COUNT(), etc. In MySQL, there is also a GROUP_CONCAT() aggregate function that will collect all values from a column for a group:
SELECT
id
, COUNT(*) AS number_of_rows_with_same_id
, MIN(col1) AS min_col1
, MAX(col1) AS max_col1
--
, GROUP_CONCAT(col1) AS gc_col1
, GROUP_CONCAT(col2) AS gc_col2
--
, GROUP_CONCAT(col16) AS gc_col16
FROM
abc
GROUP BY
id ;
The query:
SELECT *
FROM abc
GROUP BY id ;
is not valid SQL (up to 92) because you have non-aggregated results in the SELECT list and valid in SQL (2003+). Still, it's invalid here because the other columns are not functionally dependent on the grouping column (id). MySQL unfortunately allows such queries and does no checking of functional dependency.
So, you never know which row (of the many with same id) will be returned or even if - horror! - you get results from different rows (with same id). As #Andriy comments, the consequences are that values for columns other than id will be chosen arbitrarily. If you want predictable results, just don't use such a technique.
An example solution: If you want just one row from every id, and you have a datetime or timestamp (or some other) column that you can use for ordering, you can do this:
SELECT t.*
FROM abc AS t
JOIN
( SELECT id
, MIN(some_column) AS m -- or MAX()
FROM abc
GROUP BY id
) AS g
ON g.id = t.id
AND g.m = t.some_column ;
This will work as long as the (id, some_column) combination is unique.

use group by instead of distinct
group by col1, col2,col3
its doing like distinct

SELECT DISTINCT * FROM `some_table`
Is absolutely valid syntax.
The error is caused by the fact that you call Id, *. Well * includes the Id column too, which usually is unique anyway.
So what you'll need in your case is just:
SELECT DISTINCT * FROM `abc`

SELECT * FROM abc where id in(select distinct id from abc);
You can totally do this.
Hope this helps
Initially I thought it would work for group by is best one. This is same as doing select * froom abc. Sorry guys

Simply delete duplicate content in a sql table

I wanted to know if there is an easy way to remove duplicates from a table sql.
Rather than fetch the whole table and delete the data if they appear twice.
Thank you in advance
This is my structure :
CREATE TABLE IF NOT EXISTS `mups` (
`idgroupe` varchar(15) NOT NULL,
`fan` bigint(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

If you are using Sql Server
Check this: SQL SERVER – 2005 – 2008 – Delete Duplicate Rows
Sample Code using CTE:
/* Delete Duplicate records */
WITH CTE (COl1,Col2, DuplicateCount)
AS
(
SELECT COl1,Col2,
ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO

Add a calculated column that takes the checksum of the entire row. Search for any duplicate checksums, rank and remove the duplicates.

you can do something like this :
DELETE from yourTable WHERE tableID in
(SELECT clone.tableID
from yourTable origine,
yourTable clone
where clone.tableID= origine.tableID)
But in the WHERE, you can either compare the indexes or compare each other fields...
depending on how you find your doubles.
note, this solution has the advantage of letting you choose what IS a double (if the PK changes for example)

You can find the duplicates by joining the table to itself, doing a group by the fields you are looking for duplicates in, and a having clause where count is greater than one.
Let's say your table name is customers, and your looking for duplicate name fields.
select cust_out.name, count(cust_count.name)
from customers cust_out
inner join customers cust_count on cust_out.name = cust_count.name
group by cust_out.name
having count(cust_count.name) > 1
If you use this in a delete statement you would be deleting all the duplicate records, when you probably intend to keep on of the records.
So to select the records to delete,
select cust_dup.id
from customers cust
inner join customers cust_dup on cust.name = cust_dup.name and cust_dup.id > cust.id
group by cust_dup.id

MySQL GROUP BY Multiple Columns

I need to perform a GROUP BY on 2 columns separately...
In common terms, I'd like the query to say: GROUP BY column 1, then once this grouping has been performed, and the rows returned have been refined, go back to the top and GROUP BY column 2 to refine the rows returned again.
For instance, instead of stating:
GROUP BY column_1, column_2
I want to state (I Understand this is incorrect syntax):
GROUP BY column_1
GROUP BY column_2
If this is unclear I can include a sample query with expected returned results.

Are you trying to do something like this?
select ...
from (
select ...
from some_table
where ...
group by column1
) as dt
group by column2
That's the closest thing I can think of that matches what your question appears to be asking.

Mostly you can group by multiple columns in mysql. The query is:select * from table group by col1, col2
But you can't get answer as you want as. So you've another chance to get correct answer in mysql. That is, you've to use subqueries.select * from (select * from table group by col2) tabl group by col1

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Find and remove duplicate rows by two columns - mysql

For Mysql: DELETE t1 FROM yourtable t1 INNER JOIN yourtable t2 WHERE t1.id < t2.id AND t1.identField1 = t2.identField1 AND t1.identField2 = t2.identField2;

Related

Keep only last two rows for grouped columns in table

MySQL: SELECT UNIQUE VALUE

using distinct with all attributes

Simply delete duplicate content in a sql table

MySQL GROUP BY Multiple Columns

Categories

Resources