I'm having hard time removing duplicates from database. It's MariaDB (protocol version: 10, 10.3.34-MariaDB Server). I need to remove rows where three columns are equal. I was trying to use WITH clause but database throws error that it can't recognize 'WITH', so I focused on traditional way.
I need to remove rows where foreignId, column1 and column2 are equal.
I'm checking if there are duplicates like
SELECT foreignId, column1, column2, COUNT(*)
FROM table1
GROUP BY foreignId, column1, column2
HAVING COUNT(*) > 1
Trying to remove duplicates...
DELETE table1
FROM table1
INNER JOIN (
SELECT
p.id,
p.foreignId,
p.column1,
p.column2,
ROW_NUMBER() OVER (
PARTITION BY
p.column1,
p.column2,
p.foreignId
ORDER BY
p.foreignId,
p.column2,
p.column1
) AS row_number
FROM table1 p
GROUP BY p.foreignId, p.column1, p.column2
) dup
ON table1.column1 = dup.column1
WHERE dup.row_number > 1;
I was modifying this code alot but still can't make it work as intended... What am I doing wrong?
You have a few issues with your query:
You need to remove the GROUP BY in the subquery
You should change the ORDER BY in the OVER clause to ORDER BY p.ts DESC (where ts is the name of your timestamp column)
You need to JOIN on the unique id column; otherwise you will delete any row which has values which have duplicates anywhere i.e. ON table.id = dup.id
That will give you:
DELETE table1
FROM table1
INNER JOIN (
SELECT
p.id,
ROW_NUMBER() OVER (
PARTITION BY
p.column1,
p.column2,
p.foreignId
ORDER BY
p.ttimestamp DESC
) AS rn
FROM table1 p
) dup
ON table1.id = dup.id
WHERE dup.rn > 1
Note I would not use row_number as a column alias as it is a reserved word, so I've changed it to rn above.
Demo (thanks to #JonasMetzler) on dbfiddle
Note that if it's possible for duplicate rows to also have the same timestamp value, this query will delete a random selection of those rows. If you want a deterministic result, change the ORDER BY clause to
ORDER BY
p.ttimestamp DESC,
p.id DESC
which will keep the row with the highest (or lowest if you remove the DESC after p.id) id value.
Demo on dbfiddle
Assuming you have a unique column like id, you can do following:
DELETE FROM table1 WHERE ID NOT IN
(SELECT x.id FROM
(SELECT MAX(id) id, MAX(foreignId) foreignId,
MAX(column1) column1, MAX(column2) column2
FROM table1
WHERE ttimestamp IN (SELECT MAX(ttimestamp) FROM table1
GROUP BY foreignID, column1, column2)
GROUP BY foreignId, column1, column2)x);
Please see the working example here: db<>fiddle
Related
Consider the following table:
As shown in image, I want to return all the data from only first distinct id. How can I achieve that in MySQL ?
You can filter with a subquery. Assuming that by first you mean the row with the earlier start_time, that would be:
select t.*
from mytable t
where t.start_time = (
select min(t1.start_time) from mytable t1 where t1.call_unique_id = t.call_unique_id
)
from your_table t1
join
(
select min(call_unique_id) as id
from your_table
group by start_time
) t2 on t1.id = t2.id
group by should also do the job. so try
select * from your_table group by call_unique_id
I have two tables, lets say table1 and table2 with common columns, id and update_date. I am looking to get the id's and update_date based on latest update_date in descending order. I have used 'union' and 'order by' together which gave the results in descending order of update_date but there are duplicate id's which I am not sure how to get rid of.
My query is like,
(select id,update_date from table1 where [condition])
UNION
(select id,update_date from table2 where [condition])
order by update_date desc;
I can just get rid of the duplicate id's by adding select distinct id from (above query) as temp; but the problem is that I need the update_date too.
Can anyone suggest how to get rid of duplicates and still get both id and update_date information.
Assuming you want the latest update out of duplicates this one should work:
SELECT id, max(update_date) AS last_update
FROM
( (select id,update_date from table1 where [conditions])
UNION
(select id,update_date from table2 where [conditions]) ) both_tables
GROUP BY id
ORDER by last_update DESC
Wrap the query in a DISTINCT block:
SELECT DISTINCT * FROM (
select id,update_date from table1 where [condition]
UNION
select id,update_date from table2 where [condition]
)
order by update_date desc;
Limit the second query's results:
select id, update_date
from table1
where [conditions]
union
select id, update_date
from table2
where [conditions]
and id not in (select id from table1 where [conditions])
I have table with id, item_id, value (int), run (datetime) and i need select value diff betwen last two run per *item_id*.
SELECT item_id, ABS(value1 - value2) AS diff
FROM ( SELECT h.item_id, h.value AS value1, h2.value AS value2
FROM ( SELECT id, item_id, value
FROM table_name
GROUP BY item_id
ORDER BY run DESC) AS h
INNER JOIN ( SELECT id, item_id, value
FROM table_name
ORDER BY run DESC) AS h2
ON h.item_id = h2.item_id AND h.id != h2.id
GROUP BY item_id) AS h3
I believe this should do the trick for you. Just replace table_name to correct name.
Explanation:
Basicly I join the table with itself in a run DESC order, JOIN them based on item_id but also on id. Then I GROUP BY them again to remove potential 3rd and so on cases. Lastly I calculate the difference between them through ABS(value1 - value2).
SELECT t2.id, t2.item_id, (t2.value- t1.value) valueDiff, t2.run
FROM ( table_name AS t1
INNER JOIN
table_name AS t2
ON t1.run = (SELECT MAX(run) FROM table_name where run < t2.run)
and t1.item_id = t2.item_id)
This is assuming you want the diff between a record and the record with the previous run
I want to make sure that the order of the result from subquery are preserved while using Union distinct. Please note that "union distinct" is required to filter on duplicates while doing the union.
For example:
select columnA1, columnA2 from tableA order by [columnA3] asc
union distinct
select columnB1, columnB2 from tableB
When I run this, I am expecting that the records ordered from subquery ( select columnA1, columnA2 from tableA sort by [columnA3] asc) comes in first (as returned by order by columnA3 asc) followed by those from tableB.
I am assuming that I cannot add another dummy column because that would make union distinct to not work. So, this won't work:
select column1, column2 from
( select column1, column2, 1 as ORD from tableA order by [columnA3] asc
union distinct
select column1, column2, 2 as ORD from tableB
) order by ORD
Essentially, MySQL isn’t preserving the order of records from sub-query while using “Union distinct” construct. After a bit of research, I found that it works if we put in a limit clause or have nested queries. So, below are the two approaches:
Approach-1: Use Limit clause
select columnA1, columnA2 from tableA order by [columnA3] asc Limit 100000000
union distinct
select columnB1, columnB2 from tableB
I have tested this behavior using few datasets and it seems to work consistently. Also, there is a reference to this behavior in MySQL‘s documentation ( http://dev.mysql.com/doc/refman/5.1/en/union.html ):
“Use of ORDER BY for individual SELECT statements implies nothing about the order in which the rows appear in the final result because UNION by default produces an unordered set of rows. Therefore, the use of ORDER BY in this context is typically in conjunction with LIMIT, so that it is used to determine the subset of the selected rows to retrieve for the SELECT, even though it does not necessarily affect the order of those rows in the final UNION result. If ORDER BY appears without LIMIT in a SELECT, it is optimized away because it will have no effect anyway.”
Please note that there is no particular reason in choosing LIMIT of 10000000000 other than having a sufficiently high number to make sure we cover all cases.
Approach-2: A nested query like the one below also works.
select column1, column2 from
( select column1, column2 order by [columnA3] asc ) alias1
union distinct
( select column1, column2 from tableB )
I couldn’t find a reason for nested query to work. There have being some references online (like the one from Phil McCarley at http://dev.mysql.com/doc/refman/5.0/en/union.html ) but no official documentation from MySQL.
select column1, column2 from
( select column1, column2, 1 as ORD from tableA
union distinct
select tableB.column1, tableB.column2, 2 as ORD from tableB
LEFT JOIN tableA
ON tableA.column1 = tableB.column1 AND tableA.column2 = tableB.column2
WHERE tableA.column1 IS NULL
) order by ORD
note that UNION not only de-dupes across the separate sets, but within sets
Alternatively:
select column1, column2 from
( select column1, column2, 1 as ORD from tableA
union distinct
select column1, column2, 2 as ORD from tableB
WHERE (column1, column2) NOT IN (SELECT column1, column2 from tableA)
) order by ORD
I have this query in which one of the column is a calculated one. Every thing is working except it is not ordering the results when I use that calculated column in query. The query is a very large one so I have simplified it below for understanding. Here the calculated column is "remaining"
SELECT t1.id, t1.name, t2.duration - datediff(now(), t1.posting_time) as "remaining"
FROM table1 t1, table2 t2
WHERE td.id = t1.timefield
ORDER BY id, name, remaining DESC
Even if I remove this "remaining" from order by clause or use it with asc or desc, nothing happens and order stays the same.
The only time when 'remaining' alters the sort order is when the id and the name in two rows are the same. Otherwise, it has no effect on the ordering.
You need to fix the typo of 'td.id' to 't2.id'.
You should learn the JOIN notation too:
SELECT t1.id, t1.name, t2.duration - datediff(now(), t1.posting_time) as "remaining"
FROM table1 t1
JOIN table2 t2 ON t2.id = t1.timefield
ORDER BY id, name, remaining DESC
use a trick:
SELECT *
FROM (
SELECT
t1.id,
t1.name,
t2.duration - datediff(now() , t1.posting_time) as "remaining"
FROM table1 t1, table2 t2
WHERE t2.id = t1.timefield
) AS i
ORDER BY id, name, remaining DESC