Redshift delete duplicate records but keep latest - duplicates

Current Table
select * from currentTable;
select * from PG_TABLE_DEF where tablename='currenttable';
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+--------------+-----------------------+-----------------------+----------+---------+---------+---------
public | currenttable | kafkaoffset | integer | az64 | f | 0 | t
public | currenttable | operation | character varying(25) | lzo | f | 0 | t
public | currenttable | othertablepk | integer | az64 | f | 0 | t
public | currenttable | othertableorderstatus | character varying(25) | lzo | f | 0 | t
| kafkaOffset | operation | otherTablePK | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1024 | CREATE | 23 | Cooking
| 1025 | UPDATE | 23 | Shipped
| 1026 | UPDATE | 23 | Delivered
| 1027 | CREATE | 51 | Cooking
| 1028 | UPDATE | 51 | Shipped
| 1029 | CREATE | 52 | Cooking
I want to dedupe my current table to keep only the latest record(by kafkaOffset) based on the otherTablePk.
Deduped Table (Expected Result)
select * from currentTable;
| kafkaOffset | operation | otherTablePK | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1026 | UPDATE | 23 | Delivered
| 1028 | UPDATE | 51 | Shipped
| 1029 | CREATE | 52 | Cooking
Solution-1: USING InnerJoin and max
MYSQL like query in redshift using inner join and max. More Info.
DELETE
FROM currentTable
INNER JOIN
(SELECT max(kafkaOffset) AS lastOffset,
otherTablePk AS otherTablePkID
FROM currentTable
WHERE otherTablePkID IN
(SELECT otherTablePk
FROM currentTable
GROUP BY otherTablePk
HAVING count(*) > 1)
GROUP BY otherTablePk) lastTable ON lastTable.otherTablePkID = currentTable.otherTablePkID
WHERE current_table.kafkaOffset < lastTable.lastOffset;
Solution-2: Using USING and doing Self Join.
DELETE from currentTable t1
JOIN currentTable t2 USING (otherTablePK)
WHERE t1.kafkaOffset < t2.kafkaOffset
Solution-3: Using TEMP table and surgical deletes
As explained in this blog and this answer, but the use case is little different here. We need to delete everything but keep the latest., doing max makes the query slow.
All the solutions above would be slow in Redshift, it being a columnar storage. Please suggest what would be the fastest way to do this operation in Redshift?

Please include the table DDL. Otherwise it's all speculation.
Talking about speculation, please try this untested query (I don't have Redshift):
DELETE FROM currentTable
WHERE kafkaOffset IN (
SELECT kafkaOffset
FROM (
SELECT kafkaOffset
, row_number() OVER (PARTITION BY otherTablePK ORDER BY kafkaOffset DESC) rn
FROM currentTable
) t
WHERE rn > 1
);

Related

Get greatest rows based on multiple columns in MySQL

I have a table that looks like:
id | title | value | language
---+-------+-------+---------
1 | a | 1800 | NULL
2 | a | 1900 | NULL
3 | b | 1700 | NULL
4 | b | 1750 | NULL
5 | b | 1790 | 1
6 | c | 1892 | NULL
7 | c | 1900 | 1
8 | c | 1910 | 2
9 | d | 3020 | NULL
Would like to have the following result:
id | title | value | language
---+-------+-------+---------
2 | a | 1900 | NULL
4 | b | 1750 | NULL
5 | b | 1790 | 1
6 | c | 1892 | NULL
7 | c | 1900 | 1
8 | c | 1910 | 2
9 | d | 3020 | NULL
The point is to select the greatest value in value column of every language of every title - greatest being the latest. Secondly, would like to avoid Aggregate functions like MAX, DISTINCT or GROUP-BY as I am building a MySQL View using the MERGE algorithm, and don't want to end up creating a temporary table (See the bottom section of https://dev.mysql.com/doc/refman/5.6/en/view-algorithms.html).
So far this works, but only returns greatest row per title:
SELECT t1.title
FROM table t1
LEFT OUTER JOIN table t2
ON t1.title = t2.title
AND t1.value < t2.value
WHERE t2.title IS NULL
How can I create one that takes language into account like the results above? Thanx.
You can do it with NOT EXISTS:
select t.*
from tablename t
where not exists (
select 1 from tablename
where
title = t.title and
coalesce(language, 0) = coalesce(t.language, 0) and
value > t.value
)
See the demo.
Results:
| id | title | value | language |
| --- | ----- | ----- | -------- |
| 2 | a | 1900 | NULL |
| 4 | b | 1750 | NULL |
| 5 | b | 1790 | 1 |
| 6 | c | 1892 | NULL |
| 7 | c | 1900 | 1 |
| 8 | c | 1910 | 2 |
| 9 | d | 3020 | NULL |
This answer assumes that you are using MySQL 8+, in which your query becomes very easy. MySQL 8 and later version support analytic functions, which were added with the intention to solve problems such as this.
We can try using ROW_NUMBER here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY title, language ORDER BY value DESC) rn
FROM yourTable
)
SELECT id, title, value, language
FROM cte
WHERE rn = 1;
Demo
There is a way to handle this with earlier versions of MySQL, but it requires user variables, and tends to be very ugly. So maybe consider upgrading if you expect to have many queries similar to this one.
This should give you what you want.
SELECT t1.title, t1.value, t1.language
FROM [Table] t1
LEFT OUTER JOIN [Table] t2 ON
t1.title = t2.title AND
(IFNULL(t1.language, '') = IFNULL(t2.language, ''))
WHERE
t1.value > t2.value;

How to write conditions in self join that changes with current iteration of a row

A table (test) has a description
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| task | varchar(2) | NO | | NULL | |
| time | int(11) | NO | | NULL | |
| type | char(1) | NO | | NULL | |
+-------+-------------+------+-----+---------+-------+
and contains data
+------+------+------+
| task | time | type |
+------+------+------+
| T1 | 1 | S |
| T2 | 2 | S |
| T1 | 7 | E |
| T1 | 8 | S |
| T1 | 14 | E |
| T2 | 15 | E |
| T1 | 16 | S |
| T2 | 17 | S |
| T3 | 20 | S |
| T1 | 21 | E |
| T3 | 25 | E |
+------+------+------+
represents data set for a task started(S) or completed(E) at some time unit. Is it possible to join it in a way which outputs a table with task start time and end time. here (T2, 17, S) is skipped in final output as there is no data on end time for it yet.
Final result:-
+------+------+------+
| task | start| end |
+------+------+------+
| T1 | 1 | 7 |
| T2 | 2 | 15 |
| T1 | 8 | 14 |
| T1 | 16 | 21 |
| T3 | 20 | 25 |
+------+------+------+
As can be seen in the final result, all time frames for a task T (T1) is mutually exclusive [(1,7),(8,15),(16,25)].
Can't figure out condition rules for join
select S_table.task, S_table.time as start, E_table.time as end
from (select * from test where type='S') as S_table
left join (select * from test where type='E') as E_table
on
S_table.task = E_table.task
and
E_table.time should be greater than previous E_table.time for same task
and
E_table.time should be least within S_table.time < E_table.time
In result table for first row all E_table.time (7,15,14,21,25) is greater than S_table.time (current row being looked at i.e. 1) but 7 is the least one hence picked
In result table for second row all E_table.time greater than previous (7) i.e. (15,14,21,25) is greater than 2 but least one i.e. 15 is selected
For each start time you need to get the min time of type 'E' that is greater than that start time:
select t.* from (
select
t.task,
t.time start,
(select min(time) from test where type = 'E' and task = t.task and time > t.time) end
from test t
where t.type = 'S'
) t
where t.end is not null
See the demo.
Results:
| task | start | end |
| ---- | ----- | --- |
| T1 | 1 | 7 |
| T2 | 2 | 15 |
| T1 | 8 | 14 |
| T1 | 16 | 21 |
| T3 | 20 | 25 |
You can get the same results with an inner self join like your code:
select S_table.task, S_table.time as start, E_table.time as end
from (select * from test where type='S') as S_table
inner join (select * from test where type='E') as E_table
on
S_table.task = E_table.task
and
E_table.time = (
select min(time) from test where type = 'E' and task = S_table.task and time > S_table.time
)
order by S_table.time

Select all queries with reference id in a chain

I have this table which I would like to store a chain of records.
CREATE TABLE table_name (
id INT,
unique_id varchar,
reference_id varchar,
);
I want to implement SQL query for MariDB which prints all records by id with all records with reference_id. Something like this:
| id | unique_id | reference_id | | |
|----|-----------|--------------|---|---|
| 43 | 55544 | | | |
| 45 | 45454 | 43 | | |
| 66 | 55655 | 45 | | |
| 78 | 88877 | 66 | | |
| 99 | 454 | 33 | | |
I would like when I select record 66 to get all up and down transactions because each other are using id which points to them. How I can implement this using Recursive CTE? Is there a better way?
Expected result for record with unique_id 66:
| id | unique_id | reference_id | | |
|----|-----------|--------------|---|---|
| 43 | 55544 | | | |
| 45 | 45454 | 43 | | |
| 66 | 55655 | 45 | | |
| 78 | 88877 | 66 | | |
I tried this but above rows are not printed.
select #ref:=id as id, unique_id, reference_id
from mytable
join (select #ref:=id from mytable WHERE reference_id=#ref or id = 66)tmp
where reference_id=#ref
Demo on DB Fiddle
Can you give me hand to find a solution?
EDIT: Attempt with CTE:
with recursive cte as (
select t.*
from mytable
where t.id = 66
union all
select t.*
from cte join
mytable t
on cte.id = t.reference_id
)
select *
from cte;
I get error Unknown table 't'
I'm not familiar with recursive CTE. You can try the below query.
select t.id, t.unique_id, #uid := t.reference_id reference_id
from (select * from mytable order by id desc) t
join (select #uid := 66) tmp
where t.id = #uid or reference_id=66

How I create a table without singles records in MySQL

For example, I have the next table (IN MySQL)
| a | 1002 |
| b | 1002 |
| c | 1015 |
| a | 1005 |
| b | 1016 |
| a | 1106 |
| d | 1006 |
| a | 1026 |
| f | 1106 |
I want to select the objects that are duplicates.
| a | 1002 |
| a | 1106 |
| a | 1026 |
| a | 1005 |
| b | 1002 |
| b | 1016 |
Thank you
If I understand the question, you want to select rows where the number column is duplicated. One way to do it is to join against a subquery returns a list of number values that occur more than once.
SELECT letter, number
FROM myTable A
INNER JOIN (
SELECT number
FROM myTable
GROUP BY number
HAVING COUNT(*) > 1
) B ON A.number = B.number
As an alternative, if you want the list of all values where there are duplicates, you can use group_concat:
select col1, group_concat(col2)
from t
group by col1
having count(*) > 1
This does not return the exact format you want. Instead it would return:
| a | 1002,1106,1026,1005 |
| b | 1002,1016 |
But you might find it useful.

MySQL: Select all twos of a kind with highest ids

I have table that consists multiple rows of a kind that have different ids. (Kinds are many. Ids are unique. Both columns are indexed.)
Now I need to select the two with highest ids of each kind.
Here is what I do.
select max(c.id), max(d.id) from theTable c left join
theTable d on c.id > d.id and c.kind=d.kind
where c.id > constant group by c.kind;
However the query above doesnt perform very well and it is not a big surprise.
Ive figured out a faster version of it...
select c.id, max(d.id) from (select max(id) id, kind from theTable
where id > constatnt group by kind) c left join
theTable d on c.id > d.id and c.kind=d.kind group by c.kind;
.... but still it is not fast enough
Is there a more efficient way to achieve the same result?
Thanks!
Edit:
theTbale is a history table so my task is to get the current values and the previous ones for each kind and compare them as part of an expression (logical operations, coalesces, ifs and etc) and determine if expression results are different
here is an example resultset:
+-----------+-----------+
| max(c.id) | max(d.id) |
+-----------+-----------+
| 1747 | NULL |
| 1701 | 1432 |
| 1703 | 1434 |
| 1706 | 1437 |
| 1707 | 1438 |
| 1751 | NULL |
| 1713 | 1444 |
| 1750 | NULL |
| 1709 | 1440 |
| 1742 | 1741 |
| 1711 | 1442 |
| 1746 | 1745 |
| 1708 | 1439 |
| 1719 | 1450 |
| 1725 | 1456 |
| 1723 | 1454 |
| 1740 | 1733 |
| 1705 | 1436 |
| 1702 | 1433 |
| 1749 | 1748 |
| 1712 | 1443 |
| 1718 | 1449 |
| 1722 | 1453 |
| 1728 | 1459 |
| 1721 | 1452 |
| 1739 | 1731 |
| 1714 | 1445 |
| 1717 | 1448 |
| 1716 | 1447 |
| 1724 | 1455 |
| 1710 | 1441 |
| 1727 | 1458 |
| 1720 | 1451 |
| 1738 | NULL |
| 1715 | 1446 |
| 1704 | 1435 |
| 1726 | 1457 |
| 1758 | 1757 |
+-----------+-----------+
What if instead of producing (kind, id, id) tuples with one row for each kind, your result set was (kind, id) with two rows per kind? I'm not sure if this will be more performant without running it myself, though.
SELECT x.kind, x.id
FROM (SELECT a.kind, a.id
FROM theTable a
LEFT OUTER JOIN theTable b
ON a.kind = b.kind
AND a.id < b.id
GROUP BY a.id
HAVING COUNT(*) < 2
ORDER BY b.id) x
WHERE x.id > constant
ORDER BY x.kind;
The last ORDER BY clause is just to make it easier for you to verify results, so omit it when evaluating performance. Note that some kinds may only have one id exceeding your constant, so you'll only have one (kind, id) row for that kind.
The following may perform pretty well:
select kind, max(id) as maxid,
(select id from t t2 where t2.kind = t.kind and t2.id < max(t1.id) order by id desc limit 1) as secondId
from t
group by kind
This will work well if you have an index on kind, id.