ClickHouse deduplication/upsert with different functions per column - duplicates

I have a ClickHouse table which looks like this:
CREATE TABLE test
(
id Int,
property_id Int,
created_at DateTime('UTC'),
modified_at DateTime('UTC'),
data Int,
json_str Nullable(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (property_id, created_at);
When inserting new rows, I want to update (upsert) existing rows with matching id and property_id according to these rules:
created_at: Keep the earliest
modified_at: Keep the latest
data: Keep the value of the row with the latest modified_at
json_str: Ideally, deep merge json objects (stored as strings) of all matching rows
I did quite a bit of research and tried setting up a deduplication pipeline, using a source table, a destination table (ENGINE = AggregatingMergeTree) and a materialized view (using minState, maxState, argMaxState) but I couldn't figure it out so far. I'm running into errors related to primary key, partitioning, wrong aggregation functions, etc. Even a setup without merging json_str would be very helpful.

After a lot of trial and error, I found a solution (ignoring json_str for now):
-- Source table with duplicates
DROP TABLE IF EXISTS ingest;
CREATE TABLE ingest
(
id Int,
property_id Int,
created_at DateTime('UTC'), -- Should be preserved
modified_at DateTime('UTC'), -- Should be updated
data Int -- Should be updated
) ENGINE = MergeTree
ORDER BY (property_id, created_at);
-- Destination table without duplicates
DROP TABLE IF EXISTS dedup;
CREATE TABLE dedup
(
id Int,
property_id Int,
created_at_state AggregateFunction(min, DateTime),
modified_at_state AggregateFunction(max, DateTime),
data_state AggregateFunction(argMax, Int, DateTime)
) ENGINE = SummingMergeTree
ORDER BY (property_id, id);
-- Transformation pipeline
DROP VIEW IF EXISTS pipeline;
CREATE MATERIALIZED VIEW pipeline TO dedup
AS SELECT
id,
property_id,
minState(created_at) AS created_at_state,
maxState(modified_at) AS modified_at_state,
argMaxState(data, modified_at) AS data_state
FROM ingest
GROUP BY property_id, id;
-- Insert data with a duplicate
INSERT INTO ingest (id, property_id, created_at, modified_at, data)
VALUES (1, 100, '2022-01-01 08:00:00', '2022-01-01 08:00:00', 2000),
(1, 100, '2022-01-01 08:01:00', '2022-01-01 08:01:00', 3000),
(2, 100, '2022-01-01 08:00:00', '2022-01-01 08:00:00', 4000),
(3, 200, '2022-01-01 08:05:00', '2022-01-01 08:05:00', 5000);
-- Query deduplicated table with merge functions
SELECT id,
property_id,
toDateTime(minMerge(created_at_state), 'UTC') AS created_at,
toDateTime(maxMerge(modified_at_state), 'UTC') AS modified_at,
argMaxMerge(data_state) AS data
FROM dedup
GROUP BY property_id, id
ORDER BY id, property_id;
id
property_id
created_at
modified_at
data
1
100
2022-01-01T08:00Z
2022-01-01T08:01Z
3000
2
100
2022-01-01T08:00Z
2022-01-01T08:00Z
4000
3
200
2022-01-01T08:05Z
2022-01-01T08:05Z
5000

Related

Generate Oracle JSON from CLOB datatype column

Requirement is to generate JSON from clob data type column.
environment version Oracle 12.2
I have a table with fields id (number data type) and details (clob type) like below
ID - details
100 - 134332:10.0, 1481422:1.976, 1483734:1.688, 2835036:1.371
101 - 134331:0.742, 319892:0.734, 1558987:0.7, 2132090:0.697
eg output:
{
"pId":100,
"cid":[
{
"cId":134332,
"wt":"10.0"
},
{
"cId":1481422,
"wt":"1.976"
},
{
"cId":1483734,
"wt":"1.688"
},
{
"cId":2835036,
"wt":"1.371"
}
]
}
please help with oracle SQL query to generate output.
Below I set up a table with a few input rows for testing; then I show one way you can solve your problem, and the output from that query. I didn't try to write the most efficient (fastest) query; rather, I hope this will show you how this can be done. Then if speed is a problem you can work on that. (In that case, it would be best to reconsider the inputs first, which break First Normal Form.)
I added a couple of input rows for testing, to see how null is handled. You can decide if that is the desired handling. (It is possible that no null are possible in your data - in which case you should have said so when you asked the question.)
Setting up the test table:
create table input_tbl (id number primary key, details clob);
insert into input_tbl (id, details) values
(100, to_clob('134332:10.0, 1481422:1.976, 1483734:1.688, 2835036:1.371'));
insert into input_tbl (id, details) values
(101, '134331:0.742, 319892:0.734, 1558987:0.7, 2132090:0.697');
insert into input_tbl (id, details) values
(102, null);
insert into input_tbl (id, details) values
(103, '2332042: ');
commit;
Query:
with
tokenized (pid, ord, cid, wt) as (
select i.id, q.ord, q.cid, q.wt
from input_tbl i cross apply
(
select level as ord,
regexp_substr(details, '(, |^)([^:]+):', 1, level, null, 2)
as cid,
regexp_substr(details, ':([^,]*)', 1, level, null, 1) as wt
from dual
connect by level <= regexp_count(details, ':')
) q
)
, arrayed (pid, json_arr) as (
select pid, json_arrayagg(json_object(key 'cId' value to_number(trim(cid)),
key 'wt' value to_number(trim(wt)))
)
from tokenized
group by pid
)
select pid, json_object(key 'pId' value pid, key 'cid' value json_arr) as json
from arrayed
;
Output:
PID JSON
---- -----------------------------------------------------------------------------------------------------------------------------
100 {"pId":100,"cid":[{"cId":134332,"wt":10},{"cId":2835036,"wt":1.371},{"cId":1483734,"wt":1.688},{"cId":1481422,"wt":1.976}]}
101 {"pId":101,"cid":[{"cId":134331,"wt":0.742},{"cId":2132090,"wt":0.697},{"cId":1558987,"wt":0.7},{"cId":319892,"wt":0.734}]}
102 {"pId":102,"cid":[{"cId":null,"wt":null}]}
103 {"pId":103,"cid":[{"cId":2332042,"wt":null}]}

Looking For Most Frequent Values SQL Statement

I have a data set that looks like this:
id | Unit_Ids
1 | {"unit_ids" : ["5442","28397"]}
2 | {"unit_ids" : ["5442","3492","2290"]}
etc.
And I'm trying to find the most frequently appearing values in Unit_Ids. As in my example 5442 appears in both lines 1 and 2, it would be the most frequent value. I was just having trouble finding a good way of creating this statement.
Thank you in advanced!
EDIT: Sorry everyone I'm working with MySQL
If 2016+
Example
Declare #YourTable Table ([id] int,[Unit_Ids] varchar(max))
Insert Into #YourTable Values
(1,'{"unit_ids" : ["5442","28397"]}')
,(2,'{"unit_ids" : ["5442","3492","2290"]}')
Select top 1 value
From #YourTable A
Cross Apply OpenJSON([Unit_Ids],'$.unit_ids') R
Order By sum(1) over (partition by value) desc
Returns
value
5442
I'm assuming you are storing JSON strings in the Unit_Ids field. If you do that, you won't be able to extract or aggregate data stored in that field.
You can however create a child table and query it to get aggregated data. Ie:
-- Master table
create table parent(id number primary key);
-- List of units per parent
create table units(
id number not null,
parent_id number not null,
primary key (id, parent_id),
foreign key (parent_id) references parent(id)
);
-- Insert sample data
insert into parent values 1;
insert into parent values 2;
insert into units(parent_id, id) values(1, 5442);
insert into units(parent_id, id) values(1, 28397);
insert into units(parent_id, id) values(2, 5442);
insert into units(parent_id, id) values(2, 3492);
insert into units(parent_id, id) values(2, 2290);
-- Count the number of times a unit id is in the table
select id, count(id) from units group by id;

Sort the table by c_criticality field in sql query

i have one field wonum,C_criticality in workorder.
In C_criticality field have three option
critical
non-critical
tools
4.null or empty
In that C_criticality field some columns are null(empty) that should come at last.
Now i have to get the output in sorting order by criticality , noncritical ,tool , null value(empty value ) will come.
CREATE TABLE workorder
(
wonum int,
C_criticality varchar(255),
);
INSERT INTO workorder (wonum,C_criticality)
VALUES (2, 'critical');
INSERT INTO workorder (wonum,C_criticality)
VALUES (1, 'non-critical');
INSERT INTO workorder (wonum,C_criticality)
VALUES (15, 'critical');
INSERT INTO workorder (wonum,C_criticality)
VALUES (12, 'tool');
INSERT INTO workorder (wonum,C_criticality)
VALUES (21, 'non-critical');
INSERT INTO workorder (wonum,C_criticality)
VALUES (11, '');
output:-
C_criticality wonum
critical 2
critical 15
non-critical 21
tool 12
null 11
We can try a two-tiered sort using ISNULL and FIELD:
SELECT *
FROM yourTable
ORDER BY
ISNULL(C_criticality), -- 0 for non NULL, 1 for NULL
FIELD(C_criticality, 'criticality', 'noncritical', 'tool');
The call to FIELD will order according to the list you provided.
This works for all SQL engines, not just in MySQL
select *
from workorder
order by case when C_criticality = 'critical' then 1
when C_criticality = 'non-critical' then 2
when C_criticality = 'tools' then 3
else 4
end
Since you have tagged mysql the query would be something like ,
SELECT * FROM workorder ORDER BY FIELD(C_criticality ,'critical','non-critical','tools','null');
It depends from DBMS. In some there are options NULLS FIRST, NULLS LAST
In MySQL you can find this article interesting https://www.designcise.com/web/tutorial/how-to-order-null-values-first-or-last-in-mysql

Update query - incrementing int field value and if statement

I am currently learning SQL through my local mysql db. I have a table named transactions that has 5 fields. I am running an update query for column where name = jane. Essentially, I want to integrate an if statement when date_created – tran_date = 1 month to reset values to transaction = 0, tran_date = 0000-00-00, and change date_created to the new current date.
Query(help)- UPDATED
UPDATE transactions SET transactions = transactions + 1, tran_date = CURDATE() WHERE name = 'jim'
Create tables and set values:
CREATE TABLE transactions
(
id int auto_increment primary key,
date_created DATE,
name varchar(20),
transactions int(6),
tran_date DATE
);
INSERT INTO transactions
(date_created, name, transactions, tran_date)
VALUES
(NOW(), 'jim', 0, 0000-00-00),
(NOW(), 'jane', 0, 0000-00-00);
Your updatesyntax is wrong:
UPDATE transactions SET transactions = transactions + 1, tran_date = CURDATE() WHRE name = 'jim'
You have not to use AND in set clause. You must use a comma at this place.

Insert record into table with position without updating all the records position field

I am using MySQL, I don't have a good way to do this.
I have a table with a position field, which I need to keep track having values from 1 to 10,000.
Let's say I insert a record in the middle at 5000th position. So position 5000 to 10,000 need to be updated to the new position; old 5000 become 5001, 5002 becomes 5003...
Is there a good way to implement this without affecting so many records, when 1 single position is added?
Adding from the position 1st is the worst.
I'd rethink the database design. If you're going to be limited to on the order of 10K records then it's not too bad, but if this is going to increase without bound then you'll want to do something else. I'm not sure what you are doing but if you want a simple ordering (assuming you're not doing a lot of traversal) then you can have a prev_id and next_id column to indicate sibling relationships. Here's the answer to your questions though:
update some_table
set some_position = some_position + 1
where some_position > 5000 and some_position < 10000
You can try the below approach :
USE tempdb;
GO
CREATE TABLE dbo.Test
(
ID int primary key clustered identity(1,1) ,
OrderNo int,
CreatedDate datetime
);
--Insert values for testing the approach
INSERT INTO dbo.Test
VALUES
(1, GETUTCDATE()),
(2, GETUTCDATE()),
(3, GETUTCDATE()),
(4, GETUTCDATE()),
(5, GETUTCDATE()),
(6, GETUTCDATE());
SELECT *
FROM dbo.Test;
INSERT INTO dbo.Test
VALUES
(3, GETUTCDATE()),
(3, GETUTCDATE());
SELECT *
FROM dbo.Test;
--To accomplish correct order using ROW_NUMBER()
SELECT ID,
OrderNo,
CreatedDate,
ROW_NUMBER() OVER(ORDER BY OrderNo, ID) AS Rno
FROM dbo.Test;
--Again ordering change
INSERT INTO dbo.Test
VALUES
(3, GETUTCDATE()),
(4, GETUTCDATE());
SELECT ID,
OrderNo,
CreatedDate,
ROW_NUMBER() OVER(ORDER BY OrderNo, ID) AS Rno
FROM dbo.Test
DROP TABLE dbo.Test;