Generate Oracle JSON from CLOB datatype column - json

Requirement is to generate JSON from clob data type column.
environment version Oracle 12.2
I have a table with fields id (number data type) and details (clob type) like below
ID - details
100 - 134332:10.0, 1481422:1.976, 1483734:1.688, 2835036:1.371
101 - 134331:0.742, 319892:0.734, 1558987:0.7, 2132090:0.697
eg output:
{
"pId":100,
"cid":[
{
"cId":134332,
"wt":"10.0"
},
{
"cId":1481422,
"wt":"1.976"
},
{
"cId":1483734,
"wt":"1.688"
},
{
"cId":2835036,
"wt":"1.371"
}
]
}
please help with oracle SQL query to generate output.

Below I set up a table with a few input rows for testing; then I show one way you can solve your problem, and the output from that query. I didn't try to write the most efficient (fastest) query; rather, I hope this will show you how this can be done. Then if speed is a problem you can work on that. (In that case, it would be best to reconsider the inputs first, which break First Normal Form.)
I added a couple of input rows for testing, to see how null is handled. You can decide if that is the desired handling. (It is possible that no null are possible in your data - in which case you should have said so when you asked the question.)
Setting up the test table:
create table input_tbl (id number primary key, details clob);
insert into input_tbl (id, details) values
(100, to_clob('134332:10.0, 1481422:1.976, 1483734:1.688, 2835036:1.371'));
insert into input_tbl (id, details) values
(101, '134331:0.742, 319892:0.734, 1558987:0.7, 2132090:0.697');
insert into input_tbl (id, details) values
(102, null);
insert into input_tbl (id, details) values
(103, '2332042: ');
commit;
Query:
with
tokenized (pid, ord, cid, wt) as (
select i.id, q.ord, q.cid, q.wt
from input_tbl i cross apply
(
select level as ord,
regexp_substr(details, '(, |^)([^:]+):', 1, level, null, 2)
as cid,
regexp_substr(details, ':([^,]*)', 1, level, null, 1) as wt
from dual
connect by level <= regexp_count(details, ':')
) q
)
, arrayed (pid, json_arr) as (
select pid, json_arrayagg(json_object(key 'cId' value to_number(trim(cid)),
key 'wt' value to_number(trim(wt)))
)
from tokenized
group by pid
)
select pid, json_object(key 'pId' value pid, key 'cid' value json_arr) as json
from arrayed
;
Output:
PID JSON
---- -----------------------------------------------------------------------------------------------------------------------------
100 {"pId":100,"cid":[{"cId":134332,"wt":10},{"cId":2835036,"wt":1.371},{"cId":1483734,"wt":1.688},{"cId":1481422,"wt":1.976}]}
101 {"pId":101,"cid":[{"cId":134331,"wt":0.742},{"cId":2132090,"wt":0.697},{"cId":1558987,"wt":0.7},{"cId":319892,"wt":0.734}]}
102 {"pId":102,"cid":[{"cId":null,"wt":null}]}
103 {"pId":103,"cid":[{"cId":2332042,"wt":null}]}

Related

ClickHouse deduplication/upsert with different functions per column

I have a ClickHouse table which looks like this:
CREATE TABLE test
(
id Int,
property_id Int,
created_at DateTime('UTC'),
modified_at DateTime('UTC'),
data Int,
json_str Nullable(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (property_id, created_at);
When inserting new rows, I want to update (upsert) existing rows with matching id and property_id according to these rules:
created_at: Keep the earliest
modified_at: Keep the latest
data: Keep the value of the row with the latest modified_at
json_str: Ideally, deep merge json objects (stored as strings) of all matching rows
I did quite a bit of research and tried setting up a deduplication pipeline, using a source table, a destination table (ENGINE = AggregatingMergeTree) and a materialized view (using minState, maxState, argMaxState) but I couldn't figure it out so far. I'm running into errors related to primary key, partitioning, wrong aggregation functions, etc. Even a setup without merging json_str would be very helpful.
After a lot of trial and error, I found a solution (ignoring json_str for now):
-- Source table with duplicates
DROP TABLE IF EXISTS ingest;
CREATE TABLE ingest
(
id Int,
property_id Int,
created_at DateTime('UTC'), -- Should be preserved
modified_at DateTime('UTC'), -- Should be updated
data Int -- Should be updated
) ENGINE = MergeTree
ORDER BY (property_id, created_at);
-- Destination table without duplicates
DROP TABLE IF EXISTS dedup;
CREATE TABLE dedup
(
id Int,
property_id Int,
created_at_state AggregateFunction(min, DateTime),
modified_at_state AggregateFunction(max, DateTime),
data_state AggregateFunction(argMax, Int, DateTime)
) ENGINE = SummingMergeTree
ORDER BY (property_id, id);
-- Transformation pipeline
DROP VIEW IF EXISTS pipeline;
CREATE MATERIALIZED VIEW pipeline TO dedup
AS SELECT
id,
property_id,
minState(created_at) AS created_at_state,
maxState(modified_at) AS modified_at_state,
argMaxState(data, modified_at) AS data_state
FROM ingest
GROUP BY property_id, id;
-- Insert data with a duplicate
INSERT INTO ingest (id, property_id, created_at, modified_at, data)
VALUES (1, 100, '2022-01-01 08:00:00', '2022-01-01 08:00:00', 2000),
(1, 100, '2022-01-01 08:01:00', '2022-01-01 08:01:00', 3000),
(2, 100, '2022-01-01 08:00:00', '2022-01-01 08:00:00', 4000),
(3, 200, '2022-01-01 08:05:00', '2022-01-01 08:05:00', 5000);
-- Query deduplicated table with merge functions
SELECT id,
property_id,
toDateTime(minMerge(created_at_state), 'UTC') AS created_at,
toDateTime(maxMerge(modified_at_state), 'UTC') AS modified_at,
argMaxMerge(data_state) AS data
FROM dedup
GROUP BY property_id, id
ORDER BY id, property_id;
id
property_id
created_at
modified_at
data
1
100
2022-01-01T08:00Z
2022-01-01T08:01Z
3000
2
100
2022-01-01T08:00Z
2022-01-01T08:00Z
4000
3
200
2022-01-01T08:05Z
2022-01-01T08:05Z
5000

Need advice on database schema for storing complex json [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I cant decide whether to use postgres, cassandra or another nosql
the json collection is like this
[{
id:
name:
year:
credits: [{id: "", name: "", role: ""}...]
genres: [{id: "", name: ""}...]
seasons: [
{ id:
name:
season_number:
episodes: [
{id: "",
name: "",
season_number: "",
episode_number: ""},
...]
},
...]
},
...]
from above, the collection itself is an array of objects. each object has four nested arrays for values, namely keys credits, genres, seasons, episodes.
the full object needs to be read every time because all the fields are needed to be shown on the frontend.
the items in the episodes key will be inserted or deleted the most.
they are grouped, split into seasons which is also an array
initial sql table schema was one row per episode, but that creates a lot of json redundancy
its important to have a query that returns the above object in its original structure
I concur with Laurenz' answer that PostgreSQL would be a better option because of the update/deletes. You can store the data in standard SQL tables and use PostgreSQL JSON support to generate the necessary JSON.
To demonstrate I took your example and created a working example in db_fiddle which you can access here https://www.db-fiddle.com/f/81Acp5wdEEaWMMxKfvbUaS/3.
create schema test;
create table test.tvshow ( id serial, name text );
create table test.credit ( id serial, show_id integer, name text , role text );
create table test.genre (id serial, name text, show_id integer );
create table test.season ( id serial, name text, season_number text, show_id integer);
create table test.episode ( id serial, name text, season_id integer, episode_number text, show_id integer);
insert into test.tvshow (name) values ('gunsmoke');
insert into test.credit (name, show_id) values ('credits #1', 1);
insert into test.credit (name, show_id) values ('credits #2', 1);
insert into test.credit (name, show_id) values ('credits #3', 1);
insert into test.genre (name, show_id) values ('western', 1);
insert into test.genre (name, show_id) values ('drama', 1);
insert into test.season (name, show_id, season_number) values ('Season 1', 1, '01');
insert into test.season (name, show_id, season_number) values ('Season 2', 1, '02');
insert into test.episode (name, season_id, episode_number, show_id) values ('Episode#1',1, '01', 1);
insert into test.episode (name, season_id, episode_number, show_id) values ('Episode#2',1, '01', 1);
insert into test.episode (name, season_id, episode_number, show_id) values ('Episode#3',1, '01', 1);
insert into test.episode (name, season_id, episode_number, show_id) values ('Episode#1',2, '01', 1);
insert into test.episode (name, season_id, episode_number, show_id) values ('Episode#1',2, '02', 1);
Once populated you can execute the following SQL.
SELECT array_to_json(array_agg(row_to_json(t, true)))
FROM (
SELECT name
,(
SELECT array_to_json(array_agg(row_to_json(c)))
FROM (
SELECT id
,name
,ROLE
FROM test.credit
WHERE test.credit.show_id = test.tvshow.id
) c
) AS credits
,
--
(
SELECT array_to_json(array_agg(row_to_json(g)))
FROM (
SELECT id
,name
FROM test.genre
WHERE test.genre.show_id = test.tvshow.id
) g
) AS genres
,
--
(
SELECT array_to_json(array_agg(row_to_json(s, true)))
FROM (
SELECT id
,name
,season_number
,(
SELECT array_to_json(array_agg(row_to_json(e)))
FROM (
SELECT id
,name
,season_number
FROM test.episode
WHERE test.episode.show_id = test.tvshow.id
AND test.episode.season_id = test.season.id
) e
) AS episodes
FROM test.season
WHERE test.season.show_id = test.tvshow.id
) s
) AS seasons
--
FROM test.tvshow
) t
That will return a with the JSON row for all the shows.
From what you write, it sounds like you will want to update parts of that JSON frequently. In that case, all I can recommend if you want to use PostgreSQL is to not store the data as JSON, but to split it up into several tables: credits, genres, seasons and episodes.
Sure, this will make the INSERT and the SELECT of the data somewhat more complicated, as you will have to split up and assemble the JSON, but it will pay off as soon as you try to implement a request like "delete a certain episode" or you want to efficiently search for data using pattern matching.
Also, if you need to retain the whitespace or attribute order in the JSON, this will not work as expected. But that should not matter for JSON data.

Looking For Most Frequent Values SQL Statement

I have a data set that looks like this:
id | Unit_Ids
1 | {"unit_ids" : ["5442","28397"]}
2 | {"unit_ids" : ["5442","3492","2290"]}
etc.
And I'm trying to find the most frequently appearing values in Unit_Ids. As in my example 5442 appears in both lines 1 and 2, it would be the most frequent value. I was just having trouble finding a good way of creating this statement.
Thank you in advanced!
EDIT: Sorry everyone I'm working with MySQL
If 2016+
Example
Declare #YourTable Table ([id] int,[Unit_Ids] varchar(max))
Insert Into #YourTable Values
(1,'{"unit_ids" : ["5442","28397"]}')
,(2,'{"unit_ids" : ["5442","3492","2290"]}')
Select top 1 value
From #YourTable A
Cross Apply OpenJSON([Unit_Ids],'$.unit_ids') R
Order By sum(1) over (partition by value) desc
Returns
value
5442
I'm assuming you are storing JSON strings in the Unit_Ids field. If you do that, you won't be able to extract or aggregate data stored in that field.
You can however create a child table and query it to get aggregated data. Ie:
-- Master table
create table parent(id number primary key);
-- List of units per parent
create table units(
id number not null,
parent_id number not null,
primary key (id, parent_id),
foreign key (parent_id) references parent(id)
);
-- Insert sample data
insert into parent values 1;
insert into parent values 2;
insert into units(parent_id, id) values(1, 5442);
insert into units(parent_id, id) values(1, 28397);
insert into units(parent_id, id) values(2, 5442);
insert into units(parent_id, id) values(2, 3492);
insert into units(parent_id, id) values(2, 2290);
-- Count the number of times a unit id is in the table
select id, count(id) from units group by id;

How to do Where clause on simple Json Array in SQL Server 2017?

Say I have a column in my database called attributes which has this value as an example:
{"pages":["Page1"]}
How can I do a where clause so I can filter down rows that have "Page1" in it.
select JSON_QUERY(Attributes, '$.pages')
from Table
where JSON_QUERY(Attributes, '$.pages') in ('Page1')
Edit:
From the docs it seems like this might work though it seems so complicated for what it is doing.
select count(*)
from T c
cross apply Openjson(c.Attributes)
with (pages nvarchar(max) '$.pages' as json)
outer apply openjson(pages)
with ([page] nvarchar(100) '$')
where [page] = 'Page1'
Something like this:
use tempdb
create table T(id int, Attributes nvarchar(max))
insert into T(id,Attributes) values (1, '{"pages":["Page1"]}')
insert into T(id,Attributes) values (2, '{"pages":["Page3","Page4"]}')
insert into T(id,Attributes) values (3, '{"pages":["Page3","Page1"]}')
select *
from T
where exists
(
select *
from openjson(T.Attributes,'$.pages')
where value = 'Page1'
)
returns
id Attributes
----------- ---------------------------
1 {"pages":["Page1"]}
3 {"pages":["Page3","Page1"]}
(2 rows affected)

auto_increment column for a group of rows?

I am trying to figure out how to do a table with 3 columns:
unique_id, type, version
Where unique_id is AUTO_INCREMENT for each record, and version is AUTO_INCREMENT for each type.
The purpose being, when I insert I only have to specify 'type' and the unique_id and version_id are automatically generated. eg:
insert type 'a', then values are: 1 , a , 1
insert type 'a', then values are: 2 , a , 2
insert type 'a', then values are: 3 , a , 3
insert type 'b', then values are: 4 , b , 1
insert type 'b', then values are: 5 , b , 2
insert type 'a', then values are: 6 , a , 4
Also would it be fair to say that such a setup is not really normalised? That instead it should be two tables?
You can't have auto-generated sequences for each type, but you can generate your own.
insert into the_table (type, version)
select 'a', 1 + IFNULL(max(version), 0)
from the_table
where type = 'a';
If the table will have a lot of rows (and depending on what you're trying to do), then it would be best to have separate tables for each version. Then, insert each type into its proper version-table.