Unknown duplicates from querying a nested JSON

Unknown duplicates from querying a nested JSON - json

I would like to do text search in a JSON object in a table.
I have a table called Audio that is structured like below:
id| keyword | transcript | user_id | company_id | client_id
-----------------------------------------------------------
This is the JSON data structure of transcript:
{"transcript": [
{"duration": 2390.0,
"interval": [140.0, 2530.0],
"speaker": "Speaker_2",
"words": [
{"p": 0, "s": 0, "e": 320, "c": 0.545, "w": "This"},
{"p": 1, "s": 320, "e": 620, "c": 0.825, "w": "call"},
{"p": 2, "s": 620, "e": 780, "c": 0.909, "w": "is"},
{"p": 3, "s": 780, "e": 1010, "c": 0.853, "w": "being"},
{"p": 4, "s": 1010, "e": 1250, "c": 0.814, "w": "recorded"}
]
},
{"duration": 4360.0,
"interval": [3280.0, 7640.0],
"speaker": "Speaker_1",
"words": [
{"p": 5, "s": 5000, "e": 5020, "c": 0.079, "w": "as"},
{"p": 6, "s": 5020, "e": 5100, "c": 0.238, "w": "a"},
{"p": 7, "s": 5100, "e": 5409, "c": 0.689, "w": "group"},
{"p": 8, "s": 5410, "e": 5590, "c": 0.802, "w": "called"},
{"p": 9, "s": 5590, "e": 5870, "c": 0.834, "w": "tricks"}
]
},
...
}
What I am trying to do is to do a text search in the "w" field within "words". This is the query that I tried to run:
WITH info_data AS (
SELECT transcript_info->'words' AS info
FROM Audio t, json_array_elements(transcript->'transcript') AS transcript_info)
SELECT info_item->>'w', id
FROM Audio, info_data idata, json_array_elements(idata.info) AS info_item
WHERE info_item->>'w' ilike '%this';
Right now I only have four columns with data and the fifth column is null. And there are five columns in total. However, I got the following result where even the column that doesn't have data results an output:
?column? | id
----------+----
This | 2
This | 5
This | 1
This | 3
This | 4
This | 2
This | 5
I would love to know what the problem of my query is and whether there are more efficient way in doing this.

The problem is that you make a cartesian join between table Audio on the one hand and info_data and info_item on the other hand (there is an implicit lateral join between these latter two) here:
FROM Audio, info_data idata, json_array_elements(idata.info) AS info_item
You can solve this by adding Audio.id to the CTE and then adding WHERE Audio.id = info_data.id.
It is doubtful that this is the most efficient solution (CTEs rarely are). If you just want to get those rows where the word "this" is a word in the transcript, then you are most likely better off like this:
SELECT DISTINCT id
FROM (
SELECT id, transcript_info->'words' AS info
FROM Audio, json_array_elements(transcript->'transcript') AS transcript_info) AS t,
json_array_elements(info) AS words
WHERE words->>'w' ILIKE 'this';
Note that the % in the pattern string is very inefficient. Since very few words in the English language other than "this" end with the same, I have taken the liberty of removing it.

Related

How to divide json array in hive table into batches the size of a given batch size?

For example, at the input we have hive table like this:
id
entities
1
[{"a": "a1", "b": "b1"}, {"a": "a2", "b": "b2"}, {"a": "a3", "b": "b3"}, {"a": "a4", "b": "b4"}, {"a": "a5", "b": "b5"}]
2
[{"c": "c1", "d": "d1"}, {"c": "c2", "d": "d2"}, {"c": "c3", "d": "d3"}, {"c": "c4", "d": "d4"}, {"c": "c5", "d": "d5"}]
And with batchSize = 3 we should get no more than three elements in each array, like this:
id
entities
1
[{"a": "a1", "b": "b1"}, {"a": "a2", "b": "b2"}, {"a": "a3", "b": "b3"}]
1
[{"a": "a4", "b": "b4"}, {"a": "a5", "b": "b5"}]
2
[{"c": "c1", "d": "d1"}, {"c": "c2", "d": "d2"}, {"c": "c3", "d": "d3"}]
2
[{"c": "c4", "d": "d4"}, {"c": "c5", "d": "d5"}]
With batchSize = 2 no more than two elements in arrays:
id
entities
1
[{"a": "a1", "b": "b1"}, {"a": "a2", "b": "b2"}]
1
[{"a": "a3", "b": "b3"}, {"a": "a4", "b": "b4"}]
1
[{"a": "a5", "b": "b5"}]
2
[{"c": "c1", "d": "d1"}, {"c": "c2", "d": "d2"}]
2
[{"c": "c3", "d": "d3"}, {"c": "c4", "d": "d4"}]
2
[{"c": "c5", "d": "d5"}]
Are there any built-ins in the hive or can you suggest a spark udf?
I tried to write udf - but I don't know how to make it return several lines with subarrays instead of one. Because udf usually does some operations with several columns and returns the result, but I don’t know how to split the array into several lines with subarrays.
Thanks!

You can use slice function to get subarrays and then explode to get multiple rows.
import spark.implicits._
val df = // input
val batchSize = 2
df.select('id, explode(expr(
s"transform(sequence(1, size(entities), $batchSize)," +
s" s-> slice(entities, s, $batchSize)) ")))
.show(false)
+---+--------------------+
|id |col |
+---+--------------------+
|1 |[[a1, b1], [a2, b2]]|
|1 |[[a3, b3], [a4, b4]]|
|1 |[[a5, b5]] |
+---+--------------------+

How to query nested array of jsonb

I am working on a PostgreSQL 11 table with a column of nested and multiple jsonb objects
to simulate the issue: -
CREATE TABLE public.test
(
id integer NOT NULL DEFAULT nextval('test_id_seq'::regclass),
testcol jsonb
)
insert into test (testcol) values
('[{"type": {"value": 1, "displayName": "flag1"}, "value": "10"},
{"type": {"value": 2, "displayName": "flag2"}, "value": "20"},
{"type": {"value": 3, "displayName": "flag3"}, "value": "30"},
{"type": {"value": 4, "displayName": "flag4"}},
{"type": {"value": 4, "displayName": "flag4"}},
{"type": {"value": 6, "displayName": "flag6"}, "value": "40"}]');
I am trying to:
get outer value if type= specific value. e.g. get the value 30, if flag3 is in displayname.
count occurrence of flag4 in inner json

You could use json_to_recordset to parse it:
WITH cte AS (
SELECT test.id, sub."type"->'value' AS t_value, sub."type"->'displayName' AS t_name, value
FROM test
,LATERAL jsonb_to_recordset(testcol) sub("type" jsonb, "value" int)
)
SELECT *
FROM cte
-- WHERE ...
-- GROUP BY ...;
db<>fiddle demo

Extract key from JSON string in MySQL

My table contains string in json format. I need to get the sum and average of each key.
+----+------------------------------------------------------------------------------------+------------+
| id | json_data | subject_id |
+----+------------------------------------------------------------------------------------+------------+
| 1 | {"id": "a", "value": "30"}, {"id": "b", "value": "20"}, {"id": "c", "value": "30"} | 1 |
+----+------------------------------------------------------------------------------------+------------+
| 2 | {"id": "a", "value": "40"}, {"id": "b", "value": "50"}, {"id": "c", "value": "60"} | 1 |
+----+------------------------------------------------------------------------------------+------------+
| 3 | {"id": "a", "value": "20"} | 1 |
+----+------------------------------------------------------------------------------------+------------+
Expected result is
{"id": "a", "sum": 90, "avg": 30},
{"id": "b", "sum": 70, "avg": 35},
{"id": "c", "sum": 120, "avg": 40}
I've tried
SELECT (
JSON_OBJECT('id', id, 'sum', sum_data, 'avg', avg_data)
) FROM (
SELECT
JSON_EXTRACT(json_data, "$.id") as id,
SUM(JSON_EXTRACT(json_data, "$.sum_data")) as sum_data,
AVG(JSON_EXTRACT(json_data, "$.avg_data")) as avg_data
FROM Details
GROUP BY JSON_EXTRACT(json_data, "$.id")
) as t
But no luck. How can I sort this out?

Input json needs to correct
create table json_sum (id int primary key auto_increment, json_data json);
insert into json_sum values (0,'[{"id": "a", "value": "30"}, {"id": "b", "value": "20"}, {"id": "c", "value": "30"}]');
insert into json_sum values (0,'[{"id": "a", "value": "40"}, {"id": "b", "value": "50"}, {"id": "c", "value": "60"}]');
insert into json_sum values (0,'[{"id": "a", "value": "20"}]');
select
json_object("id", jt.id, "sum", sum(jt.value), "avg", avg(jt.value))
from json_sum, json_table(json_data, "$[*]" columns (
row_id for ordinality,
id varchar(10) path "$.id",
value varchar(10) path "$.value")
) as jt
group by jt.id
Output:
json_object("id", jt.id, "sum", sum(jt.value), "avg", avg(jt.value))
{"id": "a", "avg": 30.0, "sum": 90.0}
{"id": "b", "avg": 35.0, "sum": 70.0}
{"id": "c", "avg": 45.0, "sum": 90.0}

How to make pgsql return the json array

everyone , I face some issue to convert the data into json object. There is a table called milestone with the following data:
id name parentId
a test1 A
b test2 B
c test3 C
I want to convert the result into a json type in Postgres:
[{"id": "a", "name": "test1", "parentId": "A"}]
[{"id": "b", "name": "test2", "parentId": "B"}]
[{"id": "c", "name": "test3", "parentId": "C"}]
if there are anyone know how to handle , please let me know , thanks all

You can get each row of the table as simple json object with to_jsonb():
select to_jsonb(m)
from milestone m
to_jsonb
-----------------------------------------------
{"id": "a", "name": "test1", "parentid": "A"}
{"id": "b", "name": "test2", "parentid": "B"}
{"id": "c", "name": "test3", "parentid": "C"}
(3 rows)
If you want to get a single element array for each row, use jsonb_build_array():
select jsonb_build_array(to_jsonb(m))
from milestone m
jsonb_build_array
-------------------------------------------------
[{"id": "a", "name": "test1", "parentid": "A"}]
[{"id": "b", "name": "test2", "parentid": "B"}]
[{"id": "c", "name": "test3", "parentid": "C"}]
(3 rows)
You can also get all rows as a json array with jsonb_agg():
select jsonb_agg(to_jsonb(m))
from milestone m
jsonb_agg
-----------------------------------------------------------------------------------------------------------------------------------------------
[{"id": "a", "name": "test1", "parentid": "A"}, {"id": "b", "name": "test2", "parentid": "B"}, {"id": "c", "name": "test3", "parentid": "C"}]
(1 row)
Read about JSON Functions and Operators in the documentation.

You can use ROW_TO_JSON
From Documentation :
Returns the row as a JSON object. Line feeds will be added between
level-1 elements if pretty_bool is true.
For the query :
select
row_to_json(tbl)
from
(select * from tbl) as tbl;
You can check here in DEMO

Get the value from nested JSON in Postgres

I have a table called "Audio" with a column "transcript" as the following:
{"transcript": [
{"p": 0, "s": 0, "e": 320, "c": 0.545, "w": "This"},
{"p": 1, "s": 320, "e": 620, "c": 0.825, "w": "call"},
{"p": 2, "s": 620, "e": 780, "c": 0.909, "w": "is"},
{"p": 3, "s": 780, "e": 1010, "c": 0.853, "w": "being"}
...
]}
I would like to get the value of "p" where "w" matches certain keywords.
If I do the following query, it will give me the entire 's' entries of Audio where one of its "w" has words "google" or "all."
select json_array_elements(transcript->'transcript')->>'s'
from Audio,
json_array_elements(transcript->'transcript') as temp
where temp->>'w' ilike any(array['all','google'])
How could I get only value of "p" where the condition is satisfied?
Edit:
How could I get the value of "p" and its corresponding Audio ID at the same time?

Select your transcript array elements into a common table expression and match from there:
WITH transcript AS (
SELECT json_array_elements((transcript -> 'transcript')) AS line
FROM audio
)
SELECT line ->> 'p'
FROM transcript
WHERE line ->> 'w' ILIKE ANY (ARRAY ['all', 'google']);
This will select matching lines from all rows in the audio table. I'm guessing that you'll want to restrict the results to a subset of rows, in which case you'll have to narrow the query. Assuming an id column, do something like this:
WITH transcript AS (
SELECT
id,
json_array_elements((transcript -> 'transcript')) AS line
FROM audio
WHERE id = 1
)
SELECT
id,
line ->> 'p'
FROM transcript
WHERE line ->> 'w' ILIKE ANY (ARRAY ['call', 'google'])

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Unknown duplicates from querying a nested JSON - json

Related

How to divide json array in hive table into batches the size of a given batch size?

How to query nested array of jsonb

Extract key from JSON string in MySQL

How to make pgsql return the json array

Get the value from nested JSON in Postgres

Categories

Resources