Unknown duplicates from querying a nested JSON - json

I would like to do text search in a JSON object in a table.
I have a table called Audio that is structured like below:
id| keyword | transcript | user_id | company_id | client_id
-----------------------------------------------------------
This is the JSON data structure of transcript:
{"transcript": [
{"duration": 2390.0,
"interval": [140.0, 2530.0],
"speaker": "Speaker_2",
"words": [
{"p": 0, "s": 0, "e": 320, "c": 0.545, "w": "This"},
{"p": 1, "s": 320, "e": 620, "c": 0.825, "w": "call"},
{"p": 2, "s": 620, "e": 780, "c": 0.909, "w": "is"},
{"p": 3, "s": 780, "e": 1010, "c": 0.853, "w": "being"},
{"p": 4, "s": 1010, "e": 1250, "c": 0.814, "w": "recorded"}
]
},
{"duration": 4360.0,
"interval": [3280.0, 7640.0],
"speaker": "Speaker_1",
"words": [
{"p": 5, "s": 5000, "e": 5020, "c": 0.079, "w": "as"},
{"p": 6, "s": 5020, "e": 5100, "c": 0.238, "w": "a"},
{"p": 7, "s": 5100, "e": 5409, "c": 0.689, "w": "group"},
{"p": 8, "s": 5410, "e": 5590, "c": 0.802, "w": "called"},
{"p": 9, "s": 5590, "e": 5870, "c": 0.834, "w": "tricks"}
]
},
...
}
What I am trying to do is to do a text search in the "w" field within "words". This is the query that I tried to run:
WITH info_data AS (
SELECT transcript_info->'words' AS info
FROM Audio t, json_array_elements(transcript->'transcript') AS transcript_info)
SELECT info_item->>'w', id
FROM Audio, info_data idata, json_array_elements(idata.info) AS info_item
WHERE info_item->>'w' ilike '%this';
Right now I only have four columns with data and the fifth column is null. And there are five columns in total. However, I got the following result where even the column that doesn't have data results an output:
?column? | id
----------+----
This | 2
This | 5
This | 1
This | 3
This | 4
This | 2
This | 5
I would love to know what the problem of my query is and whether there are more efficient way in doing this.

The problem is that you make a cartesian join between table Audio on the one hand and info_data and info_item on the other hand (there is an implicit lateral join between these latter two) here:
FROM Audio, info_data idata, json_array_elements(idata.info) AS info_item
You can solve this by adding Audio.id to the CTE and then adding WHERE Audio.id = info_data.id.
It is doubtful that this is the most efficient solution (CTEs rarely are). If you just want to get those rows where the word "this" is a word in the transcript, then you are most likely better off like this:
SELECT DISTINCT id
FROM (
SELECT id, transcript_info->'words' AS info
FROM Audio, json_array_elements(transcript->'transcript') AS transcript_info) AS t,
json_array_elements(info) AS words
WHERE words->>'w' ILIKE 'this';
Note that the % in the pattern string is very inefficient. Since very few words in the English language other than "this" end with the same, I have taken the liberty of removing it.

Related

How to divide json array in hive table into batches the size of a given batch size?

For example, at the input we have hive table like this:
id
entities
1
[{"a": "a1", "b": "b1"}, {"a": "a2", "b": "b2"}, {"a": "a3", "b": "b3"}, {"a": "a4", "b": "b4"}, {"a": "a5", "b": "b5"}]
2
[{"c": "c1", "d": "d1"}, {"c": "c2", "d": "d2"}, {"c": "c3", "d": "d3"}, {"c": "c4", "d": "d4"}, {"c": "c5", "d": "d5"}]
And with batchSize = 3 we should get no more than three elements in each array, like this:
id
entities
1
[{"a": "a1", "b": "b1"}, {"a": "a2", "b": "b2"}, {"a": "a3", "b": "b3"}]
1
[{"a": "a4", "b": "b4"}, {"a": "a5", "b": "b5"}]
2
[{"c": "c1", "d": "d1"}, {"c": "c2", "d": "d2"}, {"c": "c3", "d": "d3"}]
2
[{"c": "c4", "d": "d4"}, {"c": "c5", "d": "d5"}]
With batchSize = 2 no more than two elements in arrays:
id
entities
1
[{"a": "a1", "b": "b1"}, {"a": "a2", "b": "b2"}]
1
[{"a": "a3", "b": "b3"}, {"a": "a4", "b": "b4"}]
1
[{"a": "a5", "b": "b5"}]
2
[{"c": "c1", "d": "d1"}, {"c": "c2", "d": "d2"}]
2
[{"c": "c3", "d": "d3"}, {"c": "c4", "d": "d4"}]
2
[{"c": "c5", "d": "d5"}]
Are there any built-ins in the hive or can you suggest a spark udf?
I tried to write udf - but I don't know how to make it return several lines with subarrays instead of one. Because udf usually does some operations with several columns and returns the result, but I don’t know how to split the array into several lines with subarrays.
Thanks!
You can use slice function to get subarrays and then explode to get multiple rows.
import spark.implicits._
val df = // input
val batchSize = 2
df.select('id, explode(expr(
s"transform(sequence(1, size(entities), $batchSize)," +
s" s-> slice(entities, s, $batchSize)) ")))
.show(false)
+---+--------------------+
|id |col |
+---+--------------------+
|1 |[[a1, b1], [a2, b2]]|
|1 |[[a3, b3], [a4, b4]]|
|1 |[[a5, b5]] |
+---+--------------------+

How to query nested array of jsonb

I am working on a PostgreSQL 11 table with a column of nested and multiple jsonb objects
to simulate the issue: -
CREATE TABLE public.test
(
id integer NOT NULL DEFAULT nextval('test_id_seq'::regclass),
testcol jsonb
)
insert into test (testcol) values
('[{"type": {"value": 1, "displayName": "flag1"}, "value": "10"},
{"type": {"value": 2, "displayName": "flag2"}, "value": "20"},
{"type": {"value": 3, "displayName": "flag3"}, "value": "30"},
{"type": {"value": 4, "displayName": "flag4"}},
{"type": {"value": 4, "displayName": "flag4"}},
{"type": {"value": 6, "displayName": "flag6"}, "value": "40"}]');
I am trying to:
get outer value if type= specific value. e.g. get the value 30, if flag3 is in displayname.
count occurrence of flag4 in inner json
You could use json_to_recordset to parse it:
WITH cte AS (
SELECT test.id, sub."type"->'value' AS t_value, sub."type"->'displayName' AS t_name, value
FROM test
,LATERAL jsonb_to_recordset(testcol) sub("type" jsonb, "value" int)
)
SELECT *
FROM cte
-- WHERE ...
-- GROUP BY ...;
db<>fiddle demo

Extract key from JSON string in MySQL

My table contains string in json format. I need to get the sum and average of each key.
+----+------------------------------------------------------------------------------------+------------+
| id | json_data | subject_id |
+----+------------------------------------------------------------------------------------+------------+
| 1 | {"id": "a", "value": "30"}, {"id": "b", "value": "20"}, {"id": "c", "value": "30"} | 1 |
+----+------------------------------------------------------------------------------------+------------+
| 2 | {"id": "a", "value": "40"}, {"id": "b", "value": "50"}, {"id": "c", "value": "60"} | 1 |
+----+------------------------------------------------------------------------------------+------------+
| 3 | {"id": "a", "value": "20"} | 1 |
+----+------------------------------------------------------------------------------------+------------+
Expected result is
{"id": "a", "sum": 90, "avg": 30},
{"id": "b", "sum": 70, "avg": 35},
{"id": "c", "sum": 120, "avg": 40}
I've tried
SELECT (
JSON_OBJECT('id', id, 'sum', sum_data, 'avg', avg_data)
) FROM (
SELECT
JSON_EXTRACT(json_data, "$.id") as id,
SUM(JSON_EXTRACT(json_data, "$.sum_data")) as sum_data,
AVG(JSON_EXTRACT(json_data, "$.avg_data")) as avg_data
FROM Details
GROUP BY JSON_EXTRACT(json_data, "$.id")
) as t
But no luck. How can I sort this out?
Input json needs to correct
create table json_sum (id int primary key auto_increment, json_data json);
insert into json_sum values (0,'[{"id": "a", "value": "30"}, {"id": "b", "value": "20"}, {"id": "c", "value": "30"}]');
insert into json_sum values (0,'[{"id": "a", "value": "40"}, {"id": "b", "value": "50"}, {"id": "c", "value": "60"}]');
insert into json_sum values (0,'[{"id": "a", "value": "20"}]');
select
json_object("id", jt.id, "sum", sum(jt.value), "avg", avg(jt.value))
from json_sum, json_table(json_data, "$[*]" columns (
row_id for ordinality,
id varchar(10) path "$.id",
value varchar(10) path "$.value")
) as jt
group by jt.id
Output:
json_object("id", jt.id, "sum", sum(jt.value), "avg", avg(jt.value))
{"id": "a", "avg": 30.0, "sum": 90.0}
{"id": "b", "avg": 35.0, "sum": 70.0}
{"id": "c", "avg": 45.0, "sum": 90.0}

How to make pgsql return the json array

everyone , I face some issue to convert the data into json object. There is a table called milestone with the following data:
id name parentId
a test1 A
b test2 B
c test3 C
I want to convert the result into a json type in Postgres:
[{"id": "a", "name": "test1", "parentId": "A"}]
[{"id": "b", "name": "test2", "parentId": "B"}]
[{"id": "c", "name": "test3", "parentId": "C"}]
if there are anyone know how to handle , please let me know , thanks all
You can get each row of the table as simple json object with to_jsonb():
select to_jsonb(m)
from milestone m
to_jsonb
-----------------------------------------------
{"id": "a", "name": "test1", "parentid": "A"}
{"id": "b", "name": "test2", "parentid": "B"}
{"id": "c", "name": "test3", "parentid": "C"}
(3 rows)
If you want to get a single element array for each row, use jsonb_build_array():
select jsonb_build_array(to_jsonb(m))
from milestone m
jsonb_build_array
-------------------------------------------------
[{"id": "a", "name": "test1", "parentid": "A"}]
[{"id": "b", "name": "test2", "parentid": "B"}]
[{"id": "c", "name": "test3", "parentid": "C"}]
(3 rows)
You can also get all rows as a json array with jsonb_agg():
select jsonb_agg(to_jsonb(m))
from milestone m
jsonb_agg
-----------------------------------------------------------------------------------------------------------------------------------------------
[{"id": "a", "name": "test1", "parentid": "A"}, {"id": "b", "name": "test2", "parentid": "B"}, {"id": "c", "name": "test3", "parentid": "C"}]
(1 row)
Read about JSON Functions and Operators in the documentation.
You can use ROW_TO_JSON
From Documentation :
Returns the row as a JSON object. Line feeds will be added between
level-1 elements if pretty_bool is true.
For the query :
select
row_to_json(tbl)
from
(select * from tbl) as tbl;
You can check here in DEMO

Get the value from nested JSON in Postgres

I have a table called "Audio" with a column "transcript" as the following:
{"transcript": [
{"p": 0, "s": 0, "e": 320, "c": 0.545, "w": "This"},
{"p": 1, "s": 320, "e": 620, "c": 0.825, "w": "call"},
{"p": 2, "s": 620, "e": 780, "c": 0.909, "w": "is"},
{"p": 3, "s": 780, "e": 1010, "c": 0.853, "w": "being"}
...
]}
I would like to get the value of "p" where "w" matches certain keywords.
If I do the following query, it will give me the entire 's' entries of Audio where one of its "w" has words "google" or "all."
select json_array_elements(transcript->'transcript')->>'s'
from Audio,
json_array_elements(transcript->'transcript') as temp
where temp->>'w' ilike any(array['all','google'])
How could I get only value of "p" where the condition is satisfied?
Edit:
How could I get the value of "p" and its corresponding Audio ID at the same time?
Select your transcript array elements into a common table expression and match from there:
WITH transcript AS (
SELECT json_array_elements((transcript -> 'transcript')) AS line
FROM audio
)
SELECT line ->> 'p'
FROM transcript
WHERE line ->> 'w' ILIKE ANY (ARRAY ['all', 'google']);
This will select matching lines from all rows in the audio table. I'm guessing that you'll want to restrict the results to a subset of rows, in which case you'll have to narrow the query. Assuming an id column, do something like this:
WITH transcript AS (
SELECT
id,
json_array_elements((transcript -> 'transcript')) AS line
FROM audio
WHERE id = 1
)
SELECT
id,
line ->> 'p'
FROM transcript
WHERE line ->> 'w' ILIKE ANY (ARRAY ['call', 'google'])