Presto/Athena - query to discover JSON attribute frequencies? - json

I have defined a Hive table where a single column contains JSON text:
CREATE EXTERNAL TABLE IF NOT EXISTS my.rawdata (
json string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '\n',
'quoteChar' = '\0',
'escapeChar' = '\r'
)
STORED AS TEXTFILE
LOCATION 's3://mydata/';
Is there a Presto/Athena query that can list out all field names that occur within the JSON and their frequency (i.e. total number of times the attribute appears in the table)?

Use the JSON functions to parse the JSON and turn it into a map. Then extract the keys and unnest them. Finally, use a normal SQL aggregation:
SELECT key, count(*)
FROM (
SELECT map_keys(cast(json_parse(json) AS map(varchar, json))) AS keys
FROM rawdata
)
CROSS JOIN UNNEST (keys) AS t (key)
GROUP BY key

Supports multi-levels documents
Ignores keys of nesting elements
select key
,count(*)
from t cross join
unnest (regexp_extract_all(json,'"([^"]+)"\s*:\s*("[^"]+"|[^,{}]+)',1)) u (key)
group by key
;

Related

Is there a way to query an array of JSON objects returned by a CTE in PostgreSQL?

I have a PostgreSQL query that uses a CTE and the SELECT within the CTE uses json_agg() to aggregate data as JSON objects. Is there a way to query the results of the CTE by searching for a specific object in the array based on the value of a field of objects?
For example, lets say the CTE creates a temporary table named results. The values from the json_agg() is available in a field called owners, and each owner object has a field called name. I want to SELECT * FROM results WHERE owner.name = 'John Smith'. I am not sure how to write the WHERE clause below so that the name field of each object in the array of owners is checked for the value.
WITH results AS (
-- some other fields here
(SELECT json_agg(owners)
FROM (
SELECT id, name, telephone, email
FROM owner
) owners
) as owners
)
SELECT *
FROM results
WHERE owners->>'name' == 'John Smith'
To do that query you can typically use the jsonpath language after having converted your json data into jsonb data (see the manual here and here) :
WITH results AS (
-- some other fields here
(SELECT json_agg(owners)
FROM (
SELECT id, name, telephone, email
FROM owner
) owners
) as owners
)
SELECT *
FROM results
WHERE jsonb_path_exists(owners :: jsonb, '$[*] ? (#.name == "John Smith")')

Combine two json array as key-value in mysql and create one json object

I have two JSON array fields in MySQL like this:
["a", "b", "c"]
["apple", "banana", "coconut"]
Now I want to combine them into one JSON object like this:
{"a":"apple", "b":"banana", "c":"coconut"}
Is there any MySQL function for this?
I would approach this in a simple way.
Unnest the two JSON structures using JSON_TABLE().
Join the two tables together.
Construct the appropriate JSON objects and aggregate.
The following implements this logic. The first CTE extracts the keys. The second extracts the values, and finally these are combined:
WITH the_keys as (
SELECT j.*
FROM t CROSS JOIN
JSON_TABLE(t.jsdata1,
'$[*]'
columns (seqnum for ordinality, the_key varchar(255) path '$')
) j
),
the_values as (
SELECT j.*
FROM t CROSS JOIN
JSON_TABLE(t.jsdata2,
'$[*]'
columns (seqnum for ordinality, val varchar(255) path '$')
) j
)
select json_objectagg(the_keys.the_key, the_values.val)
from the_keys join
the_values
on the_keys.seqnum = the_values.seqnum;
Here is a db<>fiddle.
Note that this is quite generalizable (you can add more elements to the rows). You can readily adjust it to return multiple rows of data, if you you have key/value pairs on different rows, and it uses no deprecated functionality.
You can extract by JSON_EXTRACT() function due to the index of each element within the arrays along with the contribution of row generation through use of a table from information_schema, then aggregate all results by using JSON_OBJECTAGG() returning from the subquery such as
SELECT JSON_OBJECTAGG(Js1,Js2)
FROM
(
SELECT JSON_UNQUOTE(JSON_EXTRACT(jsdata1,CONCAT('$[',#rn+1,']'))) AS Js1,
JSON_UNQUOTE(JSON_EXTRACT(jsdata2,CONCAT('$[',#rn+1,']'))) AS Js2,
#rn := #rn + 1 AS rn
FROM tab AS t1
JOIN (SELECT #rn:=-1) AS r
JOIN information_schema.tables AS t2
-- WHERE #rn < JSON_LENGTH(jsdata1) - 1 #redundant for MariaDB, but needed for MySQL
) AS j
where
'["a", "b", "c"]' is assumed to be the value of the column jsdata1 and
'["apple", "banana", "coconut"]' is assumed to be the value of the column jsdata2
within a table(tab) containing only one row inserted.
Demo
The basic way for it using JSON functions like:
select JSON_OBJECT(
JSON_UNQUOTE(JSON_EXTRACT(a, '$[0]')), JSON_EXTRACT(b, '$[0]'),
JSON_UNQUOTE(JSON_EXTRACT(a, '$[1]')), JSON_EXTRACT(b, '$[1]'),
JSON_UNQUOTE(JSON_EXTRACT(a, '$[2]')), JSON_EXTRACT(b, '$[2]')
) result from tbl;
SQL sandbox

Facing Problem in PostgresSQL query for JSON data

I am having following data
{
"City": "Fontana",
"Timezone": "America/Los_Angeles",
"Longitude": "-117.4864123",
"Timestamp": "2020-07-15T12:13:00-07:00",
"refs": ["123", "456", "789"], "tZone": "PPP"
}
above data store against analytis.col_json column
I am having table structure
CREATE TABLE analytics
(
id bigint NOT NULL,
col_typ character varying(255) COLLATE pg_catalog."default",
col_json json,
cre_dte timestamp without time zone,
CONSTRAINT clbk_logs_pkey PRIMARY KEY (id)
);
The above records are in n-rows.
I am trying to fetch records on basis of 'refs' by sending list of string. for example:-
I have a separate List as a right side values to be filter on my table.
My query is as following
select * FROM public.analytics
where col_json-> 'refs' in (
'123',
'pqa',
'bhu',
'qwerty'
);
but above query is not working for me.
The more advanced JSON capabilities are only available when using the jsonb type, so you will have to cast your column every time you want to do something non-trivial. It would be better to define the column as jsonb in the long run.
You can use the ?| operator
select a.*
from analytics a
where col_json::jsonb -> 'refs' ?| array['123','pqa','bhu','qwerty'];
Note that this only works if all array elements are strings. It does not work with numbers e.g. if the json contained "refs": [123,456] it will not work.
Alternatively you can use an EXISTS condition with a sub-query:
select a.*
from analytics a
where exists (select *
from json_array_elements_text(a.col_json -> 'refs') as x(item)
where x.item in ('123','pqa','bhu','qwerty'));
If you want refs to contain all of the values in your list you can use the contains operator #>
select a.*
from analytics a
where a.col_json::jsonb -> 'refs' #> '["123", "456"]';
Or alternatively: where a.col_json #> '{"refs": ["123", "456"]}'
The above will only return rows where both values are contained in the refs array.
Online example

Querying a JSON array of objects in Postgres

I have a postgres db with a json data field.
The json I have is an array of objects:
[{"name":"Mickey Mouse","age":10},{"name":"Donald Duck","age":5}]
I'm trying to return values for a specific key in a JSON array, so in the above example I'd like to return the values for name.
When I use the following query I just get a NULL value returned:
SELECT data->'name' AS name FROM json_test
Im assuming this is because it's an array of objects? Is it possible to directly address the name key?
Ultimately what I need to do is to return a count of every unique name, is this possible?
Thanks!
you have to unnest the array of json-objects first using the function (json_array_elements or jsonb_array_elements if you have jsonb data type), then you can access the values by specifying the key.
WITH json_test (col) AS (
values (json '[{"name":"Mickey Mouse","age":10},{"name":"Donald Duck","age":5}]')
)
SELECT
y.x->'name' "name"
FROM json_test jt,
LATERAL (SELECT json_array_elements(jt.col) x) y
-- outputs:
name
--------------
"Mickey Mouse"
"Donald Duck"
To get a count of unique names, its a similar query to the above, except the count distinct aggregate function is applied to y.x->>name
WITH json_test (col) AS (
values (json '[{"name":"Mickey Mouse","age":10},{"name":"Donald Duck","age":5}]')
)
SELECT
COUNT( DISTINCT y.x->>'name') distinct_names
FROM json_test jt,
LATERAL (SELECT json_array_elements(jt.col) x) y
It is necessary to use ->> instead of -> as the former (->>) casts the extracted value as text, which supports equality comparison (needed for distinct count), whereas the latter (->) extracts the value as json, which does not support equality comparison.
Alternatively, convert the json as jsonb and use jsonb_array_elements. JSONB supports the equality comparison, thus it is possible to use COUNT DISTINCT along with extraction via ->, i.e.
COUNT(DISTINCT (y.x::jsonb)->'name')
updated answer for postgresql versions 12+
It is now possible to extract / unnest specific keys from a list of objects using jsonb path queries, so long as the field queried is jsonb and not json.
example:
WITH json_test (col) AS (
values (jsonb '[{"name":"Mickey Mouse","age":10},{"name":"Donald Duck","age":5}]')
)
SELECT jsonb_path_query(col, '$[*].name') "name"
FROM json_test
-- replaces this original snippet:
-- SELECT
-- y.x->'name' "name"
-- FROM json_test jt,
-- LATERAL (SELECT json_array_elements(jt.col) x) y
Do like this:
SELECT * FROM json_test WHERE (column_name #> '[{"name": "Mickey Mouse"}]');
You can use jsonb_array_elements (when using jsonb) or json_array_elements (when using json) to expand the array elements.
For example:
WITH sample_data_array(arr) AS (
VALUES ('[{"name":"Mickey Mouse","age":10},{"name":"Donald Duck","age":5}]'::jsonb)
)
, sample_data_elements(elem) AS (
SELECT jsonb_array_elements(arr) FROM sample_data_array
)
SELECT elem->'name' AS extracted_name FROM sample_data_elements;
In this example, sample_data_elements is equivalent to a table with a single jsonb column called elem, with two rows (the two array elements in the initial data).
The result consists of two rows (one jsonb column, or of type text if you used ->>'name' instead):
extracted_name
----------------
"Mickey Mouse"
"Donald Duck"
(2 rows)
You should them be able to group and aggregate as usual to return the count of individual names.

How to cross join unnest a JSON array in Presto

Given a table that contains a column of JSON like this:
{"payload":[{"type":"b","value":"9"}, {"type":"a","value":"8"}]}
{"payload":[{"type":"c","value":"7"}, {"type":"b","value":"3"}]}
How can I write a Presto query to give me the average b value across all entries?
So far I think I need to use something like Hive's lateral view explode, whose equivalent is cross join unnest in Presto.
But I'm stuck on how to write the Presto query for cross join unnest.
How can I use cross join unnest to expand all array elements and select them?
Here's an example of that
with example(message) as (
VALUES
(json '{"payload":[{"type":"b","value":"9"},{"type":"a","value":"8"}]}'),
(json '{"payload":[{"type":"c","value":"7"}, {"type":"b","value":"3"}]}')
)
SELECT
n.type,
avg(n.value)
FROM example
CROSS JOIN
UNNEST(
CAST(
JSON_EXTRACT(message,'$.payload')
as ARRAY(ROW(type VARCHAR, value INTEGER))
)
) as x(n)
WHERE n.type = 'b'
GROUP BY n.type
with defines a common table expression (CTE) named example with a column aliased as message
VALUES returns a verbatim table rowset
UNNEST is taking an array within a column of a single row and returning the elements of the array as multiple rows.
CAST is changing the JSON type into an ARRAY type that is required for UNNEST. It could easily have been an ARRAY<MAP< but I find ARRAY(ROW( nicer as you can specify column names, and use dot notation in the select clause.
JSON_EXTRACT is using a jsonPath expression to return the array value of the payload key
avg() and group by should be familiar SQL.
As you pointed out, this was finally implemented in Presto 0.79. :)
Here is an example of the syntax for the cast from here:
select cast(cast ('[1,2,3]' as json) as array<bigint>);
Special word of advice, there is no 'string' type in Presto like there is in Hive.
That means if your array contains strings make sure you use type 'varchar' otherwise you get an error msg saying 'type array does not exist' which can be misleading.
select cast(cast ('["1","2","3"]' as json) as array<varchar>);
The problem was that I was running an old version of Presto.
unnest was added in version 0.79
https://github.com/facebook/presto/blob/50081273a9e8c4d7b9d851425211c71bfaf8a34e/presto-docs/src/main/sphinx/release/release-0.79.rst